OAI-PMH Best Practices
Verifying the connection to OAI-PMH service
After the setup of OAI-PMH module has been done, sending Identify request to FOLIO's OAI-PMH service is the best way to confirm the connection between the harvesting client and the service.
https://<edge-url>oai/<api-key>?verb=Identify
If the connection is successful, the following response should be returned:
Starting harvest
When a harvest starts the system checks first the inventory for the updates (new or modified records) then retrieves underlying MARC Bib records from SRS. If the harvest was triggered with the metadataPrefix set to marc21_withholdings, the holdings and items data is append as described in MODOAIPMH-102.
Max number of records per response
The number of records returned in the ListRecords response is determined by the Max number of records per response setting. Its value can be configured under Settings → OAI-PMH →Technical and can be set between 1 and 500. The default value is 100 and this is also a recommended value for libraries that have instances with thousands of items associated with them (even if the items are spread across multiple holdings).
URL
The URL for harvesting consists from the following elements:
- First request:
https://<edge-url>/oai/<api-token>?verb=ListRecords&metadataPrefix=<metadataPrefix>&from=<yyyy-mm-dd>&until=<yyyy-mm-dd>
The "from" and "until" parameters can take include timestamp and the format is then: yyyy-mm-ddThh:mm:ssZ. This is especially helpful during the troubleshooting when the harvest needs to be limited to some specific hours when the records were updated. The specified time is in UTC (Coordinated Universal Time).
- Consecutive requests (with resumption token):
https://<edge-url>/oai/<api-token>?verb=ListRecords&resumptionToken=<resumption-token>
Resumption token
The resumption token is included in the response if the number of harvested records is larger than the configured Max records per response value.
Example of the resumption token:
The resumption token can be decoded using online Base64decode tool. Among others, the token contains following elements:
- starting position of the harvest
Example: cursor ="0" means that it is the first request
- metadataPrefix used in the request
Example: metadataPrefix=marc21_withholdings
- number of records in the harvested batch
Example: offset=100
- unique identifier of the instance that will be harvested next:
Example: nextRecordId=0fc89818-6a91-45da-974d-9b16773a7e4e
- how long the token is valid
Example: until=2021-10-21
Initial full harvest
The full harvest takes a significant amount of FOLIO's resources (inventory and SRS records) and should be carefully planned. Usually, libraries run their full harvest over weekends. Following scenarios are not supported and will lead to extremely poor performance and eventually might crash the system:
- Running concurrently multiple full harvests. Note that sharing the harvesting link with multiple users will most likely lead to starting multiple harvests at the same time
- Running full harvest during large updates to inventory and SRS records (importing data, reloading data)
Time required to complete
How long it will take for the harvest to complete depends on:
- Configured number of records per response. The number can be set up to 500 but the recommended value is 100 or 200 records
- Number of holdings and items associated with instance records when harvesting with metadataPrefix set to marc21_withholdings
In tests conducted on collections with 5 million records the harvest took ~10 hours with 100 records per response and ~9 hours with 200 records per response. The harvest of 8 million records took ~15 hours with 200 records per response.
Monitoring harvest
Starting with Lotus HF1 OAI-PMH provides APIs for monitoring the harvest. In prior releases this information was only available in module logs. The request GET /oai/request-metadata will provide information about currently running harvests and information about the status of retrieving records identifiers from UUIDs, number of successfully saved record, number of skipped records due to the lack of underlying SRS record and also other information:
Sending the request with request identifier, for example GET /oai/request-metadata/{requestId}/failed-to-save-instances, will return the list of UUIDs for the records that failed to save. The list might be used for troubleshooting existing data issues in FOLIO inventory or SRS. For more information see mod-oai-pmh README.md file or FOLIO API documentation.
Suppressed records
When the Suppressed records processing setting (Settings→ OAI-PMH→ Behavior) is set to "Transfer suppressed records with discovery flag value", the records marked as suppressed are included in the response with added subfield t.
- Suppressed instance records will have the subfield t in the field 999 set to 1. If the instance is not suppressed the value will be 0
- Suppressed holdings and item records will have the subfield t in the fields 856 and 952 set to 1.
Deleted records
When the Deleted record support setting (Settings→ OAI-PMH→ Behavior) is set to "Persistent" and the record is marked as deleted (MARC LDR 05 is set to "d"), the record will be a part of the response with the header status set to deleted.
If you use an API calls to delete Instances the following steps are required to assure that the discovery is updated as well:
- Set ldr05 to "d" in the underlying SRS record. This can be done by editing the record in QuickMarc.
- Let the incremental harvest get the information about the deleted record.
- Delete instance record and corresponding SRS record via through API call
Slow Performance
The harvesting of 5 million records should not take more than 11 hours in Juniper and less than 10 in Kiwi. It is highly recommended to run REINDEX, VACUUM and ANALYZE after major updates to the inventory tables in PostgresSQL database. It is highly recommended to run ANALYZE on a regular basis.
Forcing record updates
Staff actions to Instances, Holdings, and Items through the UI automatically trigger updates. However API based processes modifying the discovery flag, locations, or other fields to the storage endpoints does not trigger an update. To trigger records for update, you must issue a GET to any of the following /inventory endpoints and then a PUT the record retrieved to trigger an update. Although this process is slow, a maximum of 4 threads is recommended:
- /inventory/instances
- /inventory/holdings
- /inventory/items