Skip to end of banner
Go to start of banner

OAI-PMH Best Practices

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 21 Next »

Verifying the connection to OAI-PMH service

After the setup of OAI-PMH module has been done, sending Identify request to FOLIO's OAI-PMH service is the best way to confirm the connection between the harvesting client and the service.

https://<edge-url>oai/<api-key>?verb=Identify

If the connection is successful, the following response should be returned:

Identify response Example
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"
xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:marc="http://www.loc.gov/MARC21/slim"
xmlns:oai-identifier="http://www.openarchives.org/OAI/2.0/oai-identifier"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<responseDate>2021-10-21T20:31:38Z</responseDate>
<request verb="Identify">http://folio.org/oai</request>
<Identify>
<repositoryName>FOLIO_OAI_Repository</repositoryName>
<baseURL>http://folio.org/oai</baseURL>
<protocolVersion>2.0</protocolVersion>
<adminEmail>oai-pmh@folio.org</adminEmail>
<earliestDatestamp>1970-01-01T00:00:00Z</earliestDatestamp>
<deletedRecord>persistent</deletedRecord>
<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
<compression>gzip</compression>
<compression>deflate</compression>
<description>
<oai-identifier:oai-identifier>
<oai-identifier:scheme>oai</oai-identifier:scheme>
<oai-identifier:repositoryIdentifier>folio.org</oai-identifier:repositoryIdentifier>
<oai-identifier:delimiter>:</oai-identifier:delimiter>
<oai-identifier:sampleIdentifier>oai:folio.org:diku/3c4ae3f3-b460-4a89-a2f9-78ce3145e4fc</oai-identifier:sampleIdentifier>
</oai-identifier:oai-identifier>
</description>
</Identify>
</OAI-PMH>

Starting harvest

When a harvest starts the system checks first the inventory for the updates (new or modified records) then retrieves underlying MARC Bib records from SRS.   If the harvest was triggered with the metadataPrefix set to marc21_withholdings, the holdings and items data is append  as described in MODOAIPMH-102.   

Max number of records per response

The number of records returned in the ListRecords response is determined by the  Max number of records per response setting.  Its value can be configured under Settings → OAI-PMH →Technical and can be set between 1 and 500.  The default value is 100 and this is also a recommended value for libraries that have instances with thousands of items associated with them (even if the items are spread across multiple holdings). 

URL

The URL for harvesting consists from the following elements:

  • First request:

https://<edge-url>/oai/<api-token>?verb=ListRecords&metadataPrefix=<metadataPrefix>&from=<yyyy-mm-dd>&until=<yyyy-mm-dd>

The "from" and "until" parameters can take include timestamp and the format is then: yyyy-mm-ddThh:mm:ssZ.  This is especially helpful during the troubleshooting when the harvest needs to be limited to some specific hours when the records were updated.  The specified time is in UTC (Coordinated Universal Time).   

  • Consecutive requests (with resumption token):

https://<edge-url>/oai/<api-token>?verb=ListRecords&resumptionToken=<resumption-token>

Resumption token

The resumption token is included in the response if the number of harvested records is larger than the configured Max records per response value.  

Example of the resumption token:

resumptionToken Example
<resumptionToken cursor="0">bWV0YWRhdGFQcmVmaXg9bWFyYzIxX3dpdGhob2xkaW5ncyZmcm9tPTIwMjEtMTAtMTUmbmV4dEluc3RhbmNlUGtWYWx1ZT01NDM1MTI2NiZvZmZzZXQ9MTAwJnJlcXVlc3RJZD02ODQyMjBiNy1kMjhhLTRhZWItYmEwNS02NDhjMmM2ODg0NGEmbmV4dFJlY29yZElkPTBmYzg5ODE4LTZhOTEtNDVkYS05NzRkLTliMTY3NzNhN2U0ZSZ1bnRpbD0yMDIxLTEwLTIx</resumptionToken>

The resumption token can be decoded using online Base64decode tool.  Among others, the token contains following elements:

  • starting position of the harvest

Example:  cursor ="0" means that it is the first request

  • metadataPrefix used in the request 

Example: metadataPrefix=marc21_withholdings

  • number of records in the harvested batch

Example: offset=100

  • unique identifier of the instance that will be harvested next:

Example: nextRecordId=0fc89818-6a91-45da-974d-9b16773a7e4e

  • how long the token is valid

Example: until=2021-10-21

Initial full harvest

The full harvest takes a significant amount of FOLIO's resources (inventory and SRS records) and should be carefully planned.  Usually, libraries run their full harvest over weekends. Following scenarios are not supported and will lead to extremely poor performance and eventually might crash the system:

  • Running concurrently multiple full harvests.  Note that sharing the harvesting link with multiple users will most likely lead to starting multiple harvests at the same time
  • Running full harvest during large updates to inventory and SRS records (importing data, reloading data)

Time required to complete

How long it will take for the harvest to complete depends on:

  • Configured number of records per response.  The number can be set up to 500 but the recommended value is 100 or 200 records
  • Number of holdings and items associated with instance records when harvesting with metadataPrefix set to marc21_withholdings

In tests conducted on collections with 5 million records the harvest took ~10 hours with 100 records per response and ~9 hours with 200 records per response.  The harvest of 8 million records took ~15 hours with 200 records per response.

Monitoring harvest

Starting with Lotus HF1 OAI-PMH provides APIs for monitoring the harvest.   In prior releases this information was only available in module logs.   The request GET /oai/request-metadata will provide information about currently running harvests and information about the status of retrieving records identifiers from UUIDs, number of successfully saved record, number of skipped records due to the lack of underlying SRS record and also other information:

Sending the request with request identifier, for example GET /oai/request-metadata/{requestId}/failed-to-save-instances, will return the list of UUIDs for the records that failed to save.  The list might be used for troubleshooting existing data issues in FOLIO inventory or SRS.  For more information see mod-oai-pmh README.md file or FOLIO API documentation.

Suppressed records

When the Suppressed records processing setting (Settings→ OAI-PMH→ Behavior) is set to "Transfer suppressed records with discovery flag value", the records marked as suppressed are included in the response with added subfield t.   

  • Suppressed instance records will have the subfield t in the field 999 set to 1.  If the instance is not suppressed the value will be 0
Suppressed instance
<marc:datafield tag="999" ind1="f" ind2="f">
<marc:subfield code="s">d2c1534d-41da-49c5-858b-850dbe23d1fa</marc:subfield>
<marc:subfield code="i">d2c1534d-41da-49c5-858b-850dbe23d1fa</marc:subfield>
<marc:subfield code="t">1</marc:subfield>
</marc:datafield>
  • Suppressed holdings and item records will have the subfield t in the fields 856 and 952 set to 1. 
Suppressed holdings and items - field 856
<marc:datafield tag="856" ind1="4" ind2="0">
<marc:subfield code="u">http://www.cairn.info/revue-dix-septieme-siecle.htm</marc:subfield>
<marc:subfield code="t">0</marc:subfield>
</marc:datafield>

Suppressed holdings and item records - 952 field
<marc:datafield tag="952" ind1="f" ind2="f">
<marc:subfield code="t">1</marc:subfield>
<marc:subfield code="e">BS2535.E7W63</marc:subfield>
<marc:subfield code="h">Library of Congress classification</marc:subfield>
</marc:datafield>

Deleted records

When the  Deleted record support setting (Settings→ OAI-PMH→ Behavior) is set to "Persistent" and the record is marked as deleted (MARC LDR 05 is set to "d"), the record will be a part of the response with the header status set to deleted. 

Deleted record example
<record>
<header status="deleted">
<identifier>oai:edge-bugfest-iris.folio.ebsco.com:fs09000000/ce064ce6-3d9c-4765-a3cf-564289f59b58</identifier>
<datestamp>2021-10-22T18:50:22Z</datestamp>
<setSpec>all</setSpec>
</header>
</record>


If you use an API process to delete Instances, be sure to delete the associated SRS record to trigger deleted record support. If you delete an Instance without knowing the associated SRS id, the only way to find it is by retrieving all Instances and SRS and then finding which SRS record points to an Instance that doesn't exist.

Slow Performance 

The harvesting of  5 million records should not take more than 11 hours in Juniper and less than 10 in Kiwi.  It is highly recommended to run REINDEX, VACUUM and ANALYZE after major updates to the inventory tables in PostgresSQL database.  It is highly recommended to run ANALYZE on a regular basis.

REINDEX and ANALYSE command
REINDEX index <tenant>_mod_inventory_storage.audit_item_pmh_createddate_idx ;
REINDEX
xxxx=> REINDEX index <tenant>_mod_inventory_storage.audit_holdings_record_pmh_createddate_idx;
REINDEX
xxxx=> REINDEX index <tenant>_mod_inventory_storage.audit_holdings_record_pmh_createddate_idx;
REINDEX
xxxx=> REINDEX index <tenant>_mod_inventory_storage.audit_holdings_record_pmh_createddate_idx;
REINDEX
xxxx=> REINDEX index <tenant>_mod_inventory_storage.holdings_record_pmh_metadata_updateddate_idx;
REINDEX
xxxx=> REINDEX index <tenant>_mod_inventory_storage.item_pmh_metadata_updateddate_idx;
REINDEX
xxxx=> REINDEX index <tenant>_mod_inventory_storage.instance_pmh_metadata_updateddate_idx;
REINDEX
xxxx=> analyze verbose <tenant>_mod_inventory_storage.instance;
ANALYZE
xxxx=> analyze verbose <tenant>_mod_inventory_storage.item;
ANALYZE
xxxx=> analyze verbose <tenant>_mod_inventory_storage.holdings_record;
ANALYZE

Forcing record updates

Staff actions to Instances, Holdings, and Items through the UI automatically trigger updates. However API based processes modifying the discovery flag, locations, or other fields to the storage endpoints does not trigger an update. To trigger records for update, you must issue a GET to any of the following /inventory endpoints and then a PUT the record retrieved to trigger an update. Although this process is slow, a maximum of 4 threads is recommended:

  • /inventory/instances
  • /inventory/holdings
  • /inventory/items




  • No labels