OAI-PMH Support (UXPROD-993)

[MODOAIPMH-10] OAI-PMH: IncompleteResponse & resumableTokens Created: 28/Sep/18  Updated: 01/Apr/20  Resolved: 09/Nov/18

Status: Closed
Project: mod-oai-pmh
Components: None
Affects versions: None
Fix versions: 1.0.0
Parent: OAI-PMH Support

Type: Story Priority: P3
Reporter: Hkaplanian Assignee: Pavel Korolenok
Resolution: Done Votes: 0
Labels: epam-thunderjet
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Relates
relates to UXPROD-350 OAI-PMH Support Closed
relates to MODOAIPMH-112 Can't complete requests chain without... Closed
Sprint: oai-pmh - sprint 50
Story Points: 5
Development Team: Thunderjet
Epic Link: OAI-PMH Support

 Description   

Add incompleteResponse/resumptionTokens to all applicable verbs. See https://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl and http://www.openarchives.org/OAI/2.0/guidelines-repository.htm#resumptionToken for details.

Applicable Verbs:

  • ListIdentifiers
  • ListRecords
  • ListSets

Things to note:

  • the offset/limit query arguments will come in handy here. See the inventory-storage API docs for details.
  • resumptionToken is an exclusive argument

Acceptance Criteria:

  • Incomplete responses/resumption Tokens are implemented for all applicable verbs
  • expirationDate should be calculated using the date-time of the original response and a TTL value configurable via System Property, e.g. resumption_token_ttl_ms
    • A reasonable default should be selected in the event the System property is not specified... 10 minutes
  • expirationDate is provided in the ISO8601 format (YYYY-MM-DDThh:mm:ssZ)
  • The maximum number of records returned should be controlled via a System Property, e.g. max_records_per_response
    • A reasonable default should be selected in the event the System property is not specified.... 100?
  • CompleteListSize and cursor should be returned on all incomplete requests (these are optional in the spec)
  • Caching of result sets is not required at this time.


 Comments   
Comment by Pavel Korolenok [ 30/Oct/18 ]

Hi Craig McNally, Hkaplanian,

It seems there is an issue with resumptionToken implementation based on inventory-instance-storage's offset/limit query arguments.

According the the specification:

When there are changes in the repository. There may be changes to the complete list returned by the list request sequence. These changes occur when the records disseminated in the list move in or out of the datestamp range of the request because of changes, modifications, or deletions in the repository. In this case, strict idempotency of the incomplete-list requests using resumptionToken values is not required. Instead, the incomplete list returned in response to a re-issued request must include all records with unchanged datestamps within the range of the initial list request. The incomplete list returned in response to a re-issued request may contain records with datestamps that either moved into or out of the range of the initial request. In cases where there are substantial changes to the repository, it may be appropriate for a repository to return a badResumptionToken error, signaling that the harvester should restart the list request sequence.

inventory-instance-storage's offset/limit query arguments does not allow to satisfy this requirement (specifically, in case of re-issued request include all records with unchanged datestamps within the range of the initial list request). I will explain it by an example:
The complete list contains 5 instances: A, B, C, D, E. And we use limit of 2 (return 2 instances at a time). Here is a flow:

  • Harvester issues ListRecords request
  • The repo responds with an incomplete list of 2 records and resumptionToken (internally offset of 0 is used)
    A, B
  • Harvester issues subsequent request with resumptionToken
  • In the meantime, record B is updated and moved to a new place in the result set (or even out of the result set if it goes beyond upper bound specified)
  • The repo calculates another set and responds with an incomplete list of 2 records and resumptionToken (internally offset of 2 is used this time)
    D, E
    (record C is lost, because it is moved to the 'left' and turned out to be out of subset with offset of 2. The result set looks like - A, C, D, E)

In other words, offset argument does not provide required guarantees (no lost records) when initially requested result set changes, because inventory-instance-storage regenerates the result set when the changes occur.
I tested this scenario on folio/testing-backed vagrant box and it works as described above:
There were the following instances after 2 queries:
Offset 0
7fbd5d84-62d1-44c6-9c45-6cb173998bbd
549fad9e-7f8e-4d8e-9a71-00d251817866
Offset 2
a89eccf0-57a6-495e-898d-32b9b2210f2f
a317b304-528c-424f-961c-39174933b454

Then I changed 549fad9e... instance and repeated the queries:
Offset 0
7fbd5d84-62d1-44c6-9c45-6cb173998bbd
a89eccf0-57a6-495e-898d-32b9b2210f2f (this item moved from the second subset and could be lost)
Offset 2
a317b304-528c-424f-961c-39174933b454
c1d3be12-ecec-4fab-9237-baf728575185

For now, I don't see ways to overcome it using approach with offset/limit query arguments.
What do you think?

Comment by Pavel Korolenok [ 30/Oct/18 ]

There are 2 ways to overcome the issue described above:
First approach

  • Limit the upper bound to the current datetime if it's not provided or if it's later than current datetime (this is even recommended according the specification)
    This should guarantee that number of records in the result set will not grow.
  • Add initial total number of records to the resumptionToken and compare it during each sequential call
    If it decreases, it means some records has changed and moved out of result set and we may run into the problem above. To avoid this issue, we can generate badResumptionToken error in this case which is acceptable according to the specification

    In cases where there are substantial changes to the repository, it may be appropriate for a repository to return a badResumptionToken error, signaling that the harvester should restart the list request sequence.

    Cons:

  • If the repository changes rapidly compared with the rate at which a harvester issues follow-on requests, it may result in big number of failed request sequences.

Second approach

  • Limit the upper bound to the current datetime as in the first approach
  • Sort the result set by createdDate/updatedDate (is it possible via CQL query?)
  • Instead of using 'offset' argument, shift the lower bound to the created/updatedDate of the last record in the incomplete result set
    It would guarantee that we will pick up all the records even if the result set changes
    Cons:
  • This approach requires sorting of the result set by created/updatedDate which may affect overall request performance (in particular, if the records are not sorted in this way in the storage).

In terms of implementation, both approaches require more or less the same effort.

Craig McNally,
could you please review and share your opinion?

Comment by Craig McNally [ 30/Oct/18 ]

> Limit the upper bound to the current datetime if it's not provided or if it's later than current datetime
Yes, definitely

I think option #1 is the preferred approach here. We might consider a slight variation or extension to this approach.

  • include the length of the original result set in the resumption token
  • include the UUID of the next result in the resumption token
    • requires that internally we ask for desired length+1
  • on subsequent requests, compare the result set length with that provided in the resumption token.
    • if the same, continue as usual
    • if smaller, a record has left the result set.
      • compare the first record in the results with the UUID from the resumptionToken.
      • if the same, update the record set length in the resumption token and continue as usual (the record that left the result set has not been reached yet).
      • if different, respond with badResumptionToken (continuing will result in missing records)
Comment by Pavel Korolenok [ 06/Nov/18 ]

The story is completed and ready for review.

Notes:

  • Flow Control (resumptionToken) is implemented for ListIdentifiers and ListRecords verbs. ListSets verb supports only one set for now, so do not require Flow Control.
  • Inventory-storage's offset/limit query params are used as a basis for Flow Control support. The approach is described above.
  • "Encoding State in the resumptionToken" strategy is implemented to support resumptionToken
  • expirationDate is not used as resumptionToken values do not expire (this is a result of using "encoding state" approach). This is acceptable according to the OAI-PMH spec.
  • The maximum number of records returned is controlled via a System Property
  • CompleteListSize and Cursor are returned on all incomplete requests
  • The resumptionToken value is Base-64 encoded (in URL safe manner) so that it can be used as GET param as is. It is done in order to not rely on Harvester to encode resumptionToken before adding it to the subsequent request (e.g. MarcEdit Harvester do not encode it and adds to the request as is)
Generated at Fri Feb 09 00:13:40 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.