|
Hi Craig McNally, Hkaplanian,
It seems there is an issue with resumptionToken implementation based on inventory-instance-storage's offset/limit query arguments.
According the the specification:
When there are changes in the repository. There may be changes to the complete list returned by the list request sequence. These changes occur when the records disseminated in the list move in or out of the datestamp range of the request because of changes, modifications, or deletions in the repository. In this case, strict idempotency of the incomplete-list requests using resumptionToken values is not required. Instead, the incomplete list returned in response to a re-issued request must include all records with unchanged datestamps within the range of the initial list request. The incomplete list returned in response to a re-issued request may contain records with datestamps that either moved into or out of the range of the initial request. In cases where there are substantial changes to the repository, it may be appropriate for a repository to return a badResumptionToken error, signaling that the harvester should restart the list request sequence.
inventory-instance-storage's offset/limit query arguments does not allow to satisfy this requirement (specifically, in case of re-issued request include all records with unchanged datestamps within the range of the initial list request). I will explain it by an example:
The complete list contains 5 instances: A, B, C, D, E. And we use limit of 2 (return 2 instances at a time). Here is a flow:
- Harvester issues ListRecords request
- The repo responds with an incomplete list of 2 records and resumptionToken (internally offset of 0 is used)
A, B
- Harvester issues subsequent request with resumptionToken
- In the meantime, record B is updated and moved to a new place in the result set (or even out of the result set if it goes beyond upper bound specified)
- The repo calculates another set and responds with an incomplete list of 2 records and resumptionToken (internally offset of 2 is used this time)
D, E
(record C is lost, because it is moved to the 'left' and turned out to be out of subset with offset of 2. The result set looks like - A, C, D, E)
In other words, offset argument does not provide required guarantees (no lost records) when initially requested result set changes, because inventory-instance-storage regenerates the result set when the changes occur.
I tested this scenario on folio/testing-backed vagrant box and it works as described above:
There were the following instances after 2 queries:
Offset 0
7fbd5d84-62d1-44c6-9c45-6cb173998bbd
549fad9e-7f8e-4d8e-9a71-00d251817866
Offset 2
a89eccf0-57a6-495e-898d-32b9b2210f2f
a317b304-528c-424f-961c-39174933b454
Then I changed 549fad9e... instance and repeated the queries:
Offset 0
7fbd5d84-62d1-44c6-9c45-6cb173998bbd
a89eccf0-57a6-495e-898d-32b9b2210f2f (this item moved from the second subset and could be lost)
Offset 2
a317b304-528c-424f-961c-39174933b454
c1d3be12-ecec-4fab-9237-baf728575185
For now, I don't see ways to overcome it using approach with offset/limit query arguments.
What do you think?
|
|
There are 2 ways to overcome the issue described above:
First approach
- Limit the upper bound to the current datetime if it's not provided or if it's later than current datetime (this is even recommended according the specification)
This should guarantee that number of records in the result set will not grow.
- Add initial total number of records to the resumptionToken and compare it during each sequential call
If it decreases, it means some records has changed and moved out of result set and we may run into the problem above. To avoid this issue, we can generate badResumptionToken error in this case which is acceptable according to the specification
In cases where there are substantial changes to the repository, it may be appropriate for a repository to return a badResumptionToken error, signaling that the harvester should restart the list request sequence.
Cons:
- If the repository changes rapidly compared with the rate at which a harvester issues follow-on requests, it may result in big number of failed request sequences.
Second approach
- Limit the upper bound to the current datetime as in the first approach
- Sort the result set by createdDate/updatedDate (is it possible via CQL query?)
- Instead of using 'offset' argument, shift the lower bound to the created/updatedDate of the last record in the incomplete result set
It would guarantee that we will pick up all the records even if the result set changes
Cons:
- This approach requires sorting of the result set by created/updatedDate which may affect overall request performance (in particular, if the records are not sorted in this way in the storage).
In terms of implementation, both approaches require more or less the same effort.
Craig McNally,
could you please review and share your opinion?
|
|
> Limit the upper bound to the current datetime if it's not provided or if it's later than current datetime
Yes, definitely
I think option #1 is the preferred approach here. We might consider a slight variation or extension to this approach.
- include the length of the original result set in the resumption token
- include the UUID of the next result in the resumption token
- requires that internally we ask for desired length+1
- on subsequent requests, compare the result set length with that provided in the resumption token.
- if the same, continue as usual
- if smaller, a record has left the result set.
- compare the first record in the results with the UUID from the resumptionToken.
- if the same, update the record set length in the resumption token and continue as usual (the record that left the result set has not been reached yet).
- if different, respond with badResumptionToken (continuing will result in missing records)
|