Multiple MODOAIPMH instances compatible approach

Problem statement

  • There should be multiple MODOAIPMH deployed at the same time
  • There is no way to route every requests for the user to the same MODOAIPMH instance with the current infrastructure
  • The requesting of the instances ids, which are updated holding updated 
  • OAIMPH need to find all instances, for which holdings and items are updated or deleted within date range having in mind that there is no direct relation between instances and items
  • For each instance the representation, which includes items and holdings fields enriched with locations and other tables should be build
  • There is no way to do it for every batch query (with offset parameter)

Initial approach (only one instance can be deployed)

The initial approach was to get everything in single query and scroll over it with cursor (vert.x database streaming and vert.x http streaming).

So in one query there were:

  1. Query which creates temporary table with the ids of all modified (if the instance or its holding or item is modified or deleted) instances within date range
  2. Iterating over this table and creating appropriate view, which consist of some instance, some holding some items fields and fields are enriched with location names and other dictionary values.

This approach was taken, as it is more performant (than the approach from next section), as there is no intermediate reads/writes to database. In addition to that there was no connection to database in MODOAIPMH.

Multiple MODOAIPMH instances compatible approach

For implantation of multiple MODOAIPMH instances compatible approach the initial approach algorithm is divided into 2 phases (rest calls) and in 

  1. Only for the For first request: execute the first part from as it is in initial approach separately and save instance ids in table in MODOAIPMH.
    1. The table contains: instance_id, request_id, json (with deleted and suppressed flags info), timestamp (of creation).
  2. For each request from EDS in OAI-PMH select by resumption token first N instance ids,
    1. Then with one REST request get the info for them, as it was in the initial point 2. When the response is send to Harvester delete them from the table.

In order not to return the same data twice (if the same resumption token is used twice), if the first request was successful the "nextInstnaceId" is stored in the token and verified each time.

If subsequent requests (with resumption token) aren't received 1 day after the last one, the corresponding ids will be deleted from the table by background job.