SPIKE - improve performance of get_updated_instance_ids_view
As per the results of investigation in the previous spike, one single view to retrieve data from SRS and Inventory, as well as audit tables is not efficient in terms of performance. Better way is to break the view into parts depending on what is required in each specific case. In particular, if records source is Inventory, it does not need to have a condition checking every time whether the record comes from SRS, or Inventory that slows down the whole query. Instead, it is possible to invoke a view that retrieves data only from Inventory.
All needed views can be stored in the oai-pmh schema and make queries through them to mod-inventory-storage and mod-source-record-storage schemas. Below is the list of such views.
1) View "get_instances_from_inventory" retrieves instances from Inventory:
create or replace view {tenant id}_mod_oai_pmh.get_instances_from_inventory as select * from {tenant id}_mod_inventory_storage.instance;
2) View "get_instances_from_srs" retrieves instances from SRS:
create or replace view {tenant id}_mod_oai_pmh.get_instances_from_srs as select * from {tenant id}_mod_source_record_storage.records_lb;
3) View "get_marc_records" retrieves marc records from SRS:
create or replace view {tenant id}_mod_oai_pmh.get_marc_records as select * from {tenant id}_mod_source_record_storage.marc_records_lb;
4) View "get_holdings" retrieves holding records:
create or replace view {tenant id}_mod_oai_pmh.get_holdings as select * from {tenant id}_mod_inventory_storage.holdings_record;
5) View "get_items" retrieves item records:
create or replace view {tenant id}_mod_oai_pmh.get_items as select * from {tenant id}_mod_inventory_storage.item;
6) View "get_deleted_instances" retrieves deleted instances:
create or replace view {tenant id}_mod_oai_pmh.get_deleted_instances as select * from {tenant id}_mod_inventory_storage.audit_instance;
7) View "get_deleted_holdings" retrieves deleted holding records:
create or replace view {tenant id}_mod_oai_pmh.get_deleted_holdings as select * from {tenant id}_mod_inventory_storage.audit_holdings_record;
8) View "get_deleted_items" retrieves deleted item records:
create or replace view {tenant id}_mod_oai_pmh.get_deleted_items as select * from {tenant id}_mod_inventory_storage.audit_item;
9) Final view "get_instances_with_marc_records" collects instances with marc records and returns all requied information for harvest:
create or replace view {tenant id}_mod_oai_pmh.get_instances_with_marc_records as SELECT instance_record.id instance_id, marc_record.content marc_record, instance_record.jsonb instance_record, instance_record.jsonb ->> 'source' source, {tenant id}_mod_inventory_storage.strToTimestamp(instance_record.jsonb -> 'metadata' ->> 'updatedDate') instance_updated_date, COALESCE(record_lb.suppress_discovery, false) suppress_from_discovery_srs, COALESCE((instance_record.jsonb ->> 'discoverySuppress')::bool, false) suppress_from_discovery_inventory, false deleted FROM {tenant id}_mod_oai_pmh.get_instances_from_inventory LEFT JOIN {tenant id}_mod_oai_pmh.get_instances_from_srs record_lb ON record_lb.external_id = instance_record.id LEFT JOIN {tenant id}_mod_oai_pmh.get_marc_records marc_record ON marc_record.id = record_lb.id;
The following is an example how to use the view above to retrieve the next portion of records when records source is "Source records storage", from="2023-01-01", until="2023-01-01", and discoverySuppressed=false:
select * from {tenant id}_mod_oai_pmh.get_instances_with_marc_records inst WHERE inst.instance_id > '{some_uuid}'::uuid AND inst.instance_updated_date BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01') AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01') and inst.source = 'MARC' and coalesce(inst.suppress_from_discovery_srs, inst.suppress_from_discovery_inventory) = false order by instance_id limit 500;
All deleted records (with all previous parameters) from audit tables (excluding audit_instance) can be appended separately through the following query:
select * from {tenant id}_mod_oai_pmh.get_instances_with_marc_records x WHERE x.instance_id > '{some_uuid}'::uuid AND x.instance_updated_date BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01') AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01') and x.source = 'MARC' and coalesce(x.suppress_from_discovery_srs, x.suppress_from_discovery_inventory) = false AND exists (select 1 from get_holdings holdings_record LEFT JOIN get_items item_record ON holdings_record.id = item_record.holdingsrecordid LEFT JOIN get_deleted_holdings audit_holdings_record ON (audit_holdings_record.jsonb #>> '{record,instanceId}')::uuid = instance_id LEFT JOIN get_deleted_items audit_item_record ON (audit_item_record.jsonb #>> '{record,holdingsRecordId}')::uuid = audit_holdings_record.id LEFT JOIN get_deleted_items audit_item_record_deleted ON (audit_item_record_deleted.jsonb #>> '{record,holdingsRecordId}')::uuid = holdings_record.id where instance_id = holdings_record.instanceid and ({tenant id}_mod_inventory_storage.strToTimestamp(holdings_record.jsonb -> 'metadata' ->> 'updatedDate') BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01') AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01') OR {tenant id}_mod_inventory_storage.strToTimestamp(item_record.jsonb -> 'metadata' ->> 'updatedDate') BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01') AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01') OR {tenant id}_mod_inventory_storage.strToTimestamp(audit_holdings_record.jsonb ->> 'createdDate') BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01') AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01') OR {tenant id}_mod_inventory_storage.strToTimestamp(audit_item_record.jsonb ->> 'createdDate') BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01') AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01') OR {tenant id}_mod_inventory_storage.strToTimestamp(audit_item_record_deleted.jsonb ->> 'createdDate') BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01') AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01') )) order by instance_id limit 500;
10) Final view "get_instances_with_marc_records_deleted" collects deleted instances along with marc records:
The view above is supposed to be invoked when all non-deleted instances have already been exhausted. Such case along with the transitional moment when a response contains a number of records less than max_records_per_response and the rest should be retrieved from deleted instances is shown in the following flowchart:
All the queries above can be collected into one single dynamic query using QueryBuilder in Java:
In the UML class diagram above, there are query templates (DELETED, DATE_UNTIL, DATE_FROM, DISCOVERY_SUPPRESS, SOURCE and LAST_INSTANCE_ID) that are used along with the final QUERY template depending on what is needed for the particlular case. For example, first query in the full harvest does not require LAST_INSTANCE_ID because there is no previous response yet. Also QueryBuilder contains one public method build(...) with the corresponding parameters and private helper methods to build sub-templates.
The following QueryBuilder can be used to build the final query and use it in the postgres client:
As an example, it is possible to build a simple query for the first request of the full harvest (SRS + Inventory):
QueryBuilder.build("fs09000000", null, null, null, null, true, false, 500);
And it produces the following result:
SELECT * FROM fs09000000_mod_oai_pmh.get_instances_with_marc_records inst ORDER BY instance_id LIMIT 500;
Query plan with expected execution time for the query above:
"QUERY PLAN" "Limit (cost=1.68..4126.70 rows=500 width=1855) (actual time=0.101..20.346 rows=500 loops=1)" " -> Merge Left Join (cost=1.68..81641211.99 rows=9895867 width=1855) (actual time=0.100..20.287 rows=500 loops=1)" " Merge Cond: (instance.id = records_lb.external_id)" " -> Index Scan using instance_pkey on instance (cost=0.56..20732654.49 rows=9895867 width=658) (actual time=0.018..0.472 rows=483 loops=1)" " -> Materialize (cost=1.12..61684767.10 rows=10284082 width=1171) (actual time=0.027..3.798 rows=487 loops=1)" " -> Nested Loop Left Join (cost=1.12..61659056.89 rows=10284082 width=1171) (actual time=0.025..3.479 rows=487 loops=1)" " -> Index Scan using idx_records_external_id on records_lb (cost=0.56..17183135.30 rows=10284082 width=33) (actual time=0.014..0.569 rows=487 loops=1)" " -> Index Scan using marc_records_lb_pkey on marc_records_lb (cost=0.56..4.32 rows=1 width=1170) (actual time=0.005..0.005 rows=1 loops=487)" " Index Cond: (id = records_lb.id)" "Planning Time: 0.784 ms" "Execution Time: 20.413 ms"
And actual execution time when running the query against folio-perf-folijet environment with more than 8 millions records:
After that, it is possible to check a second query with using last instance id from the previous one (00031ff5-4bd7-44fa-a7bb-da30266affae):
In this case, QueryBuilder can be used in the following way:
QueryBuilder.build("fs09000000", UUID.fromString("00031ff5-4bd7-44fa-a7bb-da30266affae"), null, null, null, true, false, 500)
And it produces the next output:
SELECT * FROM fs09000000_mod_oai_pmh.get_instances_with_marc_records inst WHERE inst.instance_id > '00031ff5-4bd7-44fa-a7bb-da30266affae'::uuid ORDER BY instance_id LIMIT 500;
And query plan for the request above is the following:
"QUERY PLAN" "Limit (cost=1.68..4143.30 rows=500 width=1855) (actual time=3.342..20.897 rows=500 loops=1)" " -> Merge Left Join (cost=1.68..81560067.65 rows=9846387 width=1855) (actual time=3.340..20.843 rows=500 loops=1)" " Merge Cond: (instance.id = records_lb.external_id)" " -> Index Scan using instance_pkey on instance (cost=0.56..20665233.15 rows=9846387 width=658) (actual time=0.017..0.445 rows=480 loops=1)" " Index Cond: (id > '00031ff5-4bd7-44fa-a7bb-da30266affae'::uuid)" " -> Materialize (cost=1.12..61684767.10 rows=10284082 width=1171) (actual time=0.029..6.470 rows=983 loops=1)" " -> Nested Loop Left Join (cost=1.12..61659056.89 rows=10284082 width=1171) (actual time=0.026..5.951 rows=983 loops=1)" " -> Index Scan using idx_records_external_id on records_lb (cost=0.56..17183135.30 rows=10284082 width=33) (actual time=0.015..0.962 rows=983 loops=1)" " -> Index Scan using marc_records_lb_pkey on marc_records_lb (cost=0.56..4.32 rows=1 width=1170) (actual time=0.004..0.004 rows=1 loops=983)" " Index Cond: (id = records_lb.id)" "Planning Time: 0.677 ms" "Execution Time: 20.961 ms"
And actual execution time when running the query against folio-perf-folijet environment with more than 8 millions records:
As you can see above, there is no significant differenct between the performance of the first (without last instance id) and second (with last instance id) queries and both of them are produced result during the 2-3 seconds for 500 max records per response. Such result can be considered as appropriate.