SPIKE - improve performance of get_updated_instance_ids_view

As per the results of investigation in the previous spike, one single view to retrieve data from SRS and Inventory, as well as audit tables is not efficient in terms of performance. Better way is to break the view into parts depending on what is required in each specific case. In particular, if records source is Inventory, it does not need to have a condition checking every time whether the record comes from SRS, or Inventory that slows down the whole query. Instead, it is possible to invoke a view that retrieves data only from Inventory. 

All needed views can be stored in the oai-pmh schema and make queries through them to mod-inventory-storage and mod-source-record-storage schemas. Below is the list of such views.

1) View "get_instances_from_inventory" retrieves instances from Inventory:

create or replace view {tenant id}_mod_oai_pmh.get_instances_from_inventory
as
select *
from {tenant id}_mod_inventory_storage.instance;

2) View "get_instances_from_srs" retrieves instances from SRS:

create or replace view {tenant id}_mod_oai_pmh.get_instances_from_srs
as
select *
from {tenant id}_mod_source_record_storage.records_lb;

3) View "get_marc_records" retrieves marc records from SRS:

create or replace view {tenant id}_mod_oai_pmh.get_marc_records
as
select *
from {tenant id}_mod_source_record_storage.marc_records_lb;

4) View "get_holdings" retrieves holding records:

create or replace view {tenant id}_mod_oai_pmh.get_holdings
as
select *
from {tenant id}_mod_inventory_storage.holdings_record;

5) View "get_items" retrieves item records:

create or replace view {tenant id}_mod_oai_pmh.get_items
as
select *
from {tenant id}_mod_inventory_storage.item;

6) View "get_deleted_instances" retrieves deleted instances:

create or replace view {tenant id}_mod_oai_pmh.get_deleted_instances
as
select *
from {tenant id}_mod_inventory_storage.audit_instance;

7) View "get_deleted_holdings" retrieves deleted holding records:

create or replace view {tenant id}_mod_oai_pmh.get_deleted_holdings
as
select *
from {tenant id}_mod_inventory_storage.audit_holdings_record;

8) View "get_deleted_items" retrieves deleted item records:

create or replace view {tenant id}_mod_oai_pmh.get_deleted_items
as
select *
from {tenant id}_mod_inventory_storage.audit_item;

9) Final view "get_instances_with_marc_records" collects instances with marc records and returns all requied information for harvest:

create or replace view {tenant id}_mod_oai_pmh.get_instances_with_marc_records
as
SELECT instance_record.id                                                                                      instance_id,
       marc_record.content                                                                                     marc_record,
       instance_record.jsonb                                                                                   instance_record,
       instance_record.jsonb ->> 'source'                                                                      source,
       {tenant id}_mod_inventory_storage.strToTimestamp(instance_record.jsonb -> 'metadata' ->> 'updatedDate') instance_updated_date,
       COALESCE(record_lb.suppress_discovery, false)                                                           suppress_from_discovery_srs,
       COALESCE((instance_record.jsonb ->> 'discoverySuppress')::bool, false)                                  suppress_from_discovery_inventory,
       false                                                                                                   deleted
FROM {tenant id}_mod_oai_pmh.get_instances_from_inventory
       LEFT JOIN {tenant id}_mod_oai_pmh.get_instances_from_srs record_lb
                ON record_lb.external_id = instance_record.id
       LEFT JOIN {tenant id}_mod_oai_pmh.get_marc_records marc_record ON marc_record.id = record_lb.id;

The following is an example how to use the view above to retrieve the next portion of records when records source is "Source records storage", from="2023-01-01", until="2023-01-01", and discoverySuppressed=false:

select *
from {tenant id}_mod_oai_pmh.get_instances_with_marc_records inst
WHERE inst.instance_id > '{some_uuid}'::uuid
   AND inst.instance_updated_date BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01') AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01')
   and inst.source = 'MARC'
   and coalesce(inst.suppress_from_discovery_srs, inst.suppress_from_discovery_inventory) = false
order by instance_id
limit 500;

All deleted records (with all previous parameters) from audit tables (excluding audit_instance) can be appended separately through the following query:

select *
from {tenant id}_mod_oai_pmh.get_instances_with_marc_records x
WHERE x.instance_id > '{some_uuid}'::uuid
AND x.instance_updated_date BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01') AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01')
   and x.source = 'MARC'
   and coalesce(x.suppress_from_discovery_srs, x.suppress_from_discovery_inventory) = false
  AND exists (select 1
              from get_holdings holdings_record
                       LEFT JOIN get_items item_record
                                 ON holdings_record.id = item_record.holdingsrecordid
                       LEFT JOIN get_deleted_holdings audit_holdings_record
                                 ON (audit_holdings_record.jsonb #>> '{record,instanceId}')::uuid = instance_id
                       LEFT JOIN get_deleted_items audit_item_record
                                 ON (audit_item_record.jsonb #>> '{record,holdingsRecordId}')::uuid =
                                    audit_holdings_record.id
                       LEFT JOIN get_deleted_items audit_item_record_deleted
                                 ON (audit_item_record_deleted.jsonb #>> '{record,holdingsRecordId}')::uuid =
                                    holdings_record.id
              where instance_id = holdings_record.instanceid
                and ({tenant id}_mod_inventory_storage.strToTimestamp(holdings_record.jsonb -> 'metadata' ->> 'updatedDate')
                         BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01')
                         AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01')
                  OR {tenant id}_mod_inventory_storage.strToTimestamp(item_record.jsonb -> 'metadata' ->> 'updatedDate')
                         BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01')
                         AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01')
                  OR {tenant id}_mod_inventory_storage.strToTimestamp(audit_holdings_record.jsonb ->> 'createdDate')
                         BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01')
                         AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01')
                  OR {tenant id}_mod_inventory_storage.strToTimestamp(audit_item_record.jsonb ->> 'createdDate')
                         BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01')
                         AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01')
                  OR {tenant id}_mod_inventory_storage.strToTimestamp(audit_item_record_deleted.jsonb ->> 'createdDate')
                         BETWEEN {tenant id}_mod_inventory_storage.dateOrMin(timestamptz '2023-01-01')
                         AND {tenant id}_mod_inventory_storage.dateOrMax(timestamptz '2023-01-01')
                  ))
order by instance_id
limit 500;

10) Final view "get_instances_with_marc_records_deleted" collects deleted instances along with marc records:

create or replace view {tenant id}_mod_oai_pmh.get_instances_with_marc_records_deleted
as
SELECT instance_record.id                                                                                      instance_id,
       marc_record.content                                                                                     marc_record,
       instance_record.jsonb                                                                                   instance_record,
       instance_record.jsonb ->> 'source'                                                                      source,
      {tenant id}_mod_inventory_storage.strToTimestamp(instance_record.jsonb -> 'metadata' ->> 'updatedDate')  instance_updated_date,
       COALESCE(record_lb.suppress_discovery, false)                                                           suppress_from_discovery_srs,
       COALESCE((instance_record.jsonb ->> 'discoverySuppress')::bool, false)                                  suppress_from_discovery_inventory,
       true                                                                                                    deleted
FROM {tenant id}_mod_oai_pmh.get_deleted_instances
       LEFT JOIN {tenant id}_mod_oai_pmh.get_instances_from_srs record_lb
                ON record_lb.external_id = instance_record.id
       LEFT JOIN {tenant id}_mod_oai_pmh.get_marc_records marc_record ON marc_record.id = record_lb.id;

The view above is supposed to be invoked when all non-deleted instances have already been exhausted. Such case along with the transitional moment when a response contains a number of records less than max_records_per_response and the rest should be retrieved from deleted instances is shown in the following flowchart:

All the queries above can be collected into one single dynamic query using QueryBuilder in Java:

In the UML class diagram above, there are query templates (DELETED, DATE_UNTIL, DATE_FROM, DISCOVERY_SUPPRESS, SOURCE and LAST_INSTANCE_ID) that are used along with the final QUERY template depending on what is needed for the particlular case. For example, first query in the full harvest does not require LAST_INSTANCE_ID because there is no previous response yet. Also QueryBuilder contains one public method build(...) with the corresponding parameters and private helper methods to build sub-templates.

The following QueryBuilder can be used to build the final query and use it in the postgres client:

QueryBuilder
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.UUID;

import static java.lang.String.format;
import static java.util.Objects.isNull;
import static java.util.Objects.nonNull;
import static org.apache.commons.lang3.StringUtils.EMPTY;
import static org.folio.oaipmh.Constants.ISO_DATE_TIME_PATTERN;

public class QueryBuilder {

  public static final DateFormat DATE_FORMAT = new SimpleDateFormat(ISO_DATE_TIME_PATTERN);

  private static final String QUERY = "SELECT * FROM %s_mod_oai_pmh.%s inst\n" +
    "%s" + // last instance id
    "%s" + // date from
    "%s" + // date until
    "%s" + // source
    "%s" + // discovery suppress
    "%s" + // deleted
    "ORDER BY instance_id\n" +
    "LIMIT %d;";

  private static final String DELETED = "   %s EXISTS (SELECT 1\n" +
    "              FROM %s_mod_oai_pmh.get_holdings holdings_record\n" +
    "                       LEFT JOIN %s_mod_oai_pmh.get_items item_record\n" +
    "                                 ON holdings_record.id = item_record.holdingsrecordid\n" +
    "                       LEFT JOIN %s_mod_oai_pmh.get_deleted_holdings audit_holdings_record\n" +
    "                                 ON (audit_holdings_record.jsonb #>> '{record,instanceId}')::uuid = instance_id\n" +
    "                       LEFT JOIN %s_mod_oai_pmh.get_deleted_items audit_item_record\n" +
    "                                 ON (audit_item_record.jsonb #>> '{record,holdingsRecordId}')::uuid =\n" +
    "                                    audit_holdings_record.id\n" +
    "                       LEFT JOIN %s_mod_oai_pmh.get_deleted_items audit_item_record_deleted\n" +
    "                                 ON (audit_item_record_deleted.jsonb #>> '{record,holdingsRecordId}')::uuid =\n" +
    "                                    holdings_record.id\n" +
    "              WHERE instance_id = holdings_record.instanceid\n" +
    "                AND (%s_mod_inventory_storage.strToTimestamp(holdings_record.jsonb -> 'metadata' ->> 'updatedDate')\n" +
    "                         BETWEEN %s_mod_inventory_storage.dateOrMin(%s)\n" +
    "                         AND %s_mod_inventory_storage.dateOrMax(%s)\n" +
    "                  OR %s_mod_inventory_storage.strToTimestamp(item_record.jsonb -> 'metadata' ->> 'updatedDate')\n" +
    "                         BETWEEN %s_mod_inventory_storage.dateOrMin(%s)\n" +
    "                         AND %s_mod_inventory_storage.dateOrMax(%s)\n" +
    "                  OR %s_mod_inventory_storage.strToTimestamp(audit_holdings_record.jsonb ->> 'createdDate')\n" +
    "                         BETWEEN %s_mod_inventory_storage.dateOrMin(%s)\n" +
    "                         AND %s_mod_inventory_storage.dateOrMax(%s)\n" +
    "                  OR %s_mod_inventory_storage.strToTimestamp(audit_item_record.jsonb ->> 'createdDate')\n" +
    "                         BETWEEN %s_mod_inventory_storage.dateOrMin(%s)\n" +
    "                         AND %s_mod_inventory_storage.dateOrMax(%s)\n" +
    "                  OR %s_mod_inventory_storage.strToTimestamp(audit_item_record_deleted.jsonb ->> 'createdDate')\n" +
    "                         BETWEEN %s_mod_inventory_storage.dateOrMin(%s)\n" +
    "                         AND %s_mod_inventory_storage.dateOrMax(%s)\n" +
    "                  ))\n";
  private static final String BASE_QUERY_NON_DELETED_TEMPLATE = "get_instances_with_marc_records";
  private static final String BASE_QUERY_DELETED_TEMPLATE = "get_instances_with_marc_records_deleted";
  private static final String DATE_UNTIL = "   %s inst.instance_updated_date <= %s_mod_inventory_storage.dateOrMax(timestamptz '%s')\n";
  private static final String DATE_FROM = "   %s inst.instance_updated_date >= %s_mod_inventory_storage.dateOrMin(timestamptz '%s')\n";
  private static final String DISCOVERY_SUPPRESS = "   %s coalesce(inst.suppress_from_discovery_srs, inst.suppress_from_discovery_inventory) = false\n";
  private static final String SOURCE = "   %s inst.source = '%s'\n";
  private static final String LAST_INSTANCE_ID = "%s inst.instance_id > '%s'::uuid\n";

  public static String build(String tenant, UUID lastInstanceId, Date from, Date until, RecordsSource source,
                             boolean discoverySuppress, boolean deletedRecords, int limit) {
    return format(QUERY, tenant,
      !deletedRecords ? BASE_QUERY_NON_DELETED_TEMPLATE : BASE_QUERY_DELETED_TEMPLATE,
      buildLastInstanceId(lastInstanceId),
      buildDateFrom(tenant, from, isNull(lastInstanceId)),
      buildDateUntil(tenant, until, isNull(lastInstanceId) && isNull(from)),
      buildSource(tenant, source, isNull(lastInstanceId) && isNull(from) && isNull(until)),
      buildSuppressFromDiscovery(discoverySuppress, isNull(lastInstanceId) && isNull(from) && isNull(until) && isNull(source)),
      buildDeleted(tenant, from, until, isNull(lastInstanceId) && isNull(from) && isNull(until) && isNull(source) && !discoverySuppress),
      limit);
  }

  private static String buildLastInstanceId(UUID lastInstanceId) {
    return nonNull(lastInstanceId) ? format(LAST_INSTANCE_ID, " WHERE", lastInstanceId) : EMPTY;
  }

  private static String buildDateFrom(String tenant, Date from, boolean where) {
    return nonNull(from) ? format(DATE_FROM, where ? " WHERE" : " AND", tenant, DATE_FORMAT.format(from)) : EMPTY;
  }

  private static String buildDateUntil(String tenant, Date until, boolean where) {
    return nonNull(until) ? format(DATE_UNTIL, where ? " WHERE" : " AND", tenant, DATE_FORMAT.format(until)) : EMPTY;
  }

  private static String buildSource(String tenant, RecordsSource source, boolean where) {
    return nonNull(source) ? format(SOURCE, where ? " WHERE" : " AND", tenant, source) : EMPTY;
  }

  private static String buildSuppressFromDiscovery(boolean discoverySuppress, boolean where) {
    return !discoverySuppress ? format(DISCOVERY_SUPPRESS, where ? " WHERE" : " AND") : EMPTY;
  }

  private static String buildDeleted(String tenant, Date from, Date until, boolean where) {
    return nonNull(from) || nonNull(until) ? format(DELETED, where ? " WHERE" : " AND",
      tenant, tenant, tenant, tenant, tenant,
      tenant, tenant, buildDate(from), tenant, buildDate(until),
      tenant, tenant, buildDate(from), tenant, buildDate(until),
      tenant, tenant, buildDate(from), tenant, buildDate(until),
      tenant, tenant, buildDate(from), tenant, buildDate(until),
      tenant, tenant, buildDate(from), tenant, buildDate(until)) : EMPTY;
  }

  private static String buildDate(Date date) {
    return isNull(date) ? null : format("timestamptz '%s'", DATE_FORMAT.format(date));
  }
}

As an example, it is possible to build a simple query for the first request of the full harvest (SRS + Inventory):

QueryBuilder.build("fs09000000", null, null, null, null, true, false, 500);

And it produces the following result:

SELECT * FROM fs09000000_mod_oai_pmh.get_instances_with_marc_records inst
ORDER BY instance_id
LIMIT 500;

Query plan with expected execution time for the query above:

"QUERY PLAN"
"Limit  (cost=1.68..4126.70 rows=500 width=1855) (actual time=0.101..20.346 rows=500 loops=1)"
"  ->  Merge Left Join  (cost=1.68..81641211.99 rows=9895867 width=1855) (actual time=0.100..20.287 rows=500 loops=1)"
"        Merge Cond: (instance.id = records_lb.external_id)"
"        ->  Index Scan using instance_pkey on instance  (cost=0.56..20732654.49 rows=9895867 width=658) (actual time=0.018..0.472 rows=483 loops=1)"
"        ->  Materialize  (cost=1.12..61684767.10 rows=10284082 width=1171) (actual time=0.027..3.798 rows=487 loops=1)"
"              ->  Nested Loop Left Join  (cost=1.12..61659056.89 rows=10284082 width=1171) (actual time=0.025..3.479 rows=487 loops=1)"
"                    ->  Index Scan using idx_records_external_id on records_lb  (cost=0.56..17183135.30 rows=10284082 width=33) (actual time=0.014..0.569 rows=487 loops=1)"
"                    ->  Index Scan using marc_records_lb_pkey on marc_records_lb  (cost=0.56..4.32 rows=1 width=1170) (actual time=0.005..0.005 rows=1 loops=487)"
"                          Index Cond: (id = records_lb.id)"
"Planning Time: 0.784 ms"
"Execution Time: 20.413 ms"

And actual execution time when running the query against folio-perf-folijet environment with more than 8 millions records:

After that, it is possible to check a second query with using last instance id from the previous one (00031ff5-4bd7-44fa-a7bb-da30266affae):

In this case, QueryBuilder can be used in the following way:

QueryBuilder.build("fs09000000", UUID.fromString("00031ff5-4bd7-44fa-a7bb-da30266affae"), null, null, null, true, false, 500)

And it produces the next output:

SELECT * FROM fs09000000_mod_oai_pmh.get_instances_with_marc_records inst
 WHERE inst.instance_id > '00031ff5-4bd7-44fa-a7bb-da30266affae'::uuid
ORDER BY instance_id
LIMIT 500;

And query plan for the request above is the following:

"QUERY PLAN"
"Limit  (cost=1.68..4143.30 rows=500 width=1855) (actual time=3.342..20.897 rows=500 loops=1)"
"  ->  Merge Left Join  (cost=1.68..81560067.65 rows=9846387 width=1855) (actual time=3.340..20.843 rows=500 loops=1)"
"        Merge Cond: (instance.id = records_lb.external_id)"
"        ->  Index Scan using instance_pkey on instance  (cost=0.56..20665233.15 rows=9846387 width=658) (actual time=0.017..0.445 rows=480 loops=1)"
"              Index Cond: (id > '00031ff5-4bd7-44fa-a7bb-da30266affae'::uuid)"
"        ->  Materialize  (cost=1.12..61684767.10 rows=10284082 width=1171) (actual time=0.029..6.470 rows=983 loops=1)"
"              ->  Nested Loop Left Join  (cost=1.12..61659056.89 rows=10284082 width=1171) (actual time=0.026..5.951 rows=983 loops=1)"
"                    ->  Index Scan using idx_records_external_id on records_lb  (cost=0.56..17183135.30 rows=10284082 width=33) (actual time=0.015..0.962 rows=983 loops=1)"
"                    ->  Index Scan using marc_records_lb_pkey on marc_records_lb  (cost=0.56..4.32 rows=1 width=1170) (actual time=0.004..0.004 rows=1 loops=983)"
"                          Index Cond: (id = records_lb.id)"
"Planning Time: 0.677 ms"
"Execution Time: 20.961 ms"

And actual execution time when running the query against folio-perf-folijet environment with more than 8 millions records:

As you can see above, there is no significant differenct between the performance of the first (without last instance id) and second (with last instance id) queries and both of them are produced result during the 2-3 seconds for 500 max records per response. Such result can be considered as appropriate.