/
SPIKE- investigate options for improved OAI-PMH harvesting

SPIKE- investigate options for improved OAI-PMH harvesting

One of the possible improvements for OAI-PMH harvesting is to get rid of downloading instance ids following by retrieving instances by them. More promising approach is to use a direct view to retrieve instances from Inventory/SRS. In this case, views should return records ordered by instance id to allow multiple requests in a sequence for full harvest starting from the id following by the last returned one in every next request.

The next two parameters can be added to the function to track last instance id and number of records to be returned:

CREATE OR REPLACE FUNCTION get_updated_instance_ids_view(
startdate timestamp with time zone,
...
lastinstanceid uuid    DEFAULT NULL::uuid,
numrecords     INTEGER DEFAULT 50)

Also a returned table may contain a jsonb of instance in the first column instead of id:

RETURNS TABLE(
        instance jsonb,
        ...

In the last two SELECT queries separated by UNION ALL it is possible to introduce an additional condition to check whether lastinstanceid ($7) parameter is not null and if so, select only specific number of instances ($8) with id after $7. In this case, all instances should be ordered by id:

SELECT jsonb                                         AS instance,
       ...
FROM instanceIdsInRange, instance
WHERE instanceIdsInRange.maxDate BETWEEN dateOrMin($1) AND dateOrMax($2)
      ...
      AND ($7 IS NULL OR instance.id > $7)
GROUP BY 1, 2, 4, instance.id

UNION ALL
SELECT (jsonb #>> '{record}')::jsonb                 AS instance,
        ...
FROM audit_instance
WHERE $3
      ...
      AND ($7 IS NULL OR (jsonb #>> '{record,id}')::uuid > $7)

ORDER BY instance.id, (jsonb #>> '{record,id}')::uuid
LIMIT $8

Since the instances are retrieved from two different tables (second table contains deleted ones), LIMIT value is used only once after two selections.

The next function can be helpful for full harvest since it does not require dates:

CREATE OR REPLACE FUNCTION get_instance_view(
	deletedrecordssupport boolean DEFAULT TRUE,
	skipsuppressedfromdiscoveryrecords boolean DEFAULT TRUE,
    source varchar DEFAULT NULL,
	lastinstanceid UUID DEFAULT NULL::UUID,
    numrecords INTEGER DEFAULT 50) 
    
    RETURNS TABLE(
        instance jsonb, 
        source CHARACTER varying, 
        "updatedDate" TIMESTAMP WITH TIME ZONE, 
        "suppressFromDiscovery" boolean, 
        deleted boolean) 
    LANGUAGE 'sql' COST 100 VOLATILE PARALLEL UNSAFE ROWS 1000 
    AS $BODY$ 
SELECT instance.jsonb                                                     AS instance,
        instance.jsonb ->> 'source'                                        AS source,
        strToTimestamp(instance.jsonb -> 'metadata' ->> 'createdDate')     AS maxDate,
        (instance.jsonb ->> 'discoverySuppress')::bool                     AS suppressFromDiscovery,
        false                                                              AS deleted
FROM instance
WHERE NOT ($2 AND COALESCE((instance.jsonb ->> 'discoverySuppress')::bool, false))
      AND ($4 IS NULL OR instance.id > $4)
      AND ($3 IS NULL OR (instance.jsonb ->> 'source') = $3)
GROUP BY instance.id

UNION ALL
SELECT (jsonb #>> '{record}')::jsonb               AS instance,
        jsonb #>> '{record,source}'                 AS source,
        strToTimestamp(jsonb ->> 'createdDate')     AS maxDate,
        false                                       AS suppressFromDiscovery,
        true                                        AS deleted
FROM audit_instance
WHERE $1
 AND ($4 IS NULL OR (jsonb #>> '{record,id}')::uuid > $4)
 AND ($3 IS NULL OR (jsonb #>> '{record,source}') = $3)  
ORDER BY instance.id, (jsonb #>> '{record,id}')::uuid 
LIMIT $8 
$BODY$;

The function above returns only instances in json format even if the source is MARC. In other words, there is no MARC in the response. Probably, more appropriate solution here is to remove source as parameter and return only records with source FOLIO:

CREATE OR REPLACE FUNCTION get_instances_with_folio_view(
	deletedrecordssupport boolean DEFAULT TRUE,
	skipsuppressedfromdiscoveryrecords boolean DEFAULT TRUE,
    lastinstanceid uuid DEFAULT NULL::UUID,
    numrecords integer DEFAULT 50) 
    
    RETURNS TABLE(
        instance jsonb, 
        source CHARACTER varying, 
        "updatedDate" TIMESTAMP WITH TIME ZONE, 
        "suppressFromDiscovery" boolean, 
        deleted boolean) 
    LANGUAGE 'sql' COST 100 VOLATILE PARALLEL UNSAFE ROWS 1000 
    AS $BODY$ 
SELECT instance.jsonb                                                     AS instance,
        'FOLIO'                                                            AS source,
        strToTimestamp(instance.jsonb -> 'metadata' ->> 'createdDate')     AS maxDate,
        (instance.jsonb ->> 'discoverySuppress')::bool                     AS suppressFromDiscovery,
        false                                                              AS deleted
FROM instance
WHERE NOT ($2 AND COALESCE((instance.jsonb ->> 'discoverySuppress')::bool, false))
      AND ($3 IS NULL OR instance.id > $3)
      AND ((instance.jsonb ->> 'source') = 'FOLIO')

UNION ALL
SELECT (jsonb #>> '{record}')::jsonb               AS instance,
        'FOLIO'                                     AS source,
        strToTimestamp(jsonb ->> 'createdDate')     AS maxDate,
        false                                       AS suppressFromDiscovery,
        true                                        AS deleted
FROM audit_instance
WHERE $1
 AND ($3 IS NULL OR (jsonb #>> '{record,id}')::uuid > $3)
 AND ((jsonb #>> '{record,source}') = 'FOLIO')  
ORDER BY instance.id, (jsonb #>> '{record,id}')::uuid 
LIMIT $4
$BODY$;

To harvest only MARC records, the following function can be applied:

CREATE OR REPLACE FUNCTION get_instances_with_marc_view(
	deletedrecordssupport boolean DEFAULT TRUE,
	skipsuppressedfromdiscoveryrecords boolean DEFAULT TRUE,
	lastinstanceid uuid DEFAULT NULL::UUID,
    numrecords INTEGER DEFAULT 50) 
    
    RETURNS TABLE(
        instanceId uuid,
        marc jsonb, 
        source CHARACTER varying, 
        "updatedDate" TIMESTAMP WITH TIME ZONE, 
        "suppressFromDiscovery" boolean, 
        deleted boolean) 
    LANGUAGE 'sql' COST 100 VOLATILE PARALLEL UNSAFE ROWS 1000 
    AS $BODY$ 
SELECT lb.external_id                                 AS instanceId,
        marc.content                                   AS marc,
        'MARC'                                         AS source,
        lb.created_date                                AS maxDate,
        lb.suppress_discovery::bool                    AS suppressFromDiscovery,
        false                                          AS deleted
FROM marc_records_lb marc
 JOIN records_lb lb
 ON lb.id = marc.id
WHERE NOT ($2 AND COALESCE(lb.suppress_discovery::bool, false))
      AND ($3 IS NULL OR lb.external_id > $3)

UNION ALL
SELECT (jsonb #>> '{record,id}')::uuid             AS instanceId,
        (jsonb #>> '{record}')::jsonb               AS instance,
        'MARC'                                      AS source,
        strToTimestamp(jsonb ->> 'createdDate')     AS maxDate,
        false                                       AS suppressFromDiscovery,
        true                                        AS deleted
FROM audit_instance
WHERE $1
 AND ($3 IS NULL OR (jsonb #>> '{record,id}')::uuid > $3)
 AND (jsonb #>> '{record,source}') = 'MARC'  
ORDER BY instance.id, (jsonb #>> '{record,id}')::uuid 
LIMIT $4
$BODY$;

However, in case if there are deleted records from audit_instance, marc will contain only json for instance, not marc json record and it should be generated on the fly.

The following function combines MARC and FOLIO sources and returns all records:

CREATE OR REPLACE FUNCTION get_instances_with_folio_and_marc_view(
	deletedrecordssupport boolean DEFAULT TRUE,
	skipsuppressedfromdiscoveryrecords boolean DEFAULT TRUE,
    lastinstanceid uuid DEFAULT NULL::uuid,
    numrecords integer DEFAULT 50) 
    
    RETURNS TABLE(
        instance jsonb, 
        source CHARACTER varying, 
        "updatedDate" TIMESTAMP WITH TIME ZONE, 
        "suppressFromDiscovery" boolean, 
        deleted boolean) 
    LANGUAGE 'sql' COST 100 VOLATILE PARALLEL UNSAFE ROWS 1000 
    AS $BODY$ 
SELECT instance.jsonb              AS instance,
        'FOLIO'                     AS source,
        strToTimestamp(instance.jsonb -> 'metadata' ->> 'createdDate')     AS maxDate,
        (instance.jsonb ->> 'discoverySuppress')::bool AS suppressFromDiscovery,
        false                       AS deleted
FROM instance
WHERE NOT ($2 AND COALESCE((instance.jsonb ->> 'discoverySuppress')::bool, false))
      AND ($3 IS NULL OR instance.id > $3)
      AND ((instance.jsonb ->> 'source') = 'FOLIO')

UNION ALL

SELECT marc.content                                   AS marc,
        'MARC'                                         AS source,
        lb.created_date                                AS maxDate,
        lb.suppress_discovery::bool                    AS suppressFromDiscovery,
        false                                          AS deleted
FROM marc_records_lb marc
 JOIN records_lb lb
 ON lb.id = marc.id
WHERE NOT ($2 AND COALESCE(lb.suppress_discovery::bool, false))
      AND ($3 IS NULL OR lb.external_id > $3)
ORDER BY lb.external_id


UNION ALL
SELECT (jsonb #>> '{record}')::jsonb               AS instance,
        jsonb #>> '{record,source}'                 AS source,
        strToTimestamp(jsonb ->> 'createdDate')     AS maxDate,
        false                                       AS suppressFromDiscovery,
        true                                        AS deleted
FROM audit_instance
WHERE $1
 AND ($3 IS NULL OR (jsonb #>> '{record,id}')::uuid > $3)
 AND ((jsonb #>> '{record,source}') = 'FOLIO')  
ORDER BY instance.id, lb.external_id, (jsonb #>> '{record,id}')::uuid 
LIMIT $4
$BODY$; 

All the approaches above require ordering all returned records by id for multiple requests in a row in the full harvest, but combining the tables by UNION ALL prevents us from utilizing this ordering effectively.

More efficient solution is to unify a view to return MARC data (in case of SRS or SRS + Inventory records source) instead of jsonb of instance. This approach allows us not to request SRS for MARC records.

To achieve that, it needs to join also SRS tables with MARC content:

Based on the ERD above, we can omit UNION ALL and use only LEFT JOIN to select all instances from INSTANCE table and join deleted instances from audit table using FULL JOIN ON false:

CREATE OR REPLACE FUNCTION get_instances_view(
	startdate                          timestamp with time zone,
	enddate                            timestamp with time zone,
	deletedrecordsupport               boolean DEFAULT true,
	skipsuppressedfromdiscoveryrecords boolean DEFAULT true,
	onlyinstanceupdatedate             boolean DEFAULT true,
	"source"                           character varying DEFAULT NULL::character varying,
    nextinstanceid                     uuid DEFAULT NULL, -- specifies id of instance following by the result
                                                          -- is returned (last instance id from the previous
                                                          -- result, or null if the first request)
    "limit"                            integer DEFAULT 100)
    RETURNS TABLE("instanceId"            uuid, 
                  "instanceRecord"        jsonb, -- can be either jsonb of instance, or marc content depending on the source
                  "source"                character varying, 
                  "updatedDate"           timestamp with time zone, 
                  "suppressFromDiscovery" boolean, 
                  "deleted"               boolean) 
    LANGUAGE 'sql'
    VOLATILE PARALLEL UNSAFE
    ROWS 1000

AS 
$$    
SELECT 
 DISTINCT CASE WHEN instance_record.id IS NOT NULL -- if instance deleted, then from audit, otherwise from instances
                     THEN instance_record.id ELSE (audit_instance_record.jsonb #>> '{record,id}')::uuid END,
 
          CASE WHEN marc_record.content IS NOT NULL -- return jsonb of marc content if present
                     THEN marc_record.content ELSE (
               CASE WHEN instance_record.jsonb IS NOT NULL -- return jsonb of instance if present, otherwise from audit
                     THEN instance_record.jsonb ELSE (audit_instance_record.jsonb -> 'record')::jsonb END) END, 
               
          CASE WHEN instance_record.jsonb IS NOT NULL -- return source from instances if not deleted
                     THEN instance_record.jsonb ->> 'source' 
                     ELSE audit_instance_record.jsonb #>> '{record,source}' END, -- from audit if deleted
                     
          CASE WHEN instance_record.jsonb IS NOT NULL -- if instance deleted, use audit table to find created date
                                                      -- (created date in audit table means last updated date for instance
                                                      -- that was deleted)
                     THEN strToTimestamp(instance_record.jsonb -> 'metadata' ->> 'updatedDate')
                     ELSE strToTimestamp(audit_instance_record.jsonb ->> 'createdDate') END,
                     
          CASE WHEN marc_record.content IS NOT NULL -- first, check SRS for discovery suppress
                     THEN record_lb.suppress_discovery -- if not in SRS, try to find in inventory and if not either, that means 
                                                       -- instance is not suppressed from discovery and false
                     ELSE COALESCE((instance_record.jsonb ->> 'discoverySuppress')::bool, false) END,
          audit_instance_record.id IS NOT NULL -- true if current id is present in the audit table (means it was deleted)
          
FROM      instance instance_record
LEFT JOIN holdings_record holdings_record
       ON instance_record.id = holdings_record.instanceid
LEFT JOIN item item_record
       ON holdings_record.id = item_record.holdingsrecordid

-- add marc records
LEFT JOIN records_lb record_lb
       ON record_lb.external_id = instance_record.id
LEFT JOIN marc_records_lb marc_record
       ON marc_record.id = record_lb.id
       
-- add audit tables
FULL JOIN audit_instance audit_instance_record
       ON false -- just add all records from audit_instance table cause instance can be either in
                -- audit table, or instance table, but not in both of them       
LEFT JOIN audit_holdings_record audit_holdings_record
       -- case when instance not deleted, but holding deleted
       ON (audit_holdings_record.jsonb #>> '{record,instanceId}')::uuid = instance_record.id
LEFT JOIN audit_holdings_record audit_holdings_record_deleted       
       -- case when instance deleted and holding deleted
       ON (audit_holdings_record_deleted.jsonb #>> '{record,instanceId}')::uuid = audit_instance_record.id
LEFT JOIN audit_item audit_item_record
       -- case when holding deleted and item deleted
       ON (audit_item_record.jsonb #>> '{record,holdingsRecordId}')::uuid = audit_holdings_record.id
LEFT JOIN audit_item audit_item_record_deleted       
       -- case when only item deleted (holding is still in inventory)
       ON (audit_item_record_deleted.jsonb #>> '{record,holdingsRecordId}')::uuid = holdings_record.id

WHERE     (strToTimestamp(instance_record.jsonb -> 'metadata' ->> 'updatedDate') 
           -- apply date range to instances in inventory
  BETWEEN dateOrMin($1) 
      AND dateOrMax($2)

      OR ( 
          -- case when there are deleted instances and needs to check deletedrecordsupport and date range
          $3 AND strToTimestamp(audit_instance_record.jsonb ->> 'createdDate') 
  BETWEEN dateOrMin($1) 
      AND dateOrMax($2)
          
          -- add related items and holdings
       OR NOT $5 -- check only instance updated date and if false, proceed with items and holdings
      AND (strToTimestamp(holdings_record.jsonb -> 'metadata' ->> 'updatedDate') 
  BETWEEN dateOrMin($1) 
      AND dateOrMax($2)
       OR strToTimestamp(item_record.jsonb -> 'metadata' ->> 'updatedDate') 
  BETWEEN dateOrMin($1) 
      AND dateOrMax($2)        
         -- add related deleted records
       OR strToTimestamp(audit_holdings_record.jsonb ->> 'createdDate') 
  BETWEEN dateOrMin($1) 
      AND dateOrMax($2)
       OR strToTimestamp(audit_holdings_record_deleted.jsonb ->> 'createdDate') 
  BETWEEN dateOrMin($1) 
      AND dateOrMax($2)
       OR strToTimestamp(audit_item_record.jsonb ->> 'createdDate') 
  BETWEEN dateOrMin($1) 
      AND dateOrMax($2)
       OR strToTimestamp(audit_item_record_deleted.jsonb ->> 'createdDate') 
  BETWEEN dateOrMin($1) 
      AND dateOrMax($2))
          )) AND ($7 IS NULL OR (CASE WHEN instance_record.id IS NOT NULL -- mandatory condition to check next instance id (can be null)
                     THEN instance_record.id ELSE (audit_instance_record.jsonb #>> '{record,id}')::uuid END) > $7)
             AND ($6 IS NULL OR (CASE WHEN instance_record.jsonb IS NOT NULL -- mandatory condition to check source (can be null)
                     THEN instance_record.jsonb ->> 'source' 
                     ELSE audit_instance_record.jsonb #>> '{record,source}' END) = $6)
             AND NOT ($4 AND CASE WHEN marc_record.content IS NOT NULL -- mandatory condition to check discovery suppress
                     THEN record_lb.suppress_discovery
                     ELSE COALESCE((instance_record.jsonb ->> 'discoverySuppress')::bool, false) END) -- can be null in inventory
      
ORDER BY (CASE WHEN instance_record.id IS NOT NULL THEN instance_record.id ELSE (audit_instance_record.jsonb #>> '{record,id}')::uuid END)
LIMIT     $8
$$     

As per the view above, if instance with source MARC is deleted via API from inventory, MARC content in the marc_records_lb table is still present, so it is possible to retrieve it using record id from audit_instance, but this case is not supported yet and can be considered as extension of current view.

Query plan for full harvest without date range and limit 500 with 500k instances:

"QUERY PLAN"
"Limit  (cost=24187557.85..24187566.60 rows=500 width=90) (actual time=77055.632..77057.034 rows=500 loops=1)"
"  ->  Unique  (cost=24187557.85..24270317.46 rows=4729121 width=90) (actual time=77055.630..77056.939 rows=500 loops=1)"
"        ->  Sort  (cost=24187557.85..24199380.65 rows=4729121 width=90) (actual time=77055.628..77056.324 rows=500 loops=1)"
"              Sort Key: (CASE WHEN (instance_record.id IS NOT NULL) THEN instance_record.id ELSE ((audit_instance_record.jsonb #>> '{record,id}'::text[]))::uuid END), (CASE WHEN (marc_record.content IS NOT NULL) THEN marc_record.content ELSE CASE WHEN (instance_record.jsonb IS NOT NULL) THEN instance_record.jsonb ELSE (audit_instance_record.jsonb -> 'record'::text) END END), (CASE WHEN (instance_record.jsonb IS NOT NULL) THEN (instance_record.jsonb ->> 'source'::text) ELSE (audit_instance_record.jsonb #>> '{record,source}'::text[]) END), (CASE WHEN (instance_record.jsonb IS NOT NULL) THEN diku_mod_inventory_storage.strtotimestamp(((instance_record.jsonb -> 'metadata'::text) ->> 'updatedDate'::text)) ELSE diku_mod_inventory_storage.strtotimestamp((audit_instance_record.jsonb ->> 'createdDate'::text)) END), (CASE WHEN (marc_record.content IS NOT NULL) THEN record_lb.suppress_discovery ELSE COALESCE(((instance_record.jsonb ->> 'discoverySuppress'::text))::boolean, false) END), ((audit_instance_record.id IS NOT NULL))"
"              Sort Method: external merge  Disk: 367488kB"
"              ->  Hash Left Join  (cost=118536.32..23095205.52 rows=4729121 width=90) (actual time=2584.668..73885.868 rows=282512 loops=1)"
"                    Hash Cond: (audit_instance_record.id = ((audit_holdings_record_deleted.jsonb #>> '{record,instanceId}'::text[]))::uuid)"
"                    Filter: (((diku_mod_inventory_storage.strtotimestamp(((instance_record.jsonb -> 'metadata'::text) ->> 'updatedDate'::text)) >= '1970-01-01 00:00:00+00'::timestamp with time zone) AND (diku_mod_inventory_storage.strtotimestamp(((instance_record.jsonb -> 'metadata'::text) ->> 'updatedDate'::text)) <= '2050-01-01 00:00:00+00'::timestamp with time zone)) OR ((diku_mod_inventory_storage.strtotimestamp((audit_instance_record.jsonb ->> 'createdDate'::text)) >= '1970-01-01 00:00:00+00'::timestamp with time zone) AND (diku_mod_inventory_storage.strtotimestamp((audit_instance_record.jsonb ->> 'createdDate'::text)) <= '2050-01-01 00:00:00+00'::timestamp with time zone)) OR ((diku_mod_inventory_storage.strtotimestamp(((holdings_record.jsonb -> 'metadata'::text) ->> 'updatedDate'::text)) >= '1970-01-01 00:00:00+00'::timestamp with time zone) AND (diku_mod_inventory_storage.strtotimestamp(((holdings_record.jsonb -> 'metadata'::text) ->> 'updatedDate'::text)) <= '2050-01-01 00:00:00+00'::timestamp with time zone)) OR ((diku_mod_inventory_storage.strtotimestamp(((item_record.jsonb -> 'metadata'::text) ->> 'updatedDate'::text)) >= '1970-01-01 00:00:00+00'::timestamp with time zone) AND (diku_mod_inventory_storage.strtotimestamp(((item_record.jsonb -> 'metadata'::text) ->> 'updatedDate'::text)) <= '2050-01-01 00:00:00+00'::timestamp with time zone)) OR ((diku_mod_inventory_storage.strtotimestamp((audit_holdings_record.jsonb ->> 'createdDate'::text)) >= '1970-01-01 00:00:00+00'::timestamp with time zone) AND (diku_mod_inventory_storage.strtotimestamp((audit_holdings_record.jsonb ->> 'createdDate'::text)) <= '2050-01-01 00:00:00+00'::timestamp with time zone)) OR ((diku_mod_inventory_storage.strtotimestamp((audit_holdings_record_deleted.jsonb ->> 'createdDate'::text)) >= '1970-01-01 00:00:00+00'::timestamp with time zone) AND (diku_mod_inventory_storage.strtotimestamp((audit_holdings_record_deleted.jsonb ->> 'createdDate'::text)) <= '2050-01-01 00:00:00+00'::timestamp with time zone)) OR ((diku_mod_inventory_storage.strtotimestamp((audit_item_record.jsonb ->> 'createdDate'::text)) >= '1970-01-01 00:00:00+00'::timestamp with time zone) AND (diku_mod_inventory_storage.strtotimestamp((audit_item_record.jsonb ->> 'createdDate'::text)) <= '2050-01-01 00:00:00+00'::timestamp with time zone)) OR ((diku_mod_inventory_storage.strtotimestamp((audit_item_record_deleted.jsonb ->> 'createdDate'::text)) >= '1970-01-01 00:00:00+00'::timestamp with time zone) AND (diku_mod_inventory_storage.strtotimestamp((audit_item_record_deleted.jsonb ->> 'createdDate'::text)) <= '2050-01-01 00:00:00+00'::timestamp with time zone)))"
"                    ->  Hash Left Join  (cost=118502.25..700569.46 rows=883948 width=3783) (actual time=2583.902..6956.745 rows=282512 loops=1)"
"                          Hash Cond: (holdings_record.id = ((audit_item_record_deleted.jsonb #>> '{record,holdingsRecordId}'::text[]))::uuid)"
"                          ->  Hash Left Join  (cost=118468.17..528289.45 rows=304012 width=3767) (actual time=2583.873..6758.927 rows=282512 loops=1)"
"                                Hash Cond: (audit_holdings_record.id = ((audit_item_record.jsonb #>> '{record,holdingsRecordId}'::text[]))::uuid)"
"                                ->  Hash Left Join  (cost=118434.10..361808.81 rows=304012 width=3751) (actual time=2583.835..6552.516 rows=282512 loops=1)"
"                                      Hash Cond: (instance_record.id = ((audit_holdings_record.jsonb #>> '{record,instanceId}'::text[]))::uuid)"
"                                      ->  Merge Full Join  (cost=118400.02..198357.43 rows=304012 width=3703) (actual time=2583.817..6297.398 rows=282512 loops=1)"
"                                            Join Filter: false"
"                                            ->  Hash Left Join  (cost=118400.02..194556.27 rows=304012 width=3655) (actual time=2583.797..6086.891 rows=282511 loops=1)"
"                                                  Hash Cond: (instance_record.id = holdings_record.instanceid)"
"                                                  ->  Merge Right Join  (cost=118261.36..192539.27 rows=299937 width=1511) (actual time=2581.813..5684.808 rows=282471 loops=1)"
"                                                        Merge Cond: (marc_record.id = record_lb.id)"
"                                                        ->  Index Scan using marc_records_lb_pkey on marc_records_lb marc_record  (cost=0.42..68665.84 rows=284128 width=1341) (actual time=0.020..1309.478 rows=283700 loops=1)"
"                                                        ->  Materialize  (cost=117914.21..119413.90 rows=299937 width=202) (actual time=2581.783..3406.530 rows=282471 loops=1)"
"                                                              ->  Sort  (cost=117914.21..118664.06 rows=299937 width=202) (actual time=2581.779..3111.661 rows=282471 loops=1)"
"                                                                    Sort Key: record_lb.id"
"                                                                    Sort Method: external merge  Disk: 59912kB"
"                                                                    ->  Merge Left Join  (cost=0.84..38384.25 rows=299937 width=202) (actual time=0.030..1729.037 rows=282471 loops=1)"
"                                                                          Merge Cond: (instance_record.id = record_lb.external_id)"
"                                                                          ->  Index Scan using instance_pkey on instance instance_record  (cost=0.42..20872.97 rows=299937 width=185) (actual time=0.013..624.993 rows=282445 loops=1)"
"                                                                          ->  Index Scan using idx_records_external_id on records_lb record_lb  (cost=0.42..14269.29 rows=283705 width=33) (actual time=0.013..594.909 rows=274418 loops=1)"
"                                                  ->  Hash  (cost=134.00..134.00 rows=373 width=2160) (actual time=1.965..1.968 rows=378 loops=1)"
"                                                        Buckets: 1024  Batches: 1  Memory Usage: 805kB"
"                                                        ->  Hash Right Join  (cost=68.28..134.00 rows=373 width=2160) (actual time=0.732..1.460 rows=378 loops=1)"
"                                                              Hash Cond: (item_record.holdingsrecordid = holdings_record.id)"
"                                                              ->  Seq Scan on item item_record  (cost=0.00..64.73 rows=373 width=1115) (actual time=0.006..0.276 rows=364 loops=1)"
"                                                              ->  Hash  (cost=63.68..63.68 rows=368 width=1061) (actual time=0.714..0.715 rows=369 loops=1)"
"                                                                    Buckets: 1024  Batches: 1  Memory Usage: 402kB"
"                                                                    ->  Seq Scan on holdings_record  (cost=0.00..63.68 rows=368 width=1061) (actual time=0.014..0.343 rows=369 loops=1)"
"                                            ->  Materialize  (cost=0.00..1.01 rows=1 width=48) (actual time=0.015..0.018 rows=1 loops=1)"
"                                                  ->  Seq Scan on audit_instance audit_instance_record  (cost=0.00..1.01 rows=1 width=48) (actual time=0.010..0.012 rows=1 loops=1)"
"                                      ->  Hash  (cost=20.70..20.70 rows=1070 width=48) (actual time=0.003..0.004 rows=0 loops=1)"
"                                            Buckets: 2048  Batches: 1  Memory Usage: 16kB"
"                                            ->  Seq Scan on audit_holdings_record  (cost=0.00..20.70 rows=1070 width=48) (actual time=0.003..0.003 rows=0 loops=1)"
"                                ->  Hash  (cost=20.70..20.70 rows=1070 width=32) (actual time=0.030..0.031 rows=3 loops=1)"
"                                      Buckets: 2048  Batches: 1  Memory Usage: 20kB"
"                                      ->  Seq Scan on audit_item audit_item_record  (cost=0.00..20.70 rows=1070 width=32) (actual time=0.007..0.009 rows=3 loops=1)"
"                          ->  Hash  (cost=20.70..20.70 rows=1070 width=32) (actual time=0.020..0.021 rows=3 loops=1)"
"                                Buckets: 2048  Batches: 1  Memory Usage: 20kB"
"                                ->  Seq Scan on audit_item audit_item_record_deleted  (cost=0.00..20.70 rows=1070 width=32) (actual time=0.005..0.007 rows=3 loops=1)"
"                    ->  Hash  (cost=20.70..20.70 rows=1070 width=32) (actual time=0.003..0.003 rows=0 loops=1)"
"                          Buckets: 2048  Batches: 1  Memory Usage: 16kB"
"                          ->  Seq Scan on audit_holdings_record audit_holdings_record_deleted  (cost=0.00..20.70 rows=1070 width=32) (actual time=0.002..0.002 rows=0 loops=1)"
"Planning Time: 6.777 ms"
"Execution Time: 77154.775 ms"

However, such approach takes too much time for 10 millions records and needs to be revised.

The investigation will be continued in the follow up spike.