MODSOURCE-276: Add existing records to the SRS Query API table

MODSOURCE-276 - Getting issue details... STATUS

Purpose

For today the Search API in SRS works only with newly imported Marc records.

This page is created with attendance to describe the approach to enable SRS Search for the existing Marc records.

Approach

The approach is to provide the filler script to the customer. The DB administrator/staff are able to run it once SRS is deployed and ready for use.

The script copies records from marc_records_lb table (only records that are missing in marc_indexers) and inserts records into marc_indexers table.

The content of script:

Define search_path to make the script run in a context of tenant/institution

SET search_path TO <tenant_mod_source_record_storage>;

Note: The DB administrator should replace the <tenant_mod_source_record_storage> on a proper value. For example, on the bugfest environment <tenant_mod_source_record_storage> = fs09000000_mod_source_record_storage, on the reference environment <tenant_mod_source_record_storage> = diku_mod_source_record_storage.

Remove existing indexes to speed up migration that happens after

 RAISE INFO 'Step1: Removing indexes %', NOW();
 drop index if exists idx_marc_indexers_leader_marc_id;
 for index in 0 .. 999 loop
 	suffix = lpad(index::text, 3, '0');
 	execute 'drop index if exists idx_marc_indexers_marc_id_' || suffix || ';';
 end loop;

Migrate records from marc_records_lb table to marc_indexers & marc_indexers_leader tables

The sql copies Marc fields from marc_records_lb table to marc_indexers table, and Marc leaders from marc_records_lb table to marc_indexers_leader. Both marc_indexers & marc_indexers_leader are involved into search mechanism, so here a Marc records become searchable.

Note: If some Marc field of the Marc record, that is involved into migration, is symbolic (cat, own, lkr and so on), this field is skipped from the migration being stored only in marc_records_lb

 RAISE INFO 'Step2: Migration: %', NOW();
 for rec in (select id, content from marc_records_lb x where id not in (select marc_id from marc_indexers)) loop
   IF (recordIndex % logFrequency = 0) THEN
     RAISE INFO 'RecordIndex: %', recordIndex;
   END IF;
   recordIndex := recordIndex + 1;

   insert into marc_indexers (field_no, ind1, ind2, subfield_no, value, marc_id)
     (with vals as (select value from jsonb_array_elements((select value from jsonb_each(rec.content) x where key = 'fields')) y),
             fields as (select x.key as field_no, x.value as field_value from vals, jsonb_each(vals.value) x),
             fields_subfields as (select field_no,
                                         trim(field_value ->> 'ind1'::text) ind1,
                                         trim(field_value ->> 'ind2') ind2,
                                         field_value -> 'subfields' subfields,
                                         field_value from fields
                                  where field_no between '000' and '999'), 
             marc_raw as (select fs.field_no, fs.ind1, fs.ind2, fs.field_value, null::text subfield_no, null::text subfield_value 
                          from fields_subfields fs where subfields is null
                          union all 
                          select fs.field_no, fs.ind1, fs.ind2, fs.field_value, subfs.key::text subfield_no, subfs.value::text subfield_value
                          from fields_subfields fs, jsonb_array_elements(fs.subfields) sx, jsonb_each(sx.value) subfs where subfields is not null),
             marc as (select m.field_no,
                        CASE WHEN ind1 IS NULL or ind1 = '' THEN '#' ELSE ind1 END as ind1,
                        CASE WHEN ind2 IS NULL or ind2 = '' THEN '#' ELSE ind2 END as ind2,
                        CASE WHEN subfield_no IS NULL or trim(subfield_no) = '' THEN '0' 
                        ELSE subfield_no END as subfield_no,
                        trim(both '"' from coalesce(subfield_value, field_value::text)) as value
                      from marc_raw m)
       select distinct lower(field_no) field_no, ind1, ind2, subfield_no, value, rec.id marc_id from marc);

   insert into marc_indexers_leader(p_00_04, p_05, p_06, p_07, p_08, p_09, p_10, p_11, p_12_16, p_17, p_18, p_19, p_20, p_21, p_22, marc_id)
          (select substring(value from 1 for 5) p_00_04,
                  substring(value from 6 for 1) p_05,
                  substring(value from 7 for 1) p_06,
                  substring(value from 8 for 1) p_07,
                  substring(value from 9 for 1) p_08,
                  substring(value from 10 for 1) p_09,
                  substring(value from 11 for 1) p_10,
                  substring(value from 12 for 1) p_11,
                  substring(value from 13 for 5) p_12_16,
                  substring(value from 18 for 1) p_17,
                  substring(value from 19 for 1) p_18,
                  substring(value from 20 for 1) p_19,
                  substring(value from 21 for 1) p_20,
                  substring(value from 22 for 1) p_21,
                  substring(value from 23 for 1) p_22,
                  marc_id
          from (select replace(lower(trim(both '"' from value::text)), ' ', '#') as value, rec.id marc_id
                from jsonb_each(rec.content) x
                where key = 'leader') y);
 end loop;

Restore/Create indexes

 RAISE INFO 'Step3: Creating indexes %', NOW();
 create index idx_marc_indexers_leader_marc_id on marc_indexers_leader (marc_id);
 for index in 0 .. 999 loop
    suffix = lpad(index::text, 3, '0');
    execute 'create index idx_marc_indexers_marc_id_' || suffix || ' on marc_indexers_' || suffix || '(marc_id);';
 end loop;

Running

Source on gist: https://gist.github.com/Igor-Gorchakov/4d71e19620f47af9da45aa5cf22f578b

Source on github repo on SRS: <TODO put a link later>

Running from psql:

psql -h hostname -U folio -f fill-marc-indexes.sql

Execution process:

Time needed for complete execution: ±5 hours

Tested number of records: ±8 million records in marc_records_lb:

As a result of execution the marc_indexers & marc_indexers_leader are updated:

Restrictions & limitations & edge cases

If there is a need to import (using data-import) new Marc records that contain symbolic fields (cat, own and so on) on Iris SRS, and make such records enabled for Search API, then there should be created hotfix. Need to add partitions for symbolic fields to search tables to make such fields searchable.

2. Now we do not know how exact adding/posting(using REST API) new Marc records with symbolic fields will affect the Search API and data-import.
For the Search API this records will not be available, because the database trigger for indexing the record for search will fail.
For the data-import this can probably fail the import job, because DB trigger will try to put a Marc field into none existing partition throwing runtime exception, this case needs to be double checked.