[FOLIO-1019] Inventory performance optimization for folio-alpha Created: 09/Jan/18  Updated: 12/Nov/18  Resolved: 14/Feb/18

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P3
Reporter: Wayne Schneider Assignee: Wayne Schneider
Resolution: Done Votes: 0
Labels: alpha, ci, core, sprint30, sprint31, sprint32
Remaining Estimate: Not Specified
Time Spent: 3 days, 6 hours
Original estimate: Not Specified

Attachments: File inventory.jmx     File inventory_csv_with_fulldisplay.jmx     Microsoft Word inventory_jmeter_output.csv     File inventory_only_csv.jmx     Microsoft Word inventory_queries.csv    
Issue links:
Relates
relates to FOLIO-1066 Create AWS environment for performanc... Closed
relates to FOLIO-986 set up FOLIO installation with import... Closed
relates to UIIN-69 Typeahead search should start with 3r... Closed
relates to FOLIO-917 Optimize postgresl for FOLIO deployment Open
relates to MODINVSTOR-48 folio alpha installation is missing i... Closed
Sprint:

 Description   

Create new indexes, database configuration tweaks, update queries, etc.



 Comments   
Comment by Wayne Schneider [ 10/Jan/18 ]

After much experimenting with different indexes, we have come to a few conclusions:

  • we need to change the UI to search on "publication", not "publishers" (and "publisher" needs to be the first property in the serialised JSON)
  • we need name to be the first property in contributors in the serialised JSON
  • we believe that sorting needs to be restricted to a single property, in order to limit the combinations (and hence indexes we need)
  • assuming single property sorting, we need b-tree indexes for title, contributors and publication

Open questions:

  • Can the "like" indexes be dropped?
  • Anything else needed for search performance?
Comment by shale99 [ 11/Jan/18 ]

hi Wayne Schneider - was going to try this myself , but i didnt want to apply new rules to the db because i didnt know what was in the queue (i created a security group to open my ip to the db - hope that was ok - but i deleted the security group because i did not want to apply immediately) - having said that, if you can maybe try to run the following - it will tell us which indexes are being used and which have not (look at the idx_scan column)

SELECT *, pg_size_pretty(pg_relation_size('diku_mod_inventory_storage.instance'))
 FROM pg_stat_all_indexes 
 WHERE schemaname = 'diku_mod_inventory_storage' 
 ORDER BY pg_relation_size('diku_mod_inventory_storage.instance') DESC, idx_scan ASC
 

getting that output would be interesting

Comment by Niels Erik Nielsen [ 11/Jan/18 ]

I think it would be appropriate to at least try and performance test a regular relational structure (like for response time, cpu/memory usage, indices disk space usage etc) with and without multi-column sorting and what other features we're contemplating.

If we take searches/sorting by title, contributors, identifiers for instance - we could create contributors and identifier tables in place of the contributor/identifier json object arrays - and index them appropriately, something like this:

DROP TABLE IF EXISTS diku_mod_inventory_storage.instance_identifier;

CREATE TABLE diku_mod_inventory_storage.instance_identifier (
  _id uuid PRIMARY KEY,
  instance_id uuid REFERENCES diku_mod_inventory_storage.instance (_id), 
  identifier_type_id uuid REFERENCES diku_mod_inventory_storage.identifier_type (_id), 
  identifier text,
  CONSTRAINT unique_identifiers UNIQUE (identifier, identifier_type_id, instance_id)
);

DROP TABLE IF EXISTS diku_mod_inventory_storage.instance_contributor;

CREATE TABLE diku_mod_inventory_storage.instance_contributor (
  _id uuid PRIMARY KEY,
  instance_id uuid REFERENCES diku_mod_inventory_storage.instance (_id),
  contributor_type_id uuid REFERENCES diku_mod_inventory_storage.contributor_type (_id), 
  contributor_name_type_id uuid REFERENCES diku_mod_inventory_storage.contributor_name_type (_id), 
  primary_contributor boolean, 
  name text
);

CREATE UNIQUE INDEX instance_contributor_name_type_name_idx
                 ON diku_mod_inventory_storage.instance_contributor (name, contributor_name_type_id, instance_id) 
              WHERE contributor_type_id IS NULL;          

CREATE UNIQUE INDEX instance_contributor_type_name_type_name_idx 
                 ON diku_mod_inventory_storage.instance_contributor (name, contributor_name_type_id, contributor_type_id, instance_id) 
              WHERE contributor_type_id IS NOT NULL;          

The tables might be populated directly from the ingested JSON for testing (don't know how this SQL would work for millions of records though):

INSERT INTO diku_mod_inventory_storage.instance_identifier (_id, instance_id, identifier_type_id, identifier)
SELECT gen_random_uuid(),
       instance_id, 
       REPLACE ( (jsonb_array_elements(ids)::json->'identifierTypeId')::text, '"', '')::UUID  identifier_type_id,
       REPLACE ( (jsonb_array_elements(ids)::json->'value')::text, '"', '') identifier
  FROM (SELECT _id instance_id, 
               jsonb->'identifiers' AS ids 
          FROM diku_mod_inventory_storage.instance) AS identifiers;

INSERT INTO diku_mod_inventory_storage.instance_contributor (_id, instance_id, contributor_type_id, contributor_name_type_id, name)
SELECT gen_random_uuid(),
       instance_id, 
       REPLACE ( (jsonb_array_elements(contribs)::json->'contributorTypeId')::text, '"', '')::UUID  contributor_type_id,
       REPLACE ( (jsonb_array_elements(contribs)::json->'contributorNameTypeId')::text, '"', '')::UUID  contributor_name_type_id,
       REPLACE ( (jsonb_array_elements(contribs)::json->'name')::text, '"', '') contributor_name
  FROM (SELECT _id instance_id, 
               jsonb->'contributors' AS contribs 
          FROM diku_mod_inventory_storage.instance) AS contributors;

(Outstanding: the primary_contributor column not yet populated, could also have a flag indicating what contributor to sort by if there are multiple primary contributors, a table for instance publishers should also be created)

In my small setup with only the sample data from mod-inventory-storage, I have these records, indexed by 'identifier' and 'name' respectively (if I understand postgresql indexes correctly) which I'd expect would be a super fast entry for searching (not sure about the sorting):

okapi_modules=# SELECT * 
okapi_modules-#   FROM diku_mod_inventory_storage.instance_identifier;
                 _id                  |             instance_id              |          identifier_type_id          |      identifier      
--------------------------------------+--------------------------------------+--------------------------------------+----------------------
 0933a4e9-7efd-4aca-967a-146920f9272f | 69640328-788e-43fc-9c3c-af39e243f3b7 | 913300b2-03ed-469a-8179-c1092c991227 | 0747-0088
 deb6a606-1220-40a8-9583-52b945af5a55 | 69640328-788e-43fc-9c3c-af39e243f3b7 | c858e4f2-2b6b-4385-842b-60732ee14abb | 84641839
 0e108f63-2da8-4ebf-97e2-defb4e30e896 | 30fcc8e7-a019-43f4-b642-2edc389f4501 | 593b78cb-32f3-44d1-ba8c-63fd5e6989e6 | AJMEAZ
 6fddd008-1919-4fc2-afa9-95a9e0314c1b | 30fcc8e7-a019-43f4-b642-2edc389f4501 | 913300b2-03ed-469a-8179-c1092c991227 | 0002-9343
 d3a4c0de-dff8-470d-a235-ed560be18600 | 30fcc8e7-a019-43f4-b642-2edc389f4501 | c858e4f2-2b6b-4385-842b-60732ee14abb | med49002270
 4b4c4687-febb-47b5-86e4-5a0733453f4c | 7fbd5d84-62d1-44c6-9c45-6cb173998bbd | 5d164f4b-0b15-4e42-ae75-cfcf85318ad9 | ocn956625961
 bdc976e9-627a-40af-889f-47c7e52bf823 | f31a36de-fcf8-44f9-87ef-a55d06ad21ae | 7f907515-a1bf-4513-8a38-92e1a07c539d | B01LO7PJOE
 70e1cae8-30cd-4043-9514-523607d2e7d4 | a89eccf0-57a6-495e-898d-32b9b2210f2f | 8261054f-be78-422d-bd51-4ed9f33c3422 | 0552142352
 50416f0c-7942-4fe9-a53e-26288a3af716 | a89eccf0-57a6-495e-898d-32b9b2210f2f | 8261054f-be78-422d-bd51-4ed9f33c3422 | 9780552142352
 89bb755d-146b-4674-a923-642d0c71dabf | 6506b79b-7702-48b2-9774-a1c538fdd34e | 8261054f-be78-422d-bd51-4ed9f33c3422 | 0956687695
 81f6f065-717d-496d-bdc5-a4d5b412ebf1 | 6506b79b-7702-48b2-9774-a1c538fdd34e | 8261054f-be78-422d-bd51-4ed9f33c3422 | 9780956687695
 287256da-bc2d-48c3-89e2-67d4077c2e6f | cf23adf0-61ba-4887-bf82-956c4aae2260 | 8261054f-be78-422d-bd51-4ed9f33c3422 | 1447294130
 c364068f-c166-470a-b573-c6ab76d311a6 | cf23adf0-61ba-4887-bf82-956c4aae2260 | 8261054f-be78-422d-bd51-4ed9f33c3422 | 9781447294130
 9e1e1ac2-a506-4a46-8953-c8b0b25646dd | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 5d164f4b-0b15-4e42-ae75-cfcf85318ad9 | ocn968777846
 24410134-f5fa-4f5d-afa4-50daf00e8439 | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 8261054f-be78-422d-bd51-4ed9f33c3422 | 9786316800312
 fdf3dce3-2757-4c24-a71c-5a416d0f748a | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 8261054f-be78-422d-bd51-4ed9f33c3422 | 6316800312
 781fd59b-085f-4ddb-8029-683b5e53f5f2 | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | b5d8cdc4-9441-487c-90cf-0c7ec97728eb | OTA-1031 Otá Records
 67686153-c54e-4fd6-be5c-91b3948b183a | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 7e591197-f335-4afb-bc6d-a6d76ca3bace | (OCoLC)968777846
(18 rows)

okapi_modules=# 
okapi_modules=# SELECT * 
okapi_modules-#   FROM diku_mod_inventory_storage.instance_contributor;
                 _id                  |             instance_id              |         contributor_type_id          |       contributor_name_type_id       | primary_contributor |         name         
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+---------------------+----------------------
 6ec264a7-c909-4ebd-8ed9-c4ed1ec76381 | 7fbd5d84-62d1-44c6-9c45-6cb173998bbd |                                      | 2b94c631-fca9-a892-c730-03ee529ffe2a |                     | Fielding, Helen
 b11ab3cf-622d-483f-badc-6e4034162575 | f31a36de-fcf8-44f9-87ef-a55d06ad21ae |                                      | 2e48e713-17f3-7c13-09f8-23845bb210aa |                     | Creator A
 17c81250-044b-46da-bd87-c2a6b4aa6ad5 | f31a36de-fcf8-44f9-87ef-a55d06ad21ae |                                      | e8b311a6-3b21-03f2-2269-dd9310cb2d0a |                     | Creator B
 9e136fae-9d41-4153-9501-3c86d1fdd00d | a89eccf0-57a6-495e-898d-32b9b2210f2f |                                      | 2b94c631-fca9-a892-c730-03ee529ffe2a |                     | Pratchett, Terry
 d3a491ab-9261-4b13-874f-72fd4e971bc1 | 6506b79b-7702-48b2-9774-a1c538fdd34e |                                      | 2b94c631-fca9-a892-c730-03ee529ffe2a |                     | Barnes, Adrian
 31745a13-1058-4ec6-a057-15ab73c3c098 | cf23adf0-61ba-4887-bf82-956c4aae2260 |                                      | 2b94c631-fca9-a892-c730-03ee529ffe2a |                     | Novik, Naomi
 ea443a06-b6a6-4985-a4d5-cf9bbe82b166 | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 2b94c631-fca9-a892-c730-03ee529ffe2b | 2b94c631-fca9-a892-c730-03ee529ffe2a |                     | Sosa, Omar
 68fe6a2a-7c9f-4a7f-b024-19ce27c00bff | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 2b94c631-fca9-a892-c730-03ee529ffe2b | 2b94c631-fca9-a892-c730-03ee529ffe2a |                     | Keita, Seckou, 1977-
 63515316-37fd-4953-90d1-e16d0b6c550e | bbd4a5e1-c9f3-44b9-bfdf-d184e04f0ba0 |                                      | 2b94c631-fca9-a892-c730-03ee529ffe2a |                     | White, Vincent E.
(9 rows)

okapi_modules=# 
Comment by shale99 [ 11/Jan/18 ]

hey Niels Erik Nielsen - if i understood correctly, i am not sure it makes a difference to postgres if the btree comes from a separate column or from a field in the jsonb. at the end of the day the same index is created no matter what the source (column or field in jsonb) and response time should be identical. the issue with the multiple columns (or multi field, its the same) is just that postgres is very picky - you need to create an index for title+contrib and also for contrib+title , yes the ordering is important for the index to be used. you would also need the same for title+pub and pub+title , and then also asc and desc matters here so that creates an awful lot of index to support the multi column sort - and at the end of the day this stuff really hurts scalability. if the added value of this is critical it can be supported by adding the indexes and it will perform well - i am just not sure it is....

Comment by Niels Erik Nielsen [ 11/Jan/18 ]

But - this is not just from a separate column, no? rather from two new columns in two new separate tables?

The entry into 'instance' would be through the '_id' index only.

Certainly not saying this means we would be home safe, just that it seemed sufficiently different from the current approach to potentially have other performance characteristics.

Comment by Niels Erik Nielsen [ 11/Jan/18 ]

And, btw, this is a distraction, given this issue's title, ".. for folio-alpha", sorry.

Comment by Wayne Schneider [ 12/Jan/18 ]

Conclusions from discussions on 1/11:
Output from pg_stat_all_indexes:

 relid  | indexrelid |         schemaname         | relname  |          indexrelname           | idx_scan | idx_tup_read | idx_tup_fetch | pg_size_pretty 
--------+------------+----------------------------+----------+---------------------------------+----------+--------------+---------------+----------------
 204521 |     204533 | diku_mod_inventory_storage | instance | instance_contributors_idx_like  |        0 |            0 |             0 | 3516 MB
 204521 |     347310 | diku_mod_inventory_storage | instance | instance_publication_idx        |        0 |            0 |             0 | 3516 MB
 204521 |     204531 | diku_mod_inventory_storage | instance | instance_title_idx_like         |        0 |            0 |             0 | 3516 MB
 204521 |     204527 | diku_mod_inventory_storage | instance | instance_pkey                   |        1 |            1 |             1 | 3516 MB
 204521 |     204532 | diku_mod_inventory_storage | instance | instance_identifiers_idx_like   |        2 |            0 |             0 | 3516 MB
 204521 |     347309 | diku_mod_inventory_storage | instance | instance_contributors_idx       |       19 |          750 |           750 | 3516 MB
 204521 |     347326 | diku_mod_inventory_storage | instance | instance_contrib_title_idx      |       38 |         2170 |          2170 | 3516 MB
 204521 |     204538 | diku_mod_inventory_storage | instance | instance_languages_idx_gin      |       88 |     49622320 |             0 | 3516 MB
 204521 |     204530 | diku_mod_inventory_storage | instance | instance_id_idx                 |      332 |          326 |           326 | 3516 MB
 204521 |     204537 | diku_mod_inventory_storage | instance | instance_instancetypeid_idx_gin |      444 |     50205382 |             0 | 3516 MB
 204521 |     204534 | diku_mod_inventory_storage | instance | instance_title_idx_gin          |      607 |      3924339 |             0 | 3516 MB
 204521 |     204535 | diku_mod_inventory_storage | instance | instance_contributors_idx_gin   |      607 |      3398622 |             0 | 3516 MB
 204521 |     204536 | diku_mod_inventory_storage | instance | instance_identifiers_idx_gin    |      619 |      2595910 |             0 | 3516 MB
 204521 |     204529 | diku_mod_inventory_storage | instance | instance_title_idx              |     1808 |     46971579 |      46971578 | 3516 MB

Stats on the "like" indexes seem to indicate that they are not being used.

Codex app provides search using the indexes ID, title, identifier, ISBN, ISSN, contributor, subject, classification, and publisher.

These are the indexes we decided on for mod_inventory_storage.instance:

  {
      "tableName": "instance",
      "withMetadata": true,
      "generateId": false,
      "pkColumnName": "_id",
      "ginIndex": [
        {
          "fieldName": "title",  <------------- for title search
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        },
     {
          "fieldName": "contributors", <------------- for contributors search
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        },
        {
          "fieldName": "identifiers", <------------- for identifiers search
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        },
        {
          "fieldName": "instanceTypeId", <------------- for type search (as a filter)
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        },
        {
          "fieldName": "languages", <------------- for languages search (as a filter)
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        },
        {
          "fieldName": "classifications", <------------- for CODEX classifications search
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        },
        {
          "fieldName": "subjects", <------------- for CODEX subjects search
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        }
      ],
      "index": [
        {
          "fieldName": "title", <------------- for title sort
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        },
        {
          "fieldName": "id", <------------- for join with holdings for location filter
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        },
        {
          "fieldName": "contributors", <------------- for contributors sort
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        },
        {
          "fieldName": "publication", <------------- for publication sort
          "tOps": "ADD",
          "caseSensitive": false,
          "removeAccents": true
        }
      ]
    }

Next steps:
Rebuild folio-alpha with latest code and new indexes. Reload records and test performance within the inventory app and with the Codex app.

Does this sound right? shale99 – if you have any jmeter scripts put together, feel free to send them my way.

Comment by Wayne Schneider [ 12/Jan/18 ]

Niels Erik Nielsen – I think the discussion of db architecture is not a distraction. I also was thinking of supplemental tables that might be useful and more performant for sorting, e.g. a table of titles or of contributor names that you could use in joins with the UUID of the instance as the relational key. Of course, post-alpha, but we need to capture the ideas somewhere.

Comment by shale99 [ 14/Jan/18 ]

Wayne Schneider - sounds good to me , i have started working on the jmeter - hope to have something to pass on to you tomorrow
just one more comment about the index usage - we can also see
idx_tup_read = 0
idx_tup_fetch = 0
which may strengthen the assumption that the like indexes are not need

Comment by shale99 [ 14/Jan/18 ]

hey Wayne Schneider

i am attaching two files
1. jmeter script that can be run by jenkins - it runs a login request to folio-alpha (host is configurable) and then about 40 GET requests. most of them are queries (filters, sorting) but there are some requests to the configuration module as well. It generates two outputs

  • to the console - a summary with: (1) # of requests (2) average request time (3) min request time (4) max request time (5) error % - we should set up an alert on the max and avg values
  • a csv file with the actual requests and the amount of time each request took, status of the request, latency, connection time, etc.. - i would send this file via mail at one point in case the average or max are higher than a threshold (sample csv attached - you need to change the path to the file in the .jmx file - i may be able to extract that into a variable but have not)

right now , i see that the slowest request is actually not a double sort - but only a single sort on title - we need to see the explain for that:

query=((title="water*" or contributors adj "\"name\": \"water*\"" or identifiers adj "\"value\": \"water*\"")) and instanceTypeId="2b94c631-fca9-a892-c730-03ee529ffe2c" and languages="eng" sortby title/sort.descending

I will update the jmeter script to receive a file with queries and push those queries into a template so that we can run many queries on this via a list in an external file

also, note that there is an email plugin that can send emails if we want to push out notifications on poor performance test results (<-recommended )

Comment by shale99 [ 14/Jan/18 ]

two new file attached, a new .jmx file - inventory_only_csv.jmx - will look for a csv file (need to change this path in the .jmx file inventory_only_csv.jmx inventory_queries.csv ) with a query per line - and will loop through all of them and return the same type of outputs as described above - attached a sample cvs file (inventory_queries.csv)

will add the ability to do a full display dynamically of records returned from queries tomorrow

Comment by shale99 [ 15/Jan/18 ]

one more jmx script - this one is pretty cool - as you can use it to load the codex as well as the inventory

inventory_csv_with_fulldisplay.jmx

first - you have the following set of variables

  • base url to the host
  • okapi tenant
  • origin header (same as host with http prefix)
  • query duration max
  • full display duration max
  • url path (this is the path to query, can be instance-storage/instances, or the codex path, etc...)

second - pass in the csv file as attached in the previous script

the script will
run each query from the csv
parse the json returned
pick a random id from the json
call a full display on that id

if the requests take longer than the duration max limit or a non 200 response is returned an assertion failure will occur

this means there are three current jmter scripts

1. simulates the ui by running exactly what the ui runs (including config requests)
2. simulates just running queries from a csv file
3. simulates running queries and then a full display with the queries comming from the csv file

i will tweak (1) so that it can also receive a file with terms

Comment by shale99 [ 15/Jan/18 ]

quick update, the csv has been updated to be comma delimited with offset and limit - see example attached. note that these are mandatory (at least on the first query, after that, for example the second entry in the csv, will use the previous offset and limit values)

Comment by Wayne Schneider [ 16/Jan/18 ]

We have identified a performance issue when you search for multiple terms in the typeahead search box on the UI, e.g. searching for "golda m". This results in this CQL query:

(title="golda m*" or contributors adj "\"name\": \"golda m*\"" or identifiers adj "\"value\": \"golda m*\"") sortby title

This CQL generates this SQL WHERE clause:

WHERE (((lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')) AND lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))m.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))) OR (lower(f_unaccent(instance.jsonb->>'contributors')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda([[:punct:]]|[[:space:]])+m.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))) OR (lower(f_unaccent(instance.jsonb->>'identifiers')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda([[:punct:]]|[[:space:]])+m.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) ORDER BY lower(f_unaccent(instance.jsonb->>'title')) LIMIT 30 OFFSET 0

...which performs very badly on folio-alpha.

The troublesome part of the query is this:

(lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')) AND lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))m.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))

If this bit is changed to:

(lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda([[:punct:]]|[[:space:]])+m.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))

Performance is very good.

shale99 says:

the query is generated incorrectly with the space - it creates a separate AND clause which is incorrect and is very costly as the single letter causes the planner to use the btree index to first sort and then filter on the results - but since its an AND there are only a few results and its very slow

In addition, we found a possible issue with db configuration resulting in slower-than-desired IO (adding a second or more on to query result sorting). We're changing the storage type of the RDS instance to see if that will improve things – but the query issue above has much greater impact.

Comment by Wayne Schneider [ 16/Jan/18 ]

Update: testing with Marc Johnson, we found that the change in RMB 17 that makes adj the default CQL relation for = results in a better query:

WHERE ((lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda([[:punct:]]|[[:space:]])+m.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) OR (lower(f_unaccent(instance.jsonb->>'contributors')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda([[:punct:]]|[[:space:]])+m.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))) OR (lower(f_unaccent(instance.jsonb->>'identifiers')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda([[:punct:]]|[[:space:]])+m.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) ORDER BY lower(f_unaccent(instance.jsonb->>'title')) LIMIT 30 OFFSET 0

Which returns very fast (< 500 ms in the UI). So once all the code is in the mainline, we should be in pretty good shape, right?

Comment by Wayne Schneider [ 17/Jan/18 ]

Open issues:

  1. Search for a single non-western character is cripplingly slow (see UIIN-69 Closed )
  2. Search for multiple short terms (e.g. "new yo") is slower than it should be. Query analysis shows that the bitmap heap scan on folio-alpha, even with cached data, is slower than on a local PostgreSQL install, and CPU utilization as well as read IOPS spikes on the operation. We may try with a PostgreSQL install on an EC2 instance to see if we see the same behavior (current instance is RDS)
  3. Search with multiple filters applied also appears slower than it should be. Haven't yet done query analysis on that.
Comment by Wayne Schneider [ 01/Feb/18 ]

With mod-inventory-storage-7.2.1-SNAPSHOT.73 (https://github.com/folio-org/mod-inventory-storage/commit/6bc1b9c32cf23a559d62c99b8b25b1a94855d864) and ui-inventory-1.0.100080 (https://github.com/folio-org/ui-inventory/commit/5e634a0202664d9c303fed4239c48d7f48a1d83f), which adds searching barcode to the typeahead search, search performance is really bad.

Here is the CQL generated by the UI for the typeahead search:

(title="golda*" or contributors adj "\"name\": \"golda*\"" or identifiers adj "\"value\": \"golda*\"" or item.barcode="golda*") sortby title

This translates into this WHERE clause, according to the log:

WHERE (((lower(f_unaccent(instance_holding_item_view.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) OR (lower(f_unaccent(instance_holding_item_view.jsonb->>'contributors')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))) OR (lower(f_unaccent(instance_holding_item_view.jsonb->>'identifiers')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))) OR (lower(f_unaccent(instance_holding_item_view.it_jsonb->>'barcode')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) ORDER BY lower(f_unaccent(instance_holding_item_view.jsonb->>'title')) LIMIT 30 OFFSET 0

EXPLAIN (ANALYZE, BUFFERS) for the query shows:

 Limit  (cost=1.49..67711.68 rows=30 width=1469) (actual time=9828.727..101728.613 rows=30 loops=1)
   Buffers: shared hit=1255688 read=195908
   ->  Nested Loop Left Join  (cost=1.49..8854237.51 rows=3923 width=1469) (actual time=9828.725..101728.584 rows=30 loops=1)
         Filter: ((lower(f_unaccent((instance.jsonb ->> 'title'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text) OR (lower(f_unaccent((instance.jsonb ->> 'contributors'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text) OR (lower(f_unaccent((instance.jsonb ->> 'identifiers'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text) OR (lower(f_unaccent((item.jsonb ->> 'barcode'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text))
         Rows Removed by Filter: 625319
         Buffers: shared hit=1255688 read=195908
         ->  Nested Loop Left Join  (cost=1.08..2328365.55 rows=1796473 width=1373) (actual time=0.097..14632.255 rows=625348 loops=1)
               Buffers: shared hit=1108970 read=195908
               ->  Index Scan using instance_title_idx on instance  (cost=0.68..1403133.96 rows=1796473 width=1341) (actual time=0.021..9422.059 rows=625347 loops=1)
                     Buffers: shared hit=434698 read=195907
               ->  Index Scan using holdings_record_instanceid_idx on holdings_record  (cost=0.41..0.48 rows=4 width=32) (actual time=0.002..0.002 rows=0 loops=625347)
                     Index Cond: (lower(f_unaccent((instance.jsonb ->> 'id'::text))) = lower(f_unaccent((jsonb ->> 'instanceId'::text))))
                     Buffers: shared hit=625349
         ->  Index Scan using item_holdingsrecordid_idx on item  (cost=0.41..0.51 rows=3 width=32) (actual time=0.000..0.000 rows=0 loops=625348)
               Index Cond: (lower(f_unaccent((holdings_record.jsonb ->> 'id'::text))) = lower(f_unaccent((jsonb ->> 'holdingsRecordId'::text))))
               Buffers: shared hit=6
 Planning time: 72.537 ms
 Execution time: 101728.737 ms

Interestingly, in the UI this took about 7 minutes (granted, I have a bad network connection right now, but not that bad), vs. 102 sec. using psql. Not sure how to explain that.
Applying a filter causes the query to be executed again, with extremely slow results.

If I remove the barcode from the query and run it against just the instance table, I get:

 Limit  (cost=3040.72..3040.79 rows=30 width=1413) (actual time=244.438..244.468 rows=30 loops=1)
   Buffers: shared hit=1259 read=1172
   ->  Sort  (cost=3040.72..3042.04 rows=530 width=1413) (actual time=244.436..244.460 rows=30 loops=1)
         Sort Key: (lower(f_unaccent((jsonb ->> 'title'::text))))
         Sort Method: top-N heapsort  Memory: 79kB
         Buffers: shared hit=1259 read=1172
         ->  Bitmap Heap Scan on instance  (cost=420.37..3025.07 rows=530 width=1413) (actual time=231.874..242.597 rows=75 loops=1)
               Recheck Cond: ((lower(f_unaccent((jsonb ->> 'title'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text) OR (lower(f_unaccent((jsonb ->> 'contributors'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text) OR (lower(f_unaccent((jsonb ->> 'identifiers'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text))
               Rows Removed by Index Recheck: 53
               Heap Blocks: exact=128
               Buffers: shared hit=1259 read=1172
               ->  BitmapOr  (cost=420.37..420.37 rows=530 width=0) (actual time=231.447..231.447 rows=0 loops=1)
                     Buffers: shared hit=1112 read=1143
                     ->  Bitmap Index Scan on instance_title_idx_gin  (cost=0.00..61.34 rows=179 width=0) (actual time=29.525..29.525 rows=110 loops=1)
                           Index Cond: (lower(f_unaccent((jsonb ->> 'title'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text)
                           Buffers: shared hit=19 read=28
                     ->  Bitmap Index Scan on instance_contributors_idx_gin  (cost=0.00..157.28 rows=171 width=0) (actual time=182.956..182.956 rows=64 loops=1)
                           Index Cond: (lower(f_unaccent((jsonb ->> 'contributors'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text)
                           Buffers: shared hit=1079 read=1094
                     ->  Bitmap Index Scan on instance_identifiers_idx_gin  (cost=0.00..201.35 rows=180 width=0) (actual time=18.961..18.961 rows=0 loops=1)
                           Index Cond: (lower(f_unaccent((jsonb ->> 'identifiers'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text)
                           Buffers: shared hit=14 read=21
 Planning time: 15.046 ms
 Execution time: 246.780 ms

So that's plenty fast.
Similar results if I run against the view but remove the barcode clause.

Comment by Wayne Schneider [ 14/Feb/18 ]

Closing this issue as the basic work is done...we can raise individual issues for creation of a performance testing environment, further development of a set of tests, setting up a baseline for measuring performance, and CI integration. At this point if there are particular performance problems we are working on, it makes sense to create issues for them.

Generated at Thu Feb 08 23:10:14 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.