[FOLIO-1019] Inventory performance optimization for folio-alpha Created: 09/Jan/18 Updated: 12/Nov/18 Resolved: 14/Feb/18 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Task | Priority: | P3 |
| Reporter: | Wayne Schneider | Assignee: | Wayne Schneider |
| Resolution: | Done | Votes: | 0 |
| Labels: | alpha, ci, core, sprint30, sprint31, sprint32 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | 3 days, 6 hours | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||
| Sprint: | |||||||||||||||||||||||||
| Description |
|
Create new indexes, database configuration tweaks, update queries, etc. |
| Comments |
| Comment by Wayne Schneider [ 10/Jan/18 ] |
|
After much experimenting with different indexes, we have come to a few conclusions:
Open questions:
|
| Comment by shale99 [ 11/Jan/18 ] |
|
hi Wayne Schneider - was going to try this myself , but i didnt want to apply new rules to the db because i didnt know what was in the queue (i created a security group to open my ip to the db - hope that was ok - but i deleted the security group because i did not want to apply immediately) - having said that, if you can maybe try to run the following - it will tell us which indexes are being used and which have not (look at the idx_scan column)
SELECT *, pg_size_pretty(pg_relation_size('diku_mod_inventory_storage.instance'))
FROM pg_stat_all_indexes
WHERE schemaname = 'diku_mod_inventory_storage'
ORDER BY pg_relation_size('diku_mod_inventory_storage.instance') DESC, idx_scan ASC
getting that output would be interesting |
| Comment by Niels Erik Nielsen [ 11/Jan/18 ] |
|
I think it would be appropriate to at least try and performance test a regular relational structure (like for response time, cpu/memory usage, indices disk space usage etc) with and without multi-column sorting and what other features we're contemplating. If we take searches/sorting by title, contributors, identifiers for instance - we could create contributors and identifier tables in place of the contributor/identifier json object arrays - and index them appropriately, something like this:
DROP TABLE IF EXISTS diku_mod_inventory_storage.instance_identifier;
CREATE TABLE diku_mod_inventory_storage.instance_identifier (
_id uuid PRIMARY KEY,
instance_id uuid REFERENCES diku_mod_inventory_storage.instance (_id),
identifier_type_id uuid REFERENCES diku_mod_inventory_storage.identifier_type (_id),
identifier text,
CONSTRAINT unique_identifiers UNIQUE (identifier, identifier_type_id, instance_id)
);
DROP TABLE IF EXISTS diku_mod_inventory_storage.instance_contributor;
CREATE TABLE diku_mod_inventory_storage.instance_contributor (
_id uuid PRIMARY KEY,
instance_id uuid REFERENCES diku_mod_inventory_storage.instance (_id),
contributor_type_id uuid REFERENCES diku_mod_inventory_storage.contributor_type (_id),
contributor_name_type_id uuid REFERENCES diku_mod_inventory_storage.contributor_name_type (_id),
primary_contributor boolean,
name text
);
CREATE UNIQUE INDEX instance_contributor_name_type_name_idx
ON diku_mod_inventory_storage.instance_contributor (name, contributor_name_type_id, instance_id)
WHERE contributor_type_id IS NULL;
CREATE UNIQUE INDEX instance_contributor_type_name_type_name_idx
ON diku_mod_inventory_storage.instance_contributor (name, contributor_name_type_id, contributor_type_id, instance_id)
WHERE contributor_type_id IS NOT NULL;
The tables might be populated directly from the ingested JSON for testing (don't know how this SQL would work for millions of records though):
INSERT INTO diku_mod_inventory_storage.instance_identifier (_id, instance_id, identifier_type_id, identifier)
SELECT gen_random_uuid(),
instance_id,
REPLACE ( (jsonb_array_elements(ids)::json->'identifierTypeId')::text, '"', '')::UUID identifier_type_id,
REPLACE ( (jsonb_array_elements(ids)::json->'value')::text, '"', '') identifier
FROM (SELECT _id instance_id,
jsonb->'identifiers' AS ids
FROM diku_mod_inventory_storage.instance) AS identifiers;
INSERT INTO diku_mod_inventory_storage.instance_contributor (_id, instance_id, contributor_type_id, contributor_name_type_id, name)
SELECT gen_random_uuid(),
instance_id,
REPLACE ( (jsonb_array_elements(contribs)::json->'contributorTypeId')::text, '"', '')::UUID contributor_type_id,
REPLACE ( (jsonb_array_elements(contribs)::json->'contributorNameTypeId')::text, '"', '')::UUID contributor_name_type_id,
REPLACE ( (jsonb_array_elements(contribs)::json->'name')::text, '"', '') contributor_name
FROM (SELECT _id instance_id,
jsonb->'contributors' AS contribs
FROM diku_mod_inventory_storage.instance) AS contributors;
(Outstanding: the primary_contributor column not yet populated, could also have a flag indicating what contributor to sort by if there are multiple primary contributors, a table for instance publishers should also be created) In my small setup with only the sample data from mod-inventory-storage, I have these records, indexed by 'identifier' and 'name' respectively (if I understand postgresql indexes correctly) which I'd expect would be a super fast entry for searching (not sure about the sorting):
okapi_modules=# SELECT *
okapi_modules-# FROM diku_mod_inventory_storage.instance_identifier;
_id | instance_id | identifier_type_id | identifier
--------------------------------------+--------------------------------------+--------------------------------------+----------------------
0933a4e9-7efd-4aca-967a-146920f9272f | 69640328-788e-43fc-9c3c-af39e243f3b7 | 913300b2-03ed-469a-8179-c1092c991227 | 0747-0088
deb6a606-1220-40a8-9583-52b945af5a55 | 69640328-788e-43fc-9c3c-af39e243f3b7 | c858e4f2-2b6b-4385-842b-60732ee14abb | 84641839
0e108f63-2da8-4ebf-97e2-defb4e30e896 | 30fcc8e7-a019-43f4-b642-2edc389f4501 | 593b78cb-32f3-44d1-ba8c-63fd5e6989e6 | AJMEAZ
6fddd008-1919-4fc2-afa9-95a9e0314c1b | 30fcc8e7-a019-43f4-b642-2edc389f4501 | 913300b2-03ed-469a-8179-c1092c991227 | 0002-9343
d3a4c0de-dff8-470d-a235-ed560be18600 | 30fcc8e7-a019-43f4-b642-2edc389f4501 | c858e4f2-2b6b-4385-842b-60732ee14abb | med49002270
4b4c4687-febb-47b5-86e4-5a0733453f4c | 7fbd5d84-62d1-44c6-9c45-6cb173998bbd | 5d164f4b-0b15-4e42-ae75-cfcf85318ad9 | ocn956625961
bdc976e9-627a-40af-889f-47c7e52bf823 | f31a36de-fcf8-44f9-87ef-a55d06ad21ae | 7f907515-a1bf-4513-8a38-92e1a07c539d | B01LO7PJOE
70e1cae8-30cd-4043-9514-523607d2e7d4 | a89eccf0-57a6-495e-898d-32b9b2210f2f | 8261054f-be78-422d-bd51-4ed9f33c3422 | 0552142352
50416f0c-7942-4fe9-a53e-26288a3af716 | a89eccf0-57a6-495e-898d-32b9b2210f2f | 8261054f-be78-422d-bd51-4ed9f33c3422 | 9780552142352
89bb755d-146b-4674-a923-642d0c71dabf | 6506b79b-7702-48b2-9774-a1c538fdd34e | 8261054f-be78-422d-bd51-4ed9f33c3422 | 0956687695
81f6f065-717d-496d-bdc5-a4d5b412ebf1 | 6506b79b-7702-48b2-9774-a1c538fdd34e | 8261054f-be78-422d-bd51-4ed9f33c3422 | 9780956687695
287256da-bc2d-48c3-89e2-67d4077c2e6f | cf23adf0-61ba-4887-bf82-956c4aae2260 | 8261054f-be78-422d-bd51-4ed9f33c3422 | 1447294130
c364068f-c166-470a-b573-c6ab76d311a6 | cf23adf0-61ba-4887-bf82-956c4aae2260 | 8261054f-be78-422d-bd51-4ed9f33c3422 | 9781447294130
9e1e1ac2-a506-4a46-8953-c8b0b25646dd | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 5d164f4b-0b15-4e42-ae75-cfcf85318ad9 | ocn968777846
24410134-f5fa-4f5d-afa4-50daf00e8439 | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 8261054f-be78-422d-bd51-4ed9f33c3422 | 9786316800312
fdf3dce3-2757-4c24-a71c-5a416d0f748a | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 8261054f-be78-422d-bd51-4ed9f33c3422 | 6316800312
781fd59b-085f-4ddb-8029-683b5e53f5f2 | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | b5d8cdc4-9441-487c-90cf-0c7ec97728eb | OTA-1031 Otá Records
67686153-c54e-4fd6-be5c-91b3948b183a | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 7e591197-f335-4afb-bc6d-a6d76ca3bace | (OCoLC)968777846
(18 rows)
okapi_modules=#
okapi_modules=# SELECT *
okapi_modules-# FROM diku_mod_inventory_storage.instance_contributor;
_id | instance_id | contributor_type_id | contributor_name_type_id | primary_contributor | name
--------------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+---------------------+----------------------
6ec264a7-c909-4ebd-8ed9-c4ed1ec76381 | 7fbd5d84-62d1-44c6-9c45-6cb173998bbd | | 2b94c631-fca9-a892-c730-03ee529ffe2a | | Fielding, Helen
b11ab3cf-622d-483f-badc-6e4034162575 | f31a36de-fcf8-44f9-87ef-a55d06ad21ae | | 2e48e713-17f3-7c13-09f8-23845bb210aa | | Creator A
17c81250-044b-46da-bd87-c2a6b4aa6ad5 | f31a36de-fcf8-44f9-87ef-a55d06ad21ae | | e8b311a6-3b21-03f2-2269-dd9310cb2d0a | | Creator B
9e136fae-9d41-4153-9501-3c86d1fdd00d | a89eccf0-57a6-495e-898d-32b9b2210f2f | | 2b94c631-fca9-a892-c730-03ee529ffe2a | | Pratchett, Terry
d3a491ab-9261-4b13-874f-72fd4e971bc1 | 6506b79b-7702-48b2-9774-a1c538fdd34e | | 2b94c631-fca9-a892-c730-03ee529ffe2a | | Barnes, Adrian
31745a13-1058-4ec6-a057-15ab73c3c098 | cf23adf0-61ba-4887-bf82-956c4aae2260 | | 2b94c631-fca9-a892-c730-03ee529ffe2a | | Novik, Naomi
ea443a06-b6a6-4985-a4d5-cf9bbe82b166 | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 2b94c631-fca9-a892-c730-03ee529ffe2b | 2b94c631-fca9-a892-c730-03ee529ffe2a | | Sosa, Omar
68fe6a2a-7c9f-4a7f-b024-19ce27c00bff | e54b1f4d-7d05-4b1a-9368-3c36b75d8ac6 | 2b94c631-fca9-a892-c730-03ee529ffe2b | 2b94c631-fca9-a892-c730-03ee529ffe2a | | Keita, Seckou, 1977-
63515316-37fd-4953-90d1-e16d0b6c550e | bbd4a5e1-c9f3-44b9-bfdf-d184e04f0ba0 | | 2b94c631-fca9-a892-c730-03ee529ffe2a | | White, Vincent E.
(9 rows)
okapi_modules=#
|
| Comment by shale99 [ 11/Jan/18 ] |
|
hey Niels Erik Nielsen - if i understood correctly, i am not sure it makes a difference to postgres if the btree comes from a separate column or from a field in the jsonb. at the end of the day the same index is created no matter what the source (column or field in jsonb) and response time should be identical. the issue with the multiple columns (or multi field, its the same) is just that postgres is very picky - you need to create an index for title+contrib and also for contrib+title , yes the ordering is important for the index to be used. you would also need the same for title+pub and pub+title , and then also asc and desc matters here so that creates an awful lot of index to support the multi column sort - and at the end of the day this stuff really hurts scalability. if the added value of this is critical it can be supported by adding the indexes and it will perform well - i am just not sure it is.... |
| Comment by Niels Erik Nielsen [ 11/Jan/18 ] |
|
But - this is not just from a separate column, no? rather from two new columns in two new separate tables? The entry into 'instance' would be through the '_id' index only. Certainly not saying this means we would be home safe, just that it seemed sufficiently different from the current approach to potentially have other performance characteristics. |
| Comment by Niels Erik Nielsen [ 11/Jan/18 ] |
|
And, btw, this is a distraction, given this issue's title, ".. for folio-alpha", sorry. |
| Comment by Wayne Schneider [ 12/Jan/18 ] |
|
Conclusions from discussions on 1/11: relid | indexrelid | schemaname | relname | indexrelname | idx_scan | idx_tup_read | idx_tup_fetch | pg_size_pretty --------+------------+----------------------------+----------+---------------------------------+----------+--------------+---------------+---------------- 204521 | 204533 | diku_mod_inventory_storage | instance | instance_contributors_idx_like | 0 | 0 | 0 | 3516 MB 204521 | 347310 | diku_mod_inventory_storage | instance | instance_publication_idx | 0 | 0 | 0 | 3516 MB 204521 | 204531 | diku_mod_inventory_storage | instance | instance_title_idx_like | 0 | 0 | 0 | 3516 MB 204521 | 204527 | diku_mod_inventory_storage | instance | instance_pkey | 1 | 1 | 1 | 3516 MB 204521 | 204532 | diku_mod_inventory_storage | instance | instance_identifiers_idx_like | 2 | 0 | 0 | 3516 MB 204521 | 347309 | diku_mod_inventory_storage | instance | instance_contributors_idx | 19 | 750 | 750 | 3516 MB 204521 | 347326 | diku_mod_inventory_storage | instance | instance_contrib_title_idx | 38 | 2170 | 2170 | 3516 MB 204521 | 204538 | diku_mod_inventory_storage | instance | instance_languages_idx_gin | 88 | 49622320 | 0 | 3516 MB 204521 | 204530 | diku_mod_inventory_storage | instance | instance_id_idx | 332 | 326 | 326 | 3516 MB 204521 | 204537 | diku_mod_inventory_storage | instance | instance_instancetypeid_idx_gin | 444 | 50205382 | 0 | 3516 MB 204521 | 204534 | diku_mod_inventory_storage | instance | instance_title_idx_gin | 607 | 3924339 | 0 | 3516 MB 204521 | 204535 | diku_mod_inventory_storage | instance | instance_contributors_idx_gin | 607 | 3398622 | 0 | 3516 MB 204521 | 204536 | diku_mod_inventory_storage | instance | instance_identifiers_idx_gin | 619 | 2595910 | 0 | 3516 MB 204521 | 204529 | diku_mod_inventory_storage | instance | instance_title_idx | 1808 | 46971579 | 46971578 | 3516 MB Stats on the "like" indexes seem to indicate that they are not being used. Codex app provides search using the indexes ID, title, identifier, ISBN, ISSN, contributor, subject, classification, and publisher. These are the indexes we decided on for mod_inventory_storage.instance:
{
"tableName": "instance",
"withMetadata": true,
"generateId": false,
"pkColumnName": "_id",
"ginIndex": [
{
"fieldName": "title", <------------- for title search
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
},
{
"fieldName": "contributors", <------------- for contributors search
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
},
{
"fieldName": "identifiers", <------------- for identifiers search
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
},
{
"fieldName": "instanceTypeId", <------------- for type search (as a filter)
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
},
{
"fieldName": "languages", <------------- for languages search (as a filter)
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
},
{
"fieldName": "classifications", <------------- for CODEX classifications search
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
},
{
"fieldName": "subjects", <------------- for CODEX subjects search
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
}
],
"index": [
{
"fieldName": "title", <------------- for title sort
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
},
{
"fieldName": "id", <------------- for join with holdings for location filter
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
},
{
"fieldName": "contributors", <------------- for contributors sort
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
},
{
"fieldName": "publication", <------------- for publication sort
"tOps": "ADD",
"caseSensitive": false,
"removeAccents": true
}
]
}
Next steps: Does this sound right? shale99 – if you have any jmeter scripts put together, feel free to send them my way. |
| Comment by Wayne Schneider [ 12/Jan/18 ] |
|
Niels Erik Nielsen – I think the discussion of db architecture is not a distraction. I also was thinking of supplemental tables that might be useful and more performant for sorting, e.g. a table of titles or of contributor names that you could use in joins with the UUID of the instance as the relational key. Of course, post-alpha, but we need to capture the ideas somewhere. |
| Comment by shale99 [ 14/Jan/18 ] |
|
Wayne Schneider - sounds good to me , i have started working on the jmeter - hope to have something to pass on to you tomorrow |
| Comment by shale99 [ 14/Jan/18 ] |
|
hey Wayne Schneider i am attaching two files
right now , i see that the slowest request is actually not a double sort - but only a single sort on title - we need to see the explain for that: query=((title="water*" or contributors adj "\"name\": \"water*\"" or identifiers adj "\"value\": \"water*\"")) and instanceTypeId="2b94c631-fca9-a892-c730-03ee529ffe2c" and languages="eng" sortby title/sort.descending I will update the jmeter script to receive a file with queries and push those queries into a template so that we can run many queries on this via a list in an external file also, note that there is an email plugin that can send emails if we want to push out notifications on poor performance test results (<-recommended |
| Comment by shale99 [ 14/Jan/18 ] |
|
two new file attached, a new .jmx file - inventory_only_csv.jmx - will look for a csv file (need to change this path in the .jmx file inventory_only_csv.jmx will add the ability to do a full display dynamically of records returned from queries tomorrow |
| Comment by shale99 [ 15/Jan/18 ] |
|
one more jmx script - this one is pretty cool - as you can use it to load the codex as well as the inventory inventory_csv_with_fulldisplay.jmx first - you have the following set of variables
second - pass in the csv file as attached in the previous script the script will if the requests take longer than the duration max limit or a non 200 response is returned an assertion failure will occur this means there are three current jmter scripts 1. simulates the ui by running exactly what the ui runs (including config requests) i will tweak (1) so that it can also receive a file with terms |
| Comment by shale99 [ 15/Jan/18 ] |
|
quick update, the csv has been updated to be comma delimited with offset and limit - see example attached. note that these are mandatory (at least on the first query, after that, for example the second entry in the csv, will use the previous offset and limit values) |
| Comment by Wayne Schneider [ 16/Jan/18 ] |
|
We have identified a performance issue when you search for multiple terms in the typeahead search box on the UI, e.g. searching for "golda m". This results in this CQL query: (title="golda m*" or contributors adj "\"name\": \"golda m*\"" or identifiers adj "\"value\": \"golda m*\"") sortby title This CQL generates this SQL WHERE clause: WHERE (((lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')) AND lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))m.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))) OR (lower(f_unaccent(instance.jsonb->>'contributors')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda([[:punct:]]|[[:space:]])+m.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))) OR (lower(f_unaccent(instance.jsonb->>'identifiers')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda([[:punct:]]|[[:space:]])+m.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) ORDER BY lower(f_unaccent(instance.jsonb->>'title')) LIMIT 30 OFFSET 0 ...which performs very badly on folio-alpha. The troublesome part of the query is this: (lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')) AND lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))m.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) If this bit is changed to: (lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda([[:punct:]]|[[:space:]])+m.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) Performance is very good. shale99 says:
In addition, we found a possible issue with db configuration resulting in slower-than-desired IO (adding a second or more on to query result sorting). We're changing the storage type of the RDS instance to see if that will improve things – but the query issue above has much greater impact. |
| Comment by Wayne Schneider [ 16/Jan/18 ] |
|
Update: testing with Marc Johnson, we found that the change in RMB 17 that makes adj the default CQL relation for = results in a better query: WHERE ((lower(f_unaccent(instance.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda([[:punct:]]|[[:space:]])+m.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) OR (lower(f_unaccent(instance.jsonb->>'contributors')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda([[:punct:]]|[[:space:]])+m.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))) OR (lower(f_unaccent(instance.jsonb->>'identifiers')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda([[:punct:]]|[[:space:]])+m.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) ORDER BY lower(f_unaccent(instance.jsonb->>'title')) LIMIT 30 OFFSET 0 Which returns very fast (< 500 ms in the UI). So once all the code is in the mainline, we should be in pretty good shape, right? |
| Comment by Wayne Schneider [ 17/Jan/18 ] |
|
Open issues:
|
| Comment by Wayne Schneider [ 01/Feb/18 ] |
|
With mod-inventory-storage-7.2.1-SNAPSHOT.73 (https://github.com/folio-org/mod-inventory-storage/commit/6bc1b9c32cf23a559d62c99b8b25b1a94855d864) and ui-inventory-1.0.100080 (https://github.com/folio-org/ui-inventory/commit/5e634a0202664d9c303fed4239c48d7f48a1d83f), which adds searching barcode to the typeahead search, search performance is really bad. Here is the CQL generated by the UI for the typeahead search: (title="golda*" or contributors adj "\"name\": \"golda*\"" or identifiers adj "\"value\": \"golda*\"" or item.barcode="golda*") sortby title This translates into this WHERE clause, according to the log: WHERE (((lower(f_unaccent(instance_holding_item_view.jsonb->>'title')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) OR (lower(f_unaccent(instance_holding_item_view.jsonb->>'contributors')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))) OR (lower(f_unaccent(instance_holding_item_view.jsonb->>'identifiers')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))')))) OR (lower(f_unaccent(instance_holding_item_view.it_jsonb->>'barcode')) ~ lower(f_unaccent('(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'))) ORDER BY lower(f_unaccent(instance_holding_item_view.jsonb->>'title')) LIMIT 30 OFFSET 0 EXPLAIN (ANALYZE, BUFFERS) for the query shows:
Limit (cost=1.49..67711.68 rows=30 width=1469) (actual time=9828.727..101728.613 rows=30 loops=1)
Buffers: shared hit=1255688 read=195908
-> Nested Loop Left Join (cost=1.49..8854237.51 rows=3923 width=1469) (actual time=9828.725..101728.584 rows=30 loops=1)
Filter: ((lower(f_unaccent((instance.jsonb ->> 'title'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text) OR (lower(f_unaccent((instance.jsonb ->> 'contributors'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text) OR (lower(f_unaccent((instance.jsonb ->> 'identifiers'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text) OR (lower(f_unaccent((item.jsonb ->> 'barcode'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text))
Rows Removed by Filter: 625319
Buffers: shared hit=1255688 read=195908
-> Nested Loop Left Join (cost=1.08..2328365.55 rows=1796473 width=1373) (actual time=0.097..14632.255 rows=625348 loops=1)
Buffers: shared hit=1108970 read=195908
-> Index Scan using instance_title_idx on instance (cost=0.68..1403133.96 rows=1796473 width=1341) (actual time=0.021..9422.059 rows=625347 loops=1)
Buffers: shared hit=434698 read=195907
-> Index Scan using holdings_record_instanceid_idx on holdings_record (cost=0.41..0.48 rows=4 width=32) (actual time=0.002..0.002 rows=0 loops=625347)
Index Cond: (lower(f_unaccent((instance.jsonb ->> 'id'::text))) = lower(f_unaccent((jsonb ->> 'instanceId'::text))))
Buffers: shared hit=625349
-> Index Scan using item_holdingsrecordid_idx on item (cost=0.41..0.51 rows=3 width=32) (actual time=0.000..0.000 rows=0 loops=625348)
Index Cond: (lower(f_unaccent((holdings_record.jsonb ->> 'id'::text))) = lower(f_unaccent((jsonb ->> 'holdingsRecordId'::text))))
Buffers: shared hit=6
Planning time: 72.537 ms
Execution time: 101728.737 ms
Interestingly, in the UI this took about 7 minutes (granted, I have a bad network connection right now, but not that bad), vs. 102 sec. using psql. Not sure how to explain that. If I remove the barcode from the query and run it against just the instance table, I get:
Limit (cost=3040.72..3040.79 rows=30 width=1413) (actual time=244.438..244.468 rows=30 loops=1)
Buffers: shared hit=1259 read=1172
-> Sort (cost=3040.72..3042.04 rows=530 width=1413) (actual time=244.436..244.460 rows=30 loops=1)
Sort Key: (lower(f_unaccent((jsonb ->> 'title'::text))))
Sort Method: top-N heapsort Memory: 79kB
Buffers: shared hit=1259 read=1172
-> Bitmap Heap Scan on instance (cost=420.37..3025.07 rows=530 width=1413) (actual time=231.874..242.597 rows=75 loops=1)
Recheck Cond: ((lower(f_unaccent((jsonb ->> 'title'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text) OR (lower(f_unaccent((jsonb ->> 'contributors'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text) OR (lower(f_unaccent((jsonb ->> 'identifiers'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text))
Rows Removed by Index Recheck: 53
Heap Blocks: exact=128
Buffers: shared hit=1259 read=1172
-> BitmapOr (cost=420.37..420.37 rows=530 width=0) (actual time=231.447..231.447 rows=0 loops=1)
Buffers: shared hit=1112 read=1143
-> Bitmap Index Scan on instance_title_idx_gin (cost=0.00..61.34 rows=179 width=0) (actual time=29.525..29.525 rows=110 loops=1)
Index Cond: (lower(f_unaccent((jsonb ->> 'title'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))golda.*($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text)
Buffers: shared hit=19 read=28
-> Bitmap Index Scan on instance_contributors_idx_gin (cost=0.00..157.28 rows=171 width=0) (actual time=182.956..182.956 rows=64 loops=1)
Index Cond: (lower(f_unaccent((jsonb ->> 'contributors'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"name":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text)
Buffers: shared hit=1079 read=1094
-> Bitmap Index Scan on instance_identifiers_idx_gin (cost=0.00..201.35 rows=180 width=0) (actual time=18.961..18.961 rows=0 loops=1)
Index Cond: (lower(f_unaccent((jsonb ->> 'identifiers'::text))) ~ '(^|[[:punct:]]|[[:space:]]|(?=[[:punct:]]|[[:space:]]))"value":([[:punct:]]|[[:space:]])+"golda.*"($|[[:punct:]]|[[:space:]]|(?<=[[:punct:]]|[[:space:]]))'::text)
Buffers: shared hit=14 read=21
Planning time: 15.046 ms
Execution time: 246.780 ms
So that's plenty fast. |
| Comment by Wayne Schneider [ 14/Feb/18 ] |
|
Closing this issue as the basic work is done...we can raise individual issues for creation of a performance testing environment, further development of a set of tests, setting up a baseline for measuring performance, and CI integration. At this point if there are particular performance problems we are working on, it makes sense to create issues for them. |