[RMB-441] Consider to use = over LIKE when search term has no wildcard Created: 31/Jul/19  Updated: 31/May/21  Resolved: 31/May/21

Status: Closed
Project: RAML Module Builder
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P3
Reporter: Hongwei Ji Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: platform-backlog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Blocks
is blocked by MODINVSTOR-256 Improve performance of /instance-stor... Closed
Duplicate
duplicates CQLPG-80 == relation creates a LIKE statement Closed
Relates
relates to FOLIO-2197 Research "get inventory items by stat... Closed
Sprint: CP: Roadmap backlog
Development Team: Core: Platform

 Description   

While working on FOLIO-2197 Closed , it was observed that there is a difference between = and LIKE search. See below (copied from FOLIO-2197 Closed comments section) for details.

1) if there is only BTREE index: All searches use index except LIKE search for "Available".
Note: there are two types of item: "Available" vs "Checked out". The number of "Available" item (2837576) is more than 97% of total item records (2908646). In general, the btree index is much faster than gin. The results show the difference and it seems we should use = over LIKE when there is no wildcard.
2) if there is only GIN index: LIKE search for "Checked out" can use the index; others do not.

b-tree LIKE Available 7326 ms
b-tree LIKE Checked out 470 ms
b-tree = Available 1465 ms
b-tree = Checked out 292 ms
gin LIKE Available 7150 ms
gin LIKE Checked out 1305 ms
gin = Available 7151 ms
gin = Checked out 6624 ms
1) with just b-tree index: "item_status_name_idx" btree (lower(f_unaccent((jsonb -> 'status'::text) ->> 'name'::text)))
folio=# vacuum analyze supertenant_mod_inventory_storage.item;
VACUUM
folio=# explain analyze SELECT COUNT(1) FROM supertenant_mod_inventory_storage.item  WHERE lower(f_unaccent(item.jsonb->'status'->>'name')) LIKE lower(f_unaccent('Available'));
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1056834.36..1056834.38 rows=1 width=8) (actual time=7326.753..7326.753 rows=1 loops=1)
   ->  Seq Scan on item  (cost=0.00..1049759.31 rows=2830022 width=0) (actual time=0.054..7060.456 rows=2837576 loops=1)
         Filter: (lower(f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) ~~ 'available'::text)
         Rows Removed by Filter: 71070
 Planning time: 0.215 ms
 Execution time: 7326.774 ms
(6 rows)

folio=# explain analyze SELECT COUNT(1) FROM supertenant_mod_inventory_storage.item  WHERE lower(f_unaccent(item.jsonb->'status'->>'name')) LIKE lower(f_unaccent('Checked out'));
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=58888.39..58888.40 rows=1 width=8) (actual time=469.995..469.996 rows=1 loops=1)
   ->  Index Scan using item_status_name_idx on item  (cost=0.43..58691.81 rows=78631 width=0) (actual time=0.445..461.876 rows=71070 loops=1)
         Index Cond: (lower(f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) = 'checked out'::text)
         Filter: (lower(f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) ~~ 'checked out'::text)
 Planning time: 0.158 ms
 Execution time: 470.019 ms
(6 rows)

folio=# explain analyze SELECT COUNT(1) FROM supertenant_mod_inventory_storage.item  WHERE lower(f_unaccent(item.jsonb->'status'->>'name')) = lower(f_unaccent('Available'));
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=443045.82..443045.83 rows=1 width=8) (actual time=1465.292..1465.292 rows=1 loops=1)
   ->  Index Scan using item_status_name_idx on item  (cost=0.43..435970.76 rows=2830022 width=0) (actual time=0.063..1263.410 rows=2837576 loops=1)
         Index Cond: (lower(f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) = 'available'::text)
 Planning time: 0.154 ms
 Execution time: 1465.317 ms
(5 rows)

folio=# explain analyze SELECT COUNT(1) FROM supertenant_mod_inventory_storage.item  WHERE lower(f_unaccent(item.jsonb->'status'->>'name')) = lower(f_unaccent('Checked out'));
                                                                  QUERY PLAN                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=38444.33..38444.34 rows=1 width=8) (actual time=291.678..291.679 rows=1 loops=1)
   ->  Index Scan using item_status_name_idx on item  (cost=0.43..38247.75 rows=78631 width=0) (actual time=0.094..285.701 rows=71070 loops=1)
         Index Cond: (lower(f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) = 'checked out'::text)
 Planning time: 0.154 ms
 Execution time: 291.704 ms
(5 rows)

2) With just gin index: "item_status_name_idx_gin" gin (lower(f_unaccent((jsonb -> 'status'::text) ->> 'name'::text)) gin_trgm_ops)
folio=# vacuum analyze supertenant_mod_inventory_storage.item;
VACUUM
folio=# explain analyze SELECT COUNT(1) FROM supertenant_mod_inventory_storage.item  WHERE lower(f_unaccent(item.jsonb->'status'->>'name')) LIKE lower(f_unaccent('Available'));
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1056851.82..1056851.83 rows=1 width=8) (actual time=7149.606..7149.606 rows=1 loops=1)
   ->  Seq Scan on item  (cost=0.00..1049759.31 rows=2837003 width=0) (actual time=0.053..6895.869 rows=2837576 loops=1)
         Filter: (lower(f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) ~~ 'available'::text)
         Rows Removed by Filter: 71070
 Planning time: 0.202 ms
 Execution time: 7149.625 ms
(6 rows)

folio=# explain analyze SELECT COUNT(1) FROM supertenant_mod_inventory_storage.item  WHERE lower(f_unaccent(item.jsonb->'status'->>'name')) LIKE lower(f_unaccent('Checked out'));
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=189376.56..189376.57 rows=1 width=8) (actual time=1305.208..1305.208 rows=1 loops=1)
   ->  Bitmap Heap Scan on item  (cost=9923.29..189197.43 rows=71650 width=0) (actual time=43.358..1296.635 rows=71070 loops=1)
         Recheck Cond: (lower(f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) ~~ 'checked out'::text)
         Rows Removed by Index Recheck: 337220
         Heap Blocks: exact=37348 lossy=33722
         ->  Bitmap Index Scan on item_status_name_idx_gin  (cost=0.00..9905.37 rows=71650 width=0) (actual time=37.155..37.155 rows=71070 loops=1)
               Index Cond: (lower(f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) ~~ 'checked out'::text)
 Planning time: 0.161 ms
 Execution time: 1305.239 ms
(9 rows)

folio=# explain analyze SELECT COUNT(1) FROM supertenant_mod_inventory_storage.item  WHERE lower(f_unaccent(item.jsonb->'status'->>'name')) = lower(f_unaccent('Available'));
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1056851.82..1056851.83 rows=1 width=8) (actual time=7151.198..7151.198 rows=1 loops=1)
   ->  Seq Scan on item  (cost=0.00..1049759.31 rows=2837003 width=0) (actual time=0.063..6896.387 rows=2837576 loops=1)
         Filter: (lower(f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) = 'available'::text)
         Rows Removed by Filter: 71070
 Planning time: 0.146 ms
 Execution time: 7151.217 ms
(6 rows)

folio=# explain analyze SELECT COUNT(1) FROM supertenant_mod_inventory_storage.item  WHERE lower(f_unaccent(item.jsonb->'status'->>'name')) = lower(f_unaccent('Checked out'));
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1049938.44..1049938.45 rows=1 width=8) (actual time=6624.144..6624.144 rows=1 loops=1)
   ->  Seq Scan on item  (cost=0.00..1049759.31 rows=71650 width=0) (actual time=0.431..6615.394 rows=71070 loops=1)
         Filter: (lower(f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) = 'checked out'::text)
         Rows Removed by Filter: 2837576
 Planning time: 0.142 ms
 Execution time: 6624.164 ms
(6 rows)


 Comments   
Comment by Julian Ladisch [ 03/Aug/19 ]

https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-LIKE says:
"If pattern does not contain percent signs or underscores, then the pattern only represents the string itself; in that case LIKE acts like the equals operator."

Reproducable example:

create table t(jsonb jsonb);
create unique index on t ((jsonb->>'barcode'));
explain analyse select * from t where jsonb->>'barcode' LIKE 'abc';
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Index Scan using t_expr_idx on t  (cost=0.15..8.18 rows=1 width=32) (actual time=0.028..0.028 rows=0 loops=1)
   Index Cond: ((jsonb ->> 'barcode'::text) = 'abc'::text)
   Filter: ((jsonb ->> 'barcode'::text) ~~ 'abc'::text)
 Planning Time: 0.571 ms
 Execution Time: 0.075 ms

Note that the query uses the index and uses = to lookup 'abc' in the index.

PostgreSQL has already implemented the optimization to use = when the LIKE term has no wildcard. We should not duplicate such code.

Comment by Julian Ladisch [ 03/Aug/19 ]

Changing from LIKE to = doesn't make the query fast enough, 1465 ms is still to slow for other reasons. These other reasons are discussed in FOLIO-2197 Closed . If these other reasons have been resolved we most probably no longer need any LIKE/= change.
Therefore I suggest to close this issue as won't do and reopen it if needed.

Comment by Hongwei Ji [ 29/Aug/19 ]

Could it be that in simple cases, the optimizer can do it right but not in more complex cases like what we observed above? I am not sure it is a wasted effort to try to use the most suitable syntax where you can. I guess I do not have as much faith on the optimizer as Julian does.

Comment by Julian Ladisch [ 29/Aug/19 ]

This should be closed as "Won't do".
See MODINVSTOR-336 Closed why the performance tests should use the /item-storage/items endpoint and not the /inventory/items endpoint.
Please show a real use case that this issue solves.
I cannot reproduce the numbers shown in the description. Please provide complete information how to reproduce, for example a complete .sql file that I can feed into psql.

Comment by Hongwei Ji [ 29/Aug/19 ]

The numbers were from platform perf test env. The db was populated from https://s3.amazonaws.com/folio-public-sample-data/perf.tar.gz. It contains a bash file to copy the data to db.

Comment by Adam Dickmeiss [ 03/Sep/19 ]

If anything this shows that in many cases it does not make sense to provide make a separate gin-index .. (It modules can simply choose not to define it). Fewer indexes – faster updates/insertion.

By the way: if a module does not define a gin-index will our CQL2PgJSON fail a search using == ? Or just be terribly slow?

Comment by Jakub Skoczen [ 03/Sep/19 ]

Guys, I think it might make sense to try removing the GIN index from the status field in mod-inventory-storage and verifying the performance. But I don't think we will see general gains. it might make more sense to look at limiting the size of the results in the subquery (see RMB-467 Closed )

Comment by Hanna Hulevich [ 31/May/21 ]

Hongwei Ji Jakub Skoczen we are closing this as we think this is not needed. Please reopen if you think in different way. Thanks!

Generated at Thu Feb 08 23:18:58 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.