[RMB-438] contributor search returns no results when name contains "-" Created: 11/Jul/19 Updated: 03/Jan/20 Resolved: 12/Aug/19 |
|
| Status: | Closed |
| Project: | RAML Module Builder |
| Components: | None |
| Affects versions: | None |
| Fix versions: | 26.4.0 |
| Type: | Bug | Priority: | P2 |
| Reporter: | Hongwei Ji | Assignee: | Julian Ladisch |
| Resolution: | Done | Votes: | 0 |
| Labels: | back-end, platform-backlog, q3.2-2019 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||||||||||
| Sprint: | CP: sprint 69 | ||||||||||||||||||||||||||||||||
| Story Points: | 3 | ||||||||||||||||||||||||||||||||
| Development Team: | Core: Platform | ||||||||||||||||||||||||||||||||
| Description |
|
Following should return result but it does not.
/inventory/instances?limit=30&query=(contributors =/@name "Vigneras, Louis-André,") sortby title
Reason: folio=# select to_tsvector('simple', f_unaccent('Vigneras, Louis-André')) @@ to_tsquery('simple', f_unaccent('Vigneras,<->Louis-André')); ?column? ---------- f (1 row) folio=# select to_tsvector('simple', f_unaccent('Vigneras, Louis-André')); to_tsvector -------------------------------------------------- 'andre':4 'louis':3 'louis-andre':2 'vigneras':1 (1 row) folio=# select to_tsquery('simple', f_unaccent('Vigneras,<->Louis-André')); to_tsquery ------------------------------------------------------ 'vigneras' <-> ( 'louis-andre' & 'louis' & 'andre' ) (1 row) |
| Comments |
| Comment by Jakub Skoczen [ 23/Jul/19 ] |
|
Hongwei Ji Can you provide an explanation of why do you think this doesn't work? I can't tell from the snippet you attach. Thanks. |
| Comment by Hongwei Ji [ 23/Jul/19 ] |
|
The first SQL in the snippet is essentially generated by the AP call. As you can see, the query result is "false". |
| Comment by Jakub Skoczen [ 24/Jul/19 ] |
|
Ok, here's a good explanation on why it doesn't match: https://dba.stackexchange.com/questions/204588/how-to-search-hyphenated-words-in-postgresql-full-text-search In short: terms likes "a-b" are tokenized into three tokens: "a-b", "a", "b" during the call to to_tsvector. Calling "to_tsquery" does the same but when combined with FOLLOWED BY (<->) the query no longer matches. |
| Comment by Jakub Skoczen [ 24/Jul/19 ] |
|
Hongwei Ji Julian Ladisch Guys, I am attaching an incomplete converter that handles hyphenated words similarly to how "phraseto_tsquery" does it but the output can be fed into to_tsquery direclty. It will skip all hyphens and replace them with '<->' but if there is a single hyphen it will also prepend it with <2> |
| Comment by Jakub Skoczen [ 30/Jul/19 ] |
|
Julian Ladisch what I managed to discover is that calling to_tsquery('a<->b-c') gives a<->(b-c & b & c), which will never match. As you can see to_tsquery expands each hyphenated word into AND group of three tokens. One way to fix it (and I provided some rudimentary code above to do that) is to generate <2>c<->d for each hyphenated token and used that instead as input to to_tsquery. The code attached to this issue could/should be probably extended to a proper “tokenizer” that can handle any punctuation or whitespace and generate TS queries that in a similar way to built-in function like plainto_tsquery and phraseto_tsquery (which we can't use because they remove wildcards. This would also address
|
| Comment by Julian Ladisch [ 31/Jul/19 ] |
|
My implementation idea is this: select replace(to_tsquery('simple', 'Vigneras,<->Louis-André')::text, '&', '<->')::tsquery; replace ---------------------------------------------------------- 'vigneras' <-> ( 'louis-andré' <-> 'louis' <-> 'andré' ) Replace space by <-> and replace special characters (& <-> :) by other punctuation. |
| Comment by Julian Ladisch [ 06/Aug/19 ] |
|
This also works for UUID matching: postgres=# select to_tsvector('0cc175b9-c0f1-b6a8-31c3-99e269772661') @@ replace(to_tsquery('simple', '0cc175b9-c0f1-b6a8-31c3-99e269772661')::text, '&', '<->')::tsquery; ?column? ---------- t postgres=# select to_tsvector('0cc175b9-c0f1-b6a8-31c3-99e269772661') , replace(to_tsquery('simple', '0cc175b9-c0f1-b6a8-31c3-99e269772661')::text, '&', '<->')::tsquery; to_tsvector | replace ---------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------- '0cc175b9':2 '0cc175b9-c0f1-b6a8-31c3-99e269772661':1 '31c3':5 '99e269772661':6 'b6a8':4 'c0f1':3 | '0cc175b9-c0f1-b6a8-31c3-99e269772661' <-> '0cc175b9' <-> 'c0f1' <-> 'b6a8' <-> '31c3' <-> '99e269772661' |
| Comment by Julian Ladisch [ 08/Aug/19 ] |
|
Postponed to q3.2-2019 as requested by Jakub. |
| Comment by Jakub Skoczen [ 09/Aug/19 ] |
|
Asked Adam Dickmeiss to review so we can merge before the end of the sprint. |
| Comment by Julian Ladisch [ 10/Aug/19 ] |
|
Adam asked on https://github.com/folio-org/raml-module-builder/pull/489:
create table t (jsonb jsonb);
create index on t using gin (to_tsvector('simple', jsonb->>'title'));
explain select * FROM t where to_tsvector('simple', jsonb->>'title') @@ replace(to_tsquery('simple', 'abc=xyz')::text, '&', '<->')::tsquery;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=12.00..16.27 rows=1 width=32)
Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''abc'' <-> ''xyz'''::tsquery)
-> Bitmap Index Scan on t_to_tsvector_idx (cost=0.00..12.00 rows=1 width=0)
Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''abc'' <-> ''xyz'''::tsquery)
This shows that the index t_to_tsvector_idx is used because the replace is on the right hand side of @@ only. This is similar to jsonb->>'title' LIKE f_unaccent('abc xyz'). |