[RMB-438] contributor search returns no results when name contains "-" Created: 11/Jul/19  Updated: 03/Jan/20  Resolved: 12/Aug/19

Status: Closed
Project: RAML Module Builder
Components: None
Affects versions: None
Fix versions: 26.4.0

Type: Bug Priority: P2
Reporter: Hongwei Ji Assignee: Julian Ladisch
Resolution: Done Votes: 0
Labels: back-end, platform-backlog, q3.2-2019
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: Java Source File HyphenWordToPhrase.java    
Issue links:
Blocks
is blocked by RMB-432 fulltext: word splitting and punctuat... Closed
Relates
relates to UIREQ-295 Requests on item link in requests not... Closed
relates to CIRCSTORE-138 Filtering Requests by Request Status ... Closed
relates to FOLIO-2154 add jMeter tests for searching contri... Closed
relates to UIIN-623 When ISSN in record has hyphen, searc... Closed
relates to RMB-455 do not apply lower for fulltext Closed
Sprint: CP: sprint 69
Story Points: 3
Development Team: Core: Platform

 Description   

Following should return result but it does not.

/inventory/instances?limit=30&query=(contributors =/@name "Vigneras, Louis-André,") sortby title

Reason:

folio=# select to_tsvector('simple', f_unaccent('Vigneras, Louis-André')) @@ to_tsquery('simple', f_unaccent('Vigneras,<->Louis-André'));
 ?column? 
----------
 f
(1 row)
folio=# select to_tsvector('simple', f_unaccent('Vigneras, Louis-André'));
                   to_tsvector                    
--------------------------------------------------
 'andre':4 'louis':3 'louis-andre':2 'vigneras':1
(1 row)

folio=# select to_tsquery('simple', f_unaccent('Vigneras,<->Louis-André'));
                      to_tsquery                      
------------------------------------------------------
 'vigneras' <-> ( 'louis-andre' & 'louis' & 'andre' )
(1 row)


 Comments   
Comment by Jakub Skoczen [ 23/Jul/19 ]

Hongwei Ji Can you provide an explanation of why do you think this doesn't work? I can't tell from the snippet you attach. Thanks.

Comment by Hongwei Ji [ 23/Jul/19 ]

The first SQL in the snippet is essentially generated by the AP call. As you can see, the query result is "false".

Comment by Jakub Skoczen [ 24/Jul/19 ]

Ok, here's a good explanation on why it doesn't match: https://dba.stackexchange.com/questions/204588/how-to-search-hyphenated-words-in-postgresql-full-text-search

In short: terms likes "a-b" are tokenized into three tokens: "a-b", "a", "b" during the call to to_tsvector. Calling "to_tsquery" does the same but when combined with FOLLOWED BY (<->) the query no longer matches.

Comment by Jakub Skoczen [ 24/Jul/19 ]

HyphenWordToPhrase.java

Hongwei Ji Julian Ladisch Guys, I am attaching an incomplete converter that handles hyphenated words similarly to how "phraseto_tsquery" does it but the output can be fed into to_tsquery direclty. It will skip all hyphens and replace them with '<->' but if there is a single hyphen it will also prepend it with <2>

Comment by Jakub Skoczen [ 30/Jul/19 ]

Julian Ladisch what I managed to discover is that calling to_tsquery('a<->b-c') gives a<->(b-c & b & c), which will never match. As you can see to_tsquery expands each hyphenated word into AND group of three tokens. One way to fix it (and I provided some rudimentary code above to do that) is to generate <2>c<->d for each hyphenated token and used that instead as input to to_tsquery.

The code attached to this issue could/should be probably extended to a proper “tokenizer” that can handle any punctuation or whitespace and generate TS queries that in a similar way to built-in function like plainto_tsquery and phraseto_tsquery (which we can't use because they remove wildcards. This would also address RMB-432 Closed .

Comment by Julian Ladisch [ 31/Jul/19 ]

My implementation idea is this:

select replace(to_tsquery('simple', 'Vigneras,<->Louis-André')::text, '&', '<->')::tsquery;
                         replace
----------------------------------------------------------
 'vigneras' <-> ( 'louis-andré' <-> 'louis' <-> 'andré' )

Replace space by <-> and replace special characters (& <-> :) by other punctuation.
Pass it through to_tsquery.
Replace any & by <->.

Comment by Julian Ladisch [ 06/Aug/19 ]

This also works for UUID matching:

postgres=# select to_tsvector('0cc175b9-c0f1-b6a8-31c3-99e269772661') @@ replace(to_tsquery('simple', '0cc175b9-c0f1-b6a8-31c3-99e269772661')::text, '&', '<->')::tsquery;
 ?column?
----------
 t

postgres=# select to_tsvector('0cc175b9-c0f1-b6a8-31c3-99e269772661') , replace(to_tsquery('simple', '0cc175b9-c0f1-b6a8-31c3-99e269772661')::text, '&', '<->')::tsquery;
                                            to_tsvector                                            |                                                  replace 
---------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------
 '0cc175b9':2 '0cc175b9-c0f1-b6a8-31c3-99e269772661':1 '31c3':5 '99e269772661':6 'b6a8':4 'c0f1':3 | '0cc175b9-c0f1-b6a8-31c3-99e269772661' <-> '0cc175b9' <-> 'c0f1' <-> 'b6a8' <-> '31c3' <-> '99e269772661'
Comment by Julian Ladisch [ 08/Aug/19 ]

Postponed to q3.2-2019 as requested by Jakub.

Comment by Jakub Skoczen [ 09/Aug/19 ]

Asked Adam Dickmeiss to review so we can merge before the end of the sprint.

Comment by Julian Ladisch [ 10/Aug/19 ]

Adam asked on https://github.com/folio-org/raml-module-builder/pull/489:

I hope that the replace-stuff will not hinder indexes from being used.

create table t (jsonb jsonb);
create index on t using gin (to_tsvector('simple', jsonb->>'title'));
explain select * FROM t where to_tsvector('simple', jsonb->>'title') @@ replace(to_tsquery('simple', 'abc=xyz')::text, '&', '<->')::tsquery;
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=12.00..16.27 rows=1 width=32)
   Recheck Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''abc'' <-> ''xyz'''::tsquery)
   ->  Bitmap Index Scan on t_to_tsvector_idx  (cost=0.00..12.00 rows=1 width=0)
         Index Cond: (to_tsvector('simple'::regconfig, (jsonb ->> 'title'::text)) @@ '''abc'' <-> ''xyz'''::tsquery)

This shows that the index t_to_tsvector_idx is used because the replace is on the right hand side of @@ only. This is similar to jsonb->>'title' LIKE f_unaccent('abc xyz').

Generated at Thu Feb 08 23:18:40 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.