[FOLIO-1246] Implement Postgres Full Text Search functionality Created: 15/May/18  Updated: 27/Jun/19  Resolved: 27/Jun/19

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Umbrella Priority: P3
Reporter: Jakub Skoczen Assignee: Jakub Skoczen
Resolution: Done Votes: 0
Labels: inventory, performance, search_enhancements, sprint40, sprint41, sprint42, sprint43, sprint44, sprint45
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Blocks
blocks UXPROD-745 Tenant Sort Order Setting Open
blocks UXPROD-1045 Fulltext Search Closed
blocks UXPROD-1048 Fulltext indexing for textual fields Closed
blocks UXPROD-1049 Search by automatically deriving rela... Closed
blocks UISE-68 Codex search treats Swedish diacritic... Closed
blocks UISE-69 Codex search results treats Swedish d... Closed
Relates
relates to MODINVSTOR-163 queries involving contributors or sub... In Progress
relates to CQLPG-37 Schema driven CQL generation and PGFT... Closed
relates to CQLPG-40 Find out what mod-inventory-storage n... Closed
relates to UISE-80 Search results are not sorted by title Closed
Sprint:
Development Team: Core: Platform

 Description   

In Durham we have discussed that the current approach to searching (which, essentially, is based on translating CQL queries to complex PSQL queries with LIKE/ILIKE/regex and creating BTREE and GIN indices on particular down-cased/unaccented columns, with support from RMB and CQL2PG) may not be flexible enough to provide the level of search functionality and performance expected. (Note: we do not have concrete requirements here, we are assuming that what is expected matches functionality in modern OPAC systems).

One of the approaches discussed was to use the Postgres built-in "Full Text Search" capability, documented here:

https://www.postgresql.org/docs/current/static/textsearch-intro.html

In short, the essential pieces of full text search in PG are two functions/data structures:

  • tsvector
  • tsquery

tsvector takes a “document” which can be a single column, multiple columns from a single table or multiple tables, a view, or the entire JSONB document and returns a list (vector) of lexemes, which are normalised tokens (case/accent normalised, with stopwords filtered and roots extrated).

tsquery does the same but with the user input, e.g a list of words to match. tsquery supports the same booleans as CQL so the translation is rather trivial — & for AND, || for OR, ! for NOT, <-> for PROX

One can run both tsvector and tsquery at the query time, but this obviously does not scale so normally tsvector is used at the indexing time to build a GIN index on the selected data.
tsvector/tsquery are the most fundamental data structures, PG includes more advanced functionalities expected in a search engine: ranking, highlighting/snippets, etc. It does not include specialized facet data structure – this needs to be implemented through other methods.

Since PG 10, tsvector can take the whole JSONB contents in one go:
https://wiki.postgresql.org/wiki/New_in_postgres_10#Full_Text_Search_support_for_JSON_and_JSONB

This, however, drops the keys and only indexes the “contents” of the JSONB, so we would not be able to “scope” the searches to a given field. One proposed approach here is to use triggers to do advanced indexing:
https://stackoverflow.com/questions/45680936/how-to-implement-full-text-search-on-complex-nested-jsonb-in-postgresql



 Comments   
Comment by Jakub Skoczen [ 15/May/18 ]

Guys, one thing that may or may not be a deal breaker for this approach are the facets., here is one approach:

http://akorotkov.github.io/blog/2016/06/17/faceted-search/

Comment by shale99 [ 15/May/18 ]

that is how facets are implemented currently in rmb

Comment by shale99 [ 17/May/18 ]

i think this boils down to requirements -
it seems like the postgres full text has put a check on most of the search functionality - i think we need to understand what functionality we currently are not supporting that is needed.

is there someone who can list the OPAC search features needed in v1 so we can analyze how they work in postgres full text?

1. would our query require stop words indexed in the title for exact title matches but of course also allow matching on titles without stopwords but potentially with a lower rank (hence ORing two query phrases).
ex. a query for: to be or not to be,

2. matching words but also synonyms, but ranking down synonym matches in the result set as they may introduce alot of noise - same for thesaurus and spelling corrections (these really pollute a result set many times) <- is this even a requirement?

3. stemming, once again , we may want to rank the exact word match higher then a stemmed form of the word - especially in title matching. seems like postgres uses the Snowball stemmer - which is a very hard stemmer - will stem words into non words occasionally -
revival -> reviv
happy -> happi
etc...
in the past i have migrated away from this stemmer into much better ones available in solr (this may be an option in postgres as well)

4. as mentioned - matching words across multiple fields in the json but still maintaining structure so that for example, title matches are ranked before description or something of the sort.

5. facets - will most probably not perform as well as in solr

6. do we just want the ranking?

i think maybe all the above are solvable / patially solvable

i guess, the important thing is to get some sort of requirements of what we need for v1.

Comment by shale99 [ 22/May/18 ]

looked a bit more into this , and some additional thoughts:

1. i would recommend to index the content as is with the full text feature (may need per field)
CREATE INDEX idx_fts ON harvard_mod_inventory_storage.instance USING gin ( to_tsvector('non_existant_conf',jsonb) );

meaning, create a word vector with positions without using any lexical functionality (do not remove stop words, do not stem, do not add synonyms, etc...)

during query time i would take the user query and do two things.
a. analyze it using the lexical tools (remove stop words, stemming, ,,,)
b. take the query as is
c. run an OR between the two and boost up (b)

motivation:
stop word removal, stemming, etc... adds a lot of noise and if our index is indexed without that information it will be very difficult to return exact matching since friendly will be stemmed to friend, stop words removed, etc... we will not be able to boost up the actual word that was queried in many cases. i think the above is a good middle ground. we can then add additional query clauses with lower and lower boosts
1. exact words entered
2. without stopwords
3. stemmed
4. synonyms / thesaurus (maybe)

highlighting of the matching terms is also supported so that this can be displayed in the UI

note that there is a perf penalty for ranking, i dont have numbers on that

Comment by Jakub Skoczen [ 30/May/18 ]

shale99 I am not exactly sure what you propose above. If you build the tsvector GIN index without stemming or synonyms you will not be able apply stemming/synonyms at query time (tsquery), since what you search for would not be in the index. Stemming, actually, lowers the size of index not increase it – with stemming you only index the stemmed lexeme not all possible variations. Synonyms is obviously the inverse. It is true that there doesn't seem to be a way to "boost" non-stemmed, exact matches with PG, but it may or may not be a real problem (see below).

As far as stop-words are concerned – enabling them should help precision and performance because they rarely add anything but noise. When it comes to exact matches (which as you note may be problematic with stop-words enabled) PG has the phraseto_tsquery functions which generates and expression with proximity operators in place for stop-words. We would need to test the stop-word functionality with vile cases like "to be or not to be".

Comment by shale99 [ 30/May/18 ]

i guess what i am saying is, once you index without stop words, and stem, the originals are lost forever, a case like "to be or not to be" will never be found. i would also argue that phrase matching with proximity would not be enough to over come this.

for example, two titles: "about a boy" and "about the boys on the team" would both match the query "about a boy" as the stop words are dropped and words are stemmed and there is no way to solve these issues since the indexes dont have the unstemmed + stopwords.

my suggestion was to index as is , and move the stemming + stopwords to query time - so that if someone queries "about a boy"
you create a query "about a boy" with a high boost OR "about boy" <- this would return the correct result
you can also query "about the boys" with a high boost OR "about boy" lower boost <- this would return the correct results first

the index does become smaller with stop word removal and stemming , and gets larger with synonyms - i agree and i guess that wasnt clear from my comments.

Comment by Jakub Skoczen [ 30/May/18 ]

i guess what i am saying is, once you index without stop words, and stem, the originals are lost forever, a case like "to be or not to be" will never be found. i would also argue that phrase matching with proximity would not be enough to over come this.

Assuming all the words in "to be or not to be" are considered stop-words. Something that should be checked.

for example, two titles: "about a boy" and "about the boys on the team" would both match the query "about a boy" as the stop words are dropped and words are stemmed and there is no way to solve these issues since the indexes dont have the unstemmed + stopwords.

Yes, it would match both. But that is as it should be and here the tsrank kicks in – the shorter and the better match would have higher word frequency and thus higher rank. So "about a boy" will still be the top hit.

my suggestion was to index as is , and move the stemming + stopwords to query time - so that if someone queries "about a boy"

you create a query "about a boy" with a high boost OR "about boy" <- this would return the correct result

you can also query "about the boys" with a high boost OR "about boy" lower boost <- this would return the correct results first

Yes, you could do that, but it would defeat the original purpose: to remove stop-words as the source of noise for ranking and to stem as a way to allow user to find documents where query keywords have different variation or spelling. To do that you would need to stem and stop-word at the index time but index the original tokens with higher (non-stemmed) and lower (stop-word) boost factors. I am not sure if PG supports that.

Comment by shale99 [ 30/May/18 ]

i dont know how the tsrank works, but i am not sure it ends with this - the words may appear in additional fields so it usually doesnt just end with a field vs field match rank - for example, if one document has the title being queried in a few fields (for example, the doc is discussing the title in question) and another has the incorrect title in the title field, and there is a boost for the title field, will that be enough to push the wrong title up although it is clearly not what the user is looking for? I dont know,, but ranking is quite complex , but i have not researched the tsrank enough to comment.

I dont know how EDS looks now, but a few years back , when i was doing a lot of analysis on their search quality, it was evident that they were doing something like the stop words + (maybe) stemming removal from the index (i think i dont know for sure) - and their known item search was just flat out bad because they were missing this info in the index - maybe there is an ebsco rep who can comment on that

Comment by shale99 [ 30/May/18 ]

we can start with the removal and stem and see where that goes, if its good enough then its the simplest solution

Comment by Jakub Skoczen [ 30/May/18 ]

ts_rank, at least the basic one, looks at the word frequency in the tsvector. You would need separate tsvector for each field to differentiate ranking/ranking/searching across fields (note: there is one way to "workaround" it through setweight but I don't think this is flexible enough for a schema-less system). But assuming you have a separate title index your "about a boy" example would work as expected, with stop-words and stemming enabled.

Now, this is where we are getting to the "core of the problem". Postgres offers a lot of building blocks for full text search but I don't think it can be considered an out-of-the-box solution (at least as compared to ES).

Going back to your stemming and stop-words "problem". One way to attack it with PG is to use the setweight function. Behind this cryptic name you get a function which allows you to "label" the lexeme (normalised token in PG nomenclature) with a tag e.g A labeled "about boy" is [about:A, boy:A]. So let's assume that the way we index our records is as follows:

  • we index the un-stemmed but stop-word-extracted content as label A
  • we index the stemmed and stop-worded content as label B
  • we index the stop-words with label C

This would require a fairly complex set up of Postgres dictionaries (see https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html) and INDEX or TRIGGERS but let's assume it can be implemented (of course it needs to be verified)

At query time, we do exactly the same: we run the queries through those three dictionaries, give appropriate label to each subquery and then or (||) the result into a single expression.

Now, here comes ts_rank again – it accepts an array of weights for each label: e.g

{C-weight, B-weight, A-weight}

. This would allow you to boost matches where exact match occurs while still be able to match on stemmed words and stop-words. But I agree the SQL code aroud it will be quite complex.

Comment by shale99 [ 30/May/18 ]

But assuming you have a separate title index your "about a boy" example would work as expected, with stop-words and stemming enabled.

i might be missing something, if i have multiple tsvectors, meaning i have matched the query to multiple fields, if i matched "about the boys" in the title and another record i matched "about a boy" in the description - which record wins?

why get into the complexity though - lets just index one way - the data as is, and query it in a smart why and let the scoring push down the bad results - so people need to actually scroll to find them - maybe we can also set a limit for the ranking so that really low ranks do not get returned

Comment by shale99 [ 30/May/18 ]

also for keyword queries, for example "us foreign affairs" - finding titles matching "foreign affairs in the us" would seem to make sense. so this would mean we would want to send multiple queries in any case

the queries i would send would be

  • a phrase query (if stop words were included in the query keep them as part of the phrase query)
  • a phrase without the stopwords
  • a key word query without the stopwords
    this gets trickier if someone queries "rowling harry potter" where we would need to allow key word matching across multiple fields but of course

what i am seeing is that when indexing without stopwords the positions are still maintained , for example:

arms and dollars - while the lexemes are arm , dollar . the positions seem to be arm

{0}

, dollar

{2}

- so if i query arms <-> dollars , i will not find it.

Comment by shale99 [ 30/May/18 ]

on the technical side - the full text searching seems to be very fast

on 4.6 million records, the index size on the title:

okapi_modules=# SELECT pg_size_pretty(pg_relation_size('diku_mod_inventory_storage.idx_fts_title'));
pg_size_pretty
----------------
337 MB
(1 row)

Comment by Jakub Skoczen [ 30/May/18 ]

shale99 My writeup about using setweight was addressing the problem you mentioned about searching for phrases were stop-words are critical and boosting exact (non-stemmed) matches. I think this is minority of cases and certainly not something we should be spending time optimising now. I propose a simple approach to evaluate PGFT: use the default "english" dictionary to index (with default stemming and stop-words) and query and test both normal and phrase search (phrase_to_tsquery).

Comment by shale99 [ 31/May/18 ]

the thing is with searching , especially here, is the details and these requests will come and we will need to see if the postgres engine can support them in a sensible way.

what i have seen so far is that it gives a nice amount of functionality, but to do things will take some bending and may become inefficient

for example:
1. librarians would probably want titles that start with their terms boosted up, may be doable, didnt see yet how
2. running searches on the perf env (4.6M records) - i see cases where i enter a query and get the stemmed version ranked before the original query i entered which may be ok and may not be - input welcomed:
for example:
to_tsquery('english',' constitutional & amendment')
searches for:
Index Cond: -> '''constitut'' & ''amend'''
3. i dont see facet support, if we can facet on the top N returned records, this may be ok, if we need to facet on the entire result set this would be a deal breaker
4. i dont see out of the box CJK support - meaning, tokenization / if we want to normalize to simplified chinese, we need to handle this ourselves

my preference would be to go with postgres just because it would be easier to maintain a single data store. and it gives alot more functionality and is faster then what we are doing.
my hesitation is that once we get into the details , it is still not as flexible as solr or ES and that may become a problem. as for a v1 solution , it is better then our current one

Generated at Thu Feb 08 23:11:55 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.