[FOLIO-1246] Implement Postgres Full Text Search functionality Created: 15/May/18 Updated: 27/Jun/19 Resolved: 27/Jun/19 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Umbrella | Priority: | P3 |
| Reporter: | Jakub Skoczen | Assignee: | Jakub Skoczen |
| Resolution: | Done | Votes: | 0 |
| Labels: | inventory, performance, search_enhancements, sprint40, sprint41, sprint42, sprint43, sprint44, sprint45 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Issue links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | |||||||||||||||||||||||||||||||||||||||||||||||||
| Development Team: | Core: Platform | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
In Durham we have discussed that the current approach to searching (which, essentially, is based on translating CQL queries to complex PSQL queries with LIKE/ILIKE/regex and creating BTREE and GIN indices on particular down-cased/unaccented columns, with support from RMB and CQL2PG) may not be flexible enough to provide the level of search functionality and performance expected. (Note: we do not have concrete requirements here, we are assuming that what is expected matches functionality in modern OPAC systems). One of the approaches discussed was to use the Postgres built-in "Full Text Search" capability, documented here: https://www.postgresql.org/docs/current/static/textsearch-intro.html In short, the essential pieces of full text search in PG are two functions/data structures:
tsvector takes a “document” which can be a single column, multiple columns from a single table or multiple tables, a view, or the entire JSONB document and returns a list (vector) of lexemes, which are normalised tokens (case/accent normalised, with stopwords filtered and roots extrated). tsquery does the same but with the user input, e.g a list of words to match. tsquery supports the same booleans as CQL so the translation is rather trivial — & for AND, || for OR, ! for NOT, <-> for PROX One can run both tsvector and tsquery at the query time, but this obviously does not scale so normally tsvector is used at the indexing time to build a GIN index on the selected data. Since PG 10, tsvector can take the whole JSONB contents in one go: This, however, drops the keys and only indexes the “contents” of the JSONB, so we would not be able to “scope” the searches to a given field. One proposed approach here is to use triggers to do advanced indexing: |
| Comments |
| Comment by Jakub Skoczen [ 15/May/18 ] |
|
Guys, one thing that may or may not be a deal breaker for this approach are the facets., here is one approach: |
| Comment by shale99 [ 15/May/18 ] |
|
that is how facets are implemented currently in rmb |
| Comment by shale99 [ 17/May/18 ] |
|
i think this boils down to requirements - is there someone who can list the OPAC search features needed in v1 so we can analyze how they work in postgres full text? 1. would our query require stop words indexed in the title for exact title matches but of course also allow matching on titles without stopwords but potentially with a lower rank (hence ORing two query phrases). 2. matching words but also synonyms, but ranking down synonym matches in the result set as they may introduce alot of noise - same for thesaurus and spelling corrections (these really pollute a result set many times) <- is this even a requirement? 3. stemming, once again , we may want to rank the exact word match higher then a stemmed form of the word - especially in title matching. seems like postgres uses the Snowball stemmer - which is a very hard stemmer - will stem words into non words occasionally - 4. as mentioned - matching words across multiple fields in the json but still maintaining structure so that for example, title matches are ranked before description or something of the sort. 5. facets - will most probably not perform as well as in solr 6. do we just want the ranking? i think maybe all the above are solvable / patially solvable i guess, the important thing is to get some sort of requirements of what we need for v1. |
| Comment by shale99 [ 22/May/18 ] |
|
looked a bit more into this , and some additional thoughts: 1. i would recommend to index the content as is with the full text feature (may need per field) meaning, create a word vector with positions without using any lexical functionality (do not remove stop words, do not stem, do not add synonyms, etc...) during query time i would take the user query and do two things. motivation: highlighting of the matching terms is also supported so that this can be displayed in the UI note that there is a perf penalty for ranking, i dont have numbers on that |
| Comment by Jakub Skoczen [ 30/May/18 ] |
|
shale99 I am not exactly sure what you propose above. If you build the tsvector GIN index without stemming or synonyms you will not be able apply stemming/synonyms at query time (tsquery), since what you search for would not be in the index. Stemming, actually, lowers the size of index not increase it – with stemming you only index the stemmed lexeme not all possible variations. Synonyms is obviously the inverse. It is true that there doesn't seem to be a way to "boost" non-stemmed, exact matches with PG, but it may or may not be a real problem (see below). As far as stop-words are concerned – enabling them should help precision and performance because they rarely add anything but noise. When it comes to exact matches (which as you note may be problematic with stop-words enabled) PG has the phraseto_tsquery functions which generates and expression with proximity operators in place for stop-words. We would need to test the stop-word functionality with vile cases like "to be or not to be". |
| Comment by shale99 [ 30/May/18 ] |
|
i guess what i am saying is, once you index without stop words, and stem, the originals are lost forever, a case like "to be or not to be" will never be found. i would also argue that phrase matching with proximity would not be enough to over come this. for example, two titles: "about a boy" and "about the boys on the team" would both match the query "about a boy" as the stop words are dropped and words are stemmed and there is no way to solve these issues since the indexes dont have the unstemmed + stopwords. my suggestion was to index as is , and move the stemming + stopwords to query time - so that if someone queries "about a boy" the index does become smaller with stop word removal and stemming , and gets larger with synonyms - i agree and i guess that wasnt clear from my comments. |
| Comment by Jakub Skoczen [ 30/May/18 ] |
Assuming all the words in "to be or not to be" are considered stop-words. Something that should be checked.
Yes, it would match both. But that is as it should be and here the tsrank kicks in – the shorter and the better match would have higher word frequency and thus higher rank. So "about a boy" will still be the top hit.
Yes, you could do that, but it would defeat the original purpose: to remove stop-words as the source of noise for ranking and to stem as a way to allow user to find documents where query keywords have different variation or spelling. To do that you would need to stem and stop-word at the index time but index the original tokens with higher (non-stemmed) and lower (stop-word) boost factors. I am not sure if PG supports that. |
| Comment by shale99 [ 30/May/18 ] |
|
i dont know how the tsrank works, but i am not sure it ends with this - the words may appear in additional fields so it usually doesnt just end with a field vs field match rank - for example, if one document has the title being queried in a few fields (for example, the doc is discussing the title in question) and another has the incorrect title in the title field, and there is a boost for the title field, will that be enough to push the wrong title up although it is clearly not what the user is looking for? I dont know,, but ranking is quite complex , but i have not researched the tsrank enough to comment. I dont know how EDS looks now, but a few years back , when i was doing a lot of analysis on their search quality, it was evident that they were doing something like the stop words + (maybe) stemming removal from the index (i think i dont know for sure) - and their known item search was just flat out bad because they were missing this info in the index - maybe there is an ebsco rep who can comment on that |
| Comment by shale99 [ 30/May/18 ] |
|
we can start with the removal and stem and see where that goes, if its good enough then its the simplest solution |
| Comment by Jakub Skoczen [ 30/May/18 ] |
|
ts_rank, at least the basic one, looks at the word frequency in the tsvector. You would need separate tsvector for each field to differentiate ranking/ranking/searching across fields (note: there is one way to "workaround" it through setweight but I don't think this is flexible enough for a schema-less system). But assuming you have a separate title index your "about a boy" example would work as expected, with stop-words and stemming enabled. Now, this is where we are getting to the "core of the problem". Postgres offers a lot of building blocks for full text search but I don't think it can be considered an out-of-the-box solution (at least as compared to ES). Going back to your stemming and stop-words "problem". One way to attack it with PG is to use the setweight function. Behind this cryptic name you get a function which allows you to "label" the lexeme (normalised token in PG nomenclature) with a tag e.g A labeled "about boy" is [about:A, boy:A]. So let's assume that the way we index our records is as follows:
This would require a fairly complex set up of Postgres dictionaries (see https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html) and INDEX or TRIGGERS but let's assume it can be implemented (of course it needs to be verified) At query time, we do exactly the same: we run the queries through those three dictionaries, give appropriate label to each subquery and then or (||) the result into a single expression. Now, here comes ts_rank again – it accepts an array of weights for each label: e.g {C-weight, B-weight, A-weight}. This would allow you to boost matches where exact match occurs while still be able to match on stemmed words and stop-words. But I agree the SQL code aroud it will be quite complex. |
| Comment by shale99 [ 30/May/18 ] |
i might be missing something, if i have multiple tsvectors, meaning i have matched the query to multiple fields, if i matched "about the boys" in the title and another record i matched "about a boy" in the description - which record wins? why get into the complexity though - lets just index one way - the data as is, and query it in a smart why and let the scoring push down the bad results - so people need to actually scroll to find them - maybe we can also set a limit for the ranking so that really low ranks do not get returned |
| Comment by shale99 [ 30/May/18 ] |
|
also for keyword queries, for example "us foreign affairs" - finding titles matching "foreign affairs in the us" would seem to make sense. so this would mean we would want to send multiple queries in any case the queries i would send would be
what i am seeing is that when indexing without stopwords the positions are still maintained , for example: arms and dollars - while the lexemes are arm , dollar . the positions seem to be arm {0}, dollar {2}- so if i query arms <-> dollars , i will not find it. |
| Comment by shale99 [ 30/May/18 ] |
|
on the technical side - the full text searching seems to be very fast on 4.6 million records, the index size on the title: okapi_modules=# SELECT pg_size_pretty(pg_relation_size('diku_mod_inventory_storage.idx_fts_title')); |
| Comment by Jakub Skoczen [ 30/May/18 ] |
|
shale99 My writeup about using setweight was addressing the problem you mentioned about searching for phrases were stop-words are critical and boosting exact (non-stemmed) matches. I think this is minority of cases and certainly not something we should be spending time optimising now. I propose a simple approach to evaluate PGFT: use the default "english" dictionary to index (with default stemming and stop-words) and query and test both normal and phrase search (phrase_to_tsquery). |
| Comment by shale99 [ 31/May/18 ] |
|
the thing is with searching , especially here, is the details and these requests will come and we will need to see if the postgres engine can support them in a sensible way. what i have seen so far is that it gives a nice amount of functionality, but to do things will take some bending and may become inefficient for example: my preference would be to go with postgres just because it would be easier to maintain a single data store. and it gives alot more functionality and is faster then what we are doing. |