Codex search results treats Swedish diacritics as ascii equivalents when sorting results
Description
CSP Request Details
CSP Rejection Details
Potential Workaround
Attachments
is blocked by
Checklist
hideTestRail: Results
Activity

Magda Zacharska January 14, 2022 at 8:17 PM
Kiwi bugfest environment is hardly US-centric as in addition to the default English analyzer, it has also Russian, Hebrew and Arabic but it does not have Swedish. You might want to create a request for devops to add this analyzer and to rebuild the index after that - so you can verify a Swedish language specific behavior.

Theodor Tolstoy (One-Group.se) January 14, 2022 at 8:12 AM
has this behavior been addressed in the Elastic search implementation? Bugfest-Kiwi is not a good example since that is US-centric, so it is a bit hard to verify. Do you want me to create a similar ticket for Inventory?
Holly Mistlebauer December 21, 2021 at 6:50 PM
This ticket has been closed because it is over 3 years old and has a very low priority.

Mike Taylor August 21, 2018 at 3:13 PM
Your two-phase approach to resolving this is perfect.
Of course, once we introduce relevance-based sorting, we have a completely different task in the Codex multiplexer. We'll need to add a relevanceScore
field to the instance schema at https://github.com/folio-org/raml/blob/master/schemas/codex/instance.json and require all Codex sources to include this field in records returned as the result of a relevance search: then the multiplexer can merge the streams by maintaining decreasing scores.
But of course the result of that merge may be unsatisfactory, depending on the different scoring algorithms used by the different back-ends. For example, is the EBSCO KB source issues relevance scores between 1 and 100, and the FOLIO Inventory source scores between 0.0 and 1.0, all EBSCO KB records will appear to be more relevant than all FOLIO Inventory records.
So my proposal at this point is just that we move relevance-ranking into a completely separate issue, and try to avoid letting its unique complexities confuse matters in this one.

Jakub Skoczen August 21, 2018 at 9:46 AMEdited
on you said:
"decent search engine might still surface a few hits containing "aland", but with significantly reduced relevance scores"
Which assumes that the result would be sorted according to relevancy and not directly according to the locale specific collation rules. I'd like to understand the expectations a bit better, because the two approaches can be contradictory, eg.
1. assuming matching (search) also considers un-accented version (stripped diacritics) the only sensible sort seem to be "relevancy ranking" that would boost the result positions depending on how close the match is to the original query: e.g in Polish a search for paczki ("packages") would find both paczki and pączki ("doughnuts") but boost paczki results in the relevancy score. A search for pączki ("doughnuts") would do the inverse. Note that in Polish ą sorts after a in alphabetical order.
2. sorting to the strict collation rules may only make sense for exact matching: e.g a search for pączki would not yield results for paczki. If it did, those results would get sorted higher up which I assume would not be the expectation?
I am sure you can find Swedish equivalents for the above.
Overview: When conducting title level searches in Codex for titles containing Swedish diacritics (å,ä,ö) the sort functionality behaves as if those characters are reduced to their ASCII equivalents (a,o).
Steps to Reproduce:
Create a couple of records in Inventory with titles starting on a, å, ä or similar
For example:
"Den aktansvärda"
"Den äkta varan"
"Den åländska skärgården"
Go to Codex and conduct a title search for Den
Sort by title ascending order (arrow pointing up)
Expected Results:
Results are returned alphabetically (Swedish):
"Den aktansvärda"
"Den åländska skärgården"
"Den äkta varan"
Actual Results:
Results are sorted according to the attached image:
"Den aktansvärda"
"Den äkta varan"
"Den åländska skärgården"
Additional Information: Will add these in separate issues.
Note:
This particular issue might get solved by changing Collation on relevant tables in Postgres to Swedish (see https://www.postgresql.org/docs/9.1/static/collation.html), but I believe that this issue is related to a bigger discussions on search technology