Codex search results treats Swedish diacritics as ascii equivalents when sorting results

Description

Overview: When conducting title level searches in Codex for titles containing Swedish diacritics (å,ä,ö) the sort functionality behaves as if those characters are reduced to their ASCII equivalents (a,o).

Steps to Reproduce:

  • Create a couple of records in Inventory with titles starting on a, å, ä or similar
    For example:
    "Den aktansvärda"
    "Den äkta varan"
    "Den åländska skärgården"

  • Go to Codex and conduct a title search for Den

  • Sort by title ascending order (arrow pointing up)

Expected Results:
Results are returned alphabetically (Swedish):
"Den aktansvärda"
"Den åländska skärgården"
"Den äkta varan"

Actual Results:
Results are sorted according to the attached image:
"Den aktansvärda"
"Den äkta varan"
"Den åländska skärgården"

Additional Information: Will add these in separate issues.

Note:
This particular issue might get solved by changing Collation on relevant tables in Postgres to Swedish (see https://www.postgresql.org/docs/9.1/static/collation.html), but I believe that this issue is related to a bigger discussions on search technology

CSP Request Details

None

CSP Rejection Details

None

Potential Workaround

None

Attachments

1

Checklist

hide

TestRail: Results

Activity

Show:

Magda Zacharska January 14, 2022 at 8:17 PM

Kiwi bugfest environment is hardly US-centric as in addition to the default English analyzer, it has also Russian, Hebrew and Arabic but it does not have Swedish. You might want to create a request for devops to add this analyzer and to rebuild the index after that - so you can verify a Swedish language specific behavior.

Theodor Tolstoy (One-Group.se) January 14, 2022 at 8:12 AM

has this behavior been addressed in the Elastic search implementation? Bugfest-Kiwi is not a good example since that is US-centric, so it is a bit hard to verify. Do you want me to create a similar ticket for Inventory?

Holly Mistlebauer December 21, 2021 at 6:50 PM

This ticket has been closed because it is over 3 years old and has a very low priority.

Mike Taylor August 21, 2018 at 3:13 PM

Your two-phase approach to resolving this is perfect.

Of course, once we introduce relevance-based sorting, we have a completely different task in the Codex multiplexer. We'll need to add a relevanceScore field to the instance schema at https://github.com/folio-org/raml/blob/master/schemas/codex/instance.json and require all Codex sources to include this field in records returned as the result of a relevance search: then the multiplexer can merge the streams by maintaining decreasing scores.

But of course the result of that merge may be unsatisfactory, depending on the different scoring algorithms used by the different back-ends. For example, is the EBSCO KB source issues relevance scores between 1 and 100, and the FOLIO Inventory source scores between 0.0 and 1.0, all EBSCO KB records will appear to be more relevant than all FOLIO Inventory records.

So my proposal at this point is just that we move relevance-ranking into a completely separate issue, and try to avoid letting its unique complexities confuse matters in this one.

Jakub Skoczen August 21, 2018 at 9:46 AM
Edited

on you said:

"decent search engine might still surface a few hits containing "aland", but with significantly reduced relevance scores"

Which assumes that the result would be sorted according to relevancy and not directly according to the locale specific collation rules. I'd like to understand the expectations a bit better, because the two approaches can be contradictory, eg.

1. assuming matching (search) also considers un-accented version (stripped diacritics) the only sensible sort seem to be "relevancy ranking" that would boost the result positions depending on how close the match is to the original query: e.g in Polish a search for paczki ("packages") would find both paczki and pączki ("doughnuts") but boost paczki results in the relevancy score. A search for pączki ("doughnuts") would do the inverse. Note that in Polish ą sorts after a in alphabetical order.

2. sorting to the strict collation rules may only make sense for exact matching: e.g a search for pączki would not yield results for paczki. If it did, those results would get sorted higher up which I assume would not be the expectation?

I am sure you can find Swedish equivalents for the above.

Won't Do

Details

Assignee

Reporter

Priority

Development Team

Prokopovych

TestRail: Cases

Open TestRail: Cases

TestRail: Runs

Open TestRail: Runs
Created February 20, 2018 at 4:13 PM
Updated January 14, 2022 at 8:17 PM
Resolved December 21, 2021 at 6:50 PM
TestRail: Cases
TestRail: Runs