[UISE-69] Codex search results treats Swedish diacritics as ascii equivalents when sorting results Created: 20/Feb/18  Updated: 14/Jan/22  Resolved: 21/Dec/21

Status: Closed
Project: ui-search
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: P4
Reporter: Theodor Tolstoy (One-Group.se) Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: chalmers, front-end, keep-bug, triaged, ui-only
Remaining Estimate: Not Specified
Time Spent: 1 hour
Original estimate: Not Specified

Attachments: PNG File Capture2.PNG    
Issue links:
Blocks
is blocked by MODCXMUX-25 sort according to tenant's locale Open
is blocked by MODINVSTOR-148 sort according to tenant's locale Open
is blocked by FOLIO-1246 Implement Postgres Full Text Search f... Closed
Relates
relates to UISE-70 Codex search results are taking Nonfi... Closed
relates to UXPROD-745 Tenant Sort Order Setting Open
relates to UISE-68 Codex search treats Swedish diacritic... Closed
Sprint:
Development Team: Prokopovych

 Description   

Overview: When conducting title level searches in Codex for titles containing Swedish diacritics (å,ä,ö) the sort functionality behaves as if those characters are reduced to their ASCII equivalents (a,o).

Steps to Reproduce:

  • Create a couple of records in Inventory with titles starting on a, å, ä or similar
    For example:
    "Den aktansvärda"
    "Den äkta varan"
    "Den åländska skärgården"
  • Go to Codex and conduct a title search for Den
  • Sort by title ascending order (arrow pointing up)

Expected Results:
Results are returned alphabetically (Swedish):
"Den aktansvärda"
"Den åländska skärgården"
"Den äkta varan"

Actual Results:
Results are sorted according to the attached image:
"Den aktansvärda"
"Den äkta varan"
"Den åländska skärgården"

Additional Information: Will add these in separate issues.

Note:
This particular issue might get solved by changing Collation on relevant tables in Postgres to Swedish (see https://www.postgresql.org/docs/9.1/static/collation.html), but I believe that this issue is related to a bigger discussions on search technology



 Comments   
Comment by Cate Boerema (Inactive) [ 21/Feb/18 ]

Tagging Charlotte Whitt for awareness

Comment by Mike Taylor [ 21/Feb/18 ]

This one unfortunately raises more difficulties, along the lines of UISE-68 Closed but worse.

The first question of course, is what is the desired behaviour. One can imagine that in Swedish, even if it's desirable for Åland and aland to act as the same query-term, it might also be desirable to sort in the order aktansvärda, åländska, äkta. But we should take a moment to verify that before putting too much work into this.

Related to this: I have a horrible feeling that the answer is going to be different in different locales. I don't remember details, but it's going to be something like in Swedish you want aktansvärda, åländska, äkta but in Danish you want aktansvärda, äkta, åländska.

And if I'm right, then we can't resolve this by just using the appropriate collation locale, because what we have is in general a mix of titles in different languages. We could go some way towards getting a Right Answer by collating according to the locale that is configured for the tenant (i.e. also the one used by the UI for deciding on things like how to format dates) – so a tenant based in Sweden that has predominantly Swedish titles would get Swedish collation.

But even there, we are dependent on several implementation aspects.

First, as with searching, the Codex app can only offer those facilities that the corresponding back-end modules support. It's likely that mod-codex-inventory could fetch the prevailing locale from mod-configuration and use that to instruct PostgreSQL (via some RMB-provided endpoint) to use the appropriate collation locale. I think it's much less likely that mod-codex-ekb can honour the locale – at least in its present form, though the forthcoming rewrite might help, since IIRC it's based in Solr or ElasticSearch.

But there is one further issue here, to do with merging sorted lists. The Codex multiplexer never does sorting of its own: it just passes the sort-specifications through the various Codex-source modules (as part of the query) and zips together the resulting sorted lists that they return. If the lists returned from the Codex-source modules are not sorted in the same order, the results will be incorrect – often in subtle, hard-to-understand ways. To avoid this, I see only three solutions:

1. Simplest: every Codex source uses simple ASCII collation. Obviously not optimal, but will definitely work in a predictable way.
2. Every Codex source does locale-aware collation exactly right, using the same locale. This would give us the best results, but is hard to control and imposes an implementation burden that not all Codex sources may be able to handle.
3. (Adam's idea) Extend the Codex contract so that each record returned has a sortKey field, which the multiplexer can then use for merging the sorted lists in a reliable way. Probably the most practical compromise.

(I've added Adam Dickmeiss to this issue, to get his perspective on the technical aspects.)

Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ]

I am glad i was able to surface this now then, because this is quite a "light" localization issue. Å,ä,ö are swedish letters. They are treated as all the other letters in the alphabet in the Swedish language.

I can certainly confirm that "horrible" feeling that there are different filing rules for different locales..

The thing with Collation and other localization features of RDBMS:s - to my understanding - is that Collation does just that. It reduces the right diacritics for the intended user's locale. So the french é:s and è:s would be treated like the swedish audience would like the to be treated, like e:s.

With regards to your solutions:
1. This is true for every collation that you choose. Every collation is predictable. Choosing ASCII is perhaps more predictable to you. Not me.
2. Given there is a decent search engine in front of the codex sources, those could be configured to return the right "locale" based on the search engine's language analysis. Perhaps It does not have to be that hard.
3. MARC21 has the concept of nonfiling characters that might be of use in this context

Comment by Mike Taylor [ 21/Feb/18 ]

Yes, this was good to surface!

But it's a bit weird that it's come up in the context of the Codex Search, which has many more implementation issues than other parts of FOLIO. I wonder if we'd do better to sort out our internationalisation issues in (say) the Inventory app first, before attempting this more difficult feat. On the other hand, if we do it this way round, we can be confident that our solution will work in the simpler cases!

On the solutions:
1. The thing about simple ASCII collation is not that it's more intuitive, but that it's easier to get right. As I said, I am not confident that it's necessarily even going to be possible to have the EBSCO KB Codex source do anything else.
2. I don't understand what you're proposing here. Can you please lay it out in a bit more detail, perhaps with an example request-response pair?
3. The idea here is not to change how Codex sources sort – that can be dealt with separately – but just to have them make an explicit statement of how they sorted (i.e. "this record sorted after the aktansvärda one and before the åländska one because I sorted by the ket "akta".

The values of #3 is twofold. First, and most pragmatically, it allows the multiplexer to zip together the multiple streams of records in a predictable way; second, and more philosophically important, it lays bare what each individual source is doing, and makes it possible for us to chase down the ones that are not behaving as we wish.

Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ]

Mike Taylor, my comments may make my limited knowledge in the inner workings of Codex show. I would welcome some investigation into what the actual needs are.

Yes, certainly this should also be surfaced in Inventory. I choose Codex since i thought you have directed field searches (title) there, and since it to my knowledge is the intended first point of search in FOLIO.

Comment by Mike Taylor [ 29/May/18 ]

Yep. From a functional perspective, the Codex is a perfectly sensible point of entry to have chosen. It just so happens that it involves a more elaborate technology stack than Inventory, so from an engineering perspective it makes sense to solve the more tractable problem first. But it certainly doesn't hurt to have in mind, as we do so, the more difficult problem!

Comment by Jakub Skoczen [ 21/Aug/18 ]

Mike Taylor Theodor Tolstoy (One-Group.se). We will take this in two phase:

1. Support locale-driven collation/sort in mod-inventory-storage ( MODINVSTOR-148 Open )
2. Address sorting in the Codex mux ( MODCXMUX-25 Open )

Comment by Jakub Skoczen [ 21/Aug/18 ]

Theodor Tolstoy (One-Group.se) on UISE-68 Closed you said:

"decent search engine might still surface a few hits containing "aland", but with significantly reduced relevance scores"

Which assumes that the result would be sorted according to relevancy and not directly according to the locale specific collation rules. I'd like to understand the expectations a bit better, because the two approaches can be contradictory, eg.

1. assuming matching (search) also considers un-accented version (stripped diacritics) the only sensible sort seem to be "relevancy ranking" that would boost the result positions depending on how close the match is to the original query: e.g in Polish a search for paczki ("packages") would find both paczki and pączki ("doughnuts") but boost paczki results in the relevancy score. A search for pączki ("doughnuts") would do the inverse. Note that in Polish ą sorts after a in alphabetical order.

2. sorting to the strict collation rules may only make sense for exact matching: e.g a search for pączki would not yield results for paczki. If it did, those results would get sorted higher up which I assume would not be the expectation?

I am sure you can find Swedish equivalents for the above.

Comment by Mike Taylor [ 21/Aug/18 ]

Jakub Skoczen Your two-phase approach to resolving this is perfect.

Of course, once we introduce relevance-based sorting, we have a completely different task in the Codex multiplexer. We'll need to add a relevanceScore field to the instance schema at https://github.com/folio-org/raml/blob/master/schemas/codex/instance.json and require all Codex sources to include this field in records returned as the result of a relevance search: then the multiplexer can merge the streams by maintaining decreasing scores.

But of course the result of that merge may be unsatisfactory, depending on the different scoring algorithms used by the different back-ends. For example, is the EBSCO KB source issues relevance scores between 1 and 100, and the FOLIO Inventory source scores between 0.0 and 1.0, all EBSCO KB records will appear to be more relevant than all FOLIO Inventory records.

So my proposal at this point is just that we move relevance-ranking into a completely separate issue, and try to avoid letting its unique complexities confuse matters in this one.

Comment by Holly Mistlebauer [ 21/Dec/21 ]

This ticket has been closed because it is over 3 years old and has a very low priority.

Comment by Theodor Tolstoy (One-Group.se) [ 14/Jan/22 ]

Magda Zacharska has this behavior been addressed in the Elastic search implementation? Bugfest-Kiwi is not a good example since that is US-centric, so it is a bit hard to verify. Do you want me to create a similar ticket for Inventory?

Comment by Magda Zacharska [ 14/Jan/22 ]

Theodor Tolstoy (One-Group.se) Kiwi bugfest environment is hardly US-centric as in addition to the default English analyzer, it has also Russian, Hebrew and Arabic but it does not have Swedish. You might want to create a request for devops to add this analyzer and to rebuild the index after that - so you can verify a Swedish language specific behavior.

Generated at Thu Feb 08 23:10:37 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.