[UISE-69] Codex search results treats Swedish diacritics as ascii equivalents when sorting results Created: 20/Feb/18 Updated: 14/Jan/22 Resolved: 21/Dec/21 |
|
| Status: | Closed |
| Project: | ui-search |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Bug | Priority: | P4 |
| Reporter: | Theodor Tolstoy (One-Group.se) | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | chalmers, front-end, keep-bug, triaged, ui-only | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | 1 hour | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||||||||||
| Sprint: | |||||||||||||||||||||||||||||||||
| Development Team: | Prokopovych | ||||||||||||||||||||||||||||||||
| Description |
|
Overview: When conducting title level searches in Codex for titles containing Swedish diacritics (å,ä,ö) the sort functionality behaves as if those characters are reduced to their ASCII equivalents (a,o). Steps to Reproduce:
Expected Results: Actual Results: Additional Information: Will add these in separate issues. Note: |
| Comments |
| Comment by Cate Boerema (Inactive) [ 21/Feb/18 ] |
|
Tagging Charlotte Whitt for awareness |
| Comment by Mike Taylor [ 21/Feb/18 ] |
|
This one unfortunately raises more difficulties, along the lines of
The first question of course, is what is the desired behaviour. One can imagine that in Swedish, even if it's desirable for Åland and aland to act as the same query-term, it might also be desirable to sort in the order aktansvärda, åländska, äkta. But we should take a moment to verify that before putting too much work into this. Related to this: I have a horrible feeling that the answer is going to be different in different locales. I don't remember details, but it's going to be something like in Swedish you want aktansvärda, åländska, äkta but in Danish you want aktansvärda, äkta, åländska. And if I'm right, then we can't resolve this by just using the appropriate collation locale, because what we have is in general a mix of titles in different languages. We could go some way towards getting a Right Answer by collating according to the locale that is configured for the tenant (i.e. also the one used by the UI for deciding on things like how to format dates) – so a tenant based in Sweden that has predominantly Swedish titles would get Swedish collation. But even there, we are dependent on several implementation aspects. First, as with searching, the Codex app can only offer those facilities that the corresponding back-end modules support. It's likely that mod-codex-inventory could fetch the prevailing locale from mod-configuration and use that to instruct PostgreSQL (via some RMB-provided endpoint) to use the appropriate collation locale. I think it's much less likely that mod-codex-ekb can honour the locale – at least in its present form, though the forthcoming rewrite might help, since IIRC it's based in Solr or ElasticSearch. But there is one further issue here, to do with merging sorted lists. The Codex multiplexer never does sorting of its own: it just passes the sort-specifications through the various Codex-source modules (as part of the query) and zips together the resulting sorted lists that they return. If the lists returned from the Codex-source modules are not sorted in the same order, the results will be incorrect – often in subtle, hard-to-understand ways. To avoid this, I see only three solutions: 1. Simplest: every Codex source uses simple ASCII collation. Obviously not optimal, but will definitely work in a predictable way. (I've added Adam Dickmeiss to this issue, to get his perspective on the technical aspects.) |
| Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ] |
|
I am glad i was able to surface this now then, because this is quite a "light" localization issue. Å,ä,ö are swedish letters. They are treated as all the other letters in the alphabet in the Swedish language. I can certainly confirm that "horrible" feeling that there are different filing rules for different locales.. The thing with Collation and other localization features of RDBMS:s - to my understanding - is that Collation does just that. It reduces the right diacritics for the intended user's locale. So the french é:s and è:s would be treated like the swedish audience would like the to be treated, like e:s. With regards to your solutions: |
| Comment by Mike Taylor [ 21/Feb/18 ] |
|
Yes, this was good to surface! But it's a bit weird that it's come up in the context of the Codex Search, which has many more implementation issues than other parts of FOLIO. I wonder if we'd do better to sort out our internationalisation issues in (say) the Inventory app first, before attempting this more difficult feat. On the other hand, if we do it this way round, we can be confident that our solution will work in the simpler cases! On the solutions: The values of #3 is twofold. First, and most pragmatically, it allows the multiplexer to zip together the multiple streams of records in a predictable way; second, and more philosophically important, it lays bare what each individual source is doing, and makes it possible for us to chase down the ones that are not behaving as we wish. |
| Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ] |
|
Mike Taylor, my comments may make my limited knowledge in the inner workings of Codex show. I would welcome some investigation into what the actual needs are. Yes, certainly this should also be surfaced in Inventory. I choose Codex since i thought you have directed field searches (title) there, and since it to my knowledge is the intended first point of search in FOLIO. |
| Comment by Mike Taylor [ 29/May/18 ] |
|
Yep. From a functional perspective, the Codex is a perfectly sensible point of entry to have chosen. It just so happens that it involves a more elaborate technology stack than Inventory, so from an engineering perspective it makes sense to solve the more tractable problem first. But it certainly doesn't hurt to have in mind, as we do so, the more difficult problem! |
| Comment by Jakub Skoczen [ 21/Aug/18 ] |
|
Mike Taylor Theodor Tolstoy (One-Group.se). We will take this in two phase: 1. Support locale-driven collation/sort in mod-inventory-storage (
|
| Comment by Jakub Skoczen [ 21/Aug/18 ] |
|
Theodor Tolstoy (One-Group.se) on
"decent search engine might still surface a few hits containing "aland", but with significantly reduced relevance scores" Which assumes that the result would be sorted according to relevancy and not directly according to the locale specific collation rules. I'd like to understand the expectations a bit better, because the two approaches can be contradictory, eg. 1. assuming matching (search) also considers un-accented version (stripped diacritics) the only sensible sort seem to be "relevancy ranking" that would boost the result positions depending on how close the match is to the original query: e.g in Polish a search for paczki ("packages") would find both paczki and pączki ("doughnuts") but boost paczki results in the relevancy score. A search for pączki ("doughnuts") would do the inverse. Note that in Polish ą sorts after a in alphabetical order. 2. sorting to the strict collation rules may only make sense for exact matching: e.g a search for pączki would not yield results for paczki. If it did, those results would get sorted higher up which I assume would not be the expectation? I am sure you can find Swedish equivalents for the above. |
| Comment by Mike Taylor [ 21/Aug/18 ] |
|
Jakub Skoczen Your two-phase approach to resolving this is perfect. Of course, once we introduce relevance-based sorting, we have a completely different task in the Codex multiplexer. We'll need to add a relevanceScore field to the instance schema at https://github.com/folio-org/raml/blob/master/schemas/codex/instance.json and require all Codex sources to include this field in records returned as the result of a relevance search: then the multiplexer can merge the streams by maintaining decreasing scores. But of course the result of that merge may be unsatisfactory, depending on the different scoring algorithms used by the different back-ends. For example, is the EBSCO KB source issues relevance scores between 1 and 100, and the FOLIO Inventory source scores between 0.0 and 1.0, all EBSCO KB records will appear to be more relevant than all FOLIO Inventory records. So my proposal at this point is just that we move relevance-ranking into a completely separate issue, and try to avoid letting its unique complexities confuse matters in this one. |
| Comment by Holly Mistlebauer [ 21/Dec/21 ] |
|
This ticket has been closed because it is over 3 years old and has a very low priority. |
| Comment by Theodor Tolstoy (One-Group.se) [ 14/Jan/22 ] |
|
Magda Zacharska has this behavior been addressed in the Elastic search implementation? Bugfest-Kiwi is not a good example since that is US-centric, so it is a bit hard to verify. Do you want me to create a similar ticket for Inventory? |
| Comment by Magda Zacharska [ 14/Jan/22 ] |
|
Theodor Tolstoy (One-Group.se) Kiwi bugfest environment is hardly US-centric as in addition to the default English analyzer, it has also Russian, Hebrew and Arabic but it does not have Swedish. You might want to create a request for devops to add this analyzer and to rebuild the index after that - so you can verify a Swedish language specific behavior. |