Codex (UXPROD-833)

[UISE-68] Codex search treats Swedish diacritics as ascii equivalents Created: 20/Feb/18  Updated: 14/Jan/22  Resolved: 21/Dec/21

Status: Closed
Project: ui-search
Components: None
Affects versions: None
Fix versions: None
Parent: Codex

Type: Bug Priority: P4
Reporter: Theodor Tolstoy (One-Group.se) Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: bug-search, chalmers, front-end, keep-bug, triaged, ui-only
Remaining Estimate: Not Specified
Time Spent: 1 hour
Original estimate: Not Specified

Attachments: PNG File Capture.PNG    
Issue links:
Blocks
is blocked by FOLIO-1246 Implement Postgres Full Text Search f... Closed
is blocked by FOLIO-1281 define sorting semantics for titles a... Closed
Cloners
is cloned by UXPROD-745 Tenant Sort Order Setting Open
Relates
relates to UISE-69 Codex search results treats Swedish d... Closed
relates to UISE-70 Codex search results are taking Nonfi... Closed
Sprint: malconia Sprint 1
Development Team: Prokopovych
Epic Link: Codex

 Description   

Overview: When conducting title level searches in Codex for titles containing Swedish diacritics (å,ä,ö) the search behaves as if those characters are reduced to their ASCII equivalents (a,o).

Steps to Reproduce:

  • Create a couple of records in Inventory with titles starting on a, å, ä or similar
    For example:
    "Den aktansvärda"
    "Den äkta varan"
    "Den åländska skärgården"
    "The Åland archipelago"
    "Ålöndska skärgården"
    "The Aland archipelago"
  • Go to Codex and conduct a title search for åland

Expected Results:
The title "The Åland archipelago" is showing.

(Another form of expected result is that also "Den åländska skärgården" is showing since "åländska" is a form of "åland" that Swedish stemming algorithms might be able to catch.)

Actual Results:
"The Aland archipelago" is returned together with the above and a few other items containing the string "aland". Se attached image.

Additional Information: Will add these in separate issues.

This particular issue might get solved by changing Collation on relevant tables in Postgres to Swedish (see https://www.postgresql.org/docs/9.1/static/collation.html), but I believe that this issue is related to a bigger discussions on search technology



 Comments   
Comment by Cate Boerema (Inactive) [ 21/Feb/18 ]

Tagging Charlotte Whitt for awareness.

Comment by Mike Taylor [ 21/Feb/18 ]

Two things.

First, the behaviour you describe here – treating letters with and without diacritics as equivalent for the purpose of searching – is near-universally considered desirable: many users will type aland when they want to find the Åland archipelago. Of course it's possible that librarians are different, and really do want to make a distinction – if the SIGs have come to that conclusion, then fine.

Second, this is nothing to do with ui-search. All it does it submit the user's query to the back-end module mod-codex-mux, which in turn passes it on to all the back-end modules that provide Codex sources: mod-codex-inventory, mod-codex-ekb, and others in future. In all cases, the records that get displayed to the user are those that the individual Codex-source modules considered correct.

So depending on whether you're getting local-inventory records or EBSCO KB records with
"aland" when you searched for {[åland}}, or both, this will be an issue to fix in mod-codex-inventory (where the Postgres rules you alluded to may well pertain), in mod-codex-ekb, or both.

(And I have no idea whether mod-codex-ekb has the flexibility to control this kind of detail in how search works, but based on its inability to deal with many other aspects of searching, my guess would be not. But we can ask.)

So the way forward is:

1. Determine whether, for searching, we really want to distinguish accented characters and their unaccented equivalents; and
2. Determine whether it is local inventory, the EBSCO KB, or both, that is giving us the behaviour we don't want (whichever that turns out to be).

Then we'll be able to go ahead and file issues on the relevant back-end module or modules.

Comment by Cate Boerema (Inactive) [ 21/Feb/18 ]

Thanks Mike Taylor. I actually moved this to UISE. In retrospect, I could have left it in FOLIO until Charlotte Whitt returned.

Anyway, you raise some good questions. Charlotte can weigh in on whether we want to do this or not when she's back from vacation. I'll assign this to her and mark it DRAFT.

Comment by Mike Taylor [ 21/Feb/18 ]

No problem – I'm glad you did move it into UISE, otherwise I would probably never have seen it!

Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ]

Regarding your first point Mike Taylor, I might have expressed myself in a way that leads to misinterpretation.
Your first point is indeed true for some cases and to some languages, but these letters are actual Swedish letters. They are not some form of a and o, they are letters of their own. Period.
A decent search engine might still surface a few hits containing "aland", but with significantly reduced relevance scores. As you can see, this is not the case.

On a side note, Swedes would not care for diacritics in other languages (like the french é's and so on), so this is something i18n efforts must take into consideration going forward.
These issues are not new, and they're handled very well in most search engines and RDBMS's (Collation etc) all over. I actually think that this is not really a large technical problem, it's just not yet handled.

Comment by Mike Taylor [ 21/Feb/18 ]

Your first point is indeed true for some cases and to some languages, but these letters are actual Swedish letters. They are not some form of a and o, they are letters of their own.

Thanks for this clarification.

I fear it makes the problem even more intractable, then: we will need to behave differently for Swedish (where "å" is a different letter from "a") and, say, French (where "é" is a modified form of "e"). So we will run into a similar set of locale-related issues to those discussed in UISE-69 Closed .

I actually think that this is not really a large technical problem, it's just not yet handled.

Haha, I wish I shared your confidence.

If we controlled the whole stack, I would agree with you: for example, it should not be too difficult to make the Inventory UI module work correctly along these lines. The problem is that the Codex is by design compounded from components contributed by multiple vendors, on multiple technical substrates. Please calibrate your optimism accordingly

Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ]

Mike Taylor, my comments may make my limited knowledge in the inner workings of Codex show. I would welcome some investigation into what the actual needs are.

Comment by Mike Taylor [ 21/Feb/18 ]

I'd say you've done a really helpful of clarifying the needs – especially the explanation that Swedish "å" is its own letter in a way that French "é" is not. I don't think we'd have been able to arrive at such a good understanding of what's required without that.

Anyway, let's see what Adam has to say on the sorting issue ( UISE-69 Closed ): he probably understand the technical details better

BTW., sorry if I've come across as patronising at any stage of this – I know I can fall into that; it's not my intention.

Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ]

Thank you.

No problem, I've enjoyed this discussion

Comment by Jakub Skoczen [ 21/Aug/18 ]

Theodor Tolstoy (One-Group.se) on FOLIO-1246 Closed we are discussing ability to boost ranks for "exact" (e.g including diacritics) match. Mind you this is going to only address the quality of results for Inventory, CodexSearch depends also on quality of results from the KB.

Comment by Mike Taylor [ 21/Aug/18 ]

(Side-issue: I worry that a lot of issues are cropping up here in the UISE project that really pertain to much lower level or more general aspects of FOLIO – such as the feature Jakub just mentioned where exact matching including accents would boost a hit's relevance score. In general, when tempted to file and issue in UISE, would POs please check first whether the same issue pertains in the Inventory app? If so, then better to file it in UIIN, so we can fix it more simply without having to think about so many different layers of software at once.)

Comment by Holly Mistlebauer [ 21/Dec/21 ]

This ticket has been closed because it is over 3 years old and has a very low priority.

Comment by Theodor Tolstoy (One-Group.se) [ 14/Jan/22 ]

Magda Zacharska Same here. How is this handled within the ES implementation? Do you want me to file a ticket?

Comment by Magda Zacharska [ 14/Jan/22 ]

Theodor Tolstoy (One-Group.se) Kiwi bugfest environment has in addition to the default English analyzer also installed Russian, Hebrew and Arabic analyzers but it does not have Swedish one. You might want to create a request for devops to add this analyzer and to rebuild the index after that - so you can verify a Swedish language specific behavior.

Generated at Thu Feb 08 23:10:36 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.