[UISE-70] Codex search results are taking Nonfiling characters into account when sorting Created: 20/Feb/18  Updated: 21/Dec/21  Resolved: 21/Dec/21

Status: Closed
Project: ui-search
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: P4
Reporter: Theodor Tolstoy (One-Group.se) Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: chalmers, front-end, keep-bug, triaged, ui-only
Remaining Estimate: Not Specified
Time Spent: 1 hour
Original estimate: Not Specified

Attachments: PNG File Capture.PNG     PNG File Capture.PNG    
Issue links:
Blocks
is blocked by FOLIO-1281 define sorting semantics for titles a... Closed
Relates
relates to UXPROD-745 Tenant Sort Order Setting Open
relates to UISE-68 Codex search treats Swedish diacritic... Closed
relates to UISE-69 Codex search results treats Swedish d... Closed
Sprint:
Development Team: Prokopovych

 Description   

Overview: When conducting title level searches in Codex, The sort algorithm does seem to take definite article and other Nonfiling characters into consideration. This seems to be true for both Swedish and English.

Steps to Reproduce:

  • Create a couple of records in Inventory with titles starting on a, å, ä or similar
    For example:
    "Den aktansvärda"
    "Den äkta varan"
    "Den åländska skärgården"
    "The Åland archipelago"
    "Ålöndska skärgården"
    "The Aland archipelago"
  • Go to Codex and conduct a title search for åland
  • Sort the results on title in ascending order (arrow pointing up)

Expected Results:
Search results sorted in the following order:

  • The Aland archipelago ("The " should be disregarded)
  • Northern Territories, Asia-Pacific Regional Conflicts and the Åland...
  • A User’s Guide to the Nestle-Aland 28 Greek New Testament ("A " should be disregarded)
  • Ware Conterfeyhung eines abscheulichen Aland Fisches...
  • The Åland archipelago ("The " should be disregarded)
  • Den åländska skärgården ("Den " should be disregarded since it is a Swedish definite article)

Note: Not all of these results (the result items themselves) are not expected to emerge. Disregard from that, the point is that the nonfiling charachters has been taken into account in the sort.

Actual Results:
See attached image



 Comments   
Comment by Cate Boerema (Inactive) [ 21/Feb/18 ]

Tagging Charlotte Whitt for awareness.

Comment by Mike Taylor [ 21/Feb/18 ]

I can't come up with a rationale for why we might expect the "Expected" sort order. Surely "Northern Territories, Asia-Pacific Regional Conflicts and the" should not all be discarded so that the record sorts by "Åland"?

What am I missing?

Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ]

Mike TaylorI added an explanation to the results in the expected example

I am not sure that this is the way we ant FOLIO to deal with Nonfiling characters, but I am pretty sure we must have a discussion on it since it emerged out of the initial impressions i received when visiting Chalmers, and since it is a thing in current systems.

Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ]

I added a screenshot from inside Sierra on a list of search results sorted alphabetically showcasing how it does not take the "den" definite article into account.

Comment by Mike Taylor [ 21/Feb/18 ]

Thanks for that explanation – of course, it makes perfect sense.

So just to clarify: the term 'nonfiling characters" actually refers to words, such as "the" and "den", rather than to characters? I guess this is just one of those things where the standard term for the concept is wrong, but we're stuck with it.

Here's another multilingual problem. We can't just strip "den" from the start of titles for sorting purposes, because then English titles like "Den of Thieves" will be sorted wrongly. So what is the desired functionality? (Once we figure that out, we can start to think about whether it can actually be implemented.)

Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ]

For MARC21, this is being handled.(as far as I know).
Search for "nonfiling characters" on the middle of this page: https://www.oclc.org/bibformats/en/2xx/245.html

Someone with more up-to-date knowledge should have a look into this.

Comment by Mike Taylor [ 21/Feb/18 ]

Yes, that's a good approach – the 2nd indicator on the 345 field explicitly states how many leading characters to skip. But the Codex sources will not in general have that information.

Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ]

That is true, but Ii think there are more automatic approaches that could be used today that are more efficient. I think for example Solr and Elasticsearch could be taught to handle this.

Comment by Mike Taylor [ 21/Feb/18 ]

I hope you're right – but (A) mod-codex-ekb is not using either of these; (B) neither is mod-codex-inventory, it's using the RMB-mediated access to PostgreSQL; and (C) in any case, this can't be done correctly without knowing the language of each record – otherwise we get the "Den of Thieves" problem I mentioned above.

Comment by Theodor Tolstoy (One-Group.se) [ 21/Feb/18 ]

That is true, but i thing you can come a long way using automated approaches.

Maybe this is not the best place to ask this question, but why is there not a Search engine in Codex?

Comment by Mike Taylor [ 21/Feb/18 ]

There are basically two approaches to searching multiple sources at once.
1. Harvest everything into one big database and search that.
2. Search in real time and merge the results.

There are advantages and disadvantages to each approach. #1 needs more up-front effort and more sysadmin, but yields faster and more consistent results. This is what Summon does. #2 is more lightweight, but slower and dependent on the capability of the sources.

The Codex is a type-2 solution.

We would perhaps like to do a type-1 solution, but the fundamental problem is that we can't in general harvest all the things we want. For example, the EBSCO KB is proprietary and not available for harvesting. So for now at least, this is a non-starter.

Comment by Jakub Skoczen [ 21/Aug/18 ]

Theodor Tolstoy (One-Group.se) Mike Taylor guys, I'd like to make sure we are clear about the scope of what can (and will) be done vs what is outside of Core Team conrol. I suggest particular issue in two stages:

1. Stage 1: address sort and search issues in Inventory (and other modules that index data locally in FOLIO), relevant issues here are FOLIO-1246 Closed (which is an umbrella for more powerful search functionality including ranking, stropwords etc) and MODINVSTOR-148 Open (which is about ensuring that tenant locale is used for driving the DB collation setting and will address locale-specific sorting issues)

2. Stage 2: address sort and search issues in Codex Search app, here we are generally limited by the quality of results from the upstream sources, one of which we control directly (Inventory) whlle for the other (EBSCO KB) we can request certain tuning.

Comment by Mike Taylor [ 21/Aug/18 ]

Strongly agree. These conversations got into a lot of unnecessary complexity by trying to solve the difficult case of the Codex before having solved the (relatively!) easy case of the local inventory.

Comment by Holly Mistlebauer [ 21/Dec/21 ]

This ticket has been closed because it is over 3 years old and has a very low priority.

Generated at Thu Feb 08 23:10:38 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.