Internationalization and Localization (UXPROD-779)

[UXPROD-745] Tenant Sort Order Setting Created: 28/May/18  Updated: 16/Sep/23

Status: Open
Project: UX Product
Components: None
Affects versions: None
Fix versions: None
Parent: Internationalization and Localization

Type: New Feature Priority: P3
Reporter: Theodor Tolstoy (One-Group.se) Assignee: Jakub Skoczen
Resolution: Unresolved Votes: 0
Labels: chalmers, early_implementers, elastic-search, i18n, unicode
Remaining Estimate: Not Specified
Time Spent: 1 day, 15 minutes
Original estimate: Not Specified

Attachments: PNG File Capture.PNG     PNG File Skärmavbild 2018-05-28 kl. 15.09.06.png     PNG File Skärmavbild 2018-05-28 kl. 15.09.24.png     PNG File Skärmavbild 2018-05-29 kl. 11.41.37.png     PNG File Skärmavbild 2018-05-29 kl. 11.48.34.png     PNG File screenshot-1.png    
Issue links:
Blocks
is blocked by MODCXMUX-25 sort according to tenant's locale Open
is blocked by MODINVSTOR-148 sort according to tenant's locale Open
is blocked by STCOM-78 Use new CQL sort-modifiers to specify... Open
is blocked by FOLIO-1246 Implement Postgres Full Text Search f... Closed
Cloners
clones UISE-68 Codex search treats Swedish diacritic... Closed
Duplicate
is duplicated by FOLIO-850 Locale-specific sorting Closed
Relates
relates to ERM-3011 Agreements with Korean characters do ... Closed
relates to UISE-69 Codex search results treats Swedish d... Closed
relates to UISE-70 Codex search results are taking Nonfi... Closed
relates to RMB-37 SQL sorting/comparing must use the te... Draft
relates to FOLIO-1955 Create databases using und-x-icu coll... Open
relates to UXPROD-1135 Locale-driven search Open
relates to BF-264 Sorting of contributor types ignores ... Closed
relates to UIIN-264 In edit mode of the Instance record. ... Closed
relates to UXPROD-1045 Fulltext Search Closed
relates to UISE-80 Search results are not sorted by title Closed
relates to UIU-1726 User app search: observe and interfil... Draft
Epic Link: Internationalization and Localization
Front End Estimate: Very Small (VS) < 1 day
Back End Estimate: Medium < 5 days
Back End Estimator: Jakub Skoczen
Kiwi Planning Points (DO NOT CHANGE): 1
Rank: Chalmers (Impl Aut 2019): R2
Rank: Chicago (MVP Sum 2020): R5
Rank: Cornell (Full Sum 2021): R5
Rank: Duke (Full Sum 2021): R2
Rank: 5Colleges (Full Jul 2021): R4
Rank: FLO (MVP Sum 2020): R5
Rank: GBV (MVP Sum 2020): R2
Rank: hbz (TBD): R1
Rank: Hungary (MVP End 2020): R1
Rank: Lehigh (MVP Summer 2020): R5
Rank: Leipzig (Full TBD): R1
Rank: Leipzig (ERM Aut 2019): R2
Rank: TAMU (MVP Jan 2021): R5
Rank: U of AL (MVP Oct 2020): R4

 Description   

Purpose: The purpose of this UXPROD is to capture the need to deal with sorting and diacritics. This isn't a Swedish problem, but a more general problem surfaced by Theodor in the context of Swedish (Chalmers). In addressing this UXPROD, we need to look at the problem holistically. Should also look into the other, related linked issues (see links).

Below are the details from the original bug ( UISE-68 Closed ). Lot's of good discussion can also be found in that bug's comments:

Original Issue Summary: Codex search treats Swedish diacritics as ascii equivalents

Overview: When conducting title level searches in Codex for titles containing Swedish diacritics (å,ä,ö) the search behaves as if those characters are reduced to their ASCII equivalents (a,o).

Steps to Reproduce:

  • Create a couple of records in Inventory with titles starting on a, å, ä or similar
    For example:
    "Den aktansvärda"
    "Den äkta varan"
    "Den åländska skärgården"
    "The Åland archipelago"
    "Ålöndska skärgården"
    "The Aland archipelago"
  • Go to Codex and conduct a title search for åland

Expected Results:
The title "The Åland archipelago" is showing.

(Another form of expected result is that also "Den åländska skärgården" is showing since "åländska" is a form of "åland" that Swedish stemming algorithms might be able to catch.)

Actual Results:
"The Aland archipelago" is returned together with the above and a few other items containing the string "aland". Se attached image.

Additional Information: Will add these in separate issues.

This particular issue might get solved by changing Collation on relevant tables in Postgres to Swedish (see https://www.postgresql.org/docs/9.1/static/collation.html), but I believe that this issue is related to a bigger discussions on search technology



 Comments   
Comment by Charlotte Whitt [ 28/May/18 ]

Thanks Theodor Tolstoy (One-Group.se) for catching this. This 'bug' is then not just for the Codex Search handing Swedish diacritics - but also handling Danish, and Norwegian letters - æ, ø, å, Æ, Ø, Å, and probably German diacritics too, and many more.
We must also investigate how this is handled in Inventory, and eHoldings - probably FOLIO in general. Not sure how this is solved in the Users app, etc.

Comment by Charlotte Whitt [ 28/May/18 ]

Just tested the User app, and the same problems appears here - created a new user: Thomas Øberg:

Comment by Theodor Tolstoy (One-Group.se) [ 28/May/18 ]

Yes, it is my understanding that this is a common issue all over FOLIO. I think there are both short-term and longer-term ways to solve them.

Comment by Cate Boerema (Inactive) [ 28/May/18 ]

Theodor Tolstoy (One-Group.se), do you want to summarize the short-term and long-term solutions here? I think that would be helpful for when we are looking at this.

Comment by Theodor Tolstoy (One-Group.se) [ 29/May/18 ]

Sure Cate Boerema Short-term, I think it should be investigated if the database Collations settings could be used to help to solve some of the most pressing needs here. At least for Inventory. When it comes to Nonfiling characters, localized stopwords and so on, i think there should be a more full-featured search engine put up behind the search experience. The Issue of having the search experience unified across Codex when there are multiple search technologies involved (both KB and Inventory, and in the future, even more) has to be solved of course, And that might not be the easiest task.

Comment by Niels Erik Nielsen [ 29/May/18 ]

A few weeks back I experimented with PostgreSQL Collations on my own box.

I found that the OS must have the given locale installed and PostgreSQL in turn must be configured with a reference to that locale for the sorting to be available.

English is the default sorting on my PostgreSQL but the OS (Ubuntu) also had the Swedish locale installed and PostgreSQL referenced that, so I could ask for sorting in Swedish. However, to sort in Danish, I had to install the Danish locale in the OS and create a reference to that locale in my PostgreSQL.

With that in place:
Default sorting:

select * from tmp order by columna;
 columna 
---------
 A
 Å
 Ä
 Æ
 B
 O
 Ö
 Ø
 Z
(9 rows)

English sorting:

select * from tmp order by columna COLLATE "en_US.utf8";
 columna 
---------
 A
 Å
 Ä
 Æ
 B
 O
 Ö
 Ø
 Z
(9 rows)

Swedish sorting:

select * from tmp order by columna COLLATE "sv_SE.utf8";
 columna 
---------
 A
 B
 O
 Z
 Å
 Ä
 Æ
 Ö
 Ø
(9 rows)

Danish sorting:

select * from tmp order by columna COLLATE "danish";
 columna 
---------
 A
 B
 O
 Z
 Æ
 Ä
 Ø
 Ö
 Å
(9 rows)
Comment by Cate Boerema (Inactive) [ 29/May/18 ]

That's awesome, Niels Erik Nielsen! What do you think is the best way to solve this, then? I assume that we want PostgreSQL to reference the locale set in FOLIO Settings.

Comment by Niels Erik Nielsen [ 29/May/18 ]

Hmm, there are some issues to clarify with input from more people, I think:

We could conceivably set the default sorting to Swedish in Chalmers database but a general FOLIO solution would have to consider multi-tenancy, I reckon.

Also, would there be an expectation that the admin, or even individual users, could change the locale for sorting run-time? Which would probably raise performance implications, i.e. the database indexing required to support per-request collation could be prohibitive, I don't know. I think we need shale99's evaluation of that.

Comment by Mike Taylor [ 29/May/18 ]

Whenever the server does sorting on behalf of a user, it should do that sorting under the rules of that user's locale. (At present, that means the locale of the user's tenant, because we store configuration only at the per-tenant level, but that will change.)

There are two ways to achieve this. One is to have all the back-end modules (or, more likely, raml-module-builder on their behalf) look up the user's locale in mod-configuration, and honour that. The other is to have the front-end explicitly specify the desired collation order in the sort-specification.

I think that the former approach is cumbersome, error-prone and potentially inefficient. The front-end already has to know the user's locale so it can render the correct i18n strings, so we should use that information.

CQL has support for specifying sort-collation order: see https://www.loc.gov/standards/sru/cql/contextSets/sort-context-set.html. It's essentially just using the /locale=VALUE index-modifier on each sort-key: so, for example:

(title="åland*" and ext.selected="true") sortby title/locale=sv-se

To make this work with a module like inventory, we'd need to do three things:
1. Have the front-end send the modified version of the query
2. Have cql2pgjson-java recognise the locale index modifier and emit suitably modified PostgresSQL queries.
3. Have PostgreSQL suitably configured to support the desired locales.

#1 is a matter for the front-end team. It should be a fairly straightforward extension to makeQueryFunction – right, Zak Burke? – although, irritatingly, it will probably require that we add yet another parameter to carry the locale.

#2 may already be done: Julian Ladisch will know; and if not, he'll have a sense of how big a task it is.

#3 is essentially a system administration issue, which will be handled on a customer-by-customer basis. But we should have John Malconian or Wayne Schneider be aware of it, so they can come up with suitable procedures.

Comment by Charlotte Whitt [ 29/May/18 ]

Hi Cate Boerema, Niels Erik Nielsen, Mike Taylor - please note that most libraries will have a multi lingual collection, e.g. here the list from Cornell's catalogue (https://newcatalog.library.cornell.edu/)

List of titles in Icelandic:

Comment by Niels Erik Nielsen [ 29/May/18 ]

True, but you can only apply one sort order at a time.

Like, you can have a mix of Swedish and Danish characters as in the results above, but they would be sorted either in Swedish or in Danish (or in English, etc) at any given time.

Comment by Niels Erik Nielsen [ 29/May/18 ]

Side note: The sort order of Codex search is probably only partially/indirectly depending on the PostgreSQL collation.

Adam Dickmeiss can correct me but I think it works like this:

Codex sends it search request to one or more repositories, like KBs and/or FOLIO Inventory, and specifies the desired sort order. The individual repositories will then control what records are actually returned in what order in the first batch of, say, 30, or 100, or 1000 records from each resource. For example, is the desired sort field even supported by that target? Depending on whether a given repository is based on PostgreSQL or not, the PostgreSQL collation then comes into play for that individual source, determining what records will be returned in the first batch.

What happens next, however, is that Codex will merge the incoming results, for example the 30 or 100 or 1000 records that each resource returns in the first batch, and this will happen according to Codex' own sort order.

Thus, I suspect, to sort Inventory records in Swedish in Codex search, both Inventory's PostgreSQL Collation and Codex search would have to be configured for Swedish sort order.

Comment by Mike Taylor [ 29/May/18 ]

Charlotte Whitt wrote: "Please note that most libraries will have a multi lingual collection."

Niels Erik Nielsen replied "True, but you can only apply one sort order at a time."

This is true, and really really important. We need to make sure that SMEs have realistic expectations here: the issue isn't that our software has difficulty in sorting multilingual lists, it's that there literally is no correct ordering. What is right in one of the representative languages is wrong in another. So sorting is only ever in one specific locale: there is no other way it can be.

Niels Erik Nielsen is also correct that collation within the Codex is a separate problem again. I think we should solve it for core modules first before worrying too much about this

The way the Codex multiplexer should handle this – the only way it can, really – is to propagate the sort-collation specification it gets sent in its requests to the various back-end modules, and to trust them to honour it. When it gets back a group of result-sets, it then zips them together on the assumption that they are all sorted in the same order. For that reason, a back-end module that cannot sort in the requested way must reject the request, rather then waving its hands and giving the records back in some other order. (We should add this observation to the Codex Contract document.)

Comment by Theodor Tolstoy (One-Group.se) [ 29/May/18 ]

I really like the way this discussion is going!

Charlotte Whitt wrote: _Please note that most libraries will have a multi lingual collection, _
Mike Taylor wrote: that is right in one of the representative languages is wrong in another. So sorting is only ever in one specific locale: there is no other way it can be.

As a Swedish librarian, I honor my own characters and expect the system to do the same. But I do not necessarily honor other languages "special" characters as highly. That is why Sorting and related tasks in a database consisting of documents in multiple languages usually works pretty well, and this is why I think collation is a good short term solution.

However, when it comes to more complicated scenarios, as for not taking the "The" in titles like "The tale of two cities" into consideration when doing a sort, but the necessity to do so in Swedish titles like "The som dryck och kulturbärare" (The in swedish is the Tea that you drink) I think we have to have something more advanced in place.

Comment by Mike Taylor [ 29/May/18 ]

I think (though I am not an expert on this) that things like ignoring leading articles for sorting purposes is usually handled separately from locale-honouring collation. What we'd probably do is have code that, when a record is added or modified, looks at the title, chops off any leading article, and puts the result in a separate "sortable title" field. Then title-sorting is done using that field.

But of course that raises a bunch more issues. What is an article? If I'm English, then "the" should be discarded from the front of a title, but if I'm French maybe it's a book about tea and it should be retained. And this decision can't be made based on the prevailing locale of the user (if there even is a user – there won't be for some bulk-ingest operations) – it needs to be based on the locale of the record, which I assume is in the lang field or similar.

Anyway, I think we can and should relegate article-ignoring to a separate issue for another day.

Comment by Julian Ladisch [ 29/May/18 ]

For discussing how to stem non-filing characters like "the " use this separate issue: https://folio-org.atlassian.net/browse/UISE-70

Comment by shale99 [ 29/May/18 ]

this is a really big topic and i started commenting but it is too much to write i think, so i will just explain how this currently works,

today, when cql is converted to a postgres query - it indicates that the query should be lowercased and the accents removed (this is done across the board). RMB modules (meaning all storage modules that query the database) - when creating indexes on fields in the data - does the same (lowercasing and unaccenting of the data before indexing) - these two must be synced (the query generated and the indexes created) otherwise when sorting / querying - the index will not get used as it will not be defined in the same manner as the query and the system will just not perform.

As NE mentioned there is of course support (sometimes it will require a language install) for many languages - but i have not done enough testing of declaring a collation at runtime - but it will probably need indexes to be created differently ahead of time:

for example
for an index created in the following manner:
CREATE INDEX IF NOT EXISTS abc_idx ON harvard_mod_inventory_storage.instance ((jsonb->>'title') COLLATE "sv-x-icu" ) ;

will be used for a query like this:
explain analyze select * from harvard_mod_inventory_storage.instance order by (jsonb->>'title') COLLATE "sv-x-icu" limit 10;

but will not be used for a query like this:
explain analyze select * from harvard_mod_inventory_storage.instance order by (jsonb->>'title') limit 10;

Comment by Zak Burke [ 29/May/18 ]

1. Have the front-end send the modified version of the query
...
#1 is a matter for the front-end team. It should be a fairly straightforward extension to makeQueryFunction – right, Zak Burke? – although, irritatingly, it will probably require that we add yet another parameter to carry the locale.

Adding a sort clause to makeQueryFunction is a fairly straightforward extension, but we have plenty of simple queries, e.g. to retrieve lookup tables for contributor types, material types, etc, that probably also need to handle this. This is out of the context of codex search, but I mention it because I wonder if setting the locale of the query should be the purview of something at the level of stripes-connect rather than by individual queries.

Comment by Theodor Tolstoy (One-Group.se) [ 29/May/18 ]

shale99 I think that is why Elasticsearch makes it harder to have separate pre query and pre indexing setups compared to Solr if i remember correctly

Comment by Mike Taylor [ 29/May/18 ]

shale99 That all makes sense.

Zak Burke This does, too. Of course, we'll need to add similar logic to the GraphQL support down the line. I think it's pretty straightforward, in either case, to rewrite {{sortby}] clauses as necessary.

Comment by Cate Boerema (Inactive) [ 29/May/18 ]

Great discussion! Do we know enough that I could ask you guys to put an estimate on this feature? Per the new process, all new features should be estimated as they are added.

Comment by shale99 [ 29/May/18 ]

Theodor Tolstoy (One-Group.se) its been a while so i can not comment, i would assume that if we go with a solr type solution, we would have a solr core per tenant and then we may be able to configure that to the correct locale. however, we can actually do that with postgres now.

There are a few ways to tackle this, the easiest, i think, would be for storage modules based on rmb (i believe they all are) to add a locale header to the post tenant API (when a tenant registration occurs for a module). I could then easily create indexes based on that locale. during query time the locale header , or the cql with the locale support would need to be passed in and that would be added to the cql2pgjson , which would add the collation to the query. we could then even add multiple locales per tenant - but this would have an impact on sizing

Comment by Cate Boerema (Inactive) [ 30/May/18 ]

we could then even add multiple locales per tenant - but this would have an impact on sizing

We know we will definitely want multiple locales per tenant, but it was agreed that we could consider that out of scope for v1. Let's pick a solution that will allow for that later but estimate this feature with the assumption that we have only one locale per tenant.

Comment by Niels Erik Nielsen [ 30/May/18 ]

With a PostgreSQL setup with only one locale per tenant for now, it seems that it will serve no immediate purpose for the UI to be able to pass collation parameters per request just yet?

Even if PostgreSQL was configured to offer other collations, they might perform badly without the supporting indices.

Could all sorting for a tenant rely on the default collation in PostgreSQL then shale99? Or would this require a RMB generated collation parameter added to all "order by" directives coming from clients like I think you described?

Comment by Niels Erik Nielsen [ 30/May/18 ]

What would the V1 plans for the locale setting be in this scenario:

Will it still be available and thus allow users to change page translations and date formats but not sorting?

Comment by shale99 [ 30/May/18 ]

Niels Erik Nielsen - by default , you mean the tenant's default ?

Comment by Niels Erik Nielsen [ 30/May/18 ]

@shale99, guess I'm unclear if the tenant's default would be implemented as a PostgreSQL default on some level.

But guess I also don't need to know (just curiosity). The salient point is, there's no purpose yet in clients supporting localized sorting run-time if it's statically defined per tenant in the back-end anyway, by whatever means.

Comment by Mike Taylor [ 30/May/18 ]

Do we know enough that I could ask you guys to put an estimate on this feature?

I feel like this issue has metastatized into half a dozen related but distinct issues. Maybe should enumerate what they all are, give them each their own Jiras (with different priorities) and then make time estimates accordingly?

We know we will definitely want multiple locales per tenant, but it was agreed that we could consider that out of scope for v1. Let's pick a solution that will allow for that later but estimate this feature with the assumption that we have only one locale per tenant.

Sound strategy. Down the line, we're going to want to want locales to resemble UI modules in this sense: each FOLIO installation will have a "hard" configuration of a set of installed locales (just as it has a set of installed modules); and each tenant in that installation will have a "soft" configuration of which locale or locales it supports, chosen from the installed set (just as it has a set of configured modules chosen from the installed set).

With a PostgreSQL setup with only one locale per tenant for now, it seems that it will serve no immediate purpose for the UI to be able to pass collation parameters per request just yet?

I guess that's true; but the work will need to be done sooner or later. (This comes back to my observation at the top of the comment that we have a bunch of conflated issues here, and we need to disentangle which ones we want to address in the short term.)

What would the V1 plans for the locale setting be in this scenario? Will it still be available and thus allow users to change page translations and date formats but not sorting?

Down the line, we'll make it so that the set of locales offered on the front-end is the set that's installed on the back-end; and when the user selects a given locale, one of the consequences is that the queries he sends will include the specification to sort according to that locale. I guess that whether or not the underlying database has a locale-specific index build will be a dev-ops decision: we could conceivably have some setups that have "core support" (i.e. PostgreSQL indexes) for two or three locales, but also have other locales installed but not indexed. Then if I understand correctly, sorting by those locales will work, but will be much less efficient.

Comment by Cate Boerema (Inactive) [ 31/May/18 ]

I feel like this issue has metastatized into half a dozen related but distinct issues. Maybe should enumerate what they all are, give them each their own Jiras (with different priorities) and then make time estimates accordingly?

Seems sensible. Do you want to take a stab at creating JIRAs for the issues? We can then review them to determine if they should be features or stories and estimate accordingly.

Comment by Mike Taylor [ 31/May/18 ]

I certainly walked into that one!

Comment by Mike Taylor [ 31/May/18 ]

I'll take a stab.

Comment by Julian Ladisch [ 31/May/18 ]

For GBV libraries a single collation per tenant is sufficient. See also this report about GBV Zentral discovery service: https://discuss.folio.org/t/gbv-zentral-uses-solrcloud-for-vzg-discovery-service/326
Unicode's root collation provides a good fall-back with "a reasonable language-agnostic sort order" for all characters outside the primary collation's range: http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation https://www.postgresql.org/docs/10/static/collation.html

Comment by Mike Taylor [ 31/May/18 ]

Well, it took longer to work through this than I expected, but here is a breakdown of what I think all the issue are. I won't file them in Jira will we've reached consensus, because I think there are ten new issues to be filed (plus the existing UISE-70 Closed comes into it).

The top-level issue is something like "Support locale-specific sorting". It will have four issues blocking it, which correspond to the four "parts" that I will enumerate below. Part 4 (locale-aware non-indexing prefixes) already has an issue, UISE-70 Closed . The other three will each have one of more lower-level issues blocking them, corresponding with the tasks that I have enumerated.

Here's the hierarchy as I see it:

  • Part 1. Support Swedish collation on a FOLIO installation
    • Task 1a. Set PostgreSQL default locale to Swedish
      • – Install the Swedish locale
      • – Configure PostgreSQL
  • Part 2. Support collation-locale selection per-tenant. (Happily, the steps required to resovle this also give us user-specific collation-locales, too.)
    • Task 1a. Have UI code specify the required collation locale in the CQL of each search that involves sorting
    • Task 2b. Ensure cql2pgjson converts collation information for PostgreSQL
    • Task 2c. Configure PostgreSQL to support all locales in nominated pool
      • – Install locales
      • – Create relevant PostgreSQL indexes
  • Part 3. Support collation-locales in Codex Search
    • (The Codex Search application will already forward collation information in CQL queries)
    • Task 3a. The Codex multiplexer must honour the specified collation locale in merging sources' result sets
    • (The core-modules' PostgreSQL database needs no changes)
    • Task 3b. Enhance each codex-source module to honour collation order
      • – At present, this is just mod-codex-ekb
  • Part 4. Non-indexing prefixes (e.g. "the")
    • (Not further analysed here: see UISE-70 Closed .)

We could if we chose prioritise a half-assed solution for Part 3 before doing Part 2. That would entail hard-wiring mod-codex-ekb to use the Swedish locale. I don't know how practical that would be: much will depend on operational concerns.

Comment by Mike Taylor [ 31/May/18 ]

Let's have comments on this, please, so we can arrive at an agreed set of issues. Once we have them straight, I'll file the individual issues in their appropriate Jira projects, and link them all together.

(Of course, your comments can just be "that looks perfect" if you wish )

Comment by Julian Ladisch [ 31/May/18 ]

CQLPG-26 Blocked "ICU collations locale (independent of operating system)"
If CQLPG-26 Blocked is done we don't need to install locales/collations.

Comment by Julian Ladisch [ 31/May/18 ]

If we support a single collation per tenant there is a more easy solution:
When creating the database table one can specify the default collation for the fields of that table. No need to specify the collation in CQL. I think this is sufficient for most libraries.
It is for GBV's libraries: https://discuss.folio.org/t/gbv-zentral-uses-solrcloud-for-vzg-discovery-service/326

If there really is a requirement to support several collations per tenant this solution can be extended. There still remains the default collation, but one can request a different collation in CQL (that needs to be supported by separate indexes in postgres).

Comment by Mike Taylor [ 31/May/18 ]

For what it's worth, my feeling is that support for multiple locales within a tenant is going to become a live issue really soon. Think of all the Texan universities where some students are primarily English-speaking and others Spanish-speaking.

Comment by Cate Boerema (Inactive) [ 01/Jun/18 ]

I changed the summary of this issue to the more generic "Tenant Sort Order Setting". There are other linked features now for User sort order setting (UXPROD-513) and Cross-tenant sort order setting (UXPROD-514). I actually can't remember what need the latter was supposed to serve, but it was listed in the v1 roadmap (as "system sort setting") so I am bringing it over.

Comment by Jakub Skoczen [ 20/Jun/18 ]

Cate Boerema, I am looking at estimating this, I just want to confirm that the sort locale would be a single, tenant-level setting independent from document language.

Comment by Cate Boerema (Inactive) [ 20/Jun/18 ]

That seems right to me, Jakub Skoczen. I hope others on this ticket will chime in if they disagree.

Comment by Cate Boerema (Inactive) [ 21/Jun/18 ]

I guess there should be a frontend estimate here, as well, right? At the very least for adding a menu in Settings for selecting your sort order option...

Comment by Cate Boerema (Inactive) [ 02/Jul/18 ]

In discussion with Jakub Skoczen and Tod Olson on UXPROD-513 (User sort order setting), sort and collation order should be locale-driven. Therefore, we can just use this issue to represent making a locale-driven sort and collation order (hence the change of summary for this issue) and then rely on UXPROD-510 Open (User Level Locale) to enable user-level sort order setting. I will be deleting UXPROD-513 as it's now redundant.

Comment by Cate Boerema (Inactive) [ 02/Jul/18 ]

Since I am deleting UXPROD-513 (User sort order setting), I wanted to carry over Tod Olson's useful comment from that issue:

Locale-driven sort and collation makes sense to me. Let me talk about this in the context of sorting and collating bibliographic records, as I think that's the most complicated case of what FOLIO will need to sort and collate.

When a library is dealing with a set of records, as in a result set or an alphabetical listing, we may have English, German, Swiss, Swedish, Russian, CJK, etc. records all in one list. They will not be segregated by language, but all collated together. So the sort/collate order of the tenant makes sense. For example, a user in a EN-US locale will expect diacritics to be ignored, where a user in France or Sweden would have their own locale-specific expectations for how sorting/collation treats diacritics.

Languages in different scripts could become a more complicated case. Speaking for EN-US libraries, today when we deal with non-Roman script materials, the metadata for title, author, etc. is transliterated into Roman script. More recent records will typically also have the non-Roman script, but older records may not. Take this record from our public catalog:

https://catalog.lib.uchicago.edu/vufind/Record/11419602

Here's the author and title data behind that record:

MARC Field Value
100 (Author / Creator) Nabokov, Vladimir Vladimirovich, 1899-1977, author.
245 (Title and statement of responsibility) I︠U︡nostʹ / Vladimir Nabokov.
880 (linked to 100) Набоков, Владимир Владимирович, 1899-1977, author.
880 (linked to 245) Юность / Владимир Набоков.

In a search result in either the ILS or the public catalog, an EN-US library would expect a title sort to collate this as "iunost".

Comment by Cate Boerema (Inactive) [ 13/Jul/18 ]

Per discussion with Jakub Skoczen, we can get started on this without a PO. He suggested I assign it to him and he may find a developer he can assign it to from there.

Comment by Jakub Skoczen [ 21/Aug/18 ]

Theodor Tolstoy (One-Group.se) Cate BoeremaWe have discussed that the collation setting could be independent from the locale settings visible in the UI and set only once during the tenant initialization. This would simplify the implementation in the backend modules. Would that be acceptable?

Comment by Cate Boerema (Inactive) [ 21/Aug/18 ]

set only once during the tenant initialization

Does this mean it will be read-only in the UI?

Comment by Zak Burke [ 21/Aug/18 ]

My understanding of this is that changing the locale in the UI would change labels in the UI but NOT change the locale used to sort and translate lists of items. That is, changing the UI locale from English to French will change a label from "Title" to "Titre" but will not change the ordering of results in a list of items retrieved from the backend because the backend collation will remain English.

Comment by Mike Taylor [ 21/Aug/18 ]

Now that you lay it out, Zak Burke, that sounds
(a) obviously correct, and
(b) not too bad.

Comment by Lisa Sjögren [ 02/Nov/20 ]

Hi! Does this UXPROD only concern about sort order in Codex (as the description seems to suggest), or an umbrella for various apps where sort localized sort order might be relevant? Eg Users, Inventory, Orders...

And would localized sort order be addressed in planned broader search enhancements (which I understand to be leaning towards ElasticSearch)?

Comment by Charlotte Whitt [ 02/Nov/20 ]

Hi Lisa Sjögren - to the best of my knowledge, then you're right:

  1. this feature is a general issue (FOLIO wide)
  2. this should be addressed by the Elastic Search implementation - I'll give it the label 'elastic search'

CC: Magda Zacharska (new PO for Implementation of Elastic Search)

Comment by Lisa Sjögren [ 02/Nov/20 ]

Great! Thanks, Charlotte Whitt!

I think it would also be good if we could clarify in the description that this is a general issue. What does the reporter think, Theodor Tolstoy (One-Group.se)?

Comment by Massoud Alshareef [ 02/Nov/20 ]

Indeed. This is a general FOLIO issue with languages that need certain characters representing the same letter to be normalized to one single character between them. The Arabic language has several letters of this sort. For example, characters like ا أ إ آ should be normalized to one of them (ا) during indexing if any appears leading a word. Characters like ه , ة should be normalized to one of them (ه) during indexing if any appears trailing a word. 

As you mentioned Charlotte Whitt, Elastic Search, and so is Solr, supports the Normalization feature of words stemming, and that should take care of the sorting issue described by Lisa Sjögren. We are using this technique on Arabic text searching/ sorting using Solr FTR driving DSpace and VuFind.
attia.alshareef

Comment by Julian Ladisch [ 02/Nov/20 ]

PostgreSQL has the normalize function to support normalization: https://www.postgresql.org/docs/current/functions-string.html
RMB has already implemented the f_unaccent normalization, more normalisations can be added if needed.

Comment by Khalilah Gambrell [ 03/Dec/21 ]

Jakub Skoczen and Julian Ladisch, is this feature still needed?

Comment by Julian Ladisch [ 09/Dec/21 ]

Yes.

Generated at Fri Feb 09 00:10:05 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.