[FOLIO-1955] Create databases using und-x-icu collation Created: 09/Apr/19  Updated: 24/Sep/23

Status: Open
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P2
Reporter: Julian Ladisch Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: platform-backlog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Blocks
blocks RMB-369 ICU collation: postgresql-embedded an... Open
blocks BF-264 Sorting of contributor types ignores ... Closed
blocks MODDICONV-34 Handle sorting issue for name field f... Closed
blocks UXPROD-1821 3rd party dependency upgrades (Q4 2019) Closed
blocks UXPROD-2214 3rd party dependency upgrades (Q1 2020) Closed
blocks CQLPG-26 ICU collations locale (independent of... Blocked
is blocked by FOLIO-1438 Upgrade to PostgreSQL 10 Closed
Relates
relates to UXPROD-745 Tenant Sort Order Setting Open
relates to FOLIO-1845 Replace postgresql-embedded by testco... Closed
relates to FOLIO-3901 invalid locale name: "und-x-icu" in b... Closed
relates to RMB-37 SQL sorting/comparing must use the te... Draft
relates to UIU-1726 User app search: observe and interfil... Draft
Sprint:
Story Points: 2
Development Team: Core: Platform

 Description   

These lines need to be added to the postgresql_db database creation task to enable und-x-icu collation (sorting):

- postgresql_db:
    encoding: UTF-8
    lc_collate: und-x-icu
    lc_ctype: und-x-icu
    template: template0

ICU collation is operating system independent.

und-x-icu (for “undefined”) is the ICU “root” collation. Use this to get a reasonable language-agnostic sort order: https://www.postgresql.org/docs/current/static/collation.html
"The CLDR root collation […] is used by all other locales by default, or as the base for their tailorings." http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation

This fixes the problem that spaces are ignored when sorting. See BF-264 Closed for an example.



 Comments   
Comment by John Malconian [ 10/Apr/19 ]

Updating the ansible task using the example above results in the following error:

fatal: [10.36.1.200]: FAILED! =>

{"changed": false, "msg": "Database query failed: invalid locale name: \"und-x-icu\"\n"}
Comment by John Malconian [ 11/Apr/19 ]

Julian Ladisch - Do you have any ideas about what might be missing?

Package: postgresql-10
Version: 10.7-1.pgdg16.04+1
Architecture: amd64
Maintainer: Debian PostgreSQL Maintainers <team+postgresql@tracker.debian.org>
Installed-Size: 22972
Depends: locales | locales-all, postgresql-client-10, postgresql-common (>= 182~), ssl-cert, tzdata, debconf (>= 0.5) | debconf-2.0, libc6 (>= 2.17), libgssapi-krb5-2 (>= 1.8+dfsg), libicu55 (>= 55.1-1~), libldap-2.4-2 (>= 2.4.7), libpam0g (>= 0.99.7.1), libpq5 (>= 9.3~), libselinux1 (>= 2.1.12), libssl1.0.0 (>= 1.0.0), libsystemd0, libuuid1 (>= 2.16), libxml2 (>= 2.7.4), libxslt1.1 (>= 1.1.25), zlib1g (>= 1:1.1.4)
Recommends: sysstat
Provides: postgresql-contrib-10
Homepage: http://www.postgresql.org/
Priority: optional
Section: database
Filename: pool/main/p/postgresql-10/postgresql-10_10.7-1.pgdg16.04+1_amd64.deb
Size: 4933854
SHA256: b9964ac664d150d553e33cb58f45d642165aeafd3e5e2772c380e15fad107e4a
SHA1: 8c84fc507093fd51264f9dac8924f4271fa90a90
MD5sum: edeb6ea7a8d25eac477a4c30d6d8c593
Description: object-relational SQL database, version 10 server
PostgreSQL is a powerful, open source object-relational database
system. It is fully ACID compliant, has full support for foreign
keys, joins, views, triggers, and stored procedures (in multiple
languages). It includes most SQL:2008 data types, including INTEGER,
NUMERIC, BOOLEAN, CHAR, VARCHAR, DATE, INTERVAL, and TIMESTAMP. It
also supports storage of binary large objects, including pictures,
sounds, or video. It has native programming interfaces for C/C++,
Java, .Net, Perl, Python, Ruby, Tcl, ODBC, among others, and
exceptional documentation.
.
This package provides the database server for PostgreSQL 10.
Description-md5: 99222d39773bfcad9899d42876edecfb
Postgresql-Catversion: 201707211

Comment by John Malconian [ 11/Apr/19 ]

From the postgresql server log:

2019-04-10 17:48:54.693 UTC [2714] folio_admin@postgres ERROR: invalid locale name: "und-x-icu"
2019-04-10 17:48:54.693 UTC [2714] folio_admin@postgres STATEMENT: CREATE DATABASE "okapi_modules" O
WNER "folio_admin" TEMPLATE "template0" ENCODING 'UTF-8' LC_COLLATE 'und-x-icu' LC_CTYPE 'und-x-icu'

Comment by Marc Johnson [ 11/Apr/19 ]

John Malconian Could be a red-herring, this stack overflow answer suggests the locale cannot be set on a database this way

Comment by John Malconian [ 11/Apr/19 ]

I think you're right, Marc Johnson. Might be a limitation of pg10. I've seen a couple of stack overflow answers allude to this.

Comment by John Malconian [ 11/Apr/19 ]

tried a couple of variations based on Stack Overflow and still no luck.

CREATE DATABASE "okapi_modules" WITH OWNER = "folio_admin" TEMPLATE = "template0" ENCODING = 'UTF-8' LC_COLLATE = 'und-x-icu' LC_CTYPE = 'und-x-icu' TABLESPACE = pg_default CONNECTION LIMIT = -1;

ERROR: invalid locale name: "und-x-icu"

Should I remove the LC_COLLATE and LC_TYPE config and include 'ENCODING = 'UTF-8' or should I mark this issue as 'won't fix'.

Comment by Julian Ladisch [ 23/Apr/19 ]

I've tried und-x-icu with postgres:10-alpine docker container where that CREATE DATABASE works.

That fixes the sorting problem, but this restriction how postgres must be run (Alpine) can be a temporary solution only.

A fix for PostgreSQL to make ICU work in Debian/Ubuntu has been proposed but needs some fundamental design work:
https://www.postgresql.org/message-id/flat/0951afb4-a634-e047-4cd8-a9e98f7feed9%402ndquadrant.com#d1bd94338d12d55441690651d2d27b4e

Comment by Ann-Marie Breaux (Inactive) [ 09/Aug/19 ]

Hi Julian Ladisch and John Malconian and Marc Johnson Looks like this story is blocking several others, and the blocker for this one has been resolved. Might you be able to work this into an upcoming Core-platform sprint? Thank you for considering.

Comment by Julian Ladisch [ 12/Aug/19 ]

This requires to replace the Debian/Ubuntu APT PostgreSQL package by the Docker Alpine PostgreSQL package: https://github.com/folio-org/folio-ansible/blob/master/roles/postgresql/tasks/main.yml
Do the DevOps agree with this?

Comment by Wayne Schneider [ 12/Aug/19 ]

Julian Ladisch are you proposing running PostgreSQL in a container, using the postgres:10-alpine image?

Comment by Wayne Schneider [ 12/Aug/19 ]

John Malconian Ian Hardy have we tested this in RDS?

Comment by John Malconian [ 12/Aug/19 ]

I don't understand why we would even entertain the idea of adopting something that is only supported in a small subset of postgresql distributions.

Comment by Julian Ladisch [ 13/Aug/19 ]

Yes, I propose to run PostgreSQL in a Docker container, using the postgres:10-alpine image.

PostgreSQL sorts using collations provided by the operating system. The collation that Ubuntu/Debian provides ignores spaces and therefore has been reported as a bug, see the linked issues MODDICONV-34 Closed and BF-264 Closed that are blocked by this issue.

I also don't like to restrict the operating system PostgreSQL runs under but we should try this as a temporary solution until a fix for PostgreSQL to make ICU work in Debian/Ubuntu is available: https://www.postgresql.org/message-id/flat/0951afb4-a634-e047-4cd8-a9e98f7feed9%402ndquadrant.com#d1bd94338d12d55441690651d2d27b4e

Side note: FOLIO-1845 Closed proposes to use a PostgreSQL Docker container for testing.

Comment by Marc Johnson [ 13/Aug/19 ]

Jakub Skoczen Julian Ladisch

My understanding is that there is a need to be able to search / sort using different collation for individual properties, e.g. the English title vs. the Swedish title of an instance, or where some instances have German titles and others have Dutch titles?

Given that, does this approach satisfy those needs, or would it only offer searching / sorting by a single collation across all records?

Comment by Charlotte Whitt [ 14/Aug/19 ]

Hi Marc Johnson - correct we need to be able to handle e.g. Swedish characters with diacritics, German characters with diacritics, Hungarian letters with diacritics etc, etc, special Danish and Norwegian letters like : æøå, ÆØÅ, and many more.

Theodor Tolstoy (One-Group.se) added for a long time ago this bug: UISE-69 Closed re the issue for Swedish diacritics.

Comment by Marc Johnson [ 14/Aug/19 ]

@charlotte stated

correct we need to be able to handle e.g. Swedish characters with diacritics, German characters with diacritics, Hungarian letters with diacritics etc, etc, special Danish and Norwegian letters like : æøå, ÆØÅ, and many more.

I think if we need to support records for the same tenant that can contain a mixture of multiple collations, then whilst PostgreSQL ICU support might be useful or necessary, it is unlikely to be sufficient to fulfil all of those needs.

Comment by Julian Ladisch [ 14/Aug/19 ]

This issue is to fix the space problem only. und-x-icu (for “undefined”) is the ICU “root” collation to get a reasonable language-agnostic sort order: https://www.postgresql.org/docs/current/static/collation.html

For the other topic see UXPROD-745 Open "Tenant Sort Order Setting" and RMB-37 Draft "SQL sorting/comparing must use the tenant's locale/collation".

Generated at Thu Feb 08 23:17:08 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.