[FOLIO-1002] Use an IDENTITY columns instead of UUIDs for primary keys Created: 31/Dec/17 Updated: 13/Jul/20 Resolved: 13/Jul/20 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | New Feature | Priority: | P3 |
| Reporter: | Jon Miller | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Issue links: |
|
||||||||||||||||||||||||||||
| Sprint: | |||||||||||||||||||||||||||||
| Development Team: | Core: Platform | ||||||||||||||||||||||||||||
| Description |
|
I think it would better if the tables in FOLIO used IDENTITY columns instead of UUIDs. I noticed this in the users module, but, it is more of a general issue. In the case of the user module it might not make that much of a difference performance-wise, since the number of users in the system probably won't be that great. However, as a general rule, I think it is not a good idea to use random values as primary keys. The primary key column has a clustered index. Normally, you want the ids in sequential order, or the DBMS may need to shift the rows around upon INSERT. This is probably more of an issue for something like a bibs table where you have millions of rows. But, it would be good if the system used identity columns across the board. An added advantage is that the ids take less space on disk. Also, they are easier to work with in terms of searching. An INT value is easier to type in that a long hexadecimal string. It would also be good if FOLIO upgraded to PostgreSQL 10 which supports identity columns using the standard SQL syntax. |
| Comments |
| Comment by Julian Ladisch [ 02/Jan/18 ] |
|
FOLIO will use Postgresql 10 because raml-module-builder (version 16.0.4 and later) has been updated to use Postgresql 10. |
| Comment by Jakub Skoczen [ 23/Feb/18 ] |
|
Jon Miller Conveniece aside, do you have any Postgres benchmark results that you can point that supports your claim that native UUID PK has worse performance compared to SEQUENCE/IDENTITY? |
| Comment by Jon Miller [ 23/Feb/18 ] |
|
Not at the moment, but, I could do a test. |
| Comment by Jakub Skoczen [ 23/Feb/18 ] |
|
It might me interesting to see what the overhead is (if any). The randomness is not really a problem since UUIDs are timestamp based and effectively sequential (that's only relevant to the scenario when they are autogen). Short comparison: https://blog.codinghorror.com/primary-keys-ids-versus-guids/ |
| Comment by Jon Miller [ 24/Feb/18 ] |
|
Wayne Schneider told me that PostgreSQL doesn't cluster indexes like some other databases do, such as SQL Server. So, I think it probably isn't an issue. It looks like if you want to cluster the data, it is done manually https://www.postgresql.org/docs/current/static/sql-cluster.html. It says it doesn't do it automatically. It appears that there are different UUID algorithms that can be use in PostgreSQL https://www.postgresql.org/docs/9.5/static/uuid-ossp.html. I guess some are random and some aren't. mod-users appears to be using a random one. CREATE TABLE IF NOT EXISTS myuniversity_mymodule.users (id UUID PRIMARY KEY DEFAULT gen_random_uuid(), jsonb JSONB NOT NULL); However, since it doesn't do clustering, I don't think it matters anyway. But, that is just a default anyway. A client could provide it's own value and I guess could use whatever algorithm it wants. Mainly, I was concerned about the tables that have millions of rows in them. I just noticed that the mod-inventory-storage module is already using identity values. CREATE TABLE schema_name.instance (_id SERIAL PRIMARY KEY, jsonb JSONB NOT NULL); Long story short, I doubt it's a problem. |
| Comment by Jon Miller [ 24/Feb/18 ] |
|
The fact that it's using a 128-bit value when you may only need something smaller like a 32-bit value may slow things down some. It was mainly the clustering that I was wondering about. Maybe it would still be best to set the default to one of the sequential UUID generators. However, it looked like maybe you needed to compile or add an additional module for the other ones. Maybe that's why the random one was chosen. I'm amazed that as far as I know unsigned integers aren't part of the SQL standard. |
| Comment by Jon Miller [ 19/Apr/18 ] |
|
There are a couple other issues with regard to the UUIDs. One of them is that we would like to preserve the id values for a number of tables in our existing system. Also, I've talked to a number of people and everyone I talked to said that they don't like the UUIDs in terms of being able to communicate them to other people. Smaller integer values are a lot easier to work with. The issues isn't just with speed. It has to do with ease of use as well. The values are also wasteful of screen real estate. For example, a lot of times in reports you will include the id value as a column in the output. A 36 character string wastes a lot of space. It is also difficult at a glance to to compare one UUID from another. Also, performance-wise, you have to consider that the 128-bit values can appear in foreign keys, not just primary keys which wastes additional space. |
| Comment by Julian Ladisch [ 14/Feb/20 ] |
Please create a separate issue for each table where a field to store the former identifier is missing. bq .everyone I talked to said that they don't like the UUIDs in terms of being able to communicate them to other people.
The UUID id primary key is not intended for humans. We have human readable ids (HRIDs), for example UXPROD-884. Please create a separate issue for each table where no HRID exists.
Please create a separate issue for each screen where space is wasted by a UUID.
I cannot imagine that this tiny performance loss has any significant impact in real life scenarios. Can you post the numbers of your foreign key performance comparison 128-bit UUID vs. 32-bit INT? |
| Comment by Ann-Marie Breaux (Inactive) [ 13/Jul/20 ] |
|
Hi Jon Miller Looks like Ian has indicated that you should create separate issues for the various tables. If you do, please create them in the correct Jira projects and assign to the appropriate Dev Teams/POs. Here's the Dev team assignments: https://folio-org.atlassian.net/wiki/display/REL/Team+vs+module+responsibility+matrix, and here's the PO assignments: https://folio-org.atlassian.net/wiki/display/COMMUNITY/Directory+of+Product+Owners+by+Area+of+Focus |
| Comment by Jon Miller [ 13/Jul/20 ] |
|
This one is a really old one. It can probably just be closed out. I don't see this changing at this point. |