[FOLIO-399] primary keys / resource creation Created: 01/Dec/16  Updated: 12/Nov/18  Resolved: 15/Dec/16

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P2
Reporter: Adam Dickmeiss Assignee: Adam Dickmeiss
Resolution: Done Votes: 0
Labels: sprint4
Remaining Estimate: 1 day, 2 hours
Time Spent: 1 day, 1 hour, 30 minutes
Original estimate: Not Specified

Issue links:
Relates
relates to OKAPI-225 Allow server-side UUID for POST. Allo... Closed
relates to FOLIO-972 The regex pattern for UUID in schemas... Closed
relates to FOLIO-698 document RESTful API guidelines for m... Draft
Sprint:

 Description   

Okapi and raml-module-builder (RMB) uses different schemes WRT resource creation and primary keys. We should decide on a common scheme. Whatever resolve will most likely affect OKAPI and DMOD things.



 Comments   
Comment by Adam Dickmeiss [ 01/Dec/16 ]

Here's my take on the matter.. Which, clearly, will mean modifications to Okapi.

1. POST /path . Creates a resource. The key is generated by the service and returned as part of result and location. Even in the case where a ID is given as part of JSON content, it will be IGNORED .. A new copy is returned with a unique identifier as part of response. The server will return 201 or error.

2. PUT /path/id .. Will put a resource. The pk is given as id and the server will "store" it. Suppose a database is used, then either the resource is added or updated.

Note that PUT is idempotent, while POST isn't.

Because objects may be shared between installations it is wise to use UUIDs in 1:POST. This allows "safe" use of it in 2:PUT. However, we should not rule out that other schemes COULD be used.. Ie that a service actually returns NOT UUIDs but something else and that is what is known to be the identifier for a particular set of objects.

Comment by Jakub Skoczen [ 01/Dec/16 ]

It is aligned with the HTTP spec: https://tools.ietf.org/html/rfc2616#section-9.6.

1:POST will give us server-side generated IDs and let's recommend UUIDs for the APIs and use them all over the modules' APIs we build ourselves.

2:PUT will give us a way to client-side generate ID in a special circumstance (by controlling the path). Let's just make sure to follow the HTTP spec and differentiate resource update vs resource creation with a proper response code (200 OK vs 201 Created).

It also means that 2:PUT may be optional for the API – if a module does not want to allow client-side control IDs it may choose to implement PATCH only (which is a partial update), PATCH requires an existing entity (URL).

As discussed with Adam, this approach will not allow us to immediately client-side control IDs for batch imports, which is a problem when moving data around, if the batch import is implemented by POST of a record bundle/collection. An alternative approach (e.g as in json-api.org) is to allow specifying the ID within the body of resource during POST and letting the server decide whether or not the proposed ID is used. This may results in a more complicated API.

Comment by shale99 [ 01/Dec/16 ]

IDs = a column in the table with a unique constraint or the primary key of the table?

Comment by Jakub Skoczen [ 01/Dec/16 ]

shale99 why is this relevant? In SQL `FOREIGN KEY` is pretty much a short-hand for `UNIQUE NOT NULL` with an implicit INDEX on the column.

Comment by shale99 [ 01/Dec/16 ]

if it is a unique column and not a pk - the pk can be db generated as a permanent uuid and the foriegn keys to other tables with db generated uuids as well. every records is recognizable via the "client id" or the db generated uui which is permanent - but we dont have a url to an item in amazon as the pk of the record - no?
moving data from one db to another will maintain the pk uuid and the fk as they are permenant UUIDs so no relations are lost

Comment by Jakub Skoczen [ 01/Dec/16 ]

The short answer is: none of that matters for this discussion. Again, PK is a SQL-specific short-hand for unique and non-empty column with an index that makes it simpler to create references. It does not mean it's auto-generated, you do this separately with auto-generation strategy which is not specific to the PK column but can be used for anything. E.g in Postgres you can enforce referential integrity but not using the PK keyword even once and naming columns explicitly in your REFERENCES statement. No sane person would do it, but you could. It simply makes no difference for the discussion here.

Comment by Jakub Skoczen [ 01/Dec/16 ]

Oh, and if you are asking about what strategy we should use in our modules: whatever makes sense as long as it can support the API requirements we listed above.

Comment by shale99 [ 01/Dec/16 ]

guys, my final thoughts on this (nothing new, just a summary) you have worn me out completely with this (so i am taking a step back)

1. there has to be clear guidelines to store data so that all those non developers supporting the system understand what is going on. hence,
a. a pk / fk which is propagated across the entire index needs to be maintained by the database - not in any developers code logic - it should be permanent and unique (for example uuid) and all internal workflows should use this id to reference the data.
2. there is no problem having a column with a unique index on a field that a module feels can be used to recognize a record coming from an external system - what happens when a record comes in with that data - do your put / post / patch thing, whatever... that column should NOT be used as a reference, it is just another column to be used at the discretion of the biz logic of the module.
3. the 100+ table in this system - most of them being used for biz logic within a module should not need the module to indicate its keys for it but use the standard conventions of the database (for example permanent uuids)

by the way: the heavy importing of data into a system like this (putting oai aside) - will usually be tar.gz that are uploaded and then a module will process and save. moving a tenant between folio installations would be silly to do via anything other then an import / export db mechanisms - hence the importance of separating tenants so this can be done.

Comment by Jakub Skoczen [ 01/Dec/16 ]

1. yes, guideline are good but again this is out of scope of this discussion. And no, if we choose to use UUIDs we have the following options available to us

  • generate them in the DB directly (e.g Postgres supports this)
  • generate them in the storage modules that guards all access to the data
  • generate them in the client and validate in either the system module and/or DB
Comment by shale99 [ 04/Dec/16 ]

a little more info on this:
postgres supports uuid v1 - v5 (all versions) v2 is not explicitly defined in the UUID specification and thus is not implemented by all UUID generators

the chance of a v4 collision:
for 68,719,476,736 entries (uuids generated) --> chance of 1 duplicate: 0.0000000000000004 (4 × 10−16)

wiki: "To put these numbers into perspective, the annual risk of a given person being hit by a meteorite is estimated to be one chance in 17 billion, which means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. However, these probabilities only hold when the UUIDs are generated using sufficient entropy. Otherwise, the probability of duplicates could be *significantly *higher,"

Note: this is a lower probability then not detecting a corrupt TCP/IP packet - which no one handles in the code - (checksum and content both corrupt in a way that the content seems correct) meaning for example (there are a million problematic example of course) - uploading a file and receiving part of it corrupted and not knowing about it).

so if this is done right - the system should be safe - if not, then code must be added for each api endpoint to verify this is a proper uuid / or not.....

about those generating the uuid:

uuid v1 * uses the mac address and the time in nano seconds - can this be a problem for customers as it *does not protect patron anonymity ? - there is no way for any backend service to validate this and protect against it.

uuid v3 uses md5 and it is now recommended to use v5 instead of v3

so we would want anyone implementing this to use v4 or v5

there is a case to be made about uuids being guessable from a security standpoint (for example v1) - for example, to not allow users to iterate through data (some people find this problematic - i dont know if this is critical in our case)

can implementations mess this up? - google it and find out

currently looking into what postgres offers....

Comment by Jakub Skoczen [ 08/Dec/16 ]

Guys, I'll try to capture the outcome of various discussion as succinctly as I can:

a) we want to use UUID all over the place and restrict this in the JSON schema (regex pattern)
b) POST will now do both: consider the “id” field send by the client in the document body and if validation is OK, potentially use it as the UUID OR choose it’s own UUID if it prefers that or if nothing is supplied (server choice)
b) PUT to explicit path with UUID will create the record with the specified ID, or fail (client-choice)

Generated at Thu Feb 08 23:05:29 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.