[FOLIO-535] agree on the approach for joining/filtering (single module) associations Created: 03/Apr/17  Updated: 15/Jul/20

Status: Open
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Umbrella Priority: P3
Reporter: Jakub Skoczen Assignee: Jakub Skoczen
Resolution: Unresolved Votes: 0
Labels: for-dublin
Remaining Estimate: Not Specified
Time Spent: 4 hours
Original estimate: Not Specified

Attachments: PNG File Screenshot from 2017-04-05 20-42-00.png     PNG File Screenshot from 2017-04-05 20-48-31.png     PNG File Screenshot from 2017-04-05 21-03-10.png     PNG File Screenshot from 2017-04-06 08-28-05.png     PNG File Screenshot from 2017-04-06 08-30-00.png    
Issue links:
Blocks
blocks STRIPES-319 Newly created item records have no st... Closed
is blocked by FOLIO-546 design i18n for services and data Closed
Relates
relates to METADATA-61 build an endpoint for maintaining loa... Closed
relates to STRIPES-287 "item types" should be "material type... Closed
relates to UIU-32 patron group filter should use the va... Closed
relates to CIRC-3 Depend upon item-storage v2.0 interface Closed
relates to FOLIO-556 align PG and mattype endpoints by usi... Closed
relates to METADATA-57 Use UUID for Refering to Material Types Closed
Sprint:
Development Team: Core: Platform

 Description   

We have discussed that the current approach to mattypes and PGs is not consistent and confusing: mattypes use a client-controlled "code", while PGs use a server-side enforced UUIDs to establish an association with Item and User records respectively.

The client-controlled "code" especially is considered among the front-enders as problematic and not friendly for users/UX. We would also want a consistent approach for "controlled vocabulary"-like (CV) entities (mattypes, pgroups, and soon loan_types, locations, etc)

The two proposed solution (it actually could be combined into a single-solution):

a) always generate a UUID for the CV item and use that to link the item to the entity. Do not attempt to "lookup" the display name when retrieving the entity or when searching/filtering. Use the raw UUID and perform a display-time look-up in the UI by retrieving the ID-to-displayname map (we can ensure that this map will be fairly short for the considered item types)
b) join the entity against the display name and return it during entity retrieval and searching/filtering, this will make the API more user(developer)-friendly and will require less client looks ups. It may need to support things like i18n though, unless each item has-an designate (e.g english) label.



 Comments   
Comment by shale99 [ 03/Apr/17 ]

i think that using the UUID may be abit confusing

will mention a few things that i thought about - comments welcomed...

(the below isnt a design suggestion, just how it is done in other systems, there is logic in the process below)

lets think about how 99% of the material types are populated,
1. a tenant imports their 1-2 million records (depending on the library size - could be 4-5M and in large consortium much higher) - we now have raw native records in folio

lets assume an import of marc records as this will be a large chunk of many tenants' data
2. the material type table will contain a code (either the UUID as suggested or a descriptive code 'book' assigned by a library admin) to be mapped when specific values appear in marc / dc / other records, etc...)

example: book |

{'marcfield':'a', 'marcvalue':'b'}

| etc... <--- when value b appears in field a in the marc then this is a book code (by the way this isnt far from what it looks like)

3. transforming the marc records into folio items will include the step that maps the material type to each folio item

4. at this point we have a folio record which has either 'book' in the MT or 1234567890

so at this point there are a few advantages to having book

1. as mentioned, there is no need for any lookups , i18n can be applied immediately
2. when looking at the raw record - which happens a ton - it is clear what mt the item is
3. i18n editing by library admin - an interface for editing labels is of course offered by all competing products and will be provided at one point as well - so we can either display
"Material Types"
code | lang | Label
book | en | Book
audio | en | Audio
etc...
or
"Material Types"
123456789 | en | Book
232123333 | en | Audio
etc...

i think for an admin needing to manage this - mapping UUIDs to values will be something that will be just way too complex - also looking at these files will be very hard to understand.
what we will need to do in such a case is add a mandatory description to each field which will contain something along the lines of the code discussed otherwise it will be impossible to allow anyone to translate as they wont know what those numbers mean. so this info will be entered by a library admin and it will be very important - without it will be impossible to edit i18n labels - so having this as mandatory constaints in the DB to be not null and unique etc... seems to make sense as well

Comment by shale99 [ 04/Apr/17 ]

I also believe that folio will come out of the box with probably 10-15 material types out of the box which will probably cover the bulk of all items (as part of the tenant provisioning process) - and each tenant will probably only need to add some more unique MTs that are needed by their institution. So i think most of these codes wont be created by end users, i think most will be packaged into folio and library admins will add / customize that. This will also allow us to package with folio a basic i18n in the decided upon languages, which contains default translations for those codes out of the box.

Comment by Jakub Skoczen [ 04/Apr/17 ]

shale99 You are throwing a lot of things into one basket, let me comment on some:

  • batch loading: we need a cross-walk from the input format "material type" designator (e.g various fields in MARC) to FOLIO mattype specific to a given tenant. Whether we use UUIDs or "codes" doesn't matter for how complicated this process is. In both cases the mattypes and the cross-walk need to be defined first.
  • i18n: maybe I am missing something but I don't see how mName = getLabeli18n("de", "materialTypes", "123456") is different from mName = getLabeli18n("de", "materialTypes", "book").?
Comment by shale99 [ 04/Apr/17 ]

first a general comment:
i believe this isnt only an MT issue, this will exist for many other properties in the item,
are we going to go with an item with many IDs (locations id, MT id, etc...), where you really must run multiple joins to get the needed info (remember, our UI isnt the only consumer here) - an api user getting a record with ids will find it useless in many cases without material types to display / without a location to indicate where the book can be found / etc... , so these joins will run N times to retrieve the needed info for a single record - when discussing other data types (which have for example 1000's of entries this may be ok) - but when your bread and butter records (which will be in the millions for most average libraries) has to run multiple joins to populate itself, i dont know...

i18n - if i tell you - what does this mean

getLabeli18n("de", "materialTypes", "123456")

and then ask you - what does this mean:

getLabeli18n("de", "materialTypes", "book")

without looking at anything else - which one can you give me an answer to and which one do you have to start looking into code / db to find out?

A UI interface to i18n - will be a UI screen where you assign labels in different languages to codes. this will exist in folio at one point we should accept this as a given. the days where a librarian opens a file and starts editing, well, we dont want to go back there. so how would you create this interface if all you have are numbers to map to?

Comment by Niels Erik Nielsen [ 04/Apr/17 ]

If there's a unique constraint on the material type label, you can have UUID for primary key and still implement the desired interface - using descriptive identifiers to retrieve one and only one row – if you want/need.

Comment by shale99 [ 04/Apr/17 ]

Hi NE, do you mean to maintain a uuid and a unique (index) descriptive identifier - instead of just the descriptive identifier? just want to make sure i understood?

Comment by Niels Erik Nielsen [ 04/Apr/17 ]

Yes.

Comment by shale99 [ 04/Apr/17 ]

right, I agree, there should be a primary key with a uuid and that can be used when needed. and i agree there should be a unique descriptive code as well. my view is that the descriptive code should be put in the item and not the uuid

Comment by shale99 [ 05/Apr/17 ]

guys, whichever way we chose , lets decide - i have a big change that i am either scrapping or checking in which changed the material type api to work with the descriptive codes instead of the uuid

Comment by Niels Erik Nielsen [ 05/Apr/17 ]

Right. If I understand you correctly, that means using the descriptive code as foreign key in Item.

I get the convenience of doing so, but this is where you lock the label in. It can no longer be modified just like that and then we have people discussing the options:

a) Yes, the label/descriptive code can be modified, but only if it's not referenced by any other modules yet. We need to know which modules might reference it and inquire if they have any actual references to it and reject a modification if any of them do

b) Sure, the label/descriptive code can be modified; but we need to know which modules reference it and ask that they all update their references to the new label as we modify it in the base table.

c) No, we cannot know which modules might reference a given label and we cannot safely ask them all to update their references in one big transaction. Modifying labels after the fact is not allowed.

While c) is certainly the simplest and safest option, it also provides for a unforgiving, "gotcha" user experience, where any typo or inconsistency between users in applying labels gets frozen in time.

If we use that approach for foreign keys, we also make it legitimate for every module writer to do the same and I'm pretty sure we'll regret that.

Comment by shale99 [ 05/Apr/17 ]

i dont believe that indicates a problem with the inventory module itself putting the code into the items as the inventory module controls its own data, so i dont believe there is a cross module issue here, correct?
as for the point mentioned, i agree with what you are saying, i think however, in the current context of material types and their lifecycle and usage (from a business logic perspective) - i dont really see this as a problem. Maybe if we consult with some librarians they will indicate that what i am saying is incorrect - and i may be wrong, but the lifecycle i see is as follows:
1. folio comes with 10-15 codes out of the box
2. librarian works in a sandbox env and loads data , sets up the system, etc...
3. librarian adds a few missing codes that are relevant to their specific content
4. librarian checks this in the sandbox
5. librarian is happy
6. deployed to their production environment

if this is the workflow - i know that this is done this way in other systems, then it seems that these codes will really be immutable - if this isnt the case then your point for changing the api hold water - but they dont necessarily indicate that placing the code in the items is problematic.

Comment by shale99 [ 05/Apr/17 ]

moreover, the cross module issue will actually always be an issue - lets take the DELETE scenario. if i want to delete a material type - can i do it? i can never really know what other modules are using - will i break a module if i do? isnt this a micro service issue in general?

Comment by Jakub Skoczen [ 05/Apr/17 ]

Guys, I am still catching up with the previous comments but for the sake of expediency, if the decision is needed now let's go with UUIDs. So any kind of "controlled vocabulary" endpoint is similar and would look like:

GET|POST /items/:uuid
GET|PUT|DELETE /items/:uuid

In the "parent" record (like User) we would include raw UUIDs of the associated record like so:

{
  "username": "shale99",
  "patron_groups": [""123-456-789-123", "345-678-910-111"]
}

Searching and filtering would be done using UUIDs and the UI would need to lookup the Item objects for labels etc.

Now, the open question is whether we include a seperate endpoint for controlling associations, directly on the vocabulary item like so

POST|GET /items/:uuid/users/
DELETE /items/:uuid/users/:uuid

It probably makes sense. In some case we may even skip the step of "joining" items to the parent object, especially if we don't need searching/filtering functionality on the item.

Comment by shale99 [ 05/Apr/17 ]

if we want this for material types as well, please let me know and i will make the change

Comment by Jakub Skoczen [ 05/Apr/17 ]

Yes, I am aware of that and I am proposing we align mattype with this. I know that you are coming from the perspective where mat types are shared among different tenants/folio installation and then codes might make sense but so far there's no information to indicate that this would actually happen. So they are just pain in the ass.

Comment by shale99 [ 05/Apr/17 ]

that isnt at all where i am coming from - i am coming from understanding the biz module and usage of this in a real system - but i will make the change

Comment by Mike Taylor [ 05/Apr/17 ]

Please hold off on a decision for another hour. I need to read what's been said before I comment.

Comment by Jakub Skoczen [ 05/Apr/17 ]

Mike Taylor Ok, but I would like to avoid prolonging discussion about this much more. I've taken both shale's and NE's arguments into account and UUIDs are the most flexible option.

Comment by Mike Taylor [ 05/Apr/17 ]

Well, there are a couple of things going on here.

We all agree that for each of our "controlled vocab" record types – patronGroups, materialTypes, permissionSets, etc. – we need a permanent, unchanging unique ID; and that we also need a displayName. I think we all also agree that
1. We don't intend to show the IDs to end-users, but
2. It's inevitable that they will see them anyway – in raw records, in CQL queries, in URLs, etc.

So there is a real argument for having IDs that are somewhat readable ("book", "dvd" rather than "0bfb4036-69f6-4b98-b087-7a8776a1218e", "4cc1d7ea-e5f1-471c-80c5-fa7a552ea31b"). I would be sad if the CQL queries that the UI sends to Okapi changed from

materialType="Book"

to

materialType="0bfb4036-69f6-4b98-b087-7a8776a1218e"

(I also lament that individual user views in the Items app have URLs like

http://localhost:3000/users/view/B076F7CB-9043-466A-853D-8F463B0E8C31/jacklyn

rather than

http://localhost:3000/users/view/jacklyn

but let that pass for now.)

So can we use nicer unique-IDs like "book" and "dvd"? The arguments for not doing this seem to be:
1. Once they are assigned, they can't be changed.
2. They won't be perfectly clear to end-users, especially in light of i18n concerns.
3. In some circumstances, they have to be generated.

Did I miss any? If not, then these all seem pretty weak to me:
1. Of course, exactly the same is true of UUIDs. It may be not ideal to be stuck with an id of "book" when we would have preferred "books", but it's better than being stuck with an id of "0bfb4036-69f6-4b98-b087-7a8776a1218e" when we would have preferred "books".
2. Again, clarity to end-users is not perfect when you see "book" and, because you are German you wanted to see "buch"; but it's better than seeing a UUID. It gives you a chance. (As Ze'ev rightly points out, it's unavoidable that in some cases, as in setting up mapping tables, end-users are going to have to deal with the ID.)
3. Generating unique human-readable IDs can be irritating; but it's hardly an insoluble problem. One could easily imagine, for example, generating a material-type unique-ID by downcasing the displayName, camelcasing the result, and appending an integer if that's not already enough to make it unique.

Putting it all together, I am not convinced the arguments for the nuclear option of UUIDs are strong enough to merit the nastiness they entail. I think we can be more ambitious.

Comment by Matt Connolly [ 05/Apr/17 ]

Mike has expressed much more eloquently what I was contemplating myself. I don't see a problem with auto-generating unique identifiers that are at least close to the initial desired display label.

Comment by Jakub Skoczen [ 05/Apr/17 ]

Mike Taylor I think auto-generating IDs that are close to the original labels gives you a false sense that the values are something more than opaque identifiers. When you do this, they will eventually become an integral part of data model (and interfaces) as people will consider them to be representative of the objects they identify.

And then we face all the issues NE mentioned – what it the data they represent change? Do we re-create the identifiers and scan the entire system for references to update them? Do we leave them untouched and risk that people relying on their apparent meaning keep on using them unaware that the real, underlying meaning has changed?

No, I don't think this is an acceptable solution. I think it's much more honest to say – we are using UUIDs precisely because we want to convey that the ID values have no meaning. The meaning is within objects, and the meaning might change. It's the duty of whoever consumes the data to keep the meaning in sync.

Now, to specifically address the query/display problem: as I said in the description for pure convenience we can add the ability to query/display using the values behind the ID by performing a join on ther server during retrieval, especially if the data resides within the same module. In the PG example specifically the returned object would include e.g:

 { 
   ...
  "patron_groups": [{"id": "123-456-789", "text":"Students" }]
}

and CQL may include a special index or modifier:

patron_groups/text = "Students"

But the "maintenance" of the values (edits, associations with other object) should be done using opaque identifiers. And that's the crucial piece.

Comment by shale99 [ 05/Apr/17 ]

guys,
the material types do not change. and i dont necessasrily think they should be generated. they need to come out of the box and a librarian needs to add a few as needed. no librarian will add a book code and then after loading in the entire system, say oops, it doesnt happen. these codes are carfully added by librarians there are only a handful, and they dont change.

from a technical standpoint - remember that this will require a join, unless we say that api and ui users alike need to make multiple calls to retrieve all item info.

but if we are going with ids it is not a problem i will tweak the implementation, it is already mostly id based.

Comment by Mike Taylor [ 05/Apr/17 ]

I think auto-generating IDs that are close to the original labels gives you a false sense that the values are something more than opaque identifiers.

I agree 100% that this is a real danger. The solution is in the developer culture and documentation. We absolutely must take steps to prevent people from interpreting them wrongly.

And then we face all the issues NE mentioned – what it the data they represent change? Do we re-create the identifiers and scan the entire system for references to update them? Do we leave them untouched and risk that people relying on their apparent meaning keep on using them unaware that the real, underlying meaning has changed?

The second one, for sure. The first is horribly error-prone.

No, I don't think this is an acceptable solution. I think it's much more honest to say – we are using UUIDs precisely because we want to convey that the ID values have no meaning.

I am sympathetic to that (though even then I would prefer small integers. Librarians remember that the 245 field means title-statement; they would never remember that 0bfb4036-69f6-4b98-b087-7a8776a1218e means title-statement.)

Still, we need to give some thought to what the queries, the URLs and the internal objects look like. They don't matter to the UI, but they matter enormously to the developer experience – and for several years yet, that is going to be much more important than the user experience.

Comment by Jakub Skoczen [ 05/Apr/17 ]

Ok, I think I presented above how can we increase developer experience when it comes to querying/displaying data. For now I would like shale99 to go ahead and perform changes to no longer rely on hand-code "codes" and but UUIDs. Thank you.

Comment by Marc Johnson [ 05/Apr/17 ]

shale99 Jakub Skoczen

Will the change to make material types behave this way be a separate issue, or associated with this?

Comment by Jakub Skoczen [ 05/Apr/17 ]

I think a seperate one and one that is aligned with any required UI change that potentially needs to be performed by Matt.

Comment by Niels Erik Nielsen [ 05/Apr/17 ]

It seems to be a concern that UUID's are visible in URL's, back-end requests and log files.

If we find this all that important I'd expect that we can avoid it a lot of the time when we have alternate keys, like the material type label.

This can be demonstrated with the UI for Users, which also has an alternate key, besides the UUID, the "username".

We currently have the patrons UUID in the URL:

With these changes to Users and UserView :

we can have this URL instead:

(don't mind the double name in the URL, it's due a previous identifier confusion requiring us to have both ID and name on the route; something that can obviously be just one unique identifier)

With a bit careful design I'm pretty sure we can use alternate keys (if available) in many instances, instead of the UUID, if visible UUIDs is a major concern.

Comment by Jeremy Huff [ 05/Apr/17 ]

Also I think that users are used to ugly URLs:

https://mail.google.com/mail/n/1/#inbox/15b392703b96a4bd

Comment by Mike Taylor [ 05/Apr/17 ]

Niels Erik Nielsen, we should definitely make that change in Users – it's bothered me for a long time that our URLs include both the ugly-ass ID and the human-readable one.

(By the way, usernames are a nice example of an ID that is created by humans to be human-readable, but which is nevertheless unique and usable as a key.)

Comment by Niels Erik Nielsen [ 06/Apr/17 ]

Right. As long as we realize that the UUID is still the more robust choice for linking.

When we create links on data items subject to change, like the username is, then those links would obviously break if a modification happened mid-air. Bookmarks would break too of course.

Comment by shale99 [ 06/Apr/17 ]

guy, i am moving forward with the uuid implementation
my last last two cents...
i think this is really a case by case basis. for example, can a user name ever be changed? i know passwords change , but a user name?
material type is the same, /material-type/books will always take you to books - that will never change. in those situations, where there is no change - i am not sure a uuid is more robust. api consumers will either require joins to get an all inclusive record or need to make multiple calls to the backend to get the needed info

Comment by Niels Erik Nielsen [ 06/Apr/17 ]

Change of username is what's currently implemented at least, according to LIBAPP-12 Closed and LIBAPP-114 Closed .

Which made me wonder ... so I went to the public demo and changed the username of roma to roma2.

If you look up roma2 on folio-uidemo.aws.indexdata.com now you should get:

Click the alert away, you should see that Pamela Price still has her loans, but apparently no permissions anymore.

This is what using UUID's everywhere is supposed to safeguard against.

Comment by shale99 [ 06/Apr/17 ]

if a user name is update-able , i don't think it is a candidate for the use case we are discussing.
if i have a username and it is immutable (for example , go to an internet site and create an account, your username is set in stone) - then we can use it, otherwise no,
if we go back to the material type comment i made - /material-type/books will always lead you to books, agree? /material-type/journal will always lead you to journals. that will never change

Comment by Mike Taylor [ 06/Apr/17 ]

I certainly agree that fields which can change are not suitable to be used as IDs. (This should hardly need saying.)

But it does seem bizarre to me that we want to support changing username!

Comment by Niels Erik Nielsen [ 06/Apr/17 ]

Well, we want to be able to merge two libraries' user bases, or a library might want to use email address for username.

It's a feature anyway, and you run a risk by making such assumptions, constraining yourself upfront with poor data design (saying "poor" while always still acknowledging the challenges of doing proper data design in a micro-services context).

Comment by Mike Taylor [ 06/Apr/17 ]

Merging two libraries sets of users is a horrifying scenario. But, yes, an important one.

I don't know what the right answer is when we merge MIT and Yale, and both of them have a user called mike. Ze'ev, do you have any experience of this?

At any rate, one thing is clear: if we do go with somewhat-human-readable identifiers, we need to be really clear about which such fields are identifiers, and thus immutable; and which are not (and so can be changed).

... All of which I admit is an argument for using UUIDs, which no-one would ever mistake for readable.

Comment by shale99 [ 06/Apr/17 ]

i havent heard of a scenario where two tenants are merged into one - there is always a tenant namespace (in our case db schema) , unless i missed the merging scenario

Comment by Niels Erik Nielsen [ 06/Apr/17 ]

If modules only ever referenced mike by UUID, then the respective diku_admins could go and rename their patrons mikea and mikeb in mod-users/mod-auth and that would be it.

If any modules referenced those users by "mike" (like permissions does?), then they would have to go find all those references and update them at the same time as the base user table is updated.

Comment by Mike Taylor [ 06/Apr/17 ]

I would go further: if username is used as a foreign key anywhere in any schema, then it has to be immutable. (And except in the merge-two-libraries scenario, that seems to me like an eminently reasonable constraint.)

Generated at Thu Feb 08 23:06:31 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.