[FOLIO-704] Investigate a central compound object index Created: 29/Jun/17 Updated: 14/Feb/22 |
|
| Status: | Draft |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Task | Priority: | P3 |
| Reporter: | Kurt Nordstrom | Assignee: | shale99 |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | core, for-next-sprint, sprint17, sprint18, sprint19 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | 1 hour | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||||||
| Sprint: | CP: Roadmap backlog | ||||||||||||||||||||||||||||
| Development Team: | Core: Platform | ||||||||||||||||||||||||||||
| Description |
|
As the front-end is currently desirous of a way to search for users based on related objects (e.g., get a list of all the users with permissions X, Y and Z), it makes sense to evaluate the benefits of maintaining an index of composite objects to facilitate this. For example, to execute the previously mentioned request, we have to query the permissions module first to get a list of all permission-association objects that contain permissions X, Y and Z. Then, from this list of objects, we have to compile a list of usernames that these objects possess. Then we have to query the user module to get a list of all user records that match our list of usernames. Then, depending on the return format requested by the client, we may have to join the user objects together with the permissions objects to return the compound result. The logic gets more complicated when we add in features like sorting the result set based on a field in a module outside of the users module, or doing pagination. We also run into the issue of degraded performance for the "manual searching" approach when we increase the number of records that need to be searched. A possible solution would be to use something that's happy indexing huge amounts of data (e.g. solr) and inserting composite records into this index. For the schema, we could adopt a dot notation to indicate the object and subfield. For example, to search by userid, you'd use field "user.id". For a permission name, something like permissionsUser.permission_name. The good news is that this makes record retrieval stupidly easy. It's just a straightforward query, and translating between CQL and Solr queries is not hard. We'd be free to introduce as much complex logic as we wanted, as well as choose fields to sort on and suchlike. The downside, of course, is building and maintaining the index. How can this be done in a reliable fashion with the least amount of burden on the maintainers of the individual modules that contribute to the composite record? For example, we could maintain a message queue and allow modules to push messages to it whenever a record is created, updated or deleted. These messages would then be consumed in order and used to update the composite record index. It is worth noting that Solr supports partial document updating...which seems particular relevant here, since no one module would be sending enough information to update an entire document. https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents Another option would be to try to implement this with no changes to the storage modules, and have a way to externally monitor for changes and then write them back to the indexes. One way to do this might be some kind of low level database trigger, though I worry that this might really violate the KISS principle. Another possibility could be a filter-level Okapi module that would listen for request types to various modules (e.g. PUT, POST, DELETE) and then create some kind of message into a queue for some process to query the module for changes and write these back to the index. Since this seems to potentially overlap several different issues, I'd like to determine fairly quickly whether or not this is a road worth going down, or if we want to try to implement the "manual searching" solution for the short term, at least. |
| Comments |
| Comment by shale99 [ 30/Jun/17 ] |
|
i think this is a good summary. I am 100% in favor of the solution using a messaging queue to push partial records into solr. i do not think this should be on the okapi level and should be kept at the storage / application layer. i see this as the kind of missing link of a distributed architecture that must combine data as part of the querying process. i dont see another way to do this efficiently. the partial record update may need some tweaking but still allows us to keep the modules independent while allowing biz modules to query across multiple modules efficiently. |
| Comment by Kurt Nordstrom [ 30/Jun/17 ] |
|
I was hoping to avoid adding any complexity to the existing storage modules, but it is possible that this is an unrealistic goal. Of course, if there was a standard way to do this "update queue push" operation, it would simplify things. Would we see both the Solr |
| Comment by shale99 [ 30/Jun/17 ] |
|
i think we can template whats needed by a storage module so its simple enough to integrate. i think there are a bunch of details to cover here , but i think the first stage is to get to a decision on this and if we are going with this type of solution |
| Comment by Kurt Nordstrom [ 30/Jun/17 ] |
|
Agreed. Jakub Skoczen Do you have thoughts on adding this component to the system? Would it be worth prototyping? |
| Comment by shale99 [ 03/Jul/17 ] |
|
two issues i thought about 1. what fields do we store in the index? remember the key to the index is being small, this is key. do we store any fields in the index at all? or just index the fields for searching - then return an id of the matched composite record to the biz logic module - and have the biz logic module request records from the storage modules for display (based on what needs to be displayed). <- i lean towards this with maybe a handful of must have fields stored in the index itself 2. lets take the user composite record - composed of - a user record + perms + creds + groups |
| Comment by shale99 [ 05/Jul/17 ] |
|
uploaded a screen shot i think if we start down this path we can start with a trigger in postgres which will call pg_notify - this is a wrapper for the Notify / Listen mechanism in postgres which allows external processes to receives notifications from postgres on things like data updates / deletes / inserts. there will be a need for an async driver that supports this (there are), and then all changes are streamed async to the client (can be solr / or something light above solr running in the same web server - dont know yet) the notify mechanism has channels so that each tenant can have his own (mod-users.nyu channel for example) solr has near rel time searching so that within a few seconds indexed data is searchable. this means that this data has not yet been persisted to disk so that these entries may need to be replayed on index crash open issues: |
| Comment by shale99 [ 06/Jul/17 ] |
|
phase 1: simple implementation - more error prone: I have a simple java app prototype which listens on changes to a table in postgres (Notify / Listen Postgres mechanism) - there is an async jdbc driver option but that doesnt work on windows so basically this will be better served using a replayable / highly avail / persistent message queue - where we can protect against crashes between soft and hard commits / etc... the one BIG problem with this: other options: |
| Comment by shale99 [ 09/Jul/17 ] |
|
some additional info about the notify / listen mechanism in postgres. inserting 10,000 new records with the same json content - with the notification data containing the json itself - creates a single notification (250 milliseconds db time) inserting 10,000 new records with the same json content - with the notification data containing the json itself with a dynamic id - the performance spikes exponentially at 100,000 inserts as it seems postgres has an O(n2) of duplication checks over the inserted records - or something else that causes a major slowdown.... inserting 100,000 new records with the same json content - with the notification sending the json itself with a dynamic id - |
| Comment by shale99 [ 10/Jul/17 ] |
|
ok, so of course SOLR has an answer to what we need - should have started there! in SOLR there is a dataImportHandler an example how to user this for the mod-users dev: 1. yes! this can also be done via http and not only via direct database access my current open issue: getting fancy: |
| Comment by shale99 [ 10/Jul/17 ] |
|
Another issue is handling deletes. DELETE FROM queue WHERE userId = ( SELECT userId FROM queue ORDER BY date FOR UPDATE SKIP LOCKED <------not nessarily needed LIMIT 1 ) RETURNING *; this will pull the record from the queue table (without locking other records so that this can happen concurrently if needed) and will delete on commit - need to see if this is supported in the data import handler (sending a query and then a commit - or if a customer handler is needed) |
| Comment by shale99 [ 16/Jul/17 ] |
|
SUMMARY: I think we should take a step back from the central index for the following reasons: if we have a requirement that a tenant's data should be in a single DB (by a single DB i mean , this can be a db cluster of multiple postgres servers which holds data for lets say 100 tenants) then: create a read only view of tables a module would like to expose. this read only view can serve as an interface to the data of the module - note that a view in this case is actually just a prepared statement (a query) - so that when there are upgrades the module can decide whether the query should be updated. If we take mod-users for example, this view can be a join query between the users and groups tables to create the read only data interface exposed by the module. what is the upside: |
| Comment by shale99 [ 17/Jul/17 ] |
|
a bit more details (comments welcomed as i may be just going off on a tangent) a storage module exposes the interface for its entities (aka a read only view it creates which does not have to have the same content as the actual data tables)
{
"entities": [
{
"entity" : "perms" ,
"view" : "myuniversity_mymodule.perms",
"references": [
{"entity" : "users", "key": "userName", "foriegnKey" :"userName"}
]},
]
}
Users:
{
"entities": [
{
"entity" : "users" ,
"view" : "myuniversity_mymodule.users",
"references": [
{"entity" : "groups", "key" :"groupId"}
]},
{"entity" : "groups" , "view" : "myuniversity_mymodule.groups"}
]
}
for the simplicity of the example, lets say mod-users-bl got a query param include=perms - so that it wants a composite user objects with added permissions. 1. mod-users-bl makes a request to mod-search (this can even be mod-users [as it is the core module] that supplies the core entity without the need for another module) thoughts??? |
| Comment by shale99 [ 17/Jul/17 ] |
|
last zig zag i hope (it's a tough subject) The above will work well for retrieving composite records and sorting, etc... To summarize: the simplest solution which would allow case insensitive search and stop words removal search (this would be good enough for matching partial fields / full fields) with sorting , etc... can be done via the db search solution suggested. if we need more advanced search features and ranking, then we need an alternative. |
| Comment by shale99 [ 24/Jul/17 ] |
|
a few more thoughts (hopefully the last until we discuss) 1. if we do not go with a dedicated index (solr, es) we would probably need a dedicated mod-search biz modules would query by indicating the composite object and a cql |