Investigate a central compound object index

Description

As the front-end is currently desirous of a way to search for users based on related objects (e.g., get a list of all the users with permissions X, Y and Z), it makes sense to evaluate the benefits of maintaining an index of composite objects to facilitate this.

For example, to execute the previously mentioned request, we have to query the permissions module first to get a list of all permission-association objects that contain permissions X, Y and Z. Then, from this list of objects, we have to compile a list of usernames that these objects possess. Then we have to query the user module to get a list of all user records that match our list of usernames. Then, depending on the return format requested by the client, we may have to join the user objects together with the permissions objects to return the compound result.

The logic gets more complicated when we add in features like sorting the result set based on a field in a module outside of the users module, or doing pagination. We also run into the issue of degraded performance for the "manual searching" approach when we increase the number of records that need to be searched.

A possible solution would be to use something that's happy indexing huge amounts of data (e.g. solr) and inserting composite records into this index.

For the schema, we could adopt a dot notation to indicate the object and subfield. For example, to search by userid, you'd use field "user.id". For a permission name, something like permissionsUser.permission_name.

The good news is that this makes record retrieval stupidly easy. It's just a straightforward query, and translating between CQL and Solr queries is not hard. We'd be free to introduce as much complex logic as we wanted, as well as choose fields to sort on and suchlike.

The downside, of course, is building and maintaining the index. How can this be done in a reliable fashion with the least amount of burden on the maintainers of the individual modules that contribute to the composite record?

For example, we could maintain a message queue and allow modules to push messages to it whenever a record is created, updated or deleted. These messages would then be consumed in order and used to update the composite record index.

It is worth noting that Solr supports partial document updating...which seems particular relevant here, since no one module would be sending enough information to update an entire document. https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents

Another option would be to try to implement this with no changes to the storage modules, and have a way to externally monitor for changes and then write them back to the indexes. One way to do this might be some kind of low level database trigger, though I worry that this might really violate the KISS principle. Another possibility could be a filter-level Okapi module that would listen for request types to various modules (e.g. PUT, POST, DELETE) and then create some kind of message into a queue for some process to query the module for changes and write these back to the index.

Since this seems to potentially overlap several different issues, I'd like to determine fairly quickly whether or not this is a road worth going down, or if we want to try to implement the "manual searching" solution for the short term, at least.

Environment

None

Potential Workaround

None

Attachments

3

Checklist

hide

TestRail: Results

Activity

Show:

shale99 July 24, 2017 at 10:00 AM

a few more thoughts (hopefully the last until we discuss)

1. if we do not go with a dedicated index (solr, es) we would probably need a dedicated mod-search
a. the reason for this , i think, is because storage modules will not be composite object aware - so they wont have a concept of returning data that belongs to other modules
b. and the biz logic module does not have db access
c. so there needs to be a module that can return composite objects and does have db access

biz modules would query by indicating the composite object and a cql

shale99 July 17, 2017 at 12:09 PM

last zig zag i hope (it's a tough subject)

The above will work well for retrieving composite records and sorting, etc...
It will however be limited in the search functionality it will provide.
For example, we can find solutions for case insensitive searching as well as stop words removal (as these can be handled within the query). it starts getting a bit more complicated with things like stemming support and the likes. there is stuff we can do in the query to some extent - but with limitations - for example, if someone queries all the kings men - we can run a stemmer on a per language basis and fire off 4 queries (need to understand the performance implications) with things like:
all the king man
all the king men
all the kings man
all the kings men (the original query)
but ranking will not be something we can really do in order to order the results correctly.

To summarize:

the simplest solution which would allow case insensitive search and stop words removal search (this would be good enough for matching partial fields / full fields) with sorting , etc... can be done via the db search solution suggested.

if we need more advanced search features and ranking, then we need an alternative.
either solr / elastic - or something like postgres full text (i do not know enough about it to give an opinion)

shale99 July 17, 2017 at 9:44 AM

a bit more details (comments welcomed as i may be just going off on a tangent)

a storage module exposes the interface for its entities (aka a read only view it creates which does not have to have the same content as the actual data tables)
for example:
Permissions:

Users:

for the simplicity of the example, lets say mod-users-bl got a query param include=perms - so that it wants a composite user objects with added permissions.

1. mod-users-bl makes a request to mod-search (this can even be mod-users [as it is the core module] that supplies the core entity without the need for another module)
The request made contains the request for users + perms + cql
2. mod-users calls the perms api to get the data interface (above) - this api can be handled by rmb and can look for a json file created by the developer for the particular module's version
3. mod-users then joins its interface with the perms interface using the reference keys (the keys are based on the view) and appends the cql translated to sql clause
4. mod-users returns the correct composite object / objects

thoughts???

shale99 July 16, 2017 at 11:00 AM
Edited

SUMMARY:

I think we should take a step back from the central index for the following reasons:

if we have a requirement that a tenant's data should be in a single DB (by a single DB i mean , this can be a db cluster of multiple postgres servers which holds data for lets say 100 tenants)

then:
since we are using the schema per tenant concept we can actually join across schemas by just including the schema as a namespace (prefix)
we still want to abstract the master data tables within the schemas (modules) from this joining / querying (searching) functionality - so each module can actually

create a read only view of tables a module would like to expose. this read only view can serve as an interface to the data of the module - note that a view in this case is actually just a prepared statement (a query) - so that when there are upgrades the module can decide whether the query should be updated. If we take mod-users for example, this view can be a join query between the users and groups tables to create the read only data interface exposed by the module.
We can then have a dedicated search module (preferable, since this can be a storage module with access to the db), or the biz modules themselves (will need to send the query to a storage module so not as clean) getting an indication of the views exposed by the storage modules to the data the biz module needs and then a cross view join can be executed to retrieve / sort / etc... the results

what is the upside:
1. no data replication
2. no sync issues between persisted data and indexed data
3. less modules to manage

shale99 July 10, 2017 at 8:07 AM
Edited

Another issue is handling deletes.
deletes remove records from the table which means querying the table wont return those records so we wont know to delete them from the index - so we either need to mark deleted records as deleted and not physically delete them immediately or have a trigger pushing all inserts / updates / deletes into a queue table (with a column indicating the operation). then solr would need to issue a query similar to:

this will pull the record from the queue table (without locking other records so that this can happen concurrently if needed) and will delete on commit - need to see if this is supported in the data import handler (sending a query and then a commit - or if a customer handler is needed)

Details

Assignee

Reporter

Priority

Development Team

Core: Platform

TestRail: Cases

Open TestRail: Cases

TestRail: Runs

Open TestRail: Runs
Created June 29, 2017 at 8:58 PM
Updated February 14, 2022 at 10:22 AM
TestRail: Cases
TestRail: Runs