[FOLIO-2573] Paused due to POC of Elastic Search. All Search Options Should be Supported by a Database Index Created: 24/Apr/20  Updated: 31/Jan/22  Resolved: 31/Jan/22

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P3
Reporter: Marc Johnson Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: PTF-Review, elastic-search, potential-decision
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Blocks
is blocked by UXPROD-2806 Create Elasticsearch indexes for Inve... Closed
Defines
defines UXPROD-3513 Inventory Elastic Search (Morning Glo... Closed
Gantt End to Start
has to be done before UIIN-790 Filter Instance records by Date updated Closed
Relates
relates to PERF-76 Run performance tests following addit... Open
relates to RMB-534 Reject CQL queries that match no inde... Open
relates to RMB-742 Create indexes on metadata.createdByU... Open
relates to CIRCSTORE-215 Missing Indexes Warnings Closed
relates to MODINVSTOR-496 cancel long running instance searches Closed
relates to MODINVSTOR-499 Add Index for Instances Full-text Sub... Closed
relates to MODINVSTOR-510 Inventory filtering by Item status is... Closed
relates to MODINVSTOR-515 Holdings source index for filtering Closed
relates to PERF-48 Inventory. Slow query when filtering ... Closed
relates to RMB-615 cancel queries that take longer than ... Closed
relates to MODINVSTOR-472 Search queries without database index Closed
relates to MODINVSTOR-500 Instance source index for = search Closed
relates to MODINVSTOR-530 Create a database supported search in... Closed
Sprint: Core: F - Sprint 87, Core: F - Sprint 88
Development Team: Prokopovych

 Description   

In order to (hopefully) pre-empt inadequate performance when searching inventory using the defined options in the reference UI, database indexes should be created for each CQL criteria that is generated by the reference UI

Considerations

  • Should improve the baseline performance of inventory search (to be verified during the end of quarter release process)
  • Back end work required for every new search option e.g. a new filter
  • Will need to be maintained every time the UI changes it's query generation
  • May reduce write performance (especially when upgrading the system or migrating from existing systems)
  • May not improve performance in all cases


 Comments   
Comment by Marc Johnson [ 24/Apr/20 ]

Cate Boerema Charlotte Whitt Jakub Skoczen Zak Burke Michal Kuklis

There have been a number of ad-hoc conversations about this, I want to confirm that we are all in agreement that this is what we are going to do going forward?

Bohdan Suprun Jeremy Huff

Please raise any concerns you might have

Comment by Charlotte Whitt [ 24/Apr/20 ]

Hi Marc Johnson - this looks really good

CC: lew235

Comment by Bohdan Suprun (Inactive) [ 24/Apr/20 ]

Hi Marc Johnson,

I think it is important to make an agreement what CQL query is going to be used for a particular story, because required indexes may depend on operators that are used in the CQL.

Comment by Marc Johnson [ 24/Apr/20 ]

Bohdan Suprun

I think it is important to make an agreement what CQL query is going to be used for a particular story, because required indexes may depend on operators that are used in the CQL.

I totally agree, we are going to need to think about how we sequence the work to do that

Comment by Cate Boerema (Inactive) [ 27/Apr/20 ]

May reduce write performance (especially when upgrading the system or migrating from existing systems)
May not improve performance in all cases

These considerations concern me. Rather than making this a blanket policy, is there any way to only add indexes where they would improve search performance and not negatively impact write performance?

I wonder if the folks on the performance testing task force have thoughts on this. Tagging Martin Tran for his thoughts.

+ Stephanie Buck

Comment by Marc Johnson [ 27/Apr/20 ]

Cate Boerema

is there any way to only add indexes where they would improve search performance and not negatively impact write performance?

As I understand it, the addition of an index is likely to have a negative impact on write performance, as the index has to be updated when the write happens (in order for them to be consistent). To what degree that impact is, I don't know, it would require further investigation. It might be that after a while, the additional impact of each new index becomes close to zero.

It's difficult to know for certain whether an index will increase performance for all searches. It is likely that each index will help significantly with a search that only contains the term it is designed to help.

Once the search includes more than one term, it might even reduce the performance of a query because the database needs to decide which index to use. It might not always make the right decision. I believe this could be especially the case with the JSONB structures that FOLIO uses, as there aren't detailed statistics to help it make that decision, unlike regular columns.

I think we would likely need performance tests for a wide variety of searches (and create / update scenarios) in order to learn more about the impact of these indexes.

Does that make sense?

Jakub Skoczen Mikhail Fokanov Julian Ladisch Adam Dickmeiss Please correct my understanding of any of the above.

Comment by Julian Ladisch [ 06/May/20 ]

Indexes reduce write performance, but this is very small and users cannot notice this write performance difference in the front-end.
You may notice the difference when doing bulk inserts or bulk updates, but this is small.
Some (not all) module upgrades require an index recreation, each index may take a few minutes depending on the number of records and the index type. Module upgrades are needed for each release, FOLIO publishes a new release every 3 months.
If an index speeds up the searches it should be used because the write performance penalty is very small.

Comment by Marc Johnson [ 06/May/20 ]

Julian Ladisch Thank you for your thoughts.

Indexes reduce write performance, but this is very small and users cannot notice this write performance difference in the front-end.
You may notice the difference when doing bulk inserts or bulk updates, but this is small.

Do we know what impact each index has?

Some (not all) module upgrades require an index recreation, each index may take a few minutes depending on the number of records and the index type.

I'm thinking as much about upgrades that change many records. As I understand it, any change to a JSONB column involves a re-evaluation of all indexes based upon parts of that column. Is my understanding incorrect?

If an index speeds up the searches it should be used because the write performance penalty is very small.

Does that mean you are advocating for indexes that support every search in the reference inventory UI?

Would you go further, and index all properties in inventory, as they can be used in the CQL search in the UI?

Comment by Julian Ladisch [ 06/May/20 ]

Do we know what impact each index has?

It depends on average size of the field to be indexed and the number of records of the table.

any change to a JSONB column involves a re-evaluation of all indexes based upon parts of that column.

That is correct. Some initial PostgreSQL code for a recheck_on_update optimization exists, but it is not production ready. It calculates the index expression before and after the update and skips index update when they are the same.

I advocate for creating indexes that give a noticeable speedup for all searches where we know that they will likely be executed.

There will always be slow searches, for example when joining tables (instance, holding, item) and using indexed fields from instance and item.

Comment by Marc Johnson [ 08/May/20 ]

Julian Ladisch

It depends on average size of the field to be indexed and the number of records of the table.

Thanks, I wondered if we had a sense of the impact for a table with indexes from properties in a JSONB column with approximately 10 million records. I'm not asking you to investigate this, I was only curious.

I advocate for creating indexes that give a noticeable speedup for all searches where we know that they will likely be executed.

How would you suggest we discover these indexes?

Do you think it is reasonable to do this for all of the explicit search options in the reference UI (compound searches might still be slow)?

A naive approximation, based upon 1 index per search option would mean the current reference UI would need a minimum of 15 indexes (some of which are custom like keyword). More will be coming soon, including 4-9 (I think) for expansions to call number number searching.

Comment by Julian Ladisch [ 08/May/20 ]

Yes, it is reasonable to index all explicit search options in the UI.

We should ask implementers to report slow queries. We may even create a slow queries page on wiki.folio.org where people can easily report them.

Comment by Charlotte Whitt [ 08/May/20 ]

Hi Marc Johnson - it's pretty common to maintain search indexes for large bibliographic data bases with - e.g. I found this 57 p. document with spec for search codes and indexies in in DanBib. A Danish National Union Catalogue with all records formated as danMARC2 (a Danish variant of MARC21) - https://slks.dk/fileadmin/user_upload/0_SLKS/Dokumenter/Biblioteker/Standarder/Praksisregler_for_soegeveje.pdf.

Have we an idea about, how we plan to list/document a list of implemented indexed searches?

Comment by Marc Johnson [ 08/May/20 ]

Julian Ladisch

Yes, it is reasonable to index all explicit search options in the UI.

Does that mean that that you are an advocate of creating (at least one, unless already present) index for each search option we add to the reference UI?

We should ask implementers to report slow queries. We may even create a slow queries page on wiki.folio.org where people can easily report them.

I think that is a good idea. I don't know if the FOLIO community has a process for accepting feedback from the implementers.

Comment by Marc Johnson [ 08/May/20 ]

Charlotte Whitt

it's pretty common to maintain search indexes for large bibliographic data bases with - e.g. I found this 57 p. document with spec for search codes and indexies in in DanBib. A Danish National Union Catalogue with all records formated as danMARC2

Ok. Does that mean you are advocating for FOLIO to create database indexes for each search option we add to the reference UI, based upon the fact that other systems do this?

Unfortunately, that document does not help me understand the needs or the technical choices that organisation made because it is in a language I cannot read.

Have we an idea about, how we plan to list/document a list of implemented indexed searches?

Not that I am aware of, to begin to answer that I'd need to understand the need.

What need / expectation would that documentation serve?

Comment by Charlotte Whitt [ 08/May/20 ]

Sorry Marc Johnson - I didn't intend that you should read that report in Danish. Just get the sense of how other systems do documentation of the search codes and indexes.

Yes, I'm advocating for that we in FOLIO do indexes for each of the search options and defined filters in the UI in Search and filter (first pane). And when adding a new search option or a filter, then we automatically should create the equivalent search index.

These search indexes should be documented, e.g. as a table.

This documentation could be transformed to a tips & trick document, for the staff user to know, what search indexes to use, when wanting to perform a given search. The MM-SIG has started following wiki page - https://folio-org.atlassian.net/wiki/display/FOLIOtips/Searching

Comment by Marc Johnson [ 08/May/20 ]

Charlotte Whitt

I didn't intend that you should read that report in Danish. Just get the sense of how other systems do documentation of the search codes and indexes.

Thanks. It is challenging for me to get a sense of how other systems do this when I am unable to read the documentation itself, sorry :-/

Yes, I'm advocating for that we in FOLIO do indexes for each of the search options and defined filters in the UI in the first page. And when adding a new search option or a filter, then we automatically should create the equivalent search index.

Ok. Alas, it isn't automatic and requires diligence and development work each time.

These search indexes should be documented, e.g. as a table.

This might sound that an impertinent question. Why do they need to be documented? What will this documentation be used for?

This documentation could be transformed to a tips & trick document, for the staff user to know, what search indexes to use, when wanting to perform a given search.

Does that mean this would be user facing documentation for them constructing their own queries outside of the ones provided by the reference UI?

If so, I don't know what the process is for the creation of user facing documentation for FOLIO.

Comment by Charlotte Whitt [ 12/May/20 ]

Hi Theodor Tolstoy (One-Group.se) and Lisa Sjögren - I have added you as watcher to this ticket, while I think the conversation in the #Support channel are addressing from a user perspective some of the challenges we're dealing with re. searches.

Link to the original slack chat, thread 4/16/2020: in the #Support channel - https://folio-project.slack.com/archives/C2BHMHCFJ/p1587026417102300
And to avoid risk loosing the thread, I also captured the conversation in a google doc: https://docs.google.com/document/d/1AaAVpBiEtsK7cnaN3C57fudgmVodqNiIFpJraNncXEI/edit

Comment by Charlotte Whitt [ 12/May/20 ]

Comments/findings from Wayne Schneider [5/12/2020, 10:45 PM]

OK, I have completed the db maintenance and brought the 'large FOLIO Fameflower' system back up.
Here's what I'm finding in inventory search:

  1. SRS performance is only marginally better. It takes 8+ seconds to respond to a query like /source-storage/formattedRecords/3fedbd63-3e85-408e-a1b2-806cd8668802?identifier=INSTANCE
  2. Filters that are backed by indexes (e.g. language, resource type, suppress from discovery) are performing well, except for mode of issuance.
  3. Filters that do not have indexes (format, nature of content) create a large load on the database and do not perform, either timing out or taking a very long time.
  4. Query search is Russian roulette. If there is an index for the property you are searching, it will work well. If not, performance will be unacceptable and DB CPU will spike
  5. Holdings call number search is quite slow, which surprises me, as there should be an index for it.
Comment by Theodor Tolstoy (One-Group.se) [ 15/May/20 ]

Charlotte Whitt Thank you for pinging!
Re: Wayne's findings, I believe SRS has a different kind of problem than what we experience in Inventory. Hopefully, that will get fixed this quarter.

.

Comment by Theodor Tolstoy (One-Group.se) [ 15/May/20 ]

Considerations

  • Should improve the baseline performance of inventory search (to be verified during the end of quarter release process)
  • Back end work required for every new search option e.g. a new filter
  • Will need to be maintained every time the UI changes it's query generation
  • May reduce write performance (especially when upgrading the system or migrating from existing systems)
  • May not improve performance in all cases

I cannot help to think about how this list would have looked, had we had a proper search engine in place.

Comment by Lisa Sjögren [ 15/May/20 ]

Thanks Charlotte Whitt! I am following these discussions around optimizing search performance (speed, usability, precision, internationalization), using either further indexing of fields or by adding a search engine, with great interest.

And thank you Theodor Tolstoy (One-Group.se) for bringing up the search engine angle. I think it's important that the discussions around different approaches – while they might both have their place, short and long term – converge and inform each other.

Comment by Charlotte Whitt [ 15/May/20 ]

Just a FYI: Link to a proposal (work in progress) published on the FOLIO wiki - https://folio-org.atlassian.net/wiki/pages/viewpage.action?pageId=1776170

Comment by Marc Johnson [ 22/Jun/20 ]

I think that the presence of issues reported (mostly by the EBSCO FSE or Performance Task Force teams) for searches that do not use a database index (e.g. CIRCSTORE-215 Closed or MODINVSTOR-472 Closed ) suggests to me that FOLIO is tending towards a policy of requiring a database index for every search query.

Given that, I've expanded the scope of this decision to be broader than inventory, and will add it to the topics for the Technical Leads to discuss.

I think we need need to decide:

  • whether a query that does not use a database index is considered invalid? (the corollary being, all CQL queries must execute using a database index)
  • what should happen when a client request includes an invalid query?

What do folks think? Do we need more information (a proposal) before this can be discussed further?

Comment by Julian Ladisch [ 23/Jun/20 ]

Small tables don't need database indexes because all queries are always fast, even when using a full table scan.

Comment by Charlotte Whitt [ 25/Jun/20 ]

Hi Julian Ladisch - could you give an example on, what you define as small tables?

Comment by Julian Ladisch [ 25/Jun/20 ]

10000 records. Most of these tables probably are small:
alternative_title_type
classification_type
contributor_name_type
contributor_type
identifier_type
instance_format
instance_relationship_type
instance_status
instance_type
loan_type
material_type
mode_of_issuance
nature_of_content_term
statistical_code_type

Comment by Marc Johnson [ 03/Jul/20 ]

Jakub Skoczen Craig McNally Julian Ladisch Cate Boerema Charlotte Whitt Martin Tran Svitlana Zmiivska Jeremy Huff Bohdan Suprun Zak Burke

Here is my current understanding of the situation

  • The Performance Task Force are observing the warning errors in the logs and raising issues to resolve them (see CIRCSTORE-215 Closed )
  • Long running queries are to be cancelled ( see MODINVSTOR-496 Closed and RMB-615 Closed )
  • It has been proposed that all CQL queries that are not supported by a database index be disallowed ( see RMB-534 Open )
  • Folks want to be able to search on any property in inventory using CQL

The top two of these effectively make a database index for many properties required.

If we are to accommodate the fourth option then we either cannot allow the third to go ahead or all possible queries have to be catered for by an index. This would, in effect mean multiple indexes on every property in instances, holdings and items.

Even if the impact of each index is relatively small, the cumulative effect of either of these could be significant.

Have I understood the situation correctly?

Is this the intended outcome of this combination of decisions?

Should there be exceptions? As Julian Ladisch correctly states, some record collections are small and so may not require indexes be sufficiently performant (I'm deliberately not defining what that is in this post)

Does a topic need raising with the Technical Leads or the Technical Council to turn this emerging defacto policy into actual policy?

Comment by Marc Johnson [ 03/Feb/21 ]

Blocking as is dependent upon the decisions made following the elastic search based search PoC

Comment by Charlotte Whitt [ 31/Jan/22 ]

Marc Johnson - can I close this ticket as 'Won't do', or do you want to keep it as an 'umbrealla' ticket and keeping track of the mod-search work?

Comment by Marc Johnson [ 31/Jan/22 ]

Charlotte Whitt

can I close this ticket as 'Won't do', or do you want to keep it as an 'umbrealla' ticket and keeping track of the mod-search work?

I have no interest in this being kept as an umbrella for ES based search work.

Comment by Charlotte Whitt [ 31/Jan/22 ]

This work is now tracked in MSEARCH

Generated at Thu Feb 08 23:21:39 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.