Make mod-lists stream the data when retrieving list contents
Description
Environment
Potential Workaround
Checklist
hideTestRail: Results
Activity

Mikhail Fokanov November 30, 2023 at 8:18 AMEdited
There are 2 options that we can consider to resolve the problem of performance:
1. HTTP spring of data (as described in the jira).
Pros:
The items that are listed in the benefits section of the jira description
For simple requests and small number of records the approach will not have problems.
Cons:
The long running http connections can be killed by proxy (nginx) request timeout.
The insertion to the database should be done by batches, so there should be a mechanism to accumulate records on mod-lists side and insert them.
If the connection is interrupted, the process should be started once again from the beginning.
We have experience with it in mod-oai-pmh. It was extremely tricky. Just to mention, that the functionality of mod-oai-pmh is not exactly the same (it was more complex case).
back-pressure problem: if the producer produces stream more intensively, than consumer consumes.
2. The "id more than last returned id" pagination.
Every batch request will be executed with the SQL where clause "some_sorted_coulmn > last_id" and limit (but not with the "offset/limit").
Pros:
Get data in batches, in the same way as the data should be inserted in the database
No problems with proxy timeouts
Straightforward implementation
Cons:
If there is no sorted index for the column, the data cannot be sorted on query time. In such case the data should be received sorted by the primary key along with the value for the sorting column and then sorted on the mod-lists side using "INSERT FROM (SELECT id FROM ....)" query. This is tricky.
It is less performant than doing it in one query.

Matt Weaver November 27, 2023 at 2:27 PM
Things to look out for: connection timeouts from nginx (default is 30 seconds) and Okapi

Matt Weaver November 27, 2023 at 2:25 PM
We should look at the implementation in OAI-PMH
Details
Assignee
UnassignedUnassignedReporter
Matt WeaverMatt WeaverLabels
Priority
TBDDevelopment Team
CorsairTestRail: Cases
Open TestRail: CasesTestRail: Runs
Open TestRail: Runs
Details
Details
Assignee
Reporter

Right now, mod-lists has to buffer the response from mod-fqm-manager when it is trying to get the contents of a list before it can send them out to the UI. After is done, mod-lists should be able to take those results and send them right through to the UI without needing to maintain a large buffer. It should be able to receive some subset of data from mod-fqm-manager, do minimal processing with it (e.g., restructuring each query result object representing a single row into the DTO representing a list row), then send it out through the controller without having to wait for the whole dataset.
Benefits of this
This should reduce the memory footprint of mod-lists considerably, helping to prevent it from getting OOM-killed under heavy load.
It should also improve performance slightly since we won't need to wait for each step in the DB -> FQM -> mod-lists pipeline to finish before the next step can start.
It also opens up the possibility of retrieving the full list contents in a single request, rather than doing a ton of separate HTTP requests, simplifying the overall interaction between the services and reducing the overhead considerably.
When we figure out how to do this with mod-lists, we can easily use the same technique in edge-fqm, with similar benefits. The hard part is figuring out how to do it the first time