...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Jira Legacy | ||||||||
---|---|---|---|---|---|---|---|---|
|
Table of Contents |
---|
Overview
In this apiworkflow, we are checking the performance of RTAC workflow running in the Fameflower release - We of the renewal of loan items for the Goldenrod release.
We tested it with 1, 5 2 virtual users for 30 minutes.
Backend:
- mod-circulationusers-1817.1.0.9
- mod-circulationusers-storagebl-116.0.0
- mod-inventory-storage-19authtoken-2.5.1.2
- mod-inventorypermissions-145.111.32
- okapimod-circulation-219.380.09
- mod-circulation-authtokenstorage-212.40.01mod
- okapi-permissions-53.91.01
Environment:
- 55 back-end modules deployed in 110 ECS services
- 3 okapi ECS services
- 8 m5.large EC2 instances
- 1 2 (1 reader, 1 writer) db.r5.xlarge AWS RDS instance
- INFO logging level
High Level Summary
- On average rtac/${instanceId} POST_/circulation/renew-by-barcode API call takes ~13 seconds for an instance which has greater than 300 holdingsThe main issue is with https://github.com/folio-org/mod-rtac/blob/d8e51a8a82b69a6f1717b9c7abf60342ded99246/src/main/java/org/folio/rest/impl/RtacResourceImpl.java#L48. Most of the wait time is due to lots and lots of API calls to get holdings, items, and loans. Even after leveraging vertx's asynchronous behavior and considering the fact that 1 instance -> many holdings -> each holding -> many items -> each item -> many loans, this call chain triggers O(n^3) algorithmic complexity.
see average 671.12 seconds as we gradually increase the number of loans and users to 200 and 2 respectively. - POST_/circulation/renew-by-barcode API makes 26 different API requests to gather userBarcode and itemBarcode information. 34% of those are calls are to mod-authtoken. See Recommended Improvements for the JIRAs that were created/ covered by this testing effort.
- No missing indexes or no suspicious slow queries found except a number of SELECT count_estimate() calls.
Test Runs
Test | Virtual Users | Duration | OKAPI log level | Profiled | Holdings per instance | API callLoans per User | |
1. | 1 | 30 min | WARNING | No | greater than 300 | GET rtac/${instanceId} | 10, 50, 100, 200 |
2. | 52 | 30 min | WARNING | No | greater than 300 | GET rtac/${instanceId}10, 50, 100, 200 | |
3. | 1 | 30 min | WARNING | Off | greater than 300 | Mimic above call: GET inventory/instances/${instanceId} → GET /holdings-storage/holdings → GET /inventory/items → GET /circulation/loans | |
4. | 5 | 30 min | WARNING | Off | greater than 300 | Mimic above call: GET inventory/instances/${instanceId} → GET /holdings-storage/holdings → GET /inventory/items → GET /circulation/loans |
Results
...
GET rtac/${instanceId}
...
Because there are no custom metrics inside mod-rtac to expose the wait time of each API call, we needed to mimic the rtac API logic by making subsequent calls to mod-inventory, mod-circulation to get open loans for greater than 300 holdings per instance in the JMeter script. One caveat is that these calls are synchronous and are end-to-end (over the internet from JMeter to FOLIO) so naturally these API calls will be slower compared to the actual /rtac API calls from mod-rtac which were implemented asynchronously. Therefore, the response times while are not likely what is seen by the code but should still give us a good idea of the cost of making n^3 API calls even if we take off some percentage for end-to-end over-the-wire latencies.
Calculations:
Total response time to get all holdings, items, open loans per instance(seconds) = time to get current instance + time to get all holdings
time to get all holdings = get all holdings per instance + time to get all items + time to get all loans
time to get all items = get current_item * total items
time to get all loans = get current_loan * total loans
Get an instance
GET inventory/instances/${instanceId}
...
For each instance, get all holdings
GET /holdings-storage/holdings?limit=2147483647&query=instanceId==${instanceId}
...
For each holding, get all items
GET /inventory/items?limit=2147483647&query=holdingsRecordId==${current_holding}
...
For each item, get all open loans
GET /circulation/loans?limit=2147483647&query=(itemId==${current_item} and status.name==Open)
...
JVM Profiling
- Overall slow methods (between the modules profiled: mod-rtac, mod-circulation, mod-circulation-storage, mod-inventory, mod-inventory-storage)
To see more drill down these methods at: http://ec2-54-165-58-65.compute-1.amazonaws.com/grafana/d/b3iP-8hiz/java-profiling-folio-new?orgId=1&from=1592598484852&to=1592600441623&var-service_tag=mod-rtac&var-aggregation=10s
Database
The database CPU usage decreases slightly as the number of users increases 1, 5 users run. At maximum 28% CPU usage for 1 user. We are seeing a slight decrease in CPU utilization for 5 Users because RDS Postgresql is caching the most frequent query results.
...
The slowest query which took the most of execution time was initiated by the mod_inventory_storage service presented in the following table:
...
Percent of total time
...
Average Time, ms
...
Calls
...
Query
...
SELECT COUNT(*) FROM fs09000000_mod_inventory_storage.holdings_record
API response time vs Postgresql query response time
This difference below between API and Postgresql query response time is the sum of network time to/back from calling database, deserialization of JSON string into JSON objects, processing time, and network time to/from API. Below are a single API request and SQL query triggered manually for comparison.
...
CPU Utilization
INFO | 200 |
Results
Renew all loans by barcode
Total Users for 30 minutes Loans Renew all loans (Average) POST_/circulation/renew-by-barcode (Average) 1 user 10 6.52 seconds
585.11 ms
50 31.81 seconds
637.48 ms
100 1.06 minutes 650.15 ms
200 2.09 minutes 673.40 ms
2 users 10 7.42 seconds
683.76 ms
50 32.61 seconds
686.65 ms
100 1.10 minutes
705.37 ms
200 2.27 minutes
747.04 ms
- Below is the dependency graph generated from Giraffe for single POST_/circulation/renew-by-barcode API. It shows individual calls made by renew-by-barcode to gather prerequisite data from different APIs such as item-storage, location-units, loan-types, request-storage, notice-policy, locations, and mod-authtoken. In all, this call took 990 ms, and 6% of that is consumed by mod-authtoken and the remaining 94% of the time is taken by other APIs to gather prerequisite data.
Database CPU Utilization
The database CPU usage increases gradually for 1, 2 users run. At maximum 9% CPU usage for 1 user.
For 2 users, max CPU utilization increases and then stabilizes as the number of loans increases. Therefore, CPU utilization is relatively on the lower side.
For 30 minutes | Loans | RDS CPU Utilization |
1 user | 10 | 7% |
50 | 8% | |
100 | 8% | |
200 | 9% | |
2 users | 10 | 14% |
50 | 15% | |
100 | 15% | |
200 | 15% |
Database queries
Anchor | ||||
---|---|---|---|---|
|
For 1 run of 30 minutes test, following queries were made. To point out, total calls made and the total time spend on these queries is significant.
Query | Total time | Average Time | Total calls made |
---|---|---|---|
SELECT count_estimate('SELECT jsonb,id FROM fs09000000_mod_inventory_storage.item WHERE lower(f_unaccent(item.jsonb->>''barcode'')) LIKE lower(f_unaccent(''54357881''))') | 71% | 14 ms | 9027 |
SELECT count_estimate('SELECT jsonb FROM fs09000000_mod_patron_blocks.user_summary WHERE (jsonb->>''userId'') = ''4deba809-4fb4-432a-b3df-c54b6ad51d22''') | 10% | 2 ms | 8457 |
Service CPU Utilization
Service CPU utilization is relatively on a lower side for most modules except for mod-inventory-storagecirculation. For service with the highest CPU utilization is shown by range to get a better picture.
For 30 minutes | Modules | Service CPU Utilization |
1 user |
mod-inventory | 3% | |
mod-inventory-storage | 10% | |
mod-circulation |
70%-95% |
mod-circulation-storage |
6% | |
2 users | mod-inventory |
3.25% |
mod-inventory-storage |
17.75% | |
mod-circulation | 70%-122.75% |
mod- |
circulation-storage |
11. |
75% |
Service Memory Utilization
Memory Service memory was stable throughout the runs, only a spike here or there, but in a 30 minutes run they were consistent.
...
for most modules except mod-circulation which was on the higher side. For service with the highest Memory utilization is shown by range to get a better picture.
For 30 minutes | Modules | Service Memory Utilization |
1 user | mod-inventory | 63% |
mod-inventory-storage | 56% | |
mod-circulation |
113%-118% |
mod-circulation-storage |
42% | |
2 users | mod-inventory |
65% |
mod-inventory-storage |
56% | |
mod-circulation | 113%-118% |
mod- |
circulation-storage | 49% |
Graph for service memory utilization for mod-circulation. Memory usage started at ~113%, rose to ~118% and fall back to ~101%
Recommended Improvements
Anchor | ||||
---|---|---|---|---|
|
- The following JIRA JIRAs has been created for mod-rtacauthtoken optimization:
Jira Jira Legacy server System JIRA columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 01505d01-b853-3c2e-90f1-ee9b165564fc key
- Consider taking advantage of CQL "OR" operator to chain IDs together at each level of getting items, holdings, loans, instead of making API calls. This approach could be faster, although need to be proven.
- Consider doing SQL joins between instance/holdings/items where possible to reduce making API and database calls.
- In mod-circulation, mod-circulation-storage, mod-inventory-storage, consider using a more efficient JSON package or call using the existing Jackson serialization calls in a different way to address the item: JVM profiling shows JSON de/serialization operations one of the slowest operations.
Appendix
...
MODAT-80
Jira Legacy server System JIRA columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 01505d01-b853-3c2e-90f1-ee9b165564fc key MODAT-83
Appendix
See Attached user_loanRenewals-test-report.xlsx for details