...

Jira Legacy

server	System JiraJIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	PERF-6139

Table of Contents

Overview

In this apiworkflow, we are checking the performance of RTAC workflow running in the Fameflower release - We of the renewal of loan items for the Goldenrod release.

We tested it with 1, 5 2 virtual users for 30 minutes.

Backend:

mod-circulationusers-1817.1.0.9
mod-circulationusers-storagebl-116.0.0
mod-inventory-storage-19authtoken-2.5.1.2
mod-inventorypermissions-145.111.32
okapimod-circulation-219.380.09
mod-circulation-authtokenstorage-212.40.01mod
okapi-permissions-53.91.01

Environment:

55 back-end modules deployed in 110 ECS services
3 okapi ECS services
8 m5.large EC2 instances
1 2 (1 reader, 1 writer) db.r5.xlarge AWS RDS instance
INFO logging level

High Level Summary

On average rtac/${instanceId} POST_/circulation/renew-by-barcode API call takes ~13 seconds for an instance which has greater than 300 holdingsThe main issue is with https://github.com/folio-org/mod-rtac/blob/d8e51a8a82b69a6f1717b9c7abf60342ded99246/src/main/java/org/folio/rest/impl/RtacResourceImpl.java#L48. Most of the wait time is due to lots and lots of API calls to get holdings, items, and loans. Even after leveraging vertx's asynchronous behavior and considering the fact that 1 instance -> many holdings -> each holding -> many items -> each item -> many loans, this call chain triggers O(n^3) algorithmic complexity.
see average 671.12 seconds as we gradually increase the number of loans and users to 200 and 2 respectively.
POST_/circulation/renew-by-barcode API makes 26 different API requests to gather userBarcode and itemBarcode information. 34% of those are calls are to mod-authtoken. See Recommended Improvements for the JIRAs that were created/ covered by this testing effort.
No missing indexes or no suspicious slow queries found except a number of SELECT count_estimate() calls.

Test Runs

API callGET rtac/${instanceId}

Test	Virtual Users	Duration	OKAPI log level	Profiled	Holdings per instance	Loans per User
1.	1	30 min	WARNING	No	greater than 300	GET rtac/${instanceId}	10, 50, 100, 200
2.	52	30 min	WARNING	No	greater than 300	10, 50, 100, 200
3.	1	30 min	WARNING	Off	greater than 300	Mimic above call: GET inventory/instances/${instanceId} → GET /holdings-storage/holdings → GET /inventory/items → GET /circulation/loans
4.	5	30 min	WARNING	Off	greater than 300	Mimic above call: GET inventory/instances/${instanceId} → GET /holdings-storage/holdings → GET /inventory/items → GET /circulation/loans

Results

...

GET rtac/${instanceId}

...

Because there are no custom metrics inside mod-rtac to expose the wait time of each API call, we needed to mimic the rtac API logic by making subsequent calls to mod-inventory, mod-circulation to get open loans for greater than 300 holdings per instance in the JMeter script. One caveat is that these calls are synchronous and are end-to-end (over the internet from JMeter to FOLIO) so naturally these API calls will be slower compared to the actual /rtac API calls from mod-rtac which were implemented asynchronously. Therefore, the response times while are not likely what is seen by the code but should still give us a good idea of the cost of making n^3 API calls even if we take off some percentage for end-to-end over-the-wire latencies.

Calculations:

Total response time to get all holdings, items, open loans per instance(seconds) = time to get current instance + time to get all holdings

time to get all holdings = get all holdings per instance + time to get all items + time to get all loans

time to get all items = get current_item * total items

time to get all loans = get current_loan * total loans

Get an instance

GET inventory/instances/${instanceId}

...

For each instance, get all holdings

GET /holdings-storage/holdings?limit=2147483647&query=instanceId==${instanceId}

...

For each holding, get all items

GET /inventory/items?limit=2147483647&query=holdingsRecordId==${current_holding}

...

For each item, get all open loans

GET /circulation/loans?limit=2147483647&query=(itemId==${current_item} and status.name==Open)

...

JVM Profiling

Overall slow methods (between the modules profiled: mod-rtac, mod-circulation, mod-circulation-storage, mod-inventory, mod-inventory-storage)

Image Removed

To see more drill down these methods at: http://ec2-54-165-58-65.compute-1.amazonaws.com/grafana/d/b3iP-8hiz/java-profiling-folio-new?orgId=1&from=1592598484852&to=1592600441623&var-service_tag=mod-rtac&var-aggregation=10s

Database

The database CPU usage decreases slightly as the number of users increases 1, 5 users run. At maximum 28% CPU usage for 1 user. We are seeing a slight decrease in CPU utilization for 5 Users because RDS Postgresql is caching the most frequent query results.

Image Removed

...

The slowest query which took the most of execution time was initiated by the mod_inventory_storage service presented in the following table:

...

Percent of total time

...

Average Time, ms

...

Calls

...

Query

...

SELECT COUNT(*) FROM fs09000000_mod_inventory_storage.holdings_record

API response time vs Postgresql query response time

This difference below between API and Postgresql query response time is the sum of network time to/back from calling database, deserialization of JSON string into JSON objects, processing time, and network time to/from API. Below are a single API request and SQL query triggered manually for comparison.

...

CPU Utilization

INFO

200

Results

Renew all loans by barcode

Total Users for 30 minutes	Loans	Renew all loans (Average)	POST_/circulation/renew-by-barcode (Average)
1 user	10	6.52 seconds	585.11 ms
	50	31.81 seconds	637.48 ms
	100	1.06 minutes	650.15 ms
	200	2.09 minutes	673.40 ms
2 users	10	7.42 seconds	683.76 ms
	50	32.61 seconds	686.65 ms
	100	1.10 minutes	705.37 ms
	200	2.27 minutes	747.04 ms

Below is the dependency graph generated from Giraffe for single POST_/circulation/renew-by-barcode API. It shows individual calls made by renew-by-barcode to gather prerequisite data from different APIs such as item-storage, location-units, loan-types, request-storage, notice-policy, locations, and mod-authtoken. In all, this call took 990 ms, and 6% of that is consumed by mod-authtoken and the remaining 94% of the time is taken by other APIs to gather prerequisite data.

Image Added

Database CPU Utilization

The database CPU usage increases gradually for 1, 2 users run. At maximum 9% CPU usage for 1 user.

Image Added

For 2 users, max CPU utilization increases and then stabilizes as the number of loans increases. Therefore, CPU utilization is relatively on the lower side.

For 30 minutes	Loans	RDS CPU Utilization
1 user	10	7%
	50	8%
	100	8%
	200	9%
2 users	10	14%
	50	15%
	100	15%
	200	15%

Database queries
Anchor
slowQueries
slowQueries

For 1 run of 30 minutes test, following queries were made. To point out, total calls made and the total time spend on these queries is significant.

Query	Total time	Average Time	Total calls made
SELECT count_estimate('SELECT jsonb,id FROM fs09000000_mod_inventory_storage.item WHERE lower(f_unaccent(item.jsonb->>''barcode'')) LIKE lower(f_unaccent(''54357881''))')	71%	14 ms	9027
SELECT count_estimate('SELECT jsonb FROM fs09000000_mod_patron_blocks.user_summary WHERE (jsonb->>''userId'') = ''4deba809-4fb4-432a-b3df-c54b6ad51d22''')	10%	2 ms	8457

Service CPU Utilization

Service CPU utilization is relatively on a lower side for most modules except for mod-inventory-storagecirculation. For service with the highest CPU utilization is shown by range to get a better picture.

For 30 minutes	Modules	Service CPU Utilization
1 user

5 usersAverage %Average %

	mod-inventory	3%
	mod-inventory-storage	10%
	mod-circulation

3.143.95

70%-95%

mod-circulation-storage

4.846.38

6%
2 users	mod-inventory

5.386.54

3.25%

mod-inventory-storage

2624

	17.75%
mod-circulation	70%-122.75%
mod-

rtac

circulation-storage

2

11.

45

75%

3

Service Memory Utilization

Memory Service memory was stable throughout the runs, only a spike here or there, but in a 30 minutes run they were consistent.

...

for most modules except mod-circulation which was on the higher side. For service with the highest Memory utilization is shown by range to get a better picture.

For 30 minutes	Modules	Service Memory Utilization
1 user	mod-inventory	63%
	mod-inventory-storage	56%
	mod-circulation

72.172.50

113%-118%

mod-circulation-storage

39.4639.55

42%
2 users	mod-inventory

4446.5

65%

mod-inventory-storage

44.3045

56%
mod-circulation	113%-118%
mod-

rtac55.255

circulation-storage

49%

Graph for service memory utilization for mod-circulation. Memory usage started at ~113%, rose to ~118% and fall back to ~101%

Image Added

Recommended Improvements
Anchor
recommendedImprovements
recommendedImprovements

The following JIRA JIRAs has been created for mod-rtacauthtoken optimization:

Jira Legacy

server	System

Jira

key
JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc

MODRTAC-38

Consider taking advantage of CQL "OR" operator to chain IDs together at each level of getting items, holdings, loans, instead of making API calls. This approach could be faster, although need to be proven.
Consider doing SQL joins between instance/holdings/items where possible to reduce making API and database calls.
In mod-circulation, mod-circulation-storage, mod-inventory-storage, consider using a more efficient JSON package or call using the existing Jackson serialization calls in a different way to address the item: JVM profiling shows JSON de/serialization operations one of the slowest operations.

Appendix

...

MODAT-80

Jira Legacy

server	System JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	MODAT-83

Appendix

See Attached user_loanRenewals-test-report.xlsx for details

Version	Old Version 2	New Version Current
Changes made by	Varun Javalkar	Varun Javalkar
Saved on	Aug 06, 2020	Aug 17, 2020

Versions Compared

Key

Overview

High Level Summary

Test Runs

Results

JVM Profiling

Database

API response time vs Postgresql query response time

CPU Utilization

Results

Database CPU Utilization

Database queries
Anchor
slowQueries
slowQueries

Service CPU Utilization

Service Memory Utilization

Recommended Improvements
Anchor
recommendedImprovements
recommendedImprovements

Appendix

Appendix

Page Comparison

Versions Compared

Key

Overview

High Level Summary

Test Runs

Results

JVM Profiling

Database

API response time vs Postgresql query response time

CPU Utilization

Results

Database CPU Utilization

Database queries AnchorslowQueriesslowQueries

Service CPU Utilization

Service Memory Utilization

Recommended Improvements AnchorrecommendedImprovementsrecommendedImprovements

Appendix

Appendix

Database queries
Anchor
slowQueries
slowQueries

Recommended Improvements
Anchor
recommendedImprovements
recommendedImprovements