PTF -RTAC Workflow Test Report (Fameflower)

PERF-61 - Getting issue details... STATUS

Overview

In this api, we are checking the performance of RTAC workflow running in the Fameflower release - 

 We tested it with 1, 5 virtual users for 30 minutes.  

Backend:

  • mod-circulation-18.0.9
  • mod-circulation-storage-11.0.0
  • mod-inventory-storage-19.1.2
  • mod-inventory-14.1.3
  • okapi-2.38.0
  • mod-authtoken-2.4.0
  • mod-permissions-5.9.0

Environment:

  • 55 back-end modules deployed in 110 ECS services
  • 3 okapi ECS services
  • 8 m5.large  EC2 instances
  •  db.r5.xlarge AWS RDS instance
  • INFO logging level

High Level Summary

  1. On average rtac/${instanceId} API call takes ~13 seconds for an instance which has greater than 300 holdings
  2. The main issue is with https://github.com/folio-org/mod-rtac/blob/d8e51a8a82b69a6f1717b9c7abf60342ded99246/src/main/java/org/folio/rest/impl/RtacResourceImpl.java#L48.  Most of the wait time is due to lots and lots of API calls to get holdings, items, and loans.  Even after leveraging vertx's asynchronous behavior and considering the fact that 1 instance -> many holdings -> each holding -> many items -> each item -> many loans, this call chain triggers O(n^3) algorithmic complexity. 
    see Recommended Improvements for the JIRAs that were created/covered by this testing effort.

Test Runs

Test

Virtual Users

Duration

OKAPI log level

Profiled

Holdings per instance

API call

1. 

1

30 min

WARNING

No

greater than 300

GET rtac/${instanceId}

2. 

5

30 min

WARNING

No

greater than 300

GET rtac/${instanceId}
3.130 minWARNINGOffgreater than 300

Mimic above call:

GET inventory/instances/${instanceId} → GET /holdings-storage/holdings → GET /inventory/items → GET /circulation/loans

4.530 minWARNINGOffgreater than 300

Mimic above call:

GET inventory/instances/${instanceId} → GET /holdings-storage/holdings → GET /inventory/items → GET /circulation/loans


Results

  1. A rtac-level request to get all open loans for greater than 300 holdings per instance

    GET rtac/${instanceId}

    For 30 minutesProfilerAverage 75th %tile
    1 userON17.62 seconds25.47 seconds
    1 userOFF15.44 seconds20.99 seconds
    5 usersON47.53 seconds55.01 seconds
    5 usersOFF42.06 seconds49.61 seconds
  2. Because there are no custom metrics inside mod-rtac to expose the wait time of each API call, we needed to mimic the rtac API logic by making subsequent calls to mod-inventory, mod-circulation to get open loans for greater than 300 holdings per instance in the JMeter script.  One caveat is that these calls are synchronous and are end-to-end (over the internet from JMeter to FOLIO) so naturally these API calls will be slower compared to the actual /rtac API calls from mod-rtac which were implemented asynchronously.  Therefore, the response times while are not likely what is seen by the code but should still give us a good idea of the cost of making n^3 API calls even if we take off some percentage for end-to-end over-the-wire latencies.

Calculations:

Total response time to get all holdings, items, open loans per instance(seconds) = time to get current instance + time to get all holdings 

time to get all holdings = get all holdings per instance + time to get all items + time to get all loans

time to get all items = get current_item * total items

time to get all loans = get current_loan * total loans

Get  an instance 

GET inventory/instances/${instanceId}

Profilercurrent_instance + avg(get all holdings) = Total averagecurrent_instance + avg(get all holdings) = Total average 75th %tile
1 userON143 ms + 54.5 sec = 54.7 sec131 ms + 54.7 sec = 54.9 sec
1 userOFF118 ms + 48.4 sec = 48.6 sec108 ms + 52.6 sec = 52.7 sec
5 usersON129 ms + 67.2 sec = 67.3 sec137 ms + 70.1 sec = 70.2 sec
5 usersOFF102 ms + 57.8 sec = 57.9 sec112 ms + 58.2 sec = 58.3 sec

For each instance, get all holdings

GET /holdings-storage/holdings?limit=2147483647&query=instanceId==${instanceId}

Profileravg(get all holdings per instance) + avg(get all items) + avg(get all loans) = Total average avg(get all holdings per instance) + avg(get all items) + avg(get all loans) = Total average 75th %tile
1 userON174 ms + 39 sec + 15.3 sec = 54.5 sec213 ms + 40.2 sec + 14.3 sec = 54.7 sec
1 userOFF73 ms + 34 sec + 14.3 sec = 48.4 sec86 ms + 40.2 sec + 12.3 sec = 52.6 sec
5 usersON179 ms + 49.3 sec + 17.7 sec = 67.2 sec223 ms + 51.6 sec + 18 sec = 70.1 sec
5 usersOFF71 ms + 42.9 sec + 14.9 sec = 57.840 ms + 42.6 sec + 15.6 sec = 58.2 sec

For each holding, get all items

GET /inventory/items?limit=2147483647&query=holdingsRecordId==${current_holding}

Profileravg(get current_item) * total items = Total average avg(get current_item) * total items = Total average 75th %tile
1 userON119 ms * 333 = 39 sec121 ms * 333 = 40.2 sec
1 userOFF104 ms * 333 = 34 sec106 ms * 333 = 35.2 sec
5 usersON148 ms * 333 = 49.3 sec155 ms * 333 = 51.6 sec
5 usersOFF129 ms * 333 = 42.9 sec128 ms * 333 = 42.6 sec

For each item, get all open loans

GET /circulation/loans?limit=2147483647&query=(itemId==${current_item} and status.name==Open)

Profileravg(get current_loan) * total loans = Total average avg(get current_loan) * total loans = Total average 75th %tile
1 userON46 ms * 333 = 15.3 sec43 ms * 333 = 14.3 sec
1 userOFF43 ms * 333 = 14.3 sec37 ms * 333 = 12.3 sec
5 usersON53 ms * 333 = 17.7 sec54 ms * 333 = 18 sec
5 usersOFF45 ms * 333 = 14.9 sec47 ms * 333 = 15.6 sec

JVM Profiling

  • Overall slow methods (between the modules profiled: mod-rtac, mod-circulation, mod-circulation-storage, mod-inventory, mod-inventory-storage)

To see more drill down these methods at: http://ec2-54-165-58-65.compute-1.amazonaws.com/grafana/d/b3iP-8hiz/java-profiling-folio-new?orgId=1&from=1592598484852&to=1592600441623&var-service_tag=mod-rtac&var-aggregation=10s


Database

The database CPU usage decreases slightly as the number of users increases 1, 5  users run. At maximum 28% CPU usage for 1 user. We are seeing a slight decrease in CPU utilization for 5 Users because RDS Postgresql is caching the most frequent query results.


Database Slow queries 

The slowest query which took the most of execution time was initiated by the mod_inventory_storage service presented in the following table:


Percent of total time

Average Time, ms

Calls

Module

Query

77%14,632 ms21mod_inventory_storage

SELECT COUNT(*) FROM fs09000000_mod_inventory_storage.holdings_record


API response time vs Postgresql query response time

This difference below between API and Postgresql query response time is the sum of network time to/back from calling database, deserialization of JSON string into JSON objects, processing time, and network time to/from API. Below are a single API request and SQL query triggered manually for comparison.

ActionType of RequestActual requestTime, ms
get InstanceSQL queryselect * from fs09000000_mod_inventory_storage.instance where id = '00009a0b-e973-443b-9796-c3373a352f86';305

API requesthttps:/{okapi-url}/inventory/instances/00009a0b-e973-443b-9796-c3373a352f86368
get holdingsSQL queryselect * from fs09000000_mod_inventory_storage.holdings_record hr1 where hr1.instanceid = '121063ba-43cc-48f4-a49f-aaf5488127e0';350

API requesthttps:/{okapi-url}/holdings-storage/holdings?limit=2147483647&query=instanceId==121063ba-43cc-48f4-a49f-aaf5488127e503
get ItemSQL queryselect * from fs09000000_mod_inventory_storage.item where holdingsrecordid='95a897e5-2505-4803-a7f1-7da39c9f67cb';353

API requesthttps://{okapi-url}/inventory/items?limit=2147483647&query=holdingsRecordId%3D%3D95a897e5-2505-4803-a7f1-7da39c9f67cb662
get Open loanSQL queryselect * from fs09000000_mod_circulation_storage.loan where jsonb->>'itemId'='3faad467-fca1-41c8-ad7d-af1dcd3fdc3c' AND jsonb->'status'->>'name'='Open'; 372

API requesthttps://{okapi-url}/circulation/loans?limit=2147483647&query=(itemId==3faad467-fca1-41c8-ad7d-af1dcd3fdc3c and status.name==Open)400


CPU Utilization

CPU utilization is relatively on a lower side for most modules except for mod-inventory-storage


1 user5 users

Average %Average %
mod-circulation3.143.95
mod-circulation-storage4.846.38
mod-inventory5.386.54
mod-inventory-storage2624
mod-rtac2.453

Memory

Memory was stable throughout the runs, only a spike here or there, but in a 30 minutes run they were consistent. 


1 user5 users

Average %Average %
mod-circulation72.172.50
mod-circulation-storage39.4639.55
mod-inventory4446.5
mod-inventory-storage44.3045
mod-rtac55.255


Recommended Improvements 

  • The following JIRA has been created for mod-rtac:

MODRTAC-38 - Getting issue details... STATUS

  • Consider taking advantage of CQL "OR" operator to chain IDs together at each level of getting items, holdings, loans,  instead of making API calls.  This approach could be faster, although need to be proven.
  • Consider doing SQL joins between instance/holdings/items where possible to reduce making API and database calls.
  • In mod-circulation, mod-circulation-storage, mod-inventory-storage, consider using a more efficient JSON package or call using the existing Jackson serialization calls in a different way to address the item: JVM profiling shows JSON de/serialization operations one of the slowest operations.

Appendix

See Attached rtac-test-report.xlsx for details