Workflow Tests with RMB-348 Implementation

Overview

RMB-348 implements directing read and write database calls to the read and write nodes of the database, respectively.  This is a report for a series of Check-in-check-out test and Data Import tests against the pre-Morning-glory release that has RMB-348 implemented in

  • mod-inventory-storage
  • mod-circulation-storage
  • mod-permissions
  • mod-users
  • mod-source-record-storage
  • mod-source-record-manager

Software Versions

Snapshot versions of modules on 06.08.2022

  • RMB v35.0.0-SNAPSHOT
  • mod-inventory-storage (MG-SNAPSHOT)
  • mod-circulation-storage (MG-SNAPSHOT)
  • mod-permissions (MG-SNAPSHOT)
  • mod-users (MG-SNAPSHOT)
  • mod-source-record-storage (MG-SNAPSHOT)
  • mod-source-record-manager  (MG-SNAPSHOT)

Summary

Limited testing of a few workflows exercising several modules that has the RMB-348 change shows promising signs that performance could improve and resource utilization could improve (if there are idle read-only DB instances deployed that are currently doing nothing). 

  • Check-in check-out (CICO) tests show that a healthy mix of DB activities were directed properly to the read nodes. Performance seemed to improve by at least 15%. No database or storage module memory leaks observed.
  • This RMB change currently does not have any positive impact on Data Import as the read-only DB node is not used much compared to the write node.  An action item is to examine the RMB calls that DI use (throughout its modules: mod-inventory-storage, mod-srs, mod-srm) to see if these calls are readonly and therefore should use one of the new methods created in RMB.
  • RTAC shows some sign of DB usage on the read node, but the majority of the traffics is still on the write node because a RMB stream method isn't converted to call the DB read instance.  There is a new call implemented in RMB that RTAC in mod-inventory-storage could call to take advantage of the read DB instance.
  • Further work is necessary to understand why the RMB changes upsets the failover that's managed by AWS.

Test Results

CHECK-IN/CHECK-OUT

Response Times of 30-minutes Tests


Average (seconds)50th %tile (seconds)75th %tile (seconds)95th %tile  (seconds)

Check-inCheck-outCheck-inCheck-outCheck-inCheck-outCheck-inCheck-out
1 user0.6140.9590.5890.9250.6391.0140.7971.178
5 users0.5120.8200.4990.7950.5370.8520.6190.979
8 users0.5020.7980.4870.7820.5290.8340.6060.939
20 users0.5090.7960.4920.7730.5360.8290.6330.939

Comparisons to Last Release (Lotus)

As you can see, all of the response times improves by 15%-25%. 


Average
Check-in LTCheck-in MG-SnapshotDeltaCheck-out LTCheck-out MG-SnapshotDelta
1 user0.730.61415.89%1.1880.95919.27%
5 users0.620.51217.41%0.9760.820

15.98%

8 users0.6150.50219.03%0.9710.79817.81%
20 users0.6610.50922.99%1.0520.796

24.33%

Resources usage

CPU usage shows linear dependency on users without sudden spikes and anomalies.

No signs of memory leak on service memory level. The only thing increase of memory usage on mod-inventory-storage. However after end of testing set it did lower down.

RDS CPU utilization shows that we have achieved Read/Write distribution between nodes of database. 

Note: lcp2-db-02 is writer, lcp2-db-01 is reader

No sign of memory leak on RDS side.

Longevity CICO Test

A longevity test of this workflow was performed and there was no sign of DB memory leaks either. In this test lcp2-db-01 was the reader and lcp2-db-02 was the writer

Memory of the writer (lcp2-db-02) promptly bounced back after the test. There was no visible signs of leaking memory over time on the reader (lcp2-db-01)

DATA IMPORT

When first tested a DI job to create 25K MARC BIB records, the holdings were not created.  It was because mod-inventory-storage calls postgresClient

    postgresClient.selectSingle(sql.toString(), Tuple.of(type.getSequenceName()))

passing in a custom query that calls the nextVal() function which has to be executed on the DB write node. In fact, all of the following RMB methods and their variants (selectSingle(), selectStream(), and select()) all accept custom SQL statements that the client could pass in an UPDATE or SELECT nextVal() call.  Consequently a new set of "read" only methods (selectSingleRead(), selectStreamRead(), selectRead()) were created for the clients to take advantage of querying the readonly DB node.

After the rolling back the selectSingle method so that it can query the write node again, DI jobs of 25K were rerun and here are their results:

  • 1K Create:  16.5 minutes (vs. 1.5 mins in regular Lotus (w/o RMB changes))
  • 25K Create: 37 minutes (vs. 16 mins in regular Lotus (w/o RMB changes)) 

(Lotus results here: Data Import Test report (Lotus))


The DB CPU utilization graph below shows the CPU spikes of the two DI jobs.  In this graph the DB read node is lcp2-db-02 and write node is lcp2-db-01

A couple of things to observe:

  1. The read/write split RMB does have some impact on DI, but not much, if any, positive impacts.
  2. Duration of the 1K and 25K imports are long-drawn out.  The 1K import's CPU % have 3 spikes whereas the 25K had 2 spikes.  When the CPU was not spiking, there was lull in the import and the import's completion percentage did not increase.  Perhaps some DI code is waiting for the reader to catch up? 
  3. There is little CPU activities on the DB read node and most of the spikes are on the DB write node. This means that the current implementation of Data Import does not use the RMB methods that are already converted to query the DB read node.  In the future, it'd be great if DI can use the new readonly or existing "read" methods in RMB that point to the DB Read node for efficiency and performance gains. 



RTAC 

RTAC is a read-only operation that retrieves a library item's circulation status.  The RTAC calls also uses the streaming API in RMB.  Here's how RTAC did for a 1 and 5 users tests of retrieving 50 records.

In the graph below lcp2-db-01 is the reader and lcp2-db-02 is the writer.

Why is the writer DB instance had more resource usage? 

  • Not all modules in the RTAC workflow had this RMB change
  • One crucial RMB call that RTAC/mod-inventory-storage calls, selectStream(), was not using the read-only DB node. It is not suitable to make this selectStream() method to use the read-only DB node because as in the case of DI a custom query (that may require using the DB write node) may be passed in.  Therefore a new method selectStreamRead() was created for it to use. 

Failover Testing

In a High Availability environment where there are at least a write DB node and a read DB node, when there is a failover situation, the read DB node becomes the write node and a new read node is spun up.  This is all managed by AWS (if hosted on AWS cloud).  The PTF environment is hosted on AWS Cloud and has a DB write and read node. For this testing, a CICO test was executed and failover was triggered.  In a normal situation, here is what the outcome of the failover looks like:

There were 18 errors at around 21:32 (highlighted in yellow), and after the DB writer was re-established the performance test went on successfully, didn't generate any errors the failover.

The DB CPU utilization graph shows where the failover happened and stability afterward; the read node becomes writer, and vice versa.

Some of the 18 errors were:


With these RMB changes, the situation is different:

The DB CPU Utilization graph shows the test started at 6:30, ends at 7:00, but more importantly the switch over at 6:40 where the reader became the writer and vice versa. Here lcp2-db-01 was the writer (recognizable by its higher CPU utilization) but at 6:40 it failed and became the reader while lcp2-db-02 took its place as the writer and continued to use resources at the typical level of a writer.  As far as the DB is concerned everything looked normal, except that lots of 500 errors were thrown.

Errors were found on certain APIs:

Further work is necessary to understand why the RMB changes upsets the failover that's managed by AWS.