Check-in-check-out Test Report (Kiwi)

Check-in-check-out Test Report (Kiwi)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Overview

This is a report for a series of Check-in-check-out test runs against the Kiwi release. 

Back End:

  • mod-circulation-22.1.0

  • mod-circulation-storage-13.1.0

  • mod-inventory-18.0.0

  • mod-inventory-storage-22.0.0

  • mod-authtoken-2.9.0

  • mod-pubsub-2.4.0

  • mod-patron-blocks-1.4.0

  • mod-feesfines-17.0.0

  • okapi-4.9.0

Front End:

  • folio_circulation-6.0.0

  • Item Check-in (folio_checkin-6.0.0)

  • Item Check-out (folio_checkout-7.0.0)

Infrastructure:

  • 71 back-end modules deployed in 141 ECS tasks

  • 3 okapi ECS services

  • 6 m5.xlarge  EC2 instances

  • 2 db.r6g.xlarge AWS RDS instance

  • INFO logging level

High Level Summary

  • Kiwi performs much better than Juniper GA, response times for check-ins are <900ms, checkout <1500ms, with not much variation between 1 and 20 users

  • Database performance is better and uses much less CPU compared to Juniper GA

  • Worst-performing APIs are still POST /checkin-by-barcode and POST /checkout-by-barcode.  The response time are still about 500ms. GET /circulation/loans also takes more than 200ms. GET /inventory/item (by barcode) takes less than 100ms now.

  • Longevity test shows response times worsen over time, probably due to the growing DB CPU utilization.  https://folio-org.atlassian.net/browse/CIRCSTORE-304 Potentially could address this situation

Test Runs

Test

Virtual Users

Duration

1.

1

30 mins

2.

5

30 mins

3.

8

30 mins

4.

20

30 mins

5.

1 (repeat)

30 mins

Results

Response Times (Average of all tests listed above, in seconds)

 

Average (seconds)

50th %tile (seconds)

75th %tile (seconds)

95th %tile  (seconds)

 

Check-in

Check-out

Check-in

Check-out

Check-in

Check-out

Check-in

Check-out

1 user

0.838

1.582

0.767

1.464

0.915

1.716

0.915

1.716

5 users

0.73

1.376

0.676

1.272

0.769

1.448

1.052

2.061

8 users

0.758

1.392

0.674

1.228

0.839

1.44

1.202

2.343

20 users

0.899

1.506

0.731

1.303

0.931

1.535

1.531

2.504

Response times are consistently good, sub-second for check-ins and around 1.5 seconds for check-outs even in the 75th percentile. Not much variations between 1 and 20 users. 

Comparisons to Last Release

The following tables compare Juniper HF3 test results against Kiwi General release (deployed at Bugfest).

Response Time Comparison

In general there is no regression in performance.  The response times between Kiwi and Juniper are very close to each other for 1-8 users load unless they were in the 95th percentile group or the 20 users load where Kiwi clearly out-perform Juniper.  In the tables below, the Delta columns express the differences between Juniper and Kiwi releases in percentage. Any percentage +/-5% is not statistically is within the margin of error.  It is also noteworthy that Kiwi seems to invoke the GET /automated-patron-blocks 3 times instead of once https://folio-org.atlassian.net/browse/UICHKOUT-755.  This call averages 25ms under all loads, so if 2 of these 3 calls were not needed (why would the UI calls it thrice?) then the Kiwi's Checkout average response times could improve by another 50ms. 

Note: JP = Juniper build, KW = Kiwi build

 

Average

50th percentile 

Check-in JP

Check-in KW

Delta

Check-out JP

Check-out KW

Delta

Check-in JP

Check-in KW

Delta

Check-out JP

Check-out KW

Delta

1 user

0.944

0.838

11.23%

1.579

1.582

-0.19%

0.835

0.767

8.14%

1.411

1.464

-3.76%

5 users

0.811

0.73

9.99%

1.359

1.376

-1.25%

0.750

0.676

9.87%

1.23

1.272

-3.41%

8 users

0.889

0.758

14.74%

1.425

1.392

2.32%

0.785

0.674

14.14%

1.262

1.228

2.69%

20 users

1.386

0.899

35.14%

2.21

1.506

31.86%

1.172

0.731

37.63%

1.887

1.303

30.95%

 

75th percentile

95th percentile 

Check-in JP

Check-in KW

Delta

Check-out JP

Check-out KW

Delta

Check-in JP

Check-in KW

Delta

Check-out JP

Check-out KW

Delta

1 user

0.987

0.915

7.29%

1.604

1.716

-6.98%

1.533

1.214

20.81%

2.63

2.486

5.48%

5 users

0.867

0.769

11.30%

1.438

1.448

-0.70%

1.179

1.052

10.77%

2.111

1.448

31.41%

8 users

0.958

0.839

12.42%

1.481

1.44

2.77%

1.434

1.202

16.18%

2.343

2.343

0.00%

20 users

1.589

0.931

41.41%

2.5

1.535

38.60%

2.755

1.531

44.43%

4.369

2.504

42.69%

If compared Kiwi response times against the Juniper results documented in Check-in-check-out Test Report (Juniper), then Kiwi's times clearly are better than Juniper's. This is because the numbers in those tests were from the Juniper General release, when the GET /automated-patron-blocks were not optimized. HF3's modules are closer to Kiwi and therefore performs a lot better than Juniper GA. 

"Worst" API Comparisons

The APIs identified in the table below were the ones that took over 100ms to execute in Juniper.  In Kiwi these APIs perform much better, especially under greater load (8 and 20 users) where, for example, POST /checkout-by-barcode and POST /check-in-by-barcode's time are almost 50% better. GET /inventory/items also improved greatly due to implementation of the unique barcode index in the mod_inventory_storage.item table. 

Average Response Time in milliseconds. 

API

1 user  JP (75th %tile) 

1 user  KW (75th %tile)

5 users JP (75th %tile)

5 users KW (75th %tile)

8 users JP (75th %tile)

8 users KW (75th %tile)

20 users JP (75th %tile)

20 users KW (75th %tile)

API

1 user  JP (75th %tile) 

1 user  KW (75th %tile)

5 users JP (75th %tile)

5 users KW (75th %tile)

8 users JP (75th %tile)

8 users KW (75th %tile)

20 users JP (75th %tile)

20 users KW (75th %tile)

POST checkout-by-barcode

526

550

476

436

479

428

861

461

POST checkin-by-barcode 

457

444

402

368

489

438

852

504

GET circulation/loans

246

305

245

262

272

267

516

303

GET inventory/items

171

124

166

96

170

92

250

99

Longevity Test

Longevity test shows that the response time increased as time went on. 

 

Check In

Check Out

 

Check In

Check Out

1st Hour

0.821s

1.476s

16th Hour

1.129s

3.343s

25th Hour

0.999s

3.758s

In the response time graph below the Checkout Controller time, which gathers all chekc-out API response times), increased over a 16-hours window, from <2s to over 5s.

 

The DB CPU utilization percentage also increased over time

It's unclear what (DB connections) caused the DB to use more CPU resources as the test progressed, but looking at the logs there is one query that stood out as being slow:

SELECT jsonb,id FROM fs09000000_mod_circulation_storage.loan WHERE (lower(fs09000000_mod_circulation_storage.f_unaccent(loan.jsonb->>'userId')) LIKE lower(fs09000000_mod_circulation_storage.f_unaccent('<user UUID>'))) AND (lower(fs09000000_mod_circulation_storage.f_unaccent(loan.jsonb->'status'->>'name')) NOT LIKE lower(fs09000000_mod_circulation_storage.f_unaccent('Closed')))

This query at first took a little more than 100ms to run, but toward the end of the test it got to over 500ms on average and spiked to over a second. 

Doing an EXPLAIN ANALYZE on the query shows that it was not using indexes in the query plan despite having the indexes to look up loans by userId and loan status name. 

Gather (cost=1000.00..60093.80 rows=780 width=966) (actual time=192.948..200.814 rows=1 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on loan (cost=0.00..59015.80 rows=325 width=966) (actual time=145.247..191.854 rows=0 loops=3) Filter: ((lower(fs09000000_mod_circulation_storage.f_unaccent((jsonb ->> 'userId'::text))) ~~ '664e4002-9eb2-4c9e-addd-04f308a8062c'::text) AND (lower(fs09000000_mod_circulation_storage.f_unaccent(((jsonb -> 'status'::text) ->> 'name'::text))) !~~ 'closed'::text)) Rows Removed by Filter: 53759 Planning Time: 0.212 ms Execution Time: 200.834 ms

 

Due to the left() function (more details in  https://folio-org.atlassian.net/browse/CIRCSTORE-304), the indexes were not applied. After removing the left() function, the indexes were applied and when running another longevity test, this time the DB CPU utilization was constant and under 20%, as seen in the diagram below. 

 

Before and after fixing the left() function, there were spikes at half an hour interval, but these were due to background jobs processing the loans (age-to-lost, patron notices, etc..). As more loans got created during the test, the spikes got higher. 

 

Comments