[FOLIO-1744] SPIKE: establish expected response time for performance benchmarks in Jenkins Created: 25/Jan/19  Updated: 03/Jun/20  Resolved: 11/Feb/19

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P3
Reporter: Jakub Skoczen Assignee: Eric Valuk
Resolution: Done Votes: 0
Labels: platform-backlog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: PNG File Screenshot (164).png     PNG File Screenshot (165).png    
Issue links:
Blocks
blocks MODINVSTOR-237 Performance: GET 200 - Get updated mo... Closed
Relates
relates to FOLIO-1771 clean up and isolate tests Closed
Sprint: Core: Platform - Sprint 56
Story Points: 2
Development Team: Core: Platform

 Description   

We need to establish what are the expected (acceptable) response times values for API benchmarks under Jenkins:

https://jenkins-aws.indexdata.com/job/Automation/job/folio-perf-test/performance/

This will likely depend on a list of variable:

  • the size of the machine (cpu/mem)
  • the traffic (number of "virtual" users, depends on the test configuration)

It would be preferable if the "expected" response times were somewhere around 2s, which is the acceptance criteria in the DoD



 Comments   
Comment by Jakub Skoczen [ 25/Jan/19 ]

Hongwei Ji Eric Valuk One way to try to esablish some baseline is to choose a very light-weight API call that almost directly translates into a simple SQL select. E.g /instance-storage/id. and see how many, for this particular machine configuration, concurrent clients the machine can sustain and produce average responses around 2s. If we can't get response times of 2s even for such simple query we should limit the amount of virtual users and/or amount of data. We should also measure the HTTP/code overhead (executing the raw SQL in Postgres vs an API call). Make sense or did you think about something different?

Comment by Julian Ladisch [ 28/Jan/19 ]

In our Definition of Done https://folio-org.atlassian.net/wiki/display/FOLIJET/Core+Platform+-+Definition+of+Done we have "All end user interactions < 2 seconds for 95 percentile". To achieve end user interaction < 2 seconds the back-end must be somewhat faster to allow for browser processing.

To check in a batch of books the back-end response time of checking in a single book must be within 250 ms: https://folio-org.atlassian.net/browse/CIRC-144
A check in of a single book requires several other API calls.
Therefore the expected response time for a primary key lookup like /instance-storage/

{id}

cannot be around 2000 ms but should be around 50 ms.

Comment by Eric Valuk [ 30/Jan/19 ]

Right now we do performance testing nightly that gives us a very large number for instance-storage/instances/

{id}

. I looked into how this test works. It essentially runs 125 threads within 1 second. This seems not that useful as it probably goes into severe overload. I suppose eventually we need to be able to handle 125 requests simultaneously, but we need to prepare to handle that via the correct hardware and balancing.

I propose a stepped approach where threads are brought on gradually and proceed to request forever, at which point you see easily when the number of users causes the response time to go above the approved amount. This will give us the benchmark we need to say x users on y machine gives z response time.

Let me know what you think.

Comment by Hongwei Ji [ 30/Jan/19 ]

Eric Valuk, are you referring to the value "125, 5, 1" in this file https://github.com/folio-org/folio-perf-test/blob/master/Folio-Test-Plans/mod-inventory-storage/instance-storage/instance-storage.csv? My understanding is that it is to ramp up 125 users within 5 seconds. The 1 means each user to run the test only once, no repeat.

Comment by Jakub Skoczen [ 31/Jan/19 ]

Eric Valuk Hongwei Ji I can't comment on how the tests actually ramp up the load, you will need to investigate that or ask Varun Javalkar for details. But I agree with the general direction Eric Valuk proposed: tweaking the ramp up value until we get a reasonably loaded system and response times that stay below 2s on avg. I think we should pay attention to the error rate and keep this particular test to run longer to see if requests are not queuing up and that there's no resource leaks. It would be great if we understood what test parameters give us the expected behavior.

Of course, for more sophisticated APIs the baseline will be different – let's see how much. I wouldn't be surprised if the throughput is even 10-20x less for BL calls.

Comment by Hongwei Ji [ 31/Jan/19 ]

BTW Jakub Skoczen and Eric Valuk, I did a test that just queries for instance by id, I can bump up the user to 1000 with 5 seconds ramp up time and the response seems still OK, under 1 second. The env is the same perf test env. So my take is if the env was under stress by other slow requests, it impact the response time of supposed to be quick requests as well. So it seems to be more important to identify those slow ones and improve them, so the overall response time will be better.

Comment by Varun Javalkar [ 31/Jan/19 ]

Eric Valuk Hongwei Ji As per https://github.com/folio-org/folio-perf-test/blob/master/Folio-Test-Plans/mod-inventory-storage/instance-storage/instance-storage.csv 125 threads are ramping up in 5 seconds which means each second 125/5 = 25 threads will ramp-up. Similarly, all remaining threads will ramp-up in the remaining 4 seconds.

Comment by Eric Valuk [ 31/Jan/19 ]

So After discussion with Hongwei and Jakub and then some research I think I found a solution for this specific user story.

Jmeter normally isnt sophisticated enough to do complex step functions for threads. I found a plugin that can do this though.
https://jmeter-plugins.org/wiki/UltimateThreadGroup/#Ultimate-Thread-Group

I did something like this with the expected users:
and that generated
note that the graph is just one I whipped up fast, we can make it more useful when needed.

( I ended the test near prematurely because I was worried I was starting to overly stress the server I was working on)

I can do whatever I want with the threads, start time ramp up, hold and ramp down time that we need.

The downside to this is the need for a plugin installed on whatever jmeter is running the test. This may be annoying for the headless install in the CI environment

Comment by Ann-Marie Breaux (Inactive) [ 04/Feb/19 ]

Hi Jakub Skoczen This ended up in our manual testing queue. I don't think we can actually test this! Best way to have us ignore it is to either mark the status as In Code Review, or else add someone specific as the tester assignee. I'm not sure who that should be, so I didn't update this issue. Thank you!

Comment by Hongwei Ji [ 04/Feb/19 ]

Ann-Marie Breaux, I changed it to "In Code Review".

Comment by Ann-Marie Breaux (Inactive) [ 04/Feb/19 ]

Many thanks Hongwei Ji

Comment by Eric Valuk [ 11/Feb/19 ]

I made progress after the demo and was able to get some good results. The end result after running the benchmark against an actual performance environment is:

400 users for 10 minutes
932 ms avg, min 5 max 6389

I have committed both the 10 minute test as well as the first trial test for future use. I will close this issue as complete.

Generated at Thu Feb 08 23:15:35 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.