Stress Test Strategy

Overview

Stress testing checks the application's stability under high load and also reveals the weaknesses in the application. In the case of FOLIO, it can reveal which workflows (and/or modules) perform poorly, how one workflow would affect the other.  Visually during stress testing we increase loads on a number of workflows so that the modules CPU and/or memory utilization is stretched beyond 100% capacity. This document analyzes the approach to introduce stress into the system.  The PTF team created a JMeter master script that contains more than 40 workflows.  Should a certain workflows be chosen to have higher loads, or should all of the workflows' loads getting added incrementally? These are the questions that the PTF team needs to explore and answer before starting the stress test.

Criteria

Some questions to think about when designing the stress test:

  • What is the breaking point or breaking points?
    • Does one workflow fail or multiple workflows fail?
    • Does one workflow started and continued to perform poorly (e.g., CICO response times more than 10s) or multiple workflows perform poorly?
  • Is it enough to keep adding load to the workflow(s) (TBD) until some "breaking point" is reached or is it enough to get the services' average CPU (and/or memory) utilization to get above 100% and watch how the system behaves? Likewise with database's CPU utilization. 
  • If choosing some particular workflows to have more stress than others, why are they chosen? Are there any criteria to choosing one versus another? Or should we work backward by choosing the modules to be stressed and then see which workflows use these modules?
    • Which workflows?
  • What should be observed during stressful times and after? Should we pay attention to how the system recovers after the stress time is over? If so then the test needs to be extended through high stressed time.
  • Should we perform some service failover? Which one?
  • How long the test should be performed? Will 2-3 hours be enough?
  • How many tests with increasing load need to be performed and what load should be reached to assume that the system is stable enough?

Stress-Loading Approaches

  1. "Capacity Style"
    1. Gradually and uniformly increase loads on each and all workflow as we do in capacity testing until some workflows break or the system becomes non-responsive.
      1. Pros
      2. Cons
        1. Do we miss out the opportunity to specially stress testing certain workflows since all are being stressed at the same rate?
  2. Predetermined Workflows:
    1. Focus on several workflows that we know are poor performant and stress test them. We would start with some set of loads, then iteratively add more loads to these workflows until some of them break, or the system becomes nonresponsive.
      1. Pros
        1. Straightforward, know which ones to target. 
      2. Cons
        1. Some workflows that we do not choose because they are working fine now may rear its ugly head under high stress. We won't know about them because we never stress-tested them.
  3. Hybrid of Capacity and Predetermined Workflows:
    1. Predetermining certain workflows that deemed to perform poorly under stress, then ramp up until such load is reached. Repeat this step with more loads until the system becomes non-responsive or certain workflows break down.
      1. Pros
      2. Cons

Prep Work

  1. In the JMeter script do we need to remove the throughput controller and use virtual users instead?
    1. Let's try both ways and decide which one is better. (we have already data from normal load tests we can compare proportions with and without throughput controller )
  2. Any special data-preparation needed for certain workflows? 
    1. We can try to use data from longevity test and it should be enough.

Stress-Loading Approach Final Decision

With all the considerations taken above, this section will explain the final stress-loading approach chosen and the rationale of choosing it.