Performance Testing Checklist

Pre-requisites

Load-generator script is finalized, debugged, correct and meets expectations
The system under test is ready for performance testing
- The environment's configurations (modules' CPU/memory + task count + version) are what expected
- No other activities in the test environment - check related environment by AWS Cloudwatch
- No other activities in the test database - check related database by AWS RDS and AWS Cloudwatch
- The environment is restarted - Only relevant modules are required to be restarted if the memory or CPU % is much higher than when they were first started up, and in case of endurance test will be executed
- If any module in the environment has been restarted, make sure that all ECS services are stable for at least 15 minutes
The performance testing framework is ready - check connections, permissions, tool-set itself
The profiling tools are ready - only if it's required by the task
- Setting up AWS CodeGuru
- Enabling profiling by redeploying the modules with CodeGuru's jar embedded
- Task definition's profiling parameter in an enabled state
The log analysis tool is ready - only if it's required by the task - If the module's log level needs to be adjusted (from INFO to DEBUG, for example) then it has to be redeployed with the correct setting
Jenkins job is ready - check job parameters
Run a smoke test to verify that there are no functional errors or that the environment has been set up successfully
The test Start-point is detected

During test

Keep an eye on environment metrics such as CPU and memory utilization, may need to take proactive action to restart the module or the whole env if the metrics reach critical levels (service memory level > 120%, Database freeable memory < 2000MB, DB CPU utilization constantly over 30%)
Keep an eye on database metrics such as CPU, memory utilization, deadlocks, top waits etc.
Capture any observations
- Capture observations and test data in a, for example, Google Spreadsheet file
- Raw data are preserved and important things like timestamps are recorded in case we need to go back to the graphs to look them up
Capture heap dumps (if it's necessary)
Capture profiling data (if it's necessary)

Post-test action

The test End-point is detected
Data collection - use test timeframe (start - end points) to collect pieces of data are important to collect:
- Response time: mean, median, 95pct - Obtained from Grafana
- Errors count (thresholds for failing an API call (Obtained from Grafana))
- Failure rate, %
- TPS - transactions per second
- Check CPU/Memory usage per service
- Check modules logs (if any errors entries) by Log Insights
- Check RDS Performance Insights for slow queries
  - RDS error log for any errors or slow queries
  - run EXPLAIN ANALYZE on them to see why they are slow (missing index or sequential scanning?) - If slow queries were detected
- CPU utilization for a particular module (if any abnormal behaviour is observed for any module)
- Memory usage for a particular module (if any abnormal behaviour is observed for any module)
Update the timestamps for Grafana URL so that we can go back and look at the graphs later
Collect data about FOLIO version and/or specific modules versions - could be collected by related Jenkins job
Fill the report template with all needed and helpful observations
Analyze collected observations, graphs, and trends with statistical rules and/or performance experience
- The analysis includes understanding why the workflow is slow:
  - API or SQL query
  - Or dig deeper into these areas to understand what makes them slow
  - rerun a few tests to make sure the data or behaviour is consistent - If necessary
Provide recommendations for improving the performance of the FOLIO-developed services
- As a general statement in the report
- Create a JIRA for the responsible team
- Link this JIRA to the report or vice versa