Load-generator script is finalized, debugged, correct and meets expectations
The system under test is ready for performance testing
The environment's configurations (modules' CPU/memory + task count + version) are what expected
No other activities in the test environment - check related environment by AWS Cloudwatch
No other activities in the test database - check related database by AWS RDS and AWS Cloudwatch
The environment is restarted - Only relevant modules are required to be restarted if the memory or CPU % is much higher than when they were first started up, and in case of endurance test will be executed
If any module in the environment has been restarted, make sure that all ECS services are stable for at least 15 minutes
The performance testing framework is ready - check connections, permissions, tool-set itself
The profiling tools are ready - only if it's required by the task
Setting up AWS CodeGuru
Enabling profiling by redeploying the modules with CodeGuru's jar embedded
Task definition's profiling parameter in an enabled state
The log analysis tool is ready - only if it's required by the task - If the module's log level needs to be adjusted (from INFO to DEBUG, for example) then it has to be redeployed with the correct setting
Jenkins job is ready - check job parameters
Run a smoke test to verify that there are no functional errors or that the environment has been set up successfully
The test Start-point is detected
During test
Keep an eye on environment metrics such as CPU and memory utilization, may need to take proactive action to restart the module or the whole env if the metrics reach critical levels (service memory level > 120%, Database freeable memory < 2000MB, DB CPU utilization constantly over 30%)
Keep an eye on database metrics such as CPU, memory utilization, deadlocks, top waits etc.
Capture any observations
Capture observations and test data in a, for example, Google Spreadsheet file
Raw data are preserved and important things like timestamps are recorded in case we need to go back to the graphs to look them up
Capture heap dumps (if it's necessary)
Capture profiling data (if it's necessary)
Post-test action
The test End-point is detected
Data collection - use test timeframe (start - end points) to collect pieces of data are important to collect:
Response time: mean, median, 95pct - Obtained from Grafana
Errors count (thresholds for failing an API call (Obtained from Grafana))
Failure rate, %
TPS - transactions per second
Check CPU/Memory usage per service
Check modules logs (if any errors entries) by Log Insights
Check RDS Performance Insights for slow queries
RDS error log for any errors or slow queries
run EXPLAIN ANALYZE on them to see why they are slow (missing index or sequential scanning?) - If slow queries were detected
CPU utilization for a particular module (if any abnormal behaviour is observed for any module)
Memory usage for a particular module (if any abnormal behaviour is observed for any module)
Update the timestamps for Grafana URL so that we can go back and look at the graphs later
Collect data about FOLIO version and/or specific modules versions - could be collected by related Jenkins job
Fill the report template with all needed and helpful observations
Analyze collected observations, graphs, and trends with statistical rules and/or performance experience
The analysis includes understanding why the workflow is slow:
API or SQL query
Or dig deeper into these areas to understand what makes them slow
rerun a few tests to make sure the data or behaviour is consistent - If necessary
Provide recommendations for improving the performance of the FOLIO-developed services