Performance Test Results Analysis and Automation Roadmap
Overview
Analyzing and reporting performance test results have been one of the slow parts of performance testing. The PTF has analyzed and brainstormed for ideas to automate this process to aid in human interpretation of results. This document captures these ideas and describes a roadmap to automate this process.
Stages of Performance Analysis
1. Gathering Test Results
Gathering test results is the first stage of analyzing test results. While the current Cloudwatch Performance Dashboard gives a good glance at many aspects of the systems' metrics, it does not offer a complete picture. Performance analysts still need to go to various AWS services' consoles to get more detailed information of the metrics or insights. For example, Performance Insight offers slow queries and AAS graphs that are not readily available in the Performance Dashboard, so are OpenSearch indexing rate and latency or MSK brokers' CPU and disk usage metrics.
Automate:
Create a report portal such that after entering the test’s timestamp all information from relevant AWS services are brought back on a HTML page. Or enhance the Performance Dashboard to include the following items:
MSK: Brokers' CPU utilization and disk usage stats
OpenSearch: Data Nodes and Master Nodes' CPU utilizations stats, indexing rates, indexing latencies, search rate and search latencies.
RDS: AAS, SQL graphs. Top 10 slow queries, execution time, and rate.
Kafka message lags
2. Automate Running Modules Detection
Currently, it is cumbersome to detect and select the series of modules that ran during a test. There are nearly 100 modules/series to select and when multiple workflows are running at the same time there could be a huge number of modules that participate. AWS doesn’t make it easy or fast to select and unselect a series in the graph.
Remedy:
Find a way to automatically detect the modules that ran during the test or within a test’s timeframe. Optionally use Machine Learning to detect a series (or a series of series) behavior and identify the series/modules that were spiking within a test’s timeframe. Then re-select the identified series in the dashboard so that we don’t have to manually do this.
3. Analyzing Logs Optimization
A. Errors and Warnings
Querying Cloudwatch logs for various modules' errors and warnings can be tedious and each query takes up to minutes to complete. In a complicated workflow like Data Import there are several modules to query and analyze.
Remedy:
Create a one-stop portal that when entering a test’s timestamps a summary of errors and warnings for the relevant modules will be displayed on a page. This requires knowing which modules to query the logs. There are two ways to approach this:
Manually enter the relevant modules name into a form
Use ML to interpret spikes from services CPU utilization graph during a test time frame to identify the modules that participate in the test and enter these relevant module names into a form automatically.
Note that rather than getting the desired log entries back line by line, it’s better to use Cloudwatch API to get an aggregate result of the log entries. These aggregate results return the number or percentage of certain log entries in the search time frame. This way helps to interpret the logs faster.
B. Level 2: Logs for API
At times we need to know what the module is doing. To find this out we either query the module’s nginx logs or the module’s log itself to see if what APIs it was servicing. The next step in log analysis is to create a page that queries such logs and print out the top 5 API calls that took place during the test timeframe.
Nginx logs are cleanest to get only the API calls because that’s all getting logged. However, nginx logs may go away in the future or many modules may not have nginx logs so relying on nginx logs is not reliable.
A more reliable way is to find POST and GET messages in the module logs
C. Level 2: Logs for Rates
At times we need to know how many times certain API calls were invoked or the sidescars' ingress and egress rates. Create a page that calculates the rate of a particular API log entry or a pattern of the log so that we can quickly determine the rate of an API being invoked.
4. Reporting Optimization
Create a performance test report currently takes some time because of gathering data and presenting them. Even having gathered the data already it still takes time to create a report. Here are ways that the report creation step may be optimized.
A. Automatically create a report page with charts and graphs already filled in.
This time-saving step may be done by automatically taking screenshots (or images) of the graphs on the Performance Dashboard page and POST them to a Confluence page using Confluence APIs. What is left for the performance analyst to do is to fill in the text that interprets the data.