/
Transferring Performance Testing to FOLIO Teams
Transferring Performance Testing to FOLIO Teams
Overview
One of the objectives of the Performance Task Force is to enable teams in the FOLIO community to do performance testing by themselves early and often during the release cycle, as it is a good practice, so that performance issues are caught and addressed before the release. This document outlines the challenges and strategy to hand over performance testing to the FOLIO teams. However, it does not prescribe how teams will integrate performance testing in their day-to-day activities. Such discussion will be held in due time after teams have experienced with performance testing for a couple of quarters.
Current Environment
- 1 carrier-io instance
- 1 HA large-scale FOLIO deployment
- UChicago dataset: ~27M records dataset
- Github repo: https://github.com/folio-org/perf-testing
Challenges
- Knowledge transfer is the key because carrier-io has a steep learning curve
- Using and administering carrier-io
- Creating JMeter test scripts
- Interpreting test results and troubleshoot issues
- Environment: enabling team to to share, reuse testing environment
- Each carrier-io instance has its own InfluxDB that stores all test data: JMeter test data, JVM profiling data, custom metrics data, etc... One InfluxDB for all 8 teams to use will fill up and overwhelm the database very quickly.
- Currently we are developing a large scale FOLIO deployment in the community's AWS account for PTF to use. Ideally we would have multiple.
- Creating and maintaining a Big database.
- Upgrading carrier-io
Proposals
- Teams need to learn how to work with carrier-io ASAP. The best way is to embed team members within PTF so that they can learn and be trained.
- Each team chooses an engineer who loves tackling performance problems. This person will start out working with PTF to create JMeter test scripts for their team. PTF will spend 20% of the time to work with these team members to train them to write carrier-io compatible JMeter scripts, deploying the test script and its artifacts to carrier-io, execute the Jenkins job, and interpreting test data with carrier-io.
- PTF team members will spend a maximum of 90 minutes each day to give hands-on training.
- These team members will go back to their team with this knowledge and lead/teach their team doing performance testing tasks.
- Each team chooses an engineer who loves tackling performance problems. This person will start out working with PTF to create JMeter test scripts for their team. PTF will spend 20% of the time to work with these team members to train them to write carrier-io compatible JMeter scripts, deploying the test script and its artifacts to carrier-io, execute the Jenkins job, and interpreting test data with carrier-io.
- Because each carrier-io instance has its own InfluxDB that stores all test data, all 8 teams can't use one carrier-io. Each team should have its own carrier-io instance.
- Because teams will need to performance-test their work before releases, having one large-scale FOLIO environment is not enough. There should be 2 to 3 (+1 for PTF?) large scale FOLIO environments that could be spun up and torn down on demand to save costs, all should be the same and should be on the same software versions. The "+1" is a dedicated environment for PTF as long as it exists.
- The 2-3 large scale FOLIO environments are to be shared among the teams
- Initially PTF will be responsible for upgrading the environments at the beginning of every sprint with the latest snapshots (usually of the commits whose stories were approved at the end of the previous sprints) based on FOLIO-SNAPSHOT software versions. Later on teams should take over this responsibility.
- Upgrading the environments regularly allows teams to test with the modules of latest versions, while balancing the instability of the frequent commits with upgrading the chain of dependencies required by the modules.
- This upgrade includes running any database migration script to update the database. These scripts are run automatically when the module is enabled.
- Teams spin up an env to run tests, then drop env after testing - to restore the state for the next team to use. The restored state is what was deployed at the beginning of each sprint.
- When teams spin up an environment, they will be able to customize the version of any module to be loaded if desired. This includes any released version or from master or from any branch.
- Ex: Team wants to performance test the mod-circulation-storage code on a branch. They will set the version of this mod-circulation-storage for it to be loaded on start up.
- When teams spin up an environment, they will be able to customize the version of any module to be loaded if desired. This includes any released version or from master or from any branch.
- There should be a Wiki page for teams to schedule a timeslot or timeslots to run their tests
- Teams will have two hours after running a test to collect data and examine results, after which the environment would automatically get shut down. (Since the data will be stored in a persistent EBS volume, the test data won't be dropped after the environment is shut down.)
- If a team needs to run database migration script for its performance testing, in some cases it could take more than 2 hours. In this case the team running migration script will need to communicate with the teams that are after their timeslot to update them of the overtime use.
- Note that once testing is finished and the environment is dropped, the data will be restored to the beginning of the sprint state, so all this migration will be gone.
- Teams should continue to follow the principles and guidance described in (0) JMeter Scripts Contribution Guidelines to work with shared performance environments, this includes creating scripts to add test data and to restore the database after each test run.
- Environment Costs
- The following assumptions are made to determine costs
- Environments will be used for 1/3 of the time, or 10 days in a month
- Each day will be used for about 12 hours, or 1/2 day
- Therefore it will be a total of five (5) 24-hours days in a month, which equals to 1/6 of the full month
- The costs per environment is 1/6 of the normal use, taking advantage of the ability to spin up and tearing down the environment.
1 Community FOLIO deployment (1-year Reserved) Price/hour Hours/month Instances Monthly Cost EKS Cluster 0.1 120 1 12.00 Database (t3.xlarge) 0.104 120 1 12.48 EC2 (t3.xlarge) 0.104 120 6 74.88 Load Balancers (Classic) 0.025 120 4 12.00 EBS (general purpose, gp2) 0.1/GB 142 GB 8 113.6 Total FOLIO $125.53 1 Carrier-io (1-year Reserved) EC2 (m5.xlarge - Reserved) 0.121 120 1 14.52 Spot instance (t3.medium) 0.05 15 1 0.75 EBS (general purpose, gp2) 0.1/GB 200 GB 8 20.00 Total Carrier-io 35.27 Monthly Grand Total $160.80
- Each environment costs about $160/month, three environments = $482/month.
- Using 3 years of reserved instances will bring down the costs to $112/month for one environment, 3 environments = $336/month (see attached spreadsheet CommunityPerfEnvironemtCosts.xlsx for more details)
- The following assumptions are made to determine costs
How to Get There
- Create sandbox environments (carrier-io and FOLIO) for teams to play with during the transition.
- - PERF-102Getting issue details... STATUS - Automatedly running tests checked into Github. Will need to add the step to spin up FOLIO when running tests. This helps to run tests and compare results against the previous test runs.
- Documentation:
- - PERF-110Getting issue details... STATUS
- Upgrading carrier-io going forward
- Create how-to documentations to administer carrier-io
- Create a diagram or a set of diagrams showing pieces of carrier-io and of FOLIO to communicate the architecture and responsibilities
- Performance Analysis documentation: what to look for, log analysis (missing indexes, database logs for slow queries), pgHero, pgAdmin, Performance Insight , metrics, trouble signs (such as slowness - runaway CPU/memory, 500 errors, database memory etc..), Giraffe analysis.