[FOLIO-2029] monitor bugfest environment for scalability issues Created: 20/May/19  Updated: 03/Jun/20  Resolved: 30/May/19

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P2
Reporter: Jakub Skoczen Assignee: Hongwei Ji
Resolution: Done Votes: 0
Labels: platform-backlog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: File okapi.config     File okapi.metrics    
Sprint: CP: sprint 64
Story Points: 5
Development Team: Core: Platform

 Description   

We will need the following things to be monitored:

• memory utilization (on each node)
• cpu utilization (on each node)
• number of sockets used (on each node, per process especially for Okapi processes)
• DB connections utilization (on each node)
• full okapi.log
• any other module logs

Please provide details about the AWS config (number and size of instances, RDS, etc), if possible.

This task include setting up appropriate monitoring and provided data (or access to the monitoring tools) after the bugfest has been executred



 Comments   
Comment by Hongwei Ji [ 21/May/19 ]

Here is the AWS config:

  • 1 database machine (db.r4.large, 2 vCPU and 16GB memory)
  • 4 EC2 machines (m5.large, 2 vCPU and 8GB memory)
    All folio apps including Okapi are running as Docker container

By default AWS collects CPU and memory metrics. Database has monitoring as well.
All logs are being written to AWS CloudWatch and we can dump all logs out when BugFest is over.

The only obvious missing piece is the sockets metric. Is this a must?

Comment by Hongwei Ji [ 21/May/19 ]

Found out that AWS CloudWatch agent can capture netstat_tcp_established: the number of TCP connections established., so installed CloudWatch agent on all four ec2 machines. That should satisfy the socket metrics requirement.

Comment by Hongwei Ji [ 30/May/19 ]

Extended AWS log/metrics retention period in case we need it.
AWS CloudWatch logs: changed from 2 weeks to 1 month.
AWS CloudWatch metrics: default is 15 days for 60s resolution. After that it changes to 300s resolution and is kept for 63 days.
AWS RDS performance insights: changed from default 7 days (free) to 2 year (paid, the only other option).

Comment by Hongwei Ji [ 30/May/19 ]

Ideally, we should only need to use AWS tools to analyze the metrics. In case data export is needed, we can use AWS cli to achieve that. Here is a command example to export Okapi cpu and memory utilization metrics. The config and results are attached as well.

aws cloudwatch get-metric-data --metric-data-queries file://./okapi.config --start-time 2019-05-13T00:00:00Z --end-time 2019-05-29T00:00:00Z
Generated at Thu Feb 08 23:17:40 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.