Texas A&M University Libraries
Current production-esque environment
Kubernetes node specs:
- 4 core CPU
- 16GB memory
- 40GB drive per node
- Provisioned on vSphere infrastructure
Database configuration:
- Crunchy-Postgres Kubernetes stateful set
- Deployed via Helm chart in Rancher 2.1
- Postgres volumes provisioned with vSphere storage class via vSphere cloud config in Kubernetes/Rancher 2.1
- XFS file system
- 1 primary and one replica for each Folio instance (running 3 instances of Folio)
- Max connection set to 250
- Max pool size of 10 for Postgres
Test cluster:
- 7 pre-provisioned Oracle Linux VMs on VMware infrastructure (4 Worker and 3 etcd/Control Plane nodes)
RancherOS cluster:
- 8 node template provisioned RancherOS VMs on VMware infrastructure (5 Worker and 3 etcd/Control Plane nodes)
Folio environment:
(Three instances of Folio being hosted by two clusters)
- RancherOS cluster hosting Folio Q4 2018
- Test Cluster hosting Folio Q3 2018, Folio Q4 2018
- 12k users in once instance, over 120k users in another
- Inventory records, loan types, address types, patron groups etc... loaded
- One instance hosting two tenants sharing a single DB (diku and tamu)
Preliminary Findings
- Pod monitoring on our clusters via Prometheus and Grafana - deployed via Helm chart in both clusters
- Gathered a list of worst offender modules for cluster resources used:
- mod-agreements
- mod-licenses
- mod-permissions
- Set resource reserves and limits for module Workloads - to prevent runaway or failed clusters when upgrading, rescheduling or during node down times
- Set in the Workload - batch sizes of 1 when performing rolling upgrades or deployments
- These limits have slowed down my Folio response times some - I don’t yet have a clear view entirely of what modules should be given higher resource limits
- Java is very memory hungry
UI users queries and resource utilization
Doing some user look-up queries in the Folio UI:
Query of all of the users in the UI put a 6Ghz load on the K8's cluster node hosting the DB:
htop on that node hosting the DB, after it has calmed down a little. The query is taking a huge amount of resources:
Grafana’s history of the event, you can see the pgset-0 pod in the list using 2.371 cores - even with a limit I gave to only consume a max of 2 cores in Rancher UI!
Data loading resource utilization
User load of 12k users, along with the needed reference data:
- Loaded on instance of Folio with the Okapi DB and the rest of the modules DB split
- The UI is MUCH more responsive than before, during the load
12k users in all took about 20 minutes to load
Not dropping the index before doing the load - no way to really do that using mod-user-import as far as I’m aware?