Texas A&M University Libraries
Current production-esque environment
Kubernetes node specs:
4 core CPU
16GB memory
40GB drive per node
Provisioned on vSphere infrastructure
Database configuration:
Crunchy-Postgres Kubernetes stateful set
Deployed via Helm chart in Rancher 2.1
Postgres volumes provisioned with vSphere storage class via vSphere cloud config in Kubernetes/Rancher 2.1
XFS file system
1 primary and one replica for each Folio instance (running 3 instances of Folio)
Max connection set to 250
Max pool size of 10 for Postgres
Test cluster:
7 pre-provisioned Oracle Linux VMs on VMware infrastructure (4 Worker and 3 etcd/Control Plane nodes)
RancherOS cluster:
8 node template provisioned RancherOS VMs on VMware infrastructure (5 Worker and 3 etcd/Control Plane nodes)
Folio environment:
(Three instances of Folio being hosted by two clusters)
RancherOS cluster hosting Folio Q4 2018
Test Cluster hosting Folio Q3 2018, Folio Q4 2018
12k users in once instance, over 120k users in another
Inventory records, loan types, address types, patron groups etc... loaded
One instance hosting two tenants sharing a single DB (diku and tamu)
Preliminary Findings
Pod monitoring on our clusters via Prometheus and Grafana - deployed via Helm chart in both clusters
Gathered a list of worst offender modules for cluster resources used:
mod-agreements
mod-licenses
mod-permissions
Set resource reserves and limits for module Workloads - to prevent runaway or failed clusters when upgrading, rescheduling or during node down times
Set in the Workload - batch sizes of 1 when performing rolling upgrades or deployments
These limits have slowed down my Folio response times some - I don’t yet have a clear view entirely of what modules should be given higher resource limits
Java is very memory hungry
UI users queries and resource utilization
Doing some user look-up queries in the Folio UI:
Query of all of the users in the UI put a 6Ghz load on the K8's cluster node hosting the DB:
htop on that node hosting the DB, after it has calmed down a little. The query is taking a huge amount of resources:
Grafana’s history of the event, you can see the pgset-0 pod in the list using 2.371 cores - even with a limit I gave to only consume a max of 2 cores in Rancher UI!
Data loading resource utilization
User load of 12k users, along with the needed reference data:
Loaded on instance of Folio with the Okapi DB and the rest of the modules DB split
The UI is MUCH more responsive than before, during the load
12k users in all took about 20 minutes to load
Not dropping the index before doing the load - no way to really do that using mod-user-import as far as I’m aware?
The total load event on my cluster. Take note of the network pressure graph, pod CPU usage graph and pod memory usage graph: