Texas A&M University Libraries

Texas A&M University Libraries

Current production-esque environment

Kubernetes node specs:

  • 4 core CPU

  • 16GB memory

  • 40GB drive per node

  • Provisioned on vSphere infrastructure

Database configuration:

  • Crunchy-Postgres Kubernetes stateful set

  • Deployed via Helm chart in Rancher 2.1

  • Postgres volumes provisioned with vSphere storage class via vSphere cloud config in Kubernetes/Rancher 2.1

  • XFS file system

  • 1 primary and one replica for each Folio instance (running 3 instances of Folio)

  • Max connection set to 250

  • Max pool size of 10 for Postgres

Test cluster:

  • 7 pre-provisioned Oracle Linux VMs on VMware infrastructure (4 Worker and 3 etcd/Control Plane nodes)

RancherOS cluster:

  • 8 node template provisioned RancherOS VMs on VMware infrastructure (5 Worker and 3 etcd/Control Plane nodes)

Folio environment:

(Three instances of Folio being hosted by two clusters)

  • RancherOS cluster hosting Folio Q4 2018

  • Test Cluster hosting Folio Q3 2018, Folio Q4 2018

  • 12k users in once instance, over 120k users in another

  • Inventory records, loan types, address types, patron groups etc... loaded

  • One instance hosting two tenants sharing a single DB (diku and tamu)



Preliminary Findings

  1. Pod monitoring on our clusters via Prometheus and Grafana - deployed via Helm chart in both clusters

  2. Gathered a list of worst offender modules for cluster resources used:

    1. mod-agreements

    2. mod-licenses

    3. mod-permissions

  3. Set resource reserves and limits for module Workloads - to prevent runaway or failed clusters when upgrading, rescheduling or during node down times

  4. Set in the Workload - batch sizes of 1 when performing rolling upgrades or deployments

  5. These limits have slowed down my Folio response times some - I don’t yet have a clear view entirely of what modules should be given higher resource limits

  6. Java is very memory hungry

 

UI users queries and resource utilization

Doing some user look-up queries in the Folio UI:

 

Query of all of the users in the UI put a 6Ghz load on the K8's cluster node hosting the DB:

 

htop on that node hosting the DB, after it has calmed down a little. The query is taking a huge amount of resources:

 

Grafana’s history of the event, you can see the pgset-0 pod in the list using 2.371 cores - even with a limit I gave to only consume a max of 2 cores in Rancher UI!

 

Data loading resource utilization

User load of 12k users, along with the needed reference data:

  • Loaded on instance of Folio with the Okapi DB and the rest of the modules DB split

  • The UI is MUCH more responsive than before, during the load

  • 12k users in all took about 20 minutes to load

  • Not dropping the index before doing the load - no way to really do that using mod-user-import as far as I’m aware?

 

The total load event on my cluster. Take note of the network pressure graph, pod CPU usage graph and pod memory usage graph:

 

Here you can see the pod network I/O during the load. Keep in mind this user-import load is initiated from an external client, going through the cluster Nginx ingress: