Texas A&M University Libraries

Current production-esque environment

Kubernetes node specs:

  • 4 core CPU
  • 16GB memory
  • 40GB drive per node
  • Provisioned on vSphere infrastructure

Database configuration:

  • Crunchy-Postgres Kubernetes stateful set
  • Deployed via Helm chart in Rancher 2.1
  • Postgres volumes provisioned with vSphere storage class via vSphere cloud config in Kubernetes/Rancher 2.1
  • XFS file system
  • 1 primary and one replica for each Folio instance (running 3 instances of Folio)
  • Max connection set to 250
  • Max pool size of 10 for Postgres

Test cluster:

  • 7 pre-provisioned Oracle Linux VMs on VMware infrastructure (4 Worker and 3 etcd/Control Plane nodes)

RancherOS cluster:

  • 8 node template provisioned RancherOS VMs on VMware infrastructure (5 Worker and 3 etcd/Control Plane nodes)

Folio environment:

(Three instances of Folio being hosted by two clusters)

  • RancherOS cluster hosting Folio Q4 2018
  • Test Cluster hosting Folio Q3 2018, Folio Q4 2018
  • 12k users in once instance, over 120k users in another
  • Inventory records, loan types, address types, patron groups etc... loaded
  • One instance hosting two tenants sharing a single DB (diku and tamu)


Preliminary Findings

  1. Pod monitoring on our clusters via Prometheus and Grafana - deployed via Helm chart in both clusters
  2. Gathered a list of worst offender modules for cluster resources used:
    1. mod-agreements
    2. mod-licenses
    3. mod-permissions
  3. Set resource reserves and limits for module Workloads - to prevent runaway or failed clusters when upgrading, rescheduling or during node down times
  4. Set in the Workload - batch sizes of 1 when performing rolling upgrades or deployments
  5. These limits have slowed down my Folio response times some - I don’t yet have a clear view entirely of what modules should be given higher resource limits
  6. Java is very memory hungry


UI users queries and resource utilization

Doing some user look-up queries in the Folio UI:


Query of all of the users in the UI put a 6Ghz load on the K8's cluster node hosting the DB:


htop on that node hosting the DB, after it has calmed down a little. The query is taking a huge amount of resources:


Grafana’s history of the event, you can see the pgset-0 pod in the list using 2.371 cores - even with a limit I gave to only consume a max of 2 cores in Rancher UI!


Data loading resource utilization

User load of 12k users, along with the needed reference data:

  • Loaded on instance of Folio with the Okapi DB and the rest of the modules DB split
  • The UI is MUCH more responsive than before, during the load
  • 12k users in all took about 20 minutes to load

  • Not dropping the index before doing the load - no way to really do that using mod-user-import as far as I’m aware?


The total load event on my cluster. Take note of the network pressure graph, pod CPU usage graph and pod memory usage graph:


Here you can see the pod network I/O during the load. Keep in mind this user-import load is initiated from an external client, going through the cluster Nginx ingress: