Overview
Typical FOLIO environments are deployed with one database node that hosts all the schemas of the storage modules and is responsible for both reading and writing to and from the database. While the modules' services can be scaled out horizontally, the database cannot (or at least cannot be done easily). Even if the modules can be scaled out, all the traffics eventually will go to the database thus creating a bottleneck. When this happens, usually a workflow like Data Import that uses nearly all of the database's CPU resources overwhelms other operations, like circulation, leaving other operations' performance severely affected. Circulation's check out or in response times can more than double under these circumstances.
...
NB 2: The connections and the pools inside the modules (maintained by RMB) are in-memory objects that create a channel to communicate with a database node. The actual connections are held on the database nodes. The read node would maintain its readonly DB connections (to the modules), and the write node would maintain its write DB connections (to the modules). There is no significant memory footprint added to the existing nodes due to doubling the number of connections. The modules themselves need to have two times the number of in-memory representations of the DB connections, though.
Implications
Broadly speaking, going from one logical server (read/write) to two logical servers (read only and read / write) introduces more moving parts into the system, which is naturally more complicated to understand.
Development Complexity
Assuming we want to check that the module works with both configurations, we need to be able to test this capability in our automated tests within the module. That means we need to be able to run two Postgres instances in the tests and have a way to synchronise them (see https://issuesfolio-org.folioatlassian.orgnet/browse/RMB-938) . Developers need to be aware that requests could be going to different servers. This adds cognitive load and makes figuring out why something has gone wrong harder.
There is also the possibility for stale data. This possibility is something that developers (and product folks) need to consider when implementing features and how the system could behave under these conditions.
The configuration possibilities of the system have doubled, this may raise the amount of manual testing needed. There are also implementation specific factors that could affect this, e.g. needing to know how to access/manage multiple connection pools, needing to know whether features of the tooling uses read or write servers (in RMB an example would be the methods that are ambiguous and what to do about them).
Operational Complexity
Operators need to be aware of many of the development aspects too, e.g. knowing that there will be more module -> database server connections overall and that might impact system resources like ports, firewall rules etc. Operators need to be aware of the trade offs / impacts of these two approaches in order to make a choice for their environment. If they choose to have separate read and write servers, then the operator needs to provision multiple servers and choose and run a mechanism for synchronising the data. Which mechanism they choose might have impacts on resource usage, potential for staleness etc and so they need to be aware of that when choosing. They might choose an option that the system isn’t routinely tested with.
Operators need to have the ability to monitor and react to when the synchronisation process fails (or possible more insidiously gets slower). Operators might need to be able to explain some of this to users of the system e.g. when odd things happen due to the possibility of stale data.
Performance
Performance tests on the following workflows were carried out and the results show that overwhelmingly the R/W split solution improves performance significantly when the database is under high stress.
- Data Import
- Data Import MARC BIB + Check-in-check-out with DB R/W Split
- Check-in-check-out Test Report (Nolana)
Links
- Technical Council Decision Record: DR-000032 - Splitting Database Read/Write Traffics
- Presentation
- RMB v35.0.0 README