This Page is a Work In Progress
Overview
It seems that the current auth flow consumes more time than we had originally hoped/expected. This investigation is to better understand where the bottlenecks are and to evaluate potential optimizations.
Baseline Measurements
Before evaluating possible performance improvements, I wanted to get a better understanding of where the pain points are. These measurements will also serve as a baseline for comparison purposes as various enhancements are trialed.
Test Environment
- AWS ECS
- Private - nobody else using this except me
- NGINX reverse proxy running in each container
- Single instance of each module
- Goldenrod platform-complete
- Standard sample/reference data loaded
Methodology
- Make a check-out-by-barcode request - add a query argument to help easily identify the request in logs
- Make a check-in-by-barcode request (same loan that was just created) - also add a query arg here to help identify the request in logs
- Analyze logs
- Find the checkout/in request using the testId query argument, from here we will get the requestId and find all related/internal requests
- From this log entry we obtain the total request time from okapi's perspective, including all required internal calls to auth, other modules, etc.
- In order to isolate the amount of time spent on network vs time spent in mod-authtoken, we dig into the mod-authtoken reverse proxy log.
- Correlate each auth request with a log entry in the mod-authtoken nginx transaction log.
- requestId isn't explicitly logged, but the okapi token is, which contains the requestId
- The transaction time logged here constitutes the time spent in the module
- Derive network time for the auth call from the total request time and the auth module time.
- Aggregate numbers and report a summary for each check-in/out call.
- Find the checkout/in request using the testId query argument, from here we will get the requestId and find all related/internal requests
Results
test01
- 10 interleaved calls to check-out-by-barcode and check-in-by-barcode
check-out-by-barcode:
check-in-by-barcode
Observations
- You can see that some caching is being done in the check-out-by-barcode flow. This is evident by the number of auth calls. The first has 47, then subsequent are much lower 33/34
- I'm not yet sure why it's not always 33 or 34.
- A similar pattern can be seen in the check-in-by-barcode flow, though much less so, presumably because the requests are interleaved. The checkout call happens first, values are cached, so when the check-in call is made, the cached values are used.
- Network traffic consistently takes up a fair amount of the time spent on auth (~15-20% on average)
- A large portion of the request time is spent on auth (~40-50% on average)
- TBD
Experiments
Description of various experiments and PoCs conducted.
Caching module-to-module tokens
In theory these don't really change very frequently, so caching them makes sense.
Considerations
- Currently the requestId is included in these tokens, so if we want to pursue this optimization, it will be at the cost of that. I think it's only there for convenience - it's unclear if anyone is actually using this information or not.
- The cache should be shared / synchronized between nodes in the OKAPI cluster
- Is it necessary to provide a mechanism for manually purging and/or pruning the cache?
- Must these tokens be tenant-specific, or can the same token be used by multiple tenants?
Challenges
- Figuring out where to put this in OKAPI is a little tricky, but clearly doing it in OKAPI would give us the maximum benefit as it would mean skipping the call to mod-authtoken altogether, saving network time as well.
- For my PoC I settled on checking for a cached token when determining the set of API calls needed, and populating the cache when handling the auth response. ##TODO: provide reference to PoC branch
Theoretical Benefits
- Performance gains would not be seen immediately, as the cache would need to warm up before we could benefit from saving on calls to mod-authtoken
- Once the cache is warmed, the time savings here would potentially be enormous.
- Example:
Current (from baseline measurements):
POST circulation/check-out-by-barcode?testId=co_test02_7 # of auth calls: 33 Total request time: 384.93 ms Total time on calls to mod-authtoken: 198.59 ms (51.59%) Total auth module time: 159.00 ms (80.06%) Total autn network time: 39.59 ms (19.94%) # Stats for initial auth call (used to calculate theoretical below) Auth Request Time: 5.92 ms Auth Module Time: 5.00 ms (84.46%) Auth Network Time: 0.92 ms (15.54%)
Theoretical
POST circulation/check-out-by-barcode?testId=co_test02_7 # of auth calls: 1 Total request time: 192.26 ms Total time on calls to mod-authtoken: 5.92 ms (3.08%) Total auth module time: 5.00 ms (84.46%) Total autn network time: 0.92 ms (15.54%)
The 1 auth call here is the original users call for check-out-by-barcode. The rest of the auth calls are module-to-module. Those tokens could be obtained from cache, preventing the need to call mod-authtoken at all.
Actual Benefits
A PoC for this is still a work in progress. No actual results have been obtained yet.