This Page is a Work In Progress
Overview
It seems that the current auth flow consumes more time than we had originally hoped/expected. This investigation is to better understand where the bottlenecks are and to evaluate potential optimizations.
Baseline Measurements
Before evaluating possible performance improvements, I wanted to get a better understanding of where the pain points are. These measurements will also serve as a baseline for comparison purposes as various enhancements are trialed.
Test Environment
- AWS ECS
- Private - nobody else using this except me
- NGINX reverse proxy running in each container
- Single instance of each module
- Goldenrod platform-complete
- Standard sample/reference data loaded
Methodology
- Make a check-out-by-barcode request - add a query argument to help easily identify the request in logs
- Make a check-in-by-barcode request (same loan that was just created) - also add a query arg here to help identify the request in logs
- Analyze logs
- Find the checkout/in request using the testId query argument, from here we will get the requestId and find all related/internal requests
- From this log entry we obtain the total request time from okapi's perspective, including all required internal calls to auth, other modules, etc.
- In order to isolate the amount of time spent on network vs time spent in mod-authtoken, we dig into the mod-authtoken reverse proxy log.
- Correlate each auth request with a log entry in the mod-authtoken nginx transaction log.
- requestId isn't explicitly logged, but the okapi token is, which contains the requestId
- The transaction time logged here constitutes the time spent in the module
- Derive network time for the auth call from the total request time and the auth module time.
- Aggregate numbers and report a summary for each check-in/out call.
- Find the checkout/in request using the testId query argument, from here we will get the requestId and find all related/internal requests
Results
test01
- 10 interleaved calls to check-out-by-barcode and check-in-by-barcode
check-out-by-barcode:
check-in-by-barcode
Observations
- You can see that some caching is being done in the check-out-by-barcode flow. This is evident by the number of auth calls. The first has 47, then subsequent are much lower 33/34
- I'm not yet sure why it's not always 33 or 34.
- A similar pattern can be seen in the check-in-by-barcode flow, though much less so, presumably because the requests are interleaved. The checkout call happens first, values are cached, so when the check-in call is made, the cached values are used.
- Network traffic consistently takes up a fair amount of the time spent on auth (~15-20% on average)
- A large portion of the request time is spent on auth (~40-50% on average)
- TBD
Experiments
Description of various experiments and PoCs conducted.
Caching module-to-module tokens
In theory these don't really change very frequently, so caching them makes sense.
Considerations
- Currently the requestId is included in these tokens, so if we want to pursue this optimization, it will be at the cost of that. I think it's only there for convenience - it's unclear if anyone is actually using this information or not.
- Should the cache be shared / synchronized between nodes in the OKAPI cluster?
- Is it necessary to provide a mechanism for manually purging and/or pruning the cache?
- Must these tokens be tenant-specific, or can the same token be used by multiple tenants?
Challenges
- Figuring out where to put this in OKAPI is a little tricky, but clearly doing it in OKAPI would give us the maximum benefit as it would mean skipping the call to mod-authtoken altogether, saving network time as well.
- For my PoC I settled on checking for a cached token when determining the set of API calls needed, and populating the cache when handling the auth response. ##TODO: provide reference to PoC branch
Theoretical Benefits
- Performance gains would not be seen immediately, as the cache would need to warm up before we could benefit from saving on calls to mod-authtoken
- Once the cache is warmed, the time savings here would potentially be enormous.
- Example:
Current (from baseline measurements):
POST circulation/check-out-by-barcode?testId=co_test02_7 # of auth calls: 33 Total request time: 384.93 ms Total time on calls to mod-authtoken: 198.59 ms (51.59%) Total auth module time: 159.00 ms (80.06%) Total autn network time: 39.59 ms (19.94%) # Stats for initial auth call (used to calculate theoretical below) Auth Request Time: 5.92 ms Auth Module Time: 5.00 ms (84.46%) Auth Network Time: 0.92 ms (15.54%)
Theoretical
POST circulation/check-out-by-barcode?testId=co_test02_7 # of auth calls: 1 Total request time: 192.26 ms Total time on calls to mod-authtoken: 5.92 ms (3.08%) Total auth module time: 5.00 ms (84.46%) Total autn network time: 0.92 ms (15.54%)
The 1 auth call here is the original users call for check-out-by-barcode. The rest of the auth calls are module-to-module. Those tokens could be obtained from cache, preventing the need to call mod-authtoken at all.
PoC Implementation Notes
During implementation of this PoC, I realized that mod-authtoken's plays a bigger role than initially expected. More specifically, it sets X-Okapi-User-Id and X-Okapi-Permissions headers which are then relayed onto the proxy calls made by OKAPI to the actual module. This means that these values need to also be cached along with the token, and when using a cached, token they need to be set since the call to mod-authtoken would be skipped. It also means that cache entries are now user-specific. In order to keep the system responsive to changes in user status (active/not-active, permissions granted/revoked, etc.) the cache entries must have a relatively short TTL.
After a few iterations we settled on the following:
Cache keys are compound; comprised of 4 parts: <Method>|<PathPattern (or 'Path' if not defined)>|<X-Okapi-User-Id>|<X-Okapi-Token>
Cache values include: the X-Okapi-Token mod-authtoken would have returned, as well as the values of X-Okapi-User-Id and X-Okapi-Permissions which mod-authtoken would have set.
Actual Benefits
A PoC was evaluated by the Performance Task Force (PFT). That effort was tracked in PERF-113. In a test which simulated 8 concurrent users, the benefits seen during check-in/check-out were as follows:
Existing Okapi (v3.1.2) and mod-authtoken (v2.5.1)
Average Check in time: 1.024 sec
Average Check out time: 1.812 sec
POC Okapi and mod-authtoken
Average Check in time: 0.729 sec
Average Check out time: 1.364 sec
% Differences:
Check in: 41% better in POC version
Check out: 33% better in POC version
mod-authtoken CPU utilization is 4 times better with caching. See attached screenshot and Excel file for more details
Next Steps
- Port the token cache PoC to Okapi 4 -
- Add metrics for tracking cache events (hits/misses/etc.) -
- Re-test with this version of the PoC
- Create stories (OKAPI/MODAT) for the implementation of this optimization -
- Sort out other implementation details - See Considerations above.