EUREKA-755 - Design/PoC solution to the "too many roles" problem
Overview
https://folio-org.atlassian.net/browse/EUREKA-755
Objective: Find a solution to the “too many roles” problem, and flesh out stories for it’s implementation.
Background
We’ve discovered that the size of access tokens increases with the number of roles assigned to a user. At some point, the x-okapi-token header becomes so large that requests will be rejected (e.g. at the load balancer, reverse proxies, etc.)
Problem Statement
We need to find an approach to reduce the size of access tokens in the case where numerous roles are assigned to a user.
Scope
Formally enumerate the various approaches, identifying pros/cons and relative effort of each.
Creation of one or more proof of concepts to help evaluate feasibility, and help identify pitfalls is optional, but encouraged.
The following should be considered when evaluating each approach:
feasibility
amount of effort
performance
maintainability
impact of system operators
architectural fit (i.e. is the solution consistent with existing folio architecture?)
resource consumption (e.g. will the solution require significantly more/less resources than other solutions?)
migration path
Proposed Solutions
The following options were the product of some brainstorming and have been discussed at a high level. As such the confidence factor for some of these is lower than for others. Additional investigation, prototyping, etc. is required before implementation stories can be created.
Provide Guidance
Document best practices for role creation, including the use of shorter name, and tips to help find the appropriate/ideal granularity for roles.
Candidate for short term mitigation
Lightweight Access Tokens
Configure the necessary clients in keycloak to use lightweight access tokens, which do not include a list of roles assigned to the user.
Requires changes to keycloak configuration
Requires changes to the sidecar - call a different authz/RPT endpoint
Candidate for long term solution
Questions:
What adjustments are required on the keycloak side to support this (lightweight/opaque access tokens)
Is this only applicable to the
<tenant>_applicationclient, or are other clients involved here?
Performance concerns - keycloak has to do more work
Caching can likely help - both on the keycloak and sidecar sides.
Increased load on keycloak -> $$?
Synthetic Role Per User
Behind the scenes, create a single (keycloak) role for each user which contains all capabilities assigned to the user via their (folio) roles. Here a given user might be assigned numerous roles in folio, but from keycloak’s perspective they are only assigned to one role. Therefore, the access tokens will only have this one synthetic role listed, avoiding the token scaling problem.
Candidate for long term solution
Scalability concerns? As the number of users in keycloak increases, the number of roles does as well (linear). If you have thousands of users in a given realm, you will also have thousands of roles for that realm.
Updating a role in folio may result in many roles needing to be updated on the keycloak side
If in Folio, you have a role which is assigned to many users, making an adjustment to that role would require many roles to be adjusted on the keycloak side. What happens if one of the many updates fails? This could get messy.
Map Folio Roles to Keycloak Roles w/ Short Auto-generated names
Here, roles will continue to be named however users/admins see fit on the Folio side, but when creating the corresponding role in keycloak a short, system-generated name would be used. Since the role name length factors into the access token size, using short names should help significantly.
Doesn’t actually solve the underlying issue, just mitigates it. Maybe we’d be able to support 40-50 roles assigned to a user, but now the limitation becomes something like 90-100.
Enforce Limits on Role Name Length
Presently, there are no length restrictions on the length of role names. Since the role name length factors into the access token size, using short names should help significantly. Enforcing a maximum would allow us to mitigate the problem as well as make it easier to perform some validation (see next option).
Doesn’t actually solve the underlying issue, just mitigates it.
Candidate for short term mitigation
Problem Detection During Role Assignment
Perform some validation during assignment ensuring that a user isn't assigned too many roles. Calculating the size of the access token can be tricky, the number of roles and their quantity are only part of the equation. Having a maximum role name length (see previous option) would probably help make this calculation/estimation easier, but there will still be some guess work.
Some combination of the options above
Candidate for short term mitigation
Conclusion
The “Lightweight Access Tokens” approach is the leading contender so far. A POC is needed to determine feasibility, identify potential pitfalls, determine amount of work, etc.
Open Questions
What does the migration path look like?