[FOLIO-2563] SPIKE: propose prevention of DoS via CQL query Created: 14/Apr/20  Updated: 13/Jul/21  Resolved: 04/Jun/20

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P2
Reporter: Jakub Skoczen Assignee: Mikhail Fokanov
Resolution: Done Votes: 0
Labels: platform-backlog, security
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Blocks
blocks FOLIO-2524 Security Audit raised issues Open
Relates
relates to RMB-505 allow to specify batch upload (PUT/PO... Open
relates to RMB-615 cancel queries that take longer than ... Closed
relates to RMB-534 Reject CQL queries that match no inde... Open
relates to MODINV-283 Inventory search hangs Closed
relates to MODINVSTOR-510 Inventory filtering by Item status is... Closed
relates to RMB-677 Close PostgreSQL connection after inv... Closed
relates to RMB-638 Revert PgUtil.streamGet parameter reo... Closed
Sprint: CP: sprint 87, CP: sprint 90, CP: sprint 88, CP: sprint 89
Story Points: 5
Development Team: Core: Platform

 Description   

Description

Several application endpoints accessible by normal users perform CQL (Common Query Language) queries leveraging CQL-Java. CQL is a formal language for representing queries to information retrieval systems such as web indexes, bibliographic catalogs and museum collection information.1 The application can be forced to perform an OR expression over objects that match a certain patterns. As a result, the CQL engine is forced to do very expensive queries to search and parse the result of the operations. Although it appears that the full result of the operation is filtered downstream when returned to the user, It was noticed that each request to the target endpoint would take around 29 seconds to respond back. An attacker can exploit this by performing a denial of service attack by requesting several GET requests simultaneously or in short intervals, resulting in the exhaustion of the server’s memory resources.

Steps to reproduce

Perform the following request with the thread count set to 20 and throttle time set to 1 milliseconds. Substitute the value of X-Okapi-Token header with a token belonging to a user that has inventory.instances.collection.get permission.

GET /inventory/instances?limit=1000&query=%28keyword%20all%20%22ncc%22%29%20or%201=1
urldecoded query: (keyword all "ncc") or 1=1
Users can use the Inventory front-end "Query search" slot to enter any CQL: https://folio-snapshot.aws.indexdata.com/inventory?qindex=querySearch&sort=Title

Observe that all inventory related endpoints will become unresponsive

Acceptance criteria

Propose an approach that will prevent heavy queries to cause a module to become unresponsive. Provide a PoC implementation for review.

Potential solutions:

Query validation based:

  • RMB: validate query length
  • RMB: validate query structure (max number of OR expression)
  • RMB: validate if query field and relation is backed up by an index (in schema.json, reject if not)
  • RMB: validate known problematic searches (e.g cql.allRecords)
  • RMB: wide truncation queries

Query execution based:

  • Can Postgres analyze query runtime?
  • Can Postgres limit the runtime for a SELECT
  • If not, can we timeout a connection for long running queries


 Comments   
Comment by Craig McNally [ 16/Apr/20 ]

Which ever approach we take, we need to be very careful not to break existing queries throughout the system. For instance, if you start limiting query length or the number of OR'd terms, queries that ask for a bunch of specific records via UUID might break

Comment by Julian Ladisch [ 16/Apr/20 ]

We may use pg_cancel_backend to cancel a long-running query. (Don't use pg_terminate_backend as it may result in an inconsistent state.)
https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-SIGNAL

Comment by Mikhail Fokanov [ 07/May/20 ]

queryTimeout parameter can be specified (https://vertx.io/docs/vertx-mysql-postgresql-client/java/) for postgres connection in getPostgreSQLClientConfig. In such case we can have 2 pools of connections: without execution time restriction and without it.
If the client breaks the connection PostgreSQL server will set the interrupt flag. Next time the query checks for interrupts as it executes it will see the flag and abort. Sometimes a query might be doing CPU-heavy work within code that doesn't check for interrupts.

Comment by Marc Johnson [ 07/May/20 ]

Mikhail Fokanov

In such case we can have 2 pools of connections: without execution time restriction and without it.

What would the two different pools be used for? I'm guessing the time-limited pool would be used for queries. What would the non-time-limited pool be used for?

Comment by Mikhail Fokanov [ 07/May/20 ]

There could be still some requests from another modules, that should take more than 5 sec by design. For them it could be used. I'm not aware is it the case for now.

Comment by Julian Ladisch [ 07/May/20 ]

The vertx-mysql-postgresql-client has the queryTimeout connection parameter: https://vertx.io/docs/vertx-mysql-postgresql-client/java/#_configuration
RMB-246 Closed switches to the vertx-pg-client: https://vertx.io/docs/vertx-pg-client/java/#_configuration
Has vertx-pg-client a query timeout parameter?

A vertx-pg-client connection can cancel the query using the cancelRequest method ( https://vertx.io/docs/vertx-pg-client/java/#_cancelling_request ):

connection
  .query("SELECT pg_sleep(20)")
  .execute(ar -> { ... });
connection
  .cancelRequest(ar -> { ... });
Comment by Mikhail Fokanov [ 11/May/20 ]

Julian Ladisch, I see. In such case, would it be efficient to implement thread in PostgresClient, which will once per 5 seconds verify whether there are requests that live for more than 5 seconds and kill them? In such case any request will be terminated within range of [5,10] seconds. Long running requests (by design) would be added to that list.

There also could be separate url prefix for internal REST calls, which is not visible outside of okapi. Requests for such urls can have long-running queries. Also commands, that use not standard RMB CQL query processing would be not added to the list.

Comment by Julian Ladisch [ 11/May/20 ]

https://github.com/vert-x3/vertx-mysql-postgresql-client says: "This client is deprecated - use instead https://github.com/eclipse-vertx/vertx-sql-client", vertx-sql-client contains https://github.com/eclipse-vertx/vertx-sql-client/tree/master/vertx-pg-client

Comment by Marc Johnson [ 03/Jul/20 ]

This issue is closed, could someone please summarise what decision was made.

Comment by Julian Ladisch [ 11/Aug/20 ]

Solution: RMB-615 Closed gives RMB based modules the option to set a time limit, RMB cancels a query that exceeds the time limit.
For each RMB based module that should cancel long running queries a new issue should be created. Example: https://folio-org.atlassian.net/browse/MODINVSTOR-496

This CQL based denial of service attack has low severity: It requires a valid staff login.
We also know that there will always be some legitimate queries that take long. Any member of staff can cause a denial of service by running multiple long running queries, however, this is unlikely. If we get reports that this actually happens we may consider other possibilities, for example implementing a DoS detection that automatically disables the login that causes the DoS.

Generated at Thu Feb 08 23:21:34 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.