POC: Make data-export horizontally scalable


Content:


PURPOSE:

MDEXP-90 - Getting issue details... STATUS

Explore approaches for horizontal scaling of the mod-data-export module.


There are three options to make module horizontally scalable:

  • Store file with UUIDs in the AWS S3 bucket;
  • Store file with UUIDs in the DB;
  • Solve this issue at the infrastructure layer.

OPTIONS:

Store file with UUIDs in the AWS S3 bucket.

 Advantages:          

  1. This approach required not a lot of changes in code:

             - save the file with UUIDs in S3 bucket instead of the file system;

             - add the logic of removing files from the S3 bucket to the existing cleaning mechanism;

     There are 2 ways of cleaning files from S3:

         1. Create a lifecycle configuration for the folder in the bucket at the AWS side. The disadvantages of this method: we will not have any information that files successfully removed, and when I tried to create these configs in my S3 bucket, looks like it may cost money.

            Apart from that, we can set expiration date only to days, so the file can be deleted only after the file will be in S3 at least a day.

         2. Add a call to remove a file from S3 in our cleaning mechanism. We can log the response to see if the file removed successfully. We can use the DELETE method, that described in this documentation.

 Disadvantages:

      1. When I saved a file with 110 thousands of UUIDs to S3 by calling /data-export/file-definitions/{UUID}/upload endpoint, it took 7min and 30 sec on my local machine to save it.

Store file with UUIDs in the DB.

 Advantages:

  1. This approach is much faster compared to the approach with S3 bucket ;

 Disadvantages:

  1. This approach required a lot of changes, for example: creating a table in DB, change code to save the file there;
  2. Need to upgrade the periodic cleaning job mechanism to remove old objects from DB. In the feature, we are planning to provide a rerun mechanism for jobs, so this step will be removed;
  3. We will save possibly large files to the DB, which can negatively affect the overall performance of the database instance a little bit.

Solve the issue at the infrastructure layer.

It means, that the hoster will create a configuration, that will make some folders sharable between modules. For example, if the hoster will deploy our folio application in the Kubernetes environment, then sharable folders can be configured by Persistance Volumes.

There is a discussion about this approach, we can use it only if the folio community will approve it. You can find the description of this approach in this draft document.

Advantages:

       1. We do not need to do any code changes, just indicate which folder should be sharable, and the performance is better than in the first 2 options.

Disadvantages:

  1. The hoster should make the folder sharable between modules.

CONCLUSION:

If the folio community will agree with the third option, we will use it, otherwise, we will use the approach with saving the files in DB.