SPIKE:[MDEXP-46]: Investigate File Export Storage options

Story:  MDEXP-46 - Getting issue details... STATUS

Options that were considered:


Local File Storage:

Generated files could be stored in the local file System. This method is already planned to be used in File uploading when Instance UUIDs are uploaded.

Pros:

  • Easy to implement, we already have examples from data-import

Cons:

  • Ephemeral - The generated files will be lost if the hosting environment is restarted
  • Space constraints - We can only generate a fixed sized files
  • stateless- If more then one  instance of the module is running, it will be difficult to keep track of where the files were generated


Postgres DB

PostgreSQL provides two distinct ways to store binary data. Binary data can be stored in a table using the data type bytea or by using the Large Object 

ByteaLarge Object
data type is not well suited for storing very large amounts of binary datastoring binary data is better suited to storing very large values
can hold up to 1 GB of binary datacan store upto 4TB
it would require a huge amount of memory to process such a large valuehave some security issues since anyone connected to the database can view and/or modify any Large Object, even if they don't have permissions to view/update the row containing the Large Object reference

Processing:

CREATE TABLE files (imgname text, img bytea);
File file = new File("myTxt.txt");
FileInputStream fis = new FileInputStream(file);
PreparedStatement ps = conn.prepareStatement("INSERT INTO files VALUES (?, ?)");
ps.setString(1, file.getName());
ps.setBinaryStream(2, fis, file.length());
ps.executeUpdate();
ps.close();
fis.close();

Processing:

CREATE TABLE fileslo (fileName text, filesoid oid);
// Get the Large Object Manager to perform operations with
LargeObjectManager lobj = ((org.postgresql.PGConnection)conn).getLargeObjectAPI();
int oid = lobj.create(LargeObjectManager.READ | LargeObjectManager.WRITE);
LargeObject obj = lobj.open(oid, LargeObjectManager.WRITE);

// Now open the file
File file = new File("myTxt.txt");
FileInputStream fis = new FileInputStream(file);

// Copy the data from the file to the large object
byte buf[] = new byte[2048];
int s, tl = 0;
while ((s = fis.read(buf, 0, 2048)) > 0) {
    obj.write(buf, 0, s);
    tl += s;
}

// Close the large object
obj.close();

// Now insert the row into imageslo
PreparedStatement ps = conn.prepareStatement("INSERT INTO fileslo VALUES (?, ?)");
ps.setString(1, file.getName());
ps.setInt(2, oid);
ps.executeUpdate();
ps.close();
fis.close();


Pros:

  • As postgres is already the DB used by FOLIO, we could use that to store the files
  • As it is already used DB , no special maintenance is necessary

Cons:


AWS S3:

S3 can be used to store the files, and the link to the files can be stored in the logs. We also have a precedent where the edge modules are already using AWS as one of the implementations, among several other options(https://github.com/folio-org/edge-common#secure-stores) and there were no major complaints

Pros:

  • Easy to use and we have sdk available
  • No size limit
  • Easy to download data from

Cons:

  • Hosting implementation specific


Generic Implementation

This can be figured out in another spike, we could explore other Document store DBs or object Stores

Also, to support multiple clouds, we can also try to look at Apache jcloud(https://jclouds.apache.org/)


Conclusion:

For the immediate future we can create a generic layer that can either write to local file storage if no configuration is specified, and use AWS S3 as one of the implementations. A configuration file can be used to pass the necessary credentials required for S3

In the future, we can either explore more options based on the hosting needs to consider a generic implementation, or if RMB is upgraded to use vertx pg client, then we can use that to stream data to postgres DB.



References:

https://www.postgresql.org/docs/7.4/jdbc-binary-data.html