See slack discussion: https://folio-project.slack.com/archives/CA39M62BZ/p1706283473779509
Automating the new Data Import with Splitting of Large Files
Frances Webb @ Cornell University
Starting with a filename, file contents, and the exact name of the appropriate job profile.
1. Create Upload Definition
POST to /data-import/uploadDefinitions
json of {"fileDefinitions": [{"name":(filename),"size":(size of file in KB)}]}
returns:
201 and a hash of a newly created upload definition. Keep the id, and the file definition id.
The size should be in KB, not bytes! It is also passed in as an int, not a string
I did not attempt to automate the ability to include multiple files in one upload
batch. With the new capability of splitting larger import jobs, this is less likely to
be useful. Some later steps will need to be iterated if sending multiple files.
2. Request upload URL
GET /data-import/uploadUrl?filename=(filename)
returns:
200 and a hash containing a url, key, and uploadId. All three will be needed.
The URL points to S3, but contains all the necessary credentials so it can be treated as
a generic upload link.
3. Upload the file
PUT to URL from step 2
with header: Content-Length: (size of file contents in bytes)
with body: (file contents)
returns
200 and be sure to grab the contents of the ETag header in the response
No need for any multipart encoding in the body as per a file upload from a web form. Just
the raw byte data of the file.
4. Request the file be assembled for import
POST to /data-import/uploadDefinitions/(ud_id)/files/(file_id)/assembleStorageFile
json of {"uploadId": (uploadId), "key": (key), "tags": [(etag)]}
returns:
204
In the URL, the ud_id and the file_id are the upload definition id and the file definition id
from the upload definition.
The uploadId and key are from step 2, and the tag is from step 3. The tag will already be
quoted in the data, and will end up quoted again in the compiled json. Looks weird but works.
5. Get fresh copy of upload definition
GET /data-import/uploadDefinitions/(ud_id)
returns:
200 and a hash of the upload definition
The new copy of the upload definition reflects the status of the upload as ready to import.
Keep the whole thing to send back with a later request.
6. Get the job profile information
GET /data-import-profiles/jobProfiles?query=name=="(url-encoded profile name)"
returns:
200 and a list of job profile hashes containing (hopefully) just the one record.
The double equals is an exact match query, so it shouldn't retrieve another profile that contains
the profile name you're looking for.
7. Launch import processing
POST to /data-import/uploadDefinitions/(ud_id)/processFiles
json of {"uploadDefinition": (upload definition hash),
"jobProfileInfo": {'id': (profile id), "name": (profile name), "dataType": (profile dataType)}}
returns:
204
The ud_id is again the id from the upload definition. The whole upload definition should be included in the
submitted json, but only the three needed fields from the job profile. Attaching the whole profile, though
it does contain these three fields will result in an error.
Optional
8. Monitor job until complete
GET /metadata-provider/jobExecutions?statusNot=DISCARDED&uiStatusAny=PREPARING_FOR_PREVIEW&
uiStatusAny=READY_FOR_PREVIEW&uiStatusAny=RUNNING&limit=50&sortBy=completed_date,desc&
subordinationTypeNotAny=COMPOSITE_CHILD
returns:
list of hashes of all running data import jobs
I lifted this query directly from the DI ui, and the query it makes to monitor any running
jobs. There may be room for improvement. I iterate through the list of running jobs to find any
with a sourcePath that matches the sourcePath from my upload definition hash. If none is found,
the job is not running which probably indicates it is done. (Unless you have more than 50 running
jobs, but that seems implausible.)
Finding the correct running job, you can log progress by pulling the job executions "progress"
hash. From the progress hash, "current" of "total" records have been imported.
When the job has just started, there may not be a progress hash yet. In that case the
"totalRecordsInFile" gives you the record count, so 0 of "totalRecordsInFile" have been imported.
When a job is very, very new, the "totalRecordsInFile" may also be absent. In that case, the job
is still launching.
I found good success looking for the job every five seconds and reporting progress according to
the above description.
9. Confirm completed job
GET /metadata-provider/jobExecutions?statusAny=COMMITTED&statusAny=ERROR&statusAny=CANCELLED&
profileIdNotAny=d0ebb7b0-2f0f-11eb-adc1-0242ac120002&
profileIdNotAny=91f9b8d6-d80e-4727-9783-73fb53e3c786&
fileNameNotAny=No%20file%20name&limit=40&sortBy=completed_date,desc&
subordinationTypeNotAny=COMPOSITE_PARENT
returns:
list of recent but no longer running job executions
Again, I lifted this query from what the ui was doing, and there are arguments I don't understand.
Because import jobs are divided into 1k record chunks by the new import system, I increased the
limit arg from 25 to 40 to make sure I'd get all the parts reported on particularly large jobs.
Even in smaller jobs, the sourcePath field will have been modified to include a job part number
(e.g. *_1). I was able to find all of the relevant records relevant to my job by comparing the
first 40 characters of the execution record's sourcePath with the first 40 characters of the
upload definition's sourcePath.
Finding part(s) of the job, the total number of parts can be found in totalJobParts in any one
of them to confirm that all parts are represented. The status value COMMITTED is a success status
though I don't know a lot about what to look for if there are errors. For each successful part
identified, we can pull the progress->total count for the part to get the number of imported
records for each part to report back a total count of successful record imports. If that number
doesn't match the number of records submitted, the automated job can know to raise flags.