Creating data import jobs with the API when splitting is enabled

See slack discussion: https://folio-project.slack.com/archives/CA39M62BZ/p1706283473779509

Automating the new Data Import with Splitting of Large Files
Frances Webb @ Cornell University

Starting with a filename, file contents, and the exact name of the appropriate job profile.

1. Create Upload Definition

   POST to /data-import/uploadDefinitions
     json of {"fileDefinitions": [{"name":(filename),"size":(size of file in KB)}]}
   returns:
     201 and a hash of a newly created upload definition. Keep the id, and the file definition id.

   The size should be in KB, not bytes! It is also passed in as an int, not a string

   I did not attempt to automate the ability to include multiple files in one upload
   batch. With the new capability of splitting larger import jobs, this is less likely to
   be useful. Some later steps will need to be iterated if sending multiple files.

2. Request upload URL

   GET /data-import/uploadUrl?filename=(filename)
   returns:
     200 and a hash containing a url, key, and uploadId. All three will be needed.

   The URL points to S3, but contains all the necessary credentials so it can be treated as
   a generic upload link.
     

3. Upload the file

   PUT to URL from step 2
    with header: Content-Length: (size of file contents in bytes)
    with body: (file contents)
    returns
      200 and be sure to grab the contents of the ETag header in the response

   No need for any multipart encoding in the body as per a file upload from a web form. Just
   the raw byte data of the file.

4. Request the file be assembled for import

   POST to /data-import/uploadDefinitions/(ud_id)/files/(file_id)/assembleStorageFile
     json of {"uploadId": (uploadId), "key": (key), "tags": [(etag)]}
    returns:
      204

   In the URL, the ud_id and the file_id are the upload definition id and the file definition id
   from the upload definition.
   The uploadId and key are from step 2, and the tag is from step 3. The tag will already be
   quoted in the data, and will end up quoted again in the compiled json. Looks weird but works.

5. Get fresh copy of upload definition

   GET /data-import/uploadDefinitions/(ud_id)
   returns:
     200 and a hash of the upload definition

   The new copy of the upload definition reflects the status of the upload as ready to import.
   Keep the whole thing to send back with a later request.

6. Get the job profile information

   GET /data-import-profiles/jobProfiles?query=name=="(url-encoded profile name)"
    returns:
      200 and a list of job profile hashes containing (hopefully) just the one record.

   The double equals is an exact match query, so it shouldn't retrieve another profile that contains
   the profile name you're looking for.

7. Launch import processing

   POST to /data-import/uploadDefinitions/(ud_id)/processFiles
     json of {"uploadDefinition": (upload definition hash),
              "jobProfileInfo": {'id': (profile id), "name": (profile name), "dataType": (profile dataType)}}
    returns:
      204

   The ud_id is again the id from the upload definition. The whole upload definition should be included in the
   submitted json, but only the three needed fields from the job profile. Attaching the whole profile, though
   it does contain these three fields will result in an error.

Optional

8. Monitor job until complete

   GET /metadata-provider/jobExecutions?statusNot=DISCARDED&uiStatusAny=PREPARING_FOR_PREVIEW&
        uiStatusAny=READY_FOR_PREVIEW&uiStatusAny=RUNNING&limit=50&sortBy=completed_date,desc&
        subordinationTypeNotAny=COMPOSITE_CHILD
    returns:
      list of hashes of all running data import jobs

   I lifted this query directly from the DI ui, and the query it makes to monitor any running
   jobs. There may be room for improvement. I iterate through the list of running jobs to find any
   with a sourcePath that matches the sourcePath from my upload definition hash. If none is found,
   the job is not running which probably indicates it is done. (Unless you have more than 50 running
   jobs, but that seems implausible.)

   Finding the correct running job, you can log progress by pulling the job executions "progress"
   hash. From the progress hash, "current" of "total" records have been imported.

   When the job has just started, there may not be a progress hash yet. In that case the
   "totalRecordsInFile" gives you the record count, so 0 of "totalRecordsInFile" have been imported.

   When a job is very, very new, the "totalRecordsInFile" may also be absent. In that case, the job
   is still launching.

   I found good success looking for the job every five seconds and reporting progress according to
   the above description.

9. Confirm completed job

   GET /metadata-provider/jobExecutions?statusAny=COMMITTED&statusAny=ERROR&statusAny=CANCELLED&
        profileIdNotAny=d0ebb7b0-2f0f-11eb-adc1-0242ac120002&
        profileIdNotAny=91f9b8d6-d80e-4727-9783-73fb53e3c786&
        fileNameNotAny=No%20file%20name&limit=40&sortBy=completed_date,desc&
        subordinationTypeNotAny=COMPOSITE_PARENT
    returns:
      list of recent but no longer running job executions

   Again, I lifted this query from what the ui was doing, and there are arguments I don't understand.
   Because import jobs are divided into 1k record chunks by the new import system, I increased the
   limit arg from 25 to 40 to make sure I'd get all the parts reported on particularly large jobs.

   Even in smaller jobs, the sourcePath field will have been modified to include a job part number
   (e.g. *_1). I was able to find all of the relevant records relevant to my job by comparing the
   first 40 characters of the execution record's sourcePath with the first 40 characters of the
   upload definition's sourcePath.

   Finding part(s) of the job, the total number of parts can be found in totalJobParts in any one
   of them to confirm that all parts are represented. The status value COMMITTED is a success status
   though I don't know a lot about what to look for if there are errors. For each successful part
   identified, we can pull the progress->total count for the part to get the number of imported
   records for each part to report back a total count of successful record imports. If that number
   doesn't match the number of records submitted, the automated job can know to raise flags.
  Â