Creating data import jobs with the API when splitting is enabled
See slack discussion: https://folio-project.slack.com/archives/CA39M62BZ/p1706283473779509
Automating the new Data Import with Splitting of Large Files
Frances Webb @ Cornell University
Starting with a filename, file contents, and the exact name of the appropriate job profile.
1. Create Upload Definition
  POST to /data-import/uploadDefinitions
   json of {"fileDefinitions": [{"name":(filename),"size":(size of file in KB)}]}
  returns:
   201 and a hash of a newly created upload definition. Keep the id, and the file definition id.
  The size should be in KB, not bytes! It is also passed in as an int, not a string
  I did not attempt to automate the ability to include multiple files in one upload
  batch. With the new capability of splitting larger import jobs, this is less likely to
  be useful. Some later steps will need to be iterated if sending multiple files.
2. Request upload URL
  GET /data-import/uploadUrl?filename=(filename)
  returns:
   200 and a hash containing a url, key, and uploadId. All three will be needed.
  The URL points to S3, but contains all the necessary credentials so it can be treated as
  a generic upload link.
  Â
3. Upload the file
  PUT to URL from step 2
  with header: Content-Length: (size of file contents in bytes)
  with body: (file contents)
  returns
   200 and be sure to grab the contents of the ETag header in the response
  No need for any multipart encoding in the body as per a file upload from a web form. Just
  the raw byte data of the file.
4. Request the file be assembled for import
  POST to /data-import/uploadDefinitions/(ud_id)/files/(file_id)/assembleStorageFile
   json of {"uploadId": (uploadId), "key": (key), "tags": [(etag)]}
  returns:
   204
  In the URL, the ud_id and the file_id are the upload definition id and the file definition id
  from the upload definition.
  The uploadId and key are from step 2, and the tag is from step 3. The tag will already be
  quoted in the data, and will end up quoted again in the compiled json. Looks weird but works.
5. Get fresh copy of upload definition
  GET /data-import/uploadDefinitions/(ud_id)
  returns:
   200 and a hash of the upload definition
  The new copy of the upload definition reflects the status of the upload as ready to import.
  Keep the whole thing to send back with a later request.
6. Get the job profile information
  GET /data-import-profiles/jobProfiles?query=name=="(url-encoded profile name)"
  returns:
   200 and a list of job profile hashes containing (hopefully) just the one record.
  The double equals is an exact match query, so it shouldn't retrieve another profile that contains
  the profile name you're looking for.
7. Launch import processing
  POST to /data-import/uploadDefinitions/(ud_id)/processFiles
   json of {"uploadDefinition": (upload definition hash),
       "jobProfileInfo": {'id': (profile id), "name": (profile name), "dataType": (profile dataType)}}
  returns:
   204
  The ud_id is again the id from the upload definition. The whole upload definition should be included in the
  submitted json, but only the three needed fields from the job profile. Attaching the whole profile, though
  it does contain these three fields will result in an error.
Optional
8. Monitor job until complete
  GET /metadata-provider/jobExecutions?statusNot=DISCARDED&uiStatusAny=PREPARING_FOR_PREVIEW&
    uiStatusAny=READY_FOR_PREVIEW&uiStatusAny=RUNNING&limit=50&sortBy=completed_date,desc&
    subordinationTypeNotAny=COMPOSITE_CHILD
  returns:
   list of hashes of all running data import jobs
  I lifted this query directly from the DI ui, and the query it makes to monitor any running
  jobs. There may be room for improvement. I iterate through the list of running jobs to find any
  with a sourcePath that matches the sourcePath from my upload definition hash. If none is found,
  the job is not running which probably indicates it is done. (Unless you have more than 50 running
  jobs, but that seems implausible.)
  Finding the correct running job, you can log progress by pulling the job executions "progress"
  hash. From the progress hash, "current" of "total" records have been imported.
  When the job has just started, there may not be a progress hash yet. In that case the
  "totalRecordsInFile" gives you the record count, so 0 of "totalRecordsInFile" have been imported.
  When a job is very, very new, the "totalRecordsInFile" may also be absent. In that case, the job
  is still launching.
  I found good success looking for the job every five seconds and reporting progress according to
  the above description.
9. Confirm completed job
  GET /metadata-provider/jobExecutions?statusAny=COMMITTED&statusAny=ERROR&statusAny=CANCELLED&
    profileIdNotAny=d0ebb7b0-2f0f-11eb-adc1-0242ac120002&
    profileIdNotAny=91f9b8d6-d80e-4727-9783-73fb53e3c786&
    fileNameNotAny=No%20file%20name&limit=40&sortBy=completed_date,desc&
    subordinationTypeNotAny=COMPOSITE_PARENT
  returns:
   list of recent but no longer running job executions
  Again, I lifted this query from what the ui was doing, and there are arguments I don't understand.
  Because import jobs are divided into 1k record chunks by the new import system, I increased the
  limit arg from 25 to 40 to make sure I'd get all the parts reported on particularly large jobs.
  Even in smaller jobs, the sourcePath field will have been modified to include a job part number
  (e.g. *_1). I was able to find all of the relevant records relevant to my job by comparing the
  first 40 characters of the execution record's sourcePath with the first 40 characters of the
  upload definition's sourcePath.
  Finding part(s) of the job, the total number of parts can be found in totalJobParts in any one
  of them to confirm that all parts are represented. The status value COMMITTED is a success status
  though I don't know a lot about what to look for if there are errors. For each successful part
  identified, we can pull the progress->total count for the part to get the number of imported
  records for each part to report back a total count of successful record imports. If that number
  doesn't match the number of records submitted, the automated job can know to raise flags.
 Â