Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device.
Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
/
Spike: Make records with bad data available to the harvester
Spike: Make records with bad data available to the harvester
Mar 16, 2023
As per the result of investigation in the previous spike, once batch processing has failed, each record of the batch is processed separately collecting (logging) the errors along with bad data. In this spike, the investigation is concentrated on how to make a bad data and errors available to the harvester.
There are at least two possible ways how to show errors with bad data to the harvester: adding errors exactly to the response, or save such errors to the DB and show them by request on UI side.
Add errors to the response
Errors can be included in the response at the end of the records (<ListRecords> tag):
In this case, error records are appended to <ListRecords> as additional tag, and all errors are listed one by one without any reference to instance ID. Optionally, instance ID can be included into <error> tag.
There is another way where errors can be included in the response after each record (<metadata> tag):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd" xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:marc="http://www.loc.gov/MARC21/slim" xmlns:oai-identifier="http://www.openarchives.org/OAI/2.0/oai-identifier" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<responseDate>2023-03-15T09:13:19Z</responseDate>
<request verb="ListRecords" metadataPrefix="marc21_withholdings">http://folio.org/oai</request>
<ListRecords>
<record>
<header>
<identifier>oai:folio.org:diku/f9f85e47-4673-452b-9ab5-a7f06ec12a7a</identifier>
<datestamp>2023-03-14T12:52:38Z</datestamp>
<setSpec>all</setSpec>
</header>
<metadata>
<marc:record>
<marc:leader>17171cjm a2201609 a 4500</marc:leader>
...
<marc:datafield tag="999" ind1="f" ind2="f">
<marc:subfield code="s">cb6e660a-beaf-4b98-9882-5f2f49fb0dc3</marc:subfield>
</marc:datafield>
</marc:record>
</metadata>
<errors>
<error>Statistical code ID not found: {UUID}</error>
<error>Some other error</error>
...
</errors>
</record>
<record>
<header>
<identifier>oai:folio.org:diku/f9f85e47-4673-452b-9ab5-a7f06ec12a7a</identifier>
<datestamp>2023-03-14T12:52:38Z</datestamp>
<setSpec>all</setSpec>
</header>
<metadata>
<marc:record>
<marc:leader>17171cjm a2201609 a 4500</marc:leader>
...
<marc:datafield tag="999" ind1="f" ind2="f">
<marc:subfield code="s">cb6e660a-beaf-4b98-9882-5f2f49fb0dc3</marc:subfield>
</marc:datafield>
</marc:record>
</metadata>
<errors>
<error>The following control character cannot be parsed: {character}</error>
<error>Some other error</error>
...
</errors>
</record>
...
</ListRecords>
</OAI-PMH>
In this case, errors are bound to the specific instance and it is clearly shown to the user.
This approach supposes to store every next error locally or in-memory that may affect the performance in case of large amount of errors and bad data. In addition, every record in the batch has to be processed and validated separately, so that it increases the waiting time to response. However, it is unlikely that one instance being processed can contain so many bad data and error logs.
Advantages:
Errors are always shown to the user immediately and it may be useful in some cases
No need work on UI side
Disadvantages:
Response can be significantly increased due to the possible large amount of errors
Saving errors requires additional memory and affects performance
Save errors into DB and display them on UI side
This approach requires adding a new table in the oai-pmh schema, or using existing one to save errors. In addition, it needs to create a separate endpoint to retrieve records from DB and introduce a new UI page to display errors. This UI page may look like the following:
Advantages:
Separate thread can be used to save the errors to minimize the impact on performance
Errors are shown to user only by request
Disadvantages:
Work on UI side
New endpoint implementation on back-end side
Save errors into S3
In this approach, it is assumed saving every next error into local file, and store the file into S3 at the end of the harvesting. Link to S3 allows the user accessing the error logs and bad data right after the harvest is done. For example, through UI, or directly using the link. Link can be appended to the response.