Investigate approach for harvesting bad data

The currently reported bad data while harvesting includes:

  • string values in UUID fields
  • control characters in MARC fields

1. String values in UUID fields

The investigation was done is scope of  MODOAIPMH-450 .The reason of issues is incorrect data type while harvesting : the logic expect uuid but get some not uuid string. The needed data is retrieved by special sql script get_items_and_holdings_view of mod-inventory-storage which is performed on database side. The logs from database have such kind of errors:

2022-08-09 14:31:28 UTC:10.23.44.180(54668):${tenant}_mod_inventory_storage@folio:[831]:ERROR:  invalid input syntax for type uuid: "REF"
2022-08-09 14:31:28 UTC:10.23.44.180(54668):${tenant}_mod_inventory_storage@folio:[831]:CONTEXT:  SQL function "getnatureofcontentname" statement 1
    SQL function "get_items_and_holdings_view" statement 1
2022-08-09 14:31:28 UTC:10.23.44.180(54668):${tenant}_mod_inventory_storage@folio:[831]:STATEMENT:  select * from get_items_and_holdings_view($1,$2);

The script is performed for the set of instances ids and it fails for all of them even if data for only one of instance is wrong. In scope of MODOAIPMH-391 was implemented logging functionality for such kind of issues. The logs from mod-oai-pmh module has such errors:

2022-08-03T13:20:21,880 ERROR [vert.x-eventloop-thread-2] OaiPmhJsonParser Error position at error part of json is 2
2022-08-03T13:20:21,881 ERROR [vert.x-eventloop-thread-2] OaiPmhJsonParser ERROR: invalid input syntax for type uuid: "REF" (22P02)
2022-08-03T13:20:21,883 ERROR [vert.x-eventloop-thread-2] MarcWithHoldingsRequestHelper Got error response from inventory-storage, uri: 'inventory-hierarchy/items-and-holdings' message: Internal Server Error500 

So it is possible to detect the invalid value.  For such kind of issues the harvesting is stopped as set of instances with wrong instance failed to be executed by script.

Possible solutions:

1) Change UUID casting to string. In that case possible lost of incorrect data, no issues will be created by database.

2) Create validation scripts to run before harvesting to get possible issues with data to fix before harvesting run. This approach can be combine with first one : remove casting to uuid and run validation scripts checking correctness of data.

3) Update current implementation of mod-oai-pmh to execute  database logic (get_items_and_holdings_view)  one by one for every instance id from the instances set when harvesting errors appear. In that case it will be possible to log incorrect instance and continue harvesting without wrong  one. The harvesting time will increase in case of many instances ids with wrong data.

2. Control characters in MARC fields

While harvesting it was found such kind of logs in mod-oai-pmh:

1645139143830,"2022-02-17T23:05:43,829 ERROR [vert.x-eventloop-thread-4] MarcWithHoldingsRequestHelper Error occurred while converting record to xml representation: The byte array cannot be converted to JAXB object response.."
1645139143830,java.lang.IllegalStateException: The byte array cannot be converted to JAXB object response.
1645139143830,	at org.folio.oaipmh.ResponseConverter.bytesToObject(ResponseConverter.java:164) ~[ms.jar:?]
1645139143830,	at org.folio.oaipmh.helpers.AbstractHelper.buildOaiMetadata(AbstractHelper.java:481) ~[ms.jar:?]

In scope of MODOAIPMH-396 the issues were investigated. The reason for the impossible conversion of records to XML is the presence of control characters in those records.
While investigation it was found the next codes for characters like &#1, &#7, &#19, &#20, &#21, &#24, &#25, &#27 and many others. They all are control characters. Those control characters cannot be handled by XML translator.

In scope of MODOAIPMH-402 were proposed such kind solutions:

1) Skip records that contain control characters. Provide logs with record id and information about the field that has data with control characters.

2) Remove control characters from the record before converting to XML.

3) At this moment harvesting provide xml output with xml version 1.0. Xml specification of 1.0 does not allow most of control characters. But there are changes for xml specification version 1.1 (https://www.w3.org/TR/2006/REC-xml11-20060816/  - Character range part), control characters are allowed but not recommended to use. Need a spike to investigate possibility creation harvesting output with xml version 1.1 and control characters.