ARCH-30 Architectural notes on Authority control reporting for Orchid/Poppy releases
Jira links
- ARCH-30Getting issue details... STATUS
Overview
Needs to support reporting that allows cataloger to know about updates on linked Bibliographic fields and Authorities over a period of time. Phase 2: Reporting updates - Technical Designs and Decisions - FOLIO Wiki
As it's pointed in the related document (Technical approach for update MARC Bib fields controlled by related Authority records - Technical Designs and Decisions - FOLIO Wiki) there are two phases for updating entities (1st - update entity itself and 2nd - update links). Hence two types of reports to be retrieved by specified filter criteria.
Scope
The documentation covers actual implementation details on interaction with Reporting mechanism to retrieve insights on updated authorities and bibliographic records fields linkage by specific date range.
- interaction with Reporting mechanism via API on push statistic data (its structure) and request to form report by specified filter criteria.
- Filter criteria to form proper report on user's request
- interaction with mod-exporter to generate .csv report and potential changes required
- interaction of mod-exporter with mod-entities-links to retrieve data for the report
- variation of output (.csv file or error to process request)
- Data flow overview
Solution
Flyover diagram
Discussed questions
- Format of Input statistics data for authorities and linked bibs
- success / error result payload on each step
- Actual component details for Reporting Mechanism
- determine interaction via new API of mod-data-export and Reporting mechanism
- Determine retention policy of reports generated and stored to cloud storage
- Determine sufficient / insufficient filtering criteria
- Define actual changeset on mod-data-export and Reporting mechanism
- API to check state on .csv report generation by exportJobId
Implementation details
Reporting mechanism overview
Is abstraction that can be presented for instance by Metadata Provider of mod-source-record-manager API. Under the hood there are journal records stored within Postgres DB (journal table).
Current API lets fetch journal records by jobExecutionId.
Option 1: to extend API to fetch journal records by filter criteria - (type, date range, status, action, heading fields updated or retained).
Concern: statistic data of linked entities will be separately aggregated and retrievable from module with different role.
Recommendation is to consider alternative storing stats data of linkage within mod-entities-links.
Alternative: elaborate reporting mechanism functionality on the side of mod-entities-links.
Rationale is that linked bibliographic records data to authority and actual mapping is placed within DB of mod-entities-links. Therefore, implementation of saving and retrieving stats data from actual source should reduce performance and space cost.
Input Data
Schema of reporting stats data on authority and linked bibliographic records updates.
In case of bulk updates of bibliographic records, writes to Reporting mechanism are performed in batch to reduce IO cost on the Database side when persisting new records.
Database layer
Is represented by two tables.
links_stats_authorities - stores all records as events occurred with modifications of authorities (on authority update, creation, deletion). It must be initialized with the already existing data in instance_link.
(NOTE: Avoid duplication of data by potential re-initializations of the module).
Contains status, error cause, entity id, date created, action type, data import job id, authority heading (see fields from table described in related Issue).
links_stats_bibs - stores all records as events occurred with modified linkage of instances (linked bibliographic records link(s) added, changed (updated / removed)).
This ER diagram represents the data model for the module.
- authority_data - this table contains the data of authority records stored in the mod-source-record-storage that are or were referenced by the inventory instances. The fields in this table contain only the authority data needed for audit trail and reporting.
- instance_data - this table contains the data of instance records that have references to the authority records in mod-source-record-storage. The fields in this table contain only the instance data needed for audit trail and reporting.
- instance_authority_link - this table is an association between instances and authorities representing many-to-many relations between these entities.
- authority_data_stats - this table acts as an audit trail for authority_data and contains changes applied to the authority_data records.
- instance_data_stats - this table contains records that reflect changes applied to instance records as a consequence of an authority record update.
Reports generation
To expose functionality of Reports generation changes required on mod-data-export-spring and mod-export-worker required to provide functionality on generation a new export Job to retrieve authority control reporting data.
On mod-data-export-spring side after receiving data export Job request via REST API it stores Job request and sends export commands to mod-data-export-worker via Kafka.
The mod-data-export-worker module retrieves data from other Folio modules via their REST API and adds it to CSV file parts. Once all required data is retrieved, the worker uploads the file parts to the Folio Object storage.
Once the file is uploaded, the module generates a download URL and sends it back to mod-data-export-spring via Kafka.
Efforts: mod-data-export-worker requires changes to support interaction with mod-entities-links.
Addressed Points
- Support reporting that allows a cataloger to know
- Which Authority headings (1XX) have changed over a period of time
By interacting with /links/stats API (mod-entities-links) it is feasible to query by params startDate, endDate, updatedByTag - Which Linked Bibliographic fields failed to update (including reason why) when linked Authority 1XX/010 $a updated over a period of time
By interacting with /links/stats API (mod-entities-links) it is feasible to query by params startDate, endDate, updatedByTag=1XX,010$a , status=error, type=bib - When an authority heading (1XX) is not linked to any bib field over a period of time
! It is NOT feasible by only interacting with mod-entities-links schema as it is not possible to know which authorities exist outside of mod-entities-links as only linked ones are stored. So it is proposed to use the mod-search to find such authority records. The prerequisite for this is the - MSEARCH-485Getting issue details... STATUS spike because one more field should be added to the index structure that represents the number of instance records linked to a particular authority record.
- Which Authority headings (1XX) have changed over a period of time
- These reports may be accessible from Inventory app or Authority app or Export Manager or Data export (depends on technical discussion)
universal point of access is supposed to be mod-data-exporter-spring that performs interaction with mod-entities-links (Reporting) and mod-data-export-worker facilitates retention of .csv reports within Cloud Storage. - These reports may be available as a csv export
feasible according to functionality Data export by using Spring Batch (aka Export Manager) - Technical Designs and Decisions - FOLIO Wiki - In addition Authority app may have additional facets/filters to allow a user
- To filter Authority records based on whether the record is linked to a bib field/record or not. The prerequisite for this is the - MSEARCH-485Getting issue details... STATUS spike because one more field should be added to the index structure that represents the number of instance records linked to a particular authority record.
Reporting mechanism API
POST /links/stats (alternative: consuming via Kafka inventory.authority.bib.stats topic)
request body or Events structure
GET /links/stats
parameter name | values | description |
---|---|---|
type | authority|instance | entity type to retrieve. If missed will return both authorities and instances stats |
status | success|fail | status of finished action (update, create or remove). If missed will return any |
startDate | date format | |
endDate | date format | |
actionType | create|update}delete | action performed on linkage for bibs (in case of type instance) or on |
linkedOn | string (formatted) | points linkage by specific fields, subfields. Could be setup as follows: 1XX,010$a,any,none. If missed should return all not linked entities |
offset | integer | |
limit | integer |
Examples:
Fetch first 500 unlinked authorities by specified date range:
GET /links/stats?limit=500&type=authority&linkedOn=none&startDate=today&endDate=today
Get all failed bibliographic records to update for specific date range:
GET /links/stats?type=instance&status=fail&startDate=05/07/24&endDate=11/07/24
Get all authorities where heading's been changed over a period of time
GET /links/stats?type=authority&linkedOn=1XX&actionType=update&startDate=05/07/24&endDate=11/07/24
Delivery Plan
- mod-entities-links modifications (accept stats and query stats API)
- modification on mod-export-worker (interaction with mod-entities-links to retrieve data in report by query)
- mod-data-export-spring modifications (to create export job for authority control reporting data)
- apply changes on mod-search, ui-quick-mark and mod-quick-marc according to the research results
- writing performance tests on forming multiple reports. Research and document bottlenecks
- apply optimizations after perf analysis (if required)
LOE
- mod-entities-links
- implement ingestion and storing of statistic data (2 sprints)
- implement filter query API to fetch corresponding stats records (1 sprint)
- mod-export-worker
- implement integration with mod-entities-links API to fetch data (1 sprint)
- implement functionality to store and provision .csv file report (1 sprint)
- mod-data-export-spring
- implement API to get report by linked data query (` sprint)
- Not yet defined scope (about 3 sprints. In the scope modifications on mod-search and ui corresponding changes)
Rationale
To address requirements on authority control reporting, it's decided to proceed with ingestion and storing statistic data for linking within mod-entities-links with providing corresponding API to obtain statistic data by extensible filter criteria. Supposedly it will keep data consistency and provide better performance of generation report from actual data without necessity to interact with other module like mod-source-record-manager to aggregate data that is not specific to one.