2018-02-13 Reporting SIG POC/Data Lake Small Group Meeting notes

Date

Attendees

Summary

EBSCO is designing a POC (Proof of Concept) for a Data Lake. They have just 3 weeks to set this up (by 3/2/18). They would like representative from the Reporting SIG to generate one report out of the data lake using a reporting tool of our choice. This will serve as a proof of concept for designing a Folio Data Lake environment.  A POC/Data Lake small group has been formed out of the Reporting SIG to work on this project. For the POC, we need to identify 3 things: what type of report to build, the reporting tool to use, and who will build the report. The POC/Data Lake small group met on 2/13/18 at 9am to address these questions. A recording of this meeting is available on the Reporting_SIG shared drive in the Data_Lake_Working_Group/Recordings folder.  


What type of Report should we build?

Harry Kaplanian suggested building the report from the patron, circulation, or inventory area of Folio. The group agreed it would be good to build a report using data from all 3 areas. The report will show a particular patron group (faculty) with a set of items charged to them across a call number range within a specific date range.


Tod Olson suggested we load and structure our own data through a Python script, then report on that data. This way, we know the nature of the data going in and the data going out into the report. Tod will look into writing this script. All agreed that reporting from data we load was a good plan. We will need different, additional data loaded via the Python script at the same time to be sure we are able to pull only the report data we specify (i.e., data on multiple patron groups like faculty and students, data from call number ranges outside of the report range, data from dates outside our report range). Ryan Laddusaw, from Texas A&M, will assist Tod with the Python script that will automate load loan data into Folio.


What Reporting Tool will we use?

For this initial POC/Data Lake project, we have selected BIRT, an open source reporting tool. 


Who will Build the Report?

Chris Creswell, from Lehigh University and a member of the SysOps SIG, will generate the report for the POC using BIRT. The reporting will be done in 2 steps. First, EBSCO will give Chris a data extract from the data lake (Chris would like an extract of a postgres database), and Chris will use BIRT to build the report from the extract. Second, EBSCO will build a direct connection from BIRT to the data lake and Chris will build the report using BIRT with this connection.


What elements will the Report include?

  • Patron Name
  • Patron ID
  • Item Barcode
  • Item Due Date
  • Brief Bib information
  • Owning Library

Who is setting up the Data Lake?

Mathew Reno from EBSCO will be setting up the data lake and will work with Tod Olson and Ryan Laddusaw on the data load as well as Chris Creswell on the report from BIRT.


Contacts 

Tod Olson - Python script for data load into Folio tenant

Christopher Creswell - BIRT reports against Data Lake tenant and data extracts

Matt Reno - Data Lake setup, architecture, interfaces

Ryan Laddusaw


Acceptance Criteria (from Mark Veksler) for the POC Data Lake project

  • AWS data lake instance is setup
  • Data extracted from FOLIO (refer to report requirements to determine what data elements to include) and loaded into the lake
    • one-time data load
    • stretch goal: OKAPI module to capture/inject transactional data into the lake
  • Test data generator — a utility/script to generate a ‘reasonable’ number of transactions to fill an actual report
  • One report (circulation) implemented using BIRT
    • Patron name
    • Patron id
    • Item barcode
    • Item due date
    • Brief bib information
    • Owning library
  • Outline of Findings and Recommendations
    • high level architecture
    • process for populating the data