2018-02-26 Reporting SIG Notes

Date

  • Attendees

    Present?NameOrganizationPresent?NameOrganization

    Vince BareauEBSCO
    Katalin Lovagne SzucsQulto

    Sharon BeltaineCornell University
    John McDonaldEBSCO

    Elizabeth BerneyDuke University
    Peter MurrayIndex Data

    Ginny BoyerDuke University
    Erin NettifeeDuke University

    Joyce ChapmanDuke University
    Karen NewberyDuke University

    Elizabeth EdwardsUniversity of ChicagoXTod OlsonUniversity of Chicago
    XClaudius Herkt-JanuschekSUB Hamburg
    Scott PerryUniversity of Chicago
    xDoreen HeroldLehigh University
    Robert SassQulto
    XAnne L. HighsmithTexas A&M
    Simona TabacuruTexas A&M

    Filip JakobsenIndex DataXMark VekslerEBSCO

    Harry KaplanianEBSCOXKevin WalkerThe University of Alabama
    XIngolf Kusshbz
    Charlotte WhittIndex Data

    Lina LakhiaSOASXMichael WinklerCornell University
    XJoanne LearyCornell University
    Christine WiseSOAS
    XMichael PatrickThe University of AlabamaXMatt RenoEBSCO

Discussion items

ItemWhoNotes
Assign Notetaker, Take Attendance, Review agendaSharon Markus

Previous Notetaker: Simona Tabacaru

Today's Notetaker: Sharon Beltaine (See Minutes below)

POC Data Lake

update on the progress of the Proof-of-Concept Data Lake project

  • test data loaded into Folio?
  • documenting the process?
  • report in BIRT?
  • other issues?
Giving Folio Feedback Sharon MarkusProviding Feedback - instructions on giving Folio development feedback from Cate Boerma (Product Owner lead)
Sprint ReviewsSharon Markus

link to Sprint Review recordings, presentations, etc.:

https://drive.google.com/drive/folders/0B7-0x1EqPZQKNTJHbkw4M2NRelU

Wolf Con updateSharon MarkusWolf Con dates most likely to move to June 2018
Reporting ToolsIngolf KussReporting Tools used in Germany
Future TopicsSharon MarkusTopics for Future Reporting SIG Meetings

Meeting Minutes

Status of the Data Lake POC (Proof of Concept) Project:


Loading Data into a Folio tenant

-Ryan Laddusaw from Texas A&M has worked on a Python script which now successfully loads data into a Folio tenant 

-Ryan would still like to work on these modifications to the script:

  • Modify the GetUsers routine to get the Patron Group as a parameter
  • Modify GetItems to take a Call Number range as a parameter
  • Modify the MakeLoans method to take a loan date as a parameter
  • String the above methods together to create several groups of data that we can query against

-Ryan feels he can get this working within the next week

-Tod not sure we will be able to provide different patron groups in the test instance

-Ryan has created a Python script that will automatically create loans (generate transactions) and can point at whatever instance we want it to

-Ryan has given script to Matt Reno to use in the EBSCO Folio instance

-Mark says the additonal work (listed above) is not needed for the POC, explaining that the purpose of the POC was to 

  • generate some transactions in a Folio instance
  • feed the transactional into the Data Lake
  • produce a report from the Data Lake

-Anne asks that if we have put out specifications for the sample report, how we can be sure what we are going to report against? If we do random transactions against random items, how do we know that we have checked out something in a given call number range?

-Tod says we can print the list of patron barcodes and item barcodes that the loans are being made for, and compare that to what has actually made it into the data lake.The fact that the loans are generated randomly doesn't matter, as long as they are reported reliably from the tool

-Mark says the data doesn't need to be reported reliably because the POC is just to see if we can get the data out of Folio into the Data Lake and that we can report on the data out of the data lake; we are not building a production system in the POC; our goal is to learn from the technical complexity so that we can then propose what the architectural design needs to be to move forward with the implementation of a data lake

-Mark discussed what we have done so far for the POC

  • Matt set up an AWS tenant for the data lake
  • We wanted to extract data from Folio
    • do one-time data load into the data lake
    • (stretch goal) develop an Okapi module that will capture and then move transactions into the data lake
  • Tod, then Ryan created a script to load transactional data

Getting Data into the Data Lake

-Matt explained that the data going into the data lake from Folio is in JSON format right now

-the data includes what is coming into the client and what is going out from the client, and it is just being stored as a larger JSON object in the data lake

-the data is structured, but the data depends on the API call; the data is unstructured, but we can create schemas based on what is in the data lake

-once the data in the data lake is there, 

-Matt set up the data lake so that it uses Amazon components to allow for Okapi to stream in each transaction, which gets stored in a bucket in S3 and accumulates there; there is another component in Amazon called Glue that allows you to transform the data, build a schema, build SQL tables on top of the JSON data to be used by whatever reporting tool you are using

-Amazon has a tool called Athena that lets you do on the fly SQL queries; Matt has been using Athena to validate the data coming into the data lake

-Once in the data lake, the data could be moved into Red Shift or another type of database (e.g., a relational database)

-Sharon asks if there was a plan to set up the data to be consumed directly, and Matt answered that this would be a challenge because the data is unstructured

-Matt explained we would always use tools to access the data that are on top of the data lake in whatever format we choose; this is where we create the schema, and the schema can evolve over time as the data changes; it's likely that the data would change over time; we are finding that the schemas in Folio are changing rapidly right now

-developers are storing their data in Folio however they want right now, and this could change; each microservice can have its own database and the data can be stored mulitple places


Data Lake Structure

-For the POC, Matt is setting up a schema in the data lake that will allow generating a report (specifically) from BIRT through the report Chris Creswell is building; the data will be structured as if it was SQL

-Matt is using AWS Glue to structure the data

-Anne asked what about those schools not using an AWS environment, and Matt said his setup was strictly for the POC, and that those schools not using AWS would need a different data lake setup

-Matt will set up a table that Chris can query to get the data desired in the report; the schema in the data lake will be set up to allow SQL query, which will be used by BIRT for the report

-Matt explained that more work needs to be done, for instance, UUIDs would need to be looked up during the transform process and there is not a good way to do that with Folio right now (he believes edge APIs may be developed after the POC to allow rapid lookups)


Transforming the Data

-Ingolf asks what the data looks like in the data lake, is it a flat structure? (In contrast to a data warehouse, where it is typically a relational database)

-Mark explains that for the POC, the data is just a collection of JSON documents, and that you want to transform the data into a columnar view that can then be consumed by SQL

-Matt said the data does not have to stay in the data lake and can be transformed and then moved into another data store, such as a data warehouse (using Red Shift, for instance); alternatively the data could potentially stay in the data lake until it is needed and tranformed on the fly (depending on performance issues); the data lake is an S3 bucket, so it is hard to say how it will perform under heavy data load

-Ingolf points out that using the data lake in this way may be fine for the POC, but may not work in the real world


Data Usage, Recency, Performance

-Tod asks how the transactions come into the data lake: do they just accumulate or do they overwrite previous state? 

-Matt said the intent for the POC is to store everything in the data lake that comes and goes out ; right now, there is no aggregation or transformation being done on load, it's just raw data that just keeps coming in, like a log, it just keeps appending

-Tod and Matt agreed that with this setup, it will quickly run into scale problems, and we will need to figure out what data we need to move into the data lake

-Matt said these decisions will depend on data usage; we may want to partition the data into separate places; the POC implementation is very simple and the actual data structure set will need to be expanded

-Matt has seen complicated setups for data lakes and they are being used for very big data, so he knows we can achieve (based on these other setups) the type of performance we require

-If you are hosting your own Folio and Data lake environment, and you know what you want to report on, you would only send the type of data that you care about from the microservices to the data lake; whereas, if you are using a hosted environment, you might have far more data 

-It may be a matter of configuring for the data you need in the data lake module by module, perhaps as a Folio level setting that defines what data you want to collect; this should be included in the recommendations after the POC


Data and Privacy Issues

-Tod asks if there is data that could be requested by subpeona that we need to identify

-Matt said we will need to adhere to the requirements for data privacy, and that we might have to scrub data

-Ingolf points out that the privacy issues should be addressed in the privacy SIG

-Mike Winkler pointed out that the Privacy SIG may not know enough about the architecture to know there are privacy implications (for the data lake setup)

-Ingolf suggested it could be useful to be able to turn a switch off and on for data privacy, because there may be times when you need the data (e.g., an audit trail); also, different institutions will have different privacy requirements

-Sharon said that it sounds like each institution will need to configure how the data will be transformed for its needs, and that the configuration tasks may be considerable


Institutions building Data Lakes

-Mike pointed out that how much configuration each institution needs to do to get its data is an analysis point; reporting is a critical operation to libraries, and if we are recommending something that is untenable, then we need to deal with that.


43