2019-01-28 Reporting Prototype Subgroup notes

Date

28 Jan 2019

Attendees

Notes from Past Meetings (November 2018 - January 2019)

20181107

Reporting Prototype Meeting

-to connect to database on Glint server, give Nassib IP address you will connect from in a direct message to Nassib (for security purposes)

-can connect from your reporting application of choice

-Nassib working on setting up a way to allow HTTP request for results of report/query

-this is a suggested view, but would be interesting to use different views of the data

-would be helpful to try with and without crosstab

The dataset

-sample data created with this report in mind

-approx 50 transactions with user, loan, date, patron group

Questions

-how is the distribution of the data, should it be different for the query?

-do you have any feedback, questions, how is the process working?

-Sharon: BIRT

-Angela: Tableau, R, PGAdmin

-Tod: Aqua Data Studio, maybe something additional

-Kevin: MS Access, Excel, Crystal Reports

For background, see https://wiki.folio.org/display/RPT/Building+Report+Prototypes

create view v as

select '2017-01-01 - 2018-12-31' as checkout_date_range,

*,

coalesce(faculty, 0) +

coalesce(graduate, 0) +

coalesce(staff, 0) +

coalesce(undergrad, 0) as total_for_library

from crosstab(

'select li.location_name, g.groupname, count(l.id) as ct

from (

select * from loans

where loan_date >= ''2017-01-01'' and loan_date <= ''2018-12-31''

) l

left join loans_item li

on l.id = li.loan_id

left join users u

on l.user_id = u.id

left join groups g

on u.patron_group_id = g.id

group by li.location_name, g.groupname

order by 1, 2'

) as ct ("library" text, "faculty" bigint, "graduate" bigint,

"staff" bigint, "undergrad" bigint)

;

select *

from v

union all

select '',

'total_for_patron_group',

sum(faculty),

sum(graduate),

sum(staff),

sum(undergrad),

sum(total_for_library)

from v

;

Report

-------------------------+------------------------+---------+----------+-------+-----------+-------------------

2017-01-01 - 2018-12-31 | Annex | 2 | 1 | 2 | | 5

2017-01-01 - 2018-12-31 | Main Library | 5 | 13 | 1 | 13 | 32

2017-01-01 - 2018-12-31 | SECOND FLOOR | 1 | 2 | | | 3

| total_for_patron_group | 8 | 16 | 3 | 13 | 40

Open Postgres client to try query directly

20181112 Meeting Notes

Tableau

-Angela is able to connect with Tableau (PostgreSQL connection)

-tried Custom SQL, but it does not create views; can access existing view and do the union, can also just get the simple data query without the crosstab

-note: Custom SQL should *not* include trailing semicolon

BIRT

-still installing and configuring

-Sharon to follow up about her IP address and JDBC drivers

-Angela recommended this for JDBC -- http://wiki.arcs.com/article/BIRT_With_Postgres

Crosstab

-crosstabs less likely to be supported

-need to consider whether or not to use the crosstab

Aqua Studio

-test SQL worked fine

MS Access

-#deleted showing up on data points

Pgadmin4

-seems to connect fine

-was able to run complete crosstab code, including creating a view (named it “w” to make sure it actually created a new view)

-can save SQL query as local .sql file, which would be easy for sharing with others

-can save results as .csv

-interface is very complex, though - no easy query builder, just straight SQL

Simple query, without crosstab:

select li.location_name, g.groupname, count(l.id) as ct

from (

select * from loans

where loan_date >= '2017-01-01' and loan_date <= '2018-12-31'

) l

left join loans_item li

on l.id = li.loan_id

left join users u

on l.user_id = u.id

left join groups g

on u.patron_group_id = g.id

group by li.location_name, g.groupname

order by 1, 2;

Example without aggregation:

select li.location_name, g.groupname, l.id, l.loan_date

from (

select * from loans

where loan_date >= '2017-01-01' and loan_date <= '2018-12-31'

) l

left join loans_item li

on l.id = li.loan_id

left join users u

on l.user_id = u.id

left join groups g

on u.patron_group_id = g.id; 

Example without aggregation or filter

select li.location_name, g.groupname, l.id, l.loan_date

from loans l

left join loans_item li

on l.id = li.loan_id

left join users u

on l.user_id = u.id

left join groups g

on u.patron_group_id = g.id; 

OLAP Data Model

-Nassib starting to work on this now

GitHub Repository for LDP

-https://github.com/folio-org/ldp

Loans query is in

https://github.com/folio-org/ldp/tree/master/database/query

2018-11-19 Reporting Prototype Meeting Notes

-some tools may not create temporary views

-try ‘select with’ to create the view

Storing and Sharing Queries

-stored in GitHub

*Make sure you send your GitHub username to Nassib

Testing

-need to test queries against large datasets

-need to look at complexity of data relationships

Data Model

-can create OLAP-style schemas if needed

-see https://docs.google.com/spreadsheets/d/1O8orRx2dBedaeWIkbN30v2ZHyRG6BgecLXgy5xL35bQ/edit?usp=sharing

Stress Test

-table showing for each table #rows

-100 patron groups

-100,000 users

-100 locations

-20 million loan transactions per year (maybe 5 years?)

Normal Data Test

- 12 patron groups

- 30,000 users

- 20 locations

- 10 million loan transactions per year (1 year)

20181126 Reporting Prototypes Meeting Notes

-still need Folio Attributes documented to do data loads

-can only work on loans, users, and groups right now

-latest documentation due soon

-Sharon to check in with Nassib week of Dec 10 to see next steps

-what data do we want next?

-which data elements do we need next?

-focus on Item Detail Report next

Tod: data integrity check?

Nassib: not ready yet because data not synchronized yet

Data Integrity Check

-make sure all loans point to real users and real items

-data comes from storage modules, stored in PostGres databases

-corrupt foreign keys will not be brought into the LDP; LDP currently has foreign key constraints

-should we have foreign key constraints or not?

-use the data load to detect problems with the data

Privacy/GDPR

-how will we move it out of the system without breaking data integrity?

-we could move the references to another table so that they could still be included in reports

Prioritizing

-loans, users, and groups are in the sprint now and could be brought into the LDP

Testing Reporting Applications

-Access: not getting access to the tables in the LDP; “#deleted” errors

-Crystal Reports: able to connect to LDP, but hard to use the native SQL against LDP

20181203 Reporting Prototypes Meeting Notes

Today’s Agenda:

Circ Detail Report and updates on connecting with various reporting applications

MSAccess

-MS Access having problems connecting; Access adding the word “public” to table names; just use an OCBC driver

BIRT

-both Angela and Tod finding BIRT on the Mac not working well; can start building report, but won’t preview in the web viewer; Java issue

Other Apps

-Angela trying ModeStudio

R

-Mac driver installation is advanced

Python

-Angela to try a test

Apache Superset

-visual SQL builder that supports Docker

20181210 Reporting Prototypes Meeting Notes

-Kevin created Circ Detail Report made notes on each of the Folio attributes - https://wiki.folio.org/display/RPT/Circulation+Detail+Report+Prototype

-some data element definitions had to be inferred because they were not documented

-FOLIO attributes documentation not consistent in that sometimes in Request, sometimes Response

-question: will the data be organized as the API documentation indicates?

Connecting with Access

-set up ODBC

-need to import data to get good data

-linked table gives #deleted name

-tables show

-exported data looks good, but tables in Access are not right

-”cannot join on Memo, OLE, or Hyperlink Object” error

-does Access expect a foreign key flag?

-remove “public_” from the SQL to get it to work

-changed type of query to “pass-through”; point to Postgres SQL 3.0

-error message saying attempt to do joins on memo fields

-maybe there is a security issue?

-maybe a problem with underscores in names of tables

-works if you replace underscores with periods in the SQL

Crystal Reports

-like Access, the name of the database is added to the table names

-renaming tables to remove “public” may help

BIRT

-still challenging to get Mac version working

How will we authenticate to the LDP?

-IP address is too restricted

-can we have shibboleth?

-”S-Tunnel” connection?

20181217 Reporting Prototype Meeting

-eUsage data model development

-consider Counter 4 to Counter 5 changes

Before Circ Detail Report can be tested with Reporting Apps...

-developers need to finish documenting loans, users, and groups

-Nassib and Roman need to load more data in the LDP for loans, users, and groups

-developers also need to document new attributes in second circ report

-Roman now working on generating sample data for LDP

-ask Bjorn and Annika to document the data model for eUsage before we schedule time to build data model in LDP

-also ask what form the data elements take, e.g., JSON, string, etc.

Database Schema

-public is schema

-schema.table is required in the naming

-best to run the database queries in Postgres

Performance

-Postgres has query optimizer to determine the best strategy for satisfying the query

Documentation

-would be good for group to document how best to write queries using the data model

-will depend somewhat on work that is done by analyst vs data administrator, e.g., creating indexes

2019-1-14 Reporting Prototype Meeting Notes

-what other CIRC reports could be build from original model?

-eUsage update from Sharon: eUsage team is writing FOLIO attribute documentation we can use to build a data model

-testing with other reporting systems such as MonetDB and Vertica

-location module time of checkout issue

-use star schema and simplify query for circ report

-more data in data warehouse

-possible new diagram showing star schema

-having developers document Folio attributes still an important dependency for reporting (see UXPROD-1414)

Applications to consider reporting

-CU likes Tableau as high-end tool

-BIRT has flexibility and is free

Mirroring

-would allow queries “on the edges” of what we support

Star Schema

-2 or more step joins are flattened into a single join

-easier to write reports, and performance of database is better

-center (fact) tables and outer (dimension) tables

MonetDB

-might be a good choice for smaller libraries because it is fast; can do fast aggregate functions from different datasets

Vertica

-commercial

-fastest database for this schema

-Nassib looking into cost

-compatible with Postgres

Sample Data Generator

-Roman is working on a sample data generator

-will let us load the kind of data we need for reporting

Platform Plans

-start with Postgres with an option for participants to later upgrade to Vertica if performance is an issue (Postgres and Vertica designed by same person)

Next Steps

-Nassib to add more data and more attributes to Library Reporting Database, and we will move to a star schema

Plans for future meeting of this group

-next meeting January 28th

----------------------------------

1/28/19

Agenda:

-(Nassib) How we are handling "missing data", temporary anomalies caused by incremental updates.

“Not Available” flag

-“eventual consistency” model and message queue

-reporting database being developed to enforce referential integrity

-placeholder record is created when referential integrity is missing, e.g., new user takes a loan but no UserID; placeholder record gets created in the UserID table with label “not available”

-star schema has loan transactions in the center and points to user but reference has been denormalized because there is no groups table

- “not available” could show due to data arriving out of order or because there is a problem with the message queue or because there is missing data in the operational database

-need to document what to do if this happens (e.g., escalate with developers)

-should say “data not found”

-when filling fields with “not available,” some fields are unique, e.g. username, so the error will say “not available” + random#

Historical data

-how to deal with historical data? Label “inactive”? If data are deleted from the operational store, would we accumulate historical data in a secondary table?

-should historical data be kept in the data warehouse only?

-at what point does data size become a problem? 100 million data points can be supported by Postgres structure

-data purging affects our data warehouse schema design

-if okay to delete from operational, okay to delete from data warehouse?

-institutions should set data retention policies

-in-app reporting not longitudinal

-developers should develop features that let institutions make their own decisions about data retention

-what happens when data gets too large? Don’t need to worry too much about this in the data warehouse, but could become a problem for the operational system

-request: take a look at sample database with missing user records labeled “not available” - Nassib to send out examples

In-App Reporting

-confusion about in-app vs dw reporting

-Chalmers downgraded in-app reports because expectation will be reports to come out of dw at go live

https://docs.google.com/document/d/1AyxUqU3_sZRUdAJ7jATu4OjJWmqfY7KGQXuOIwrHnB4/edit?usp=sharing

-(Nassib) incremental updater, which we can later use to load streaming updates.

-(Nassib) feedback on star schema

-(Nasssib) feedback on test data

-(Sharon) What are our goals for this week?