Skip to end of banner
Go to start of banner

2019-01-28 Reporting Prototype Subgroup notes

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

Date

Attendees

Notes from Past Meetings (November 2018 - January 2019)


20181107

Reporting Prototype Meeting

-to connect to database on Glint server, give Nassib IP address you will connect from in a direct message to Nassib (for security purposes)

-can connect from your reporting application of choice

-Nassib working on setting up a way to allow HTTP request for results of report/query

-this is a suggested view, but would be interesting to use different views of the data

-would be helpful to try with and without crosstab

The dataset

-sample data created with this report in mind

-approx 50 transactions with user, loan, date, patron group

Questions

-how is the distribution of the data, should it be different for the query?

-do you have any feedback, questions, how is the process working?

-Sharon: BIRT

-Angela: Tableau, R, PGAdmin

-Tod: Aqua Data Studio, maybe something additional

-Kevin: MS Access, Excel, Crystal Reports


For background, see https://wiki.folio.org/display/RPT/Building+Report+Prototypes

create view v as

select '2017-01-01 - 2018-12-31' as checkout_date_range,

      *,

      coalesce(faculty, 0) +

          coalesce(graduate, 0) +

          coalesce(staff, 0) +

          coalesce(undergrad, 0) as total_for_library

   from crosstab(

   'select li.location_name, g.groupname, count(l.id) as ct

       from (

       select * from loans

           where loan_date >= ''2017-01-01'' and loan_date <= ''2018-12-31''

       ) l

           left join loans_item li

               on l.id = li.loan_id

           left join users u

               on l.user_id = u.id

           left join groups g

               on u.patron_group_id = g.id

       group by li.location_name, g.groupname

       order by 1, 2'

   ) as ct ("library" text, "faculty" bigint, "graduate" bigint,

            "staff" bigint, "undergrad" bigint)

;

select *

   from v

union all

select '',

      'total_for_patron_group',

      sum(faculty),

      sum(graduate),

      sum(staff),

      sum(undergrad),

      sum(total_for_library)

   from v

;


Report

checkout_date_range   | library   | faculty | graduate | staff | undergrad | total_for_library

-------------------------+------------------------+---------+----------+-------+-----------+-------------------

2017-01-01 - 2018-12-31 | Annex                  | 2 | 1 | 2 | |         5

2017-01-01 - 2018-12-31 | Main Library           | 5 | 13 | 1 | 13 |        32

2017-01-01 - 2018-12-31 | SECOND FLOOR           | 1 | 2 | | |         3

                        | total_for_patron_group |       8 | 16 | 3 | 13 |                40


Open Postgres client to try query directly


20181112 Meeting Notes

Tableau

-Angela is able to connect with Tableau (PostgreSQL connection)

-tried Custom SQL, but it does not create views; can access existing view and do the union, can also just get the simple data query without the crosstab

-note: Custom SQL should *not* include trailing semicolon

BIRT

-still installing and configuring

-Sharon to follow up about her IP address and JDBC drivers

-Angela recommended this for JDBC -- http://wiki.arcs.com/article/BIRT_With_Postgres

Crosstab

-crosstabs less likely to be supported

-need to consider whether or not to use the crosstab

Aqua Studio

-test SQL worked fine

MS Access

-#deleted showing up on data points

Pgadmin4

-seems to connect fine

-was able to run complete crosstab code, including creating a view (named it “w” to make sure it actually created a new view)

-can save SQL query as local .sql file, which would be easy for sharing with others

-can save results as .csv

-interface is very complex, though - no easy query builder, just straight SQL

Simple query, without crosstab:

select li.location_name, g.groupname, count(l.id) as ct

       from (

       select * from loans

           where loan_date >= '2017-01-01' and loan_date <= '2018-12-31'

       ) l

           left join loans_item li

               on l.id = li.loan_id

           left join users u

               on l.user_id = u.id

           left join groups g

               on u.patron_group_id = g.id

       group by li.location_name, g.groupname

       order by 1, 2;



Example without aggregation:

select li.location_name, g.groupname, l.id, l.loan_date

 from (

   select * from loans

     where loan_date >= '2017-01-01' and loan_date <= '2018-12-31'

 ) l

 left join loans_item li

   on l.id = li.loan_id

 left join users u

   on l.user_id = u.id

 left join groups g

   on u.patron_group_id = g.id;


Example without aggregation or filter

select li.location_name, g.groupname, l.id, l.loan_date

 from loans l

 left join loans_item li

   on l.id = li.loan_id

 left join users u

   on l.user_id = u.id

 left join groups g

   on u.patron_group_id = g.id;




OLAP Data Model

-Nassib starting to work on this now


GitHub Repository for LDP

-https://github.com/folio-org/ldp

Loans query is in

https://github.com/folio-org/ldp/tree/master/database/query


2018-11-19 Reporting Prototype Meeting Notes

-some tools may not create temporary views

-try ‘select with’ to create the view

Storing and Sharing Queries

-stored in GitHub


*Make sure you send your GitHub username to Nassib

Testing

-need to test queries against large datasets

-need to look at complexity of data relationships


Data Model

-can create OLAP-style schemas if needed

-see https://docs.google.com/spreadsheets/d/1O8orRx2dBedaeWIkbN30v2ZHyRG6BgecLXgy5xL35bQ/edit?usp=sharing

Stress Test

-table showing for each table #rows

-100 patron groups

-100,000 users

-100 locations

-20 million loan transactions per year (maybe 5 years?)


Normal Data Test

- 12 patron groups

- 30,000 users

- 20 locations

- 10 million loan transactions per year (1 year)



20181126  Reporting Prototypes Meeting Notes

-still need Folio Attributes documented to do data loads

-can only work on loans, users, and groups right now

-latest documentation due soon

-Sharon to check in with Nassib week of Dec 10 to see next steps


-what data do we want next?

-which data elements do we need next?

-focus on Item Detail Report next

Tod: data integrity check?

Nassib: not ready yet because data not synchronized yet

Data Integrity Check

-make sure all loans point to real users and real items

-data comes from storage modules, stored in PostGres databases

-corrupt foreign keys will not be brought into the LDP; LDP currently has foreign key constraints

-should we have foreign key constraints or not?

-use the data load to detect problems with the data


Privacy/GDPR

-how will we move it out of the system without breaking data integrity?

-we could move the references to another table so that they could still be included in reports


Prioritizing

-loans, users, and groups are in the sprint now and could be brought into the LDP

Testing Reporting Applications

-Access: not getting access to the tables in the LDP; “#deleted” errors

-Crystal Reports: able to connect to LDP, but hard to use the native SQL against LDP



20181203   Reporting Prototypes Meeting Notes

Today’s Agenda:


Circ Detail Report and updates on connecting with various reporting applications

MSAccess

-MS Access having problems connecting; Access adding the word “public” to table names; just use an OCBC driver

BIRT

-both Angela and Tod finding BIRT on the Mac not working well; can start building report, but won’t preview in the web viewer; Java issue


Other Apps

-Angela trying ModeStudio


R

-Mac driver installation is advanced


Python

-Angela to try a test

Apache Superset

-visual SQL builder that supports Docker


20181210  Reporting Prototypes Meeting Notes

-Kevin created Circ Detail Report  made notes on each of the Folio attributes - https://wiki.folio.org/display/RPT/Circulation+Detail+Report+Prototype

-some data element definitions had to be inferred because they were not documented

-FOLIO attributes documentation not consistent in that sometimes in Request, sometimes Response

-question: will the data be organized as the API documentation indicates?


Connecting with Access

-set up ODBC

-need to import data to get good data

-linked table gives #deleted name

-tables show

-exported data looks good, but tables in Access are not right

-”cannot join on Memo, OLE, or Hyperlink Object” error

-does Access expect a foreign key flag?

-remove “public_” from the SQL to get it to work

-changed type of query to “pass-through”; point to Postgres SQL 3.0

-error message saying attempt to do joins on memo fields

-maybe there is a security issue?

-maybe a problem with underscores in names of tables

-works if you replace underscores with periods in the SQL


Crystal Reports

-like Access, the name of the database is added to the table names

-renaming tables to remove “public” may help

BIRT

-still challenging to get Mac version working

How will we authenticate to the LDP?

-IP address is too restricted

-can we have shibboleth?

-”S-Tunnel” connection?


20181217 Reporting Prototype Meeting

-eUsage data model development

-consider Counter 4 to Counter 5 changes

Before Circ Detail Report can be tested with Reporting Apps...

-developers need to finish documenting loans, users, and groups

-Nassib and Roman need to load more data in the LDP for loans, users, and groups

-developers also need to document new attributes in second circ report

-Roman now working on generating sample data for LDP

-ask Bjorn and Annika to document the data model for eUsage before we schedule time to build data model in LDP

-also ask what form the data elements take, e.g., JSON, string, etc.


Database Schema

-public is schema

-schema.table is required in the naming

-best to run the database queries in Postgres

Performance

-Postgres has query optimizer to determine the best strategy for satisfying the query

Documentation

-would be good for group to document how best to write queries using the data model

-will depend somewhat on work that is done by analyst vs data administrator, e.g., creating indexes



2019-1-14 Reporting Prototype Meeting Notes

-what other CIRC reports could be build from original model?

-eUsage update from Sharon: eUsage team is writing FOLIO attribute documentation we can use to build a data model

-testing with other reporting systems such as MonetDB and Vertica

-location module time of checkout issue

-use star schema and simplify query for circ report

-more data in data warehouse

-possible new diagram showing star schema

-having developers document Folio attributes still an important dependency for reporting (see UXPROD-1414)

Applications to consider reporting

-CU likes Tableau as high-end tool

-BIRT has flexibility and is free

Mirroring

-would allow queries “on the edges” of what we support

Star Schema

-2 or more step joins are flattened into a single join

-easier to write reports, and performance of database is better

-center (fact) tables and outer (dimension) tables

MonetDB

-might be a good choice for smaller libraries because it is fast; can do fast aggregate functions from different datasets

Vertica

-commercial

-fastest database for this schema

-Nassib looking into cost

-compatible with Postgres

Sample Data Generator

-Roman is working on a sample data generator

-will let us load the kind of data we need for reporting

Platform Plans

-start with Postgres with an option for participants to later upgrade to Vertica if performance is an issue (Postgres and Vertica designed by same person)


Next Steps

-Nassib to  add more data and more attributes to Library Reporting Database, and we will move to a star schema

Plans for future meeting of this group

-next meeting January 28th



----------------------------------

1/28/19

Agenda:

-(Nassib) How we are handling "missing data", temporary anomalies caused by incremental updates.

“Not Available” flag

-“eventual consistency” model and message queue

-reporting database being developed to enforce referential integrity

-placeholder record is created when referential integrity is missing, e.g., new user takes a loan but no UserID; placeholder record gets created in the UserID table with label “not available”

-star schema has loan transactions in the center and points to user but reference has been denormalized because there is no groups table

- “not available” could show due to data arriving out of order or because there is a problem with the message queue or because there is missing data in the operational database

-need to document what to do if this happens (e.g., escalate with developers)

-should say “data not found”

-when filling fields with “not available,” some fields are unique, e.g. username, so the error will say “not available” + random#

Historical data

-how to deal with historical data? Label “inactive”? If data are deleted from the operational store, would we accumulate historical data in a secondary table?

-should historical data be kept in the data warehouse only?

-at what point does data size become a problem? 100 million data points can be supported by Postgres structure

-data purging affects our data warehouse schema design

-if okay to delete from operational, okay to delete from data warehouse?

-institutions should set data retention policies

-in-app reporting not longitudinal

-developers should develop features that let institutions make their own decisions about data retention

-what happens when data gets too large? Don’t need to worry too much about this in the data warehouse, but could become a problem for the operational system

-request: take a look at sample database with missing user records labeled “not available” - Nassib to send out examples

In-App Reporting

-confusion about in-app vs dw reporting

-Chalmers downgraded in-app reports because expectation will be reports to come out of dw at go live

https://docs.google.com/document/d/1AyxUqU3_sZRUdAJ7jATu4OjJWmqfY7KGQXuOIwrHnB4/edit?usp=sharing



-(Nassib) incremental updater, which we can later use to load streaming updates.

-(Nassib) feedback on star schema

-(Nasssib) feedback on test data

-(Sharon) What are our goals for this week?



  • No labels