Date
Attendees
Notes from Past Meetings (November 2018 - January 2019)
20181107
Reporting Prototype Meeting
-to connect to database on Glint server, give Nassib IP address you will connect from in a direct message to Nassib (for security purposes)
-can connect from your reporting application of choice
-Nassib working on setting up a way to allow HTTP request for results of report/query
-this is a suggested view, but would be interesting to use different views of the data
-would be helpful to try with and without crosstab
The dataset
-sample data created with this report in mind
-approx 50 transactions with user, loan, date, patron group
Questions
-how is the distribution of the data, should it be different for the query?
-do you have any feedback, questions, how is the process working?
-Sharon: BIRT
-Angela: Tableau, R, PGAdmin
-Tod: Aqua Data Studio, maybe something additional
-Kevin: MS Access, Excel, Crystal Reports
For background, see https://wiki.folio.org/display/RPT/Building+Report+Prototypes
create view v as
select '2017-01-01 - 2018-12-31' as checkout_date_range,
*,
coalesce(faculty, 0) +
coalesce(graduate, 0) +
coalesce(staff, 0) +
coalesce(undergrad, 0) as total_for_library
from crosstab(
'select li.location_name, g.groupname, count(l.id) as ct
from (
select * from loans
where loan_date >= ''2017-01-01'' and loan_date <= ''2018-12-31''
) l
left join loans_item li
on l.id = li.loan_id
left join users u
on l.user_id = u.id
left join groups g
on u.patron_group_id = g.id
group by li.location_name, g.groupname
order by 1, 2'
) as ct ("library" text, "faculty" bigint, "graduate" bigint,
"staff" bigint, "undergrad" bigint)
;
select *
from v
union all
select '',
'total_for_patron_group',
sum(faculty),
sum(graduate),
sum(staff),
sum(undergrad),
sum(total_for_library)
from v
;
Report
checkout_date_range | library | faculty | graduate | staff | undergrad | total_for_library
-------------------------+------------------------+---------+----------+-------+-----------+-------------------
2017-01-01 - 2018-12-31 | Annex | 2 | 1 | 2 | | 5
2017-01-01 - 2018-12-31 | Main Library | 5 | 13 | 1 | 13 | 32
2017-01-01 - 2018-12-31 | SECOND FLOOR | 1 | 2 | | | 3
| total_for_patron_group | 8 | 16 | 3 | 13 | 40
Open Postgres client to try query directly
20181112 Meeting Notes
Tableau
-Angela is able to connect with Tableau (PostgreSQL connection)
-tried Custom SQL, but it does not create views; can access existing view and do the union, can also just get the simple data query without the crosstab
-note: Custom SQL should *not* include trailing semicolon
BIRT
-still installing and configuring
-Sharon to follow up about her IP address and JDBC drivers
-Angela recommended this for JDBC -- http://wiki.arcs.com/article/BIRT_With_Postgres
Crosstab
-crosstabs less likely to be supported
-need to consider whether or not to use the crosstab
Aqua Studio
-test SQL worked fine
MS Access
-#deleted showing up on data points
Pgadmin4
-seems to connect fine
-was able to run complete crosstab code, including creating a view (named it “w” to make sure it actually created a new view)
-can save SQL query as local .sql file, which would be easy for sharing with others
-can save results as .csv
-interface is very complex, though - no easy query builder, just straight SQL
Simple query, without crosstab:
select li.location_name, g.groupname, count(l.id) as ct
from (
select * from loans
where loan_date >= '2017-01-01' and loan_date <= '2018-12-31'
) l
left join loans_item li
on l.id = li.loan_id
left join users u
on l.user_id = u.id
left join groups g
on u.patron_group_id = g.id
group by li.location_name, g.groupname
order by 1, 2;
Example without aggregation:
select li.location_name, g.groupname, l.id, l.loan_date
from (
select * from loans
where loan_date >= '2017-01-01' and loan_date <= '2018-12-31'
) l
left join loans_item li
on l.id = li.loan_id
left join users u
on l.user_id = u.id
left join groups g
on u.patron_group_id = g.id;
Example without aggregation or filter
select li.location_name, g.groupname, l.id, l.loan_date
from loans l
left join loans_item li
on l.id = li.loan_id
left join users u
on l.user_id = u.id
left join groups g
on u.patron_group_id = g.id;
OLAP Data Model
-Nassib starting to work on this now
GitHub Repository for LDP
-https://github.com/folio-org/ldp
Loans query is in
https://github.com/folio-org/ldp/tree/master/database/query
2018-11-19 Reporting Prototype Meeting Notes
-some tools may not create temporary views
-try ‘select with’ to create the view
Storing and Sharing Queries
-stored in GitHub
*Make sure you send your GitHub username to Nassib
Testing
-need to test queries against large datasets
-need to look at complexity of data relationships
Data Model
-can create OLAP-style schemas if needed
-see https://docs.google.com/spreadsheets/d/1O8orRx2dBedaeWIkbN30v2ZHyRG6BgecLXgy5xL35bQ/edit?usp=sharing
Stress Test
-table showing for each table #rows
-100 patron groups
-100,000 users
-100 locations
-20 million loan transactions per year (maybe 5 years?)
Normal Data Test
- 12 patron groups
- 30,000 users
- 20 locations
- 10 million loan transactions per year (1 year)
20181126 Reporting Prototypes Meeting Notes
-still need Folio Attributes documented to do data loads
-can only work on loans, users, and groups right now
-latest documentation due soon
-Sharon to check in with Nassib week of Dec 10 to see next steps
-what data do we want next?
-which data elements do we need next?
-focus on Item Detail Report next
Tod: data integrity check?
Nassib: not ready yet because data not synchronized yet
Data Integrity Check
-make sure all loans point to real users and real items
-data comes from storage modules, stored in PostGres databases
-corrupt foreign keys will not be brought into the LDP; LDP currently has foreign key constraints
-should we have foreign key constraints or not?
-use the data load to detect problems with the data
Privacy/GDPR
-how will we move it out of the system without breaking data integrity?
-we could move the references to another table so that they could still be included in reports
Prioritizing
-loans, users, and groups are in the sprint now and could be brought into the LDP
Testing Reporting Applications
-Access: not getting access to the tables in the LDP; “#deleted” errors
-Crystal Reports: able to connect to LDP, but hard to use the native SQL against LDP
20181203 Reporting Prototypes Meeting Notes
Today’s Agenda:
Circ Detail Report and updates on connecting with various reporting applications
MSAccess
-MS Access having problems connecting; Access adding the word “public” to table names; just use an OCBC driver
BIRT
-both Angela and Tod finding BIRT on the Mac not working well; can start building report, but won’t preview in the web viewer; Java issue
Other Apps
-Angela trying ModeStudio
R
-Mac driver installation is advanced
Python
-Angela to try a test
Apache Superset
-visual SQL builder that supports Docker
20181210 Reporting Prototypes Meeting Notes
-Kevin created Circ Detail Report made notes on each of the Folio attributes - https://wiki.folio.org/display/RPT/Circulation+Detail+Report+Prototype
-some data element definitions had to be inferred because they were not documented
-FOLIO attributes documentation not consistent in that sometimes in Request, sometimes Response
-question: will the data be organized as the API documentation indicates?
Connecting with Access
-set up ODBC
-need to import data to get good data
-linked table gives #deleted name
-tables show
-exported data looks good, but tables in Access are not right
-”cannot join on Memo, OLE, or Hyperlink Object” error
-does Access expect a foreign key flag?
-remove “public_” from the SQL to get it to work
-changed type of query to “pass-through”; point to Postgres SQL 3.0
-error message saying attempt to do joins on memo fields
-maybe there is a security issue?
-maybe a problem with underscores in names of tables
-works if you replace underscores with periods in the SQL
Crystal Reports
-like Access, the name of the database is added to the table names
-renaming tables to remove “public” may help
BIRT
-still challenging to get Mac version working
How will we authenticate to the LDP?
-IP address is too restricted
-can we have shibboleth?
-”S-Tunnel” connection?
20181217 Reporting Prototype Meeting
-eUsage data model development
-consider Counter 4 to Counter 5 changes
Before Circ Detail Report can be tested with Reporting Apps...
-developers need to finish documenting loans, users, and groups
-Nassib and Roman need to load more data in the LDP for loans, users, and groups
-developers also need to document new attributes in second circ report
-Roman now working on generating sample data for LDP
-ask Bjorn and Annika to document the data model for eUsage before we schedule time to build data model in LDP
-also ask what form the data elements take, e.g., JSON, string, etc.
Database Schema
-public is schema
-schema.table is required in the naming
-best to run the database queries in Postgres
Performance
-Postgres has query optimizer to determine the best strategy for satisfying the query
Documentation
-would be good for group to document how best to write queries using the data model
-will depend somewhat on work that is done by analyst vs data administrator, e.g., creating indexes
2019-1-14 Reporting Prototype Meeting Notes
-what other CIRC reports could be build from original model?
-eUsage update from Sharon: eUsage team is writing FOLIO attribute documentation we can use to build a data model
-testing with other reporting systems such as MonetDB and Vertica
-location module time of checkout issue
-use star schema and simplify query for circ report
-more data in data warehouse
-possible new diagram showing star schema
-having developers document Folio attributes still an important dependency for reporting (see UXPROD-1414)
Applications to consider reporting
-CU likes Tableau as high-end tool
-BIRT has flexibility and is free
Mirroring
-would allow queries “on the edges” of what we support
Star Schema
-2 or more step joins are flattened into a single join
-easier to write reports, and performance of database is better
-center (fact) tables and outer (dimension) tables
MonetDB
-might be a good choice for smaller libraries because it is fast; can do fast aggregate functions from different datasets
Vertica
-commercial
-fastest database for this schema
-Nassib looking into cost
-compatible with Postgres
Sample Data Generator
-Roman is working on a sample data generator
-will let us load the kind of data we need for reporting
Platform Plans
-start with Postgres with an option for participants to later upgrade to Vertica if performance is an issue (Postgres and Vertica designed by same person)
Next Steps
-Nassib to add more data and more attributes to Library Reporting Database, and we will move to a star schema
Plans for future meeting of this group
-next meeting January 28th
----------------------------------
1/28/19
Agenda:
-(Nassib) How we are handling "missing data", temporary anomalies caused by incremental updates.
“Not Available” flag
-“eventual consistency” model and message queue
-reporting database being developed to enforce referential integrity
-placeholder record is created when referential integrity is missing, e.g., new user takes a loan but no UserID; placeholder record gets created in the UserID table with label “not available”
-star schema has loan transactions in the center and points to user but reference has been denormalized because there is no groups table
- “not available” could show due to data arriving out of order or because there is a problem with the message queue or because there is missing data in the operational database
-need to document what to do if this happens (e.g., escalate with developers)
-should say “data not found”
-when filling fields with “not available,” some fields are unique, e.g. username, so the error will say “not available” + random#
Historical data
-how to deal with historical data? Label “inactive”? If data are deleted from the operational store, would we accumulate historical data in a secondary table?
-should historical data be kept in the data warehouse only?
-at what point does data size become a problem? 100 million data points can be supported by Postgres structure
-data purging affects our data warehouse schema design
-if okay to delete from operational, okay to delete from data warehouse?
-institutions should set data retention policies
-in-app reporting not longitudinal
-developers should develop features that let institutions make their own decisions about data retention
-what happens when data gets too large? Don’t need to worry too much about this in the data warehouse, but could become a problem for the operational system
-request: take a look at sample database with missing user records labeled “not available” - Nassib to send out examples
In-App Reporting
-confusion about in-app vs dw reporting
-Chalmers downgraded in-app reports because expectation will be reports to come out of dw at go live
https://docs.google.com/document/d/1AyxUqU3_sZRUdAJ7jATu4OjJWmqfY7KGQXuOIwrHnB4/edit?usp=sharing
-(Nassib) incremental updater, which we can later use to load streaming updates.
-(Nassib) feedback on star schema
-(Nasssib) feedback on test data
-(Sharon) What are our goals for this week?