2022-03-28 Reporting SIG Meeting notes

Date

Attendees

Present?

Name

Organization

Present?

Name

Organization

xArthur AguileraUniversity of Colorado, BoulderxLinda MillerCornell University
xSharon BeltaineCornell UniversityxNassib NassarIndex Data

Erin BlockUniversity of Colorado, BoulderxElena O'MalleyEmerson

Nancy BolducCornell UniversityxTod OlsonUniversity of Chicago

Lloyd ChittendenMarmotxJean PajerekCornell University
xAxel DoerrerUniversity MainzxMichael PatrickThe University of Alabama

Shelley DoljackStanford UniversityxEric PenningtonTexas A&M

x

Stefan DombekLeipzig UniversityxScott PerryUniversity of Chicago
xJennifer EustisU. Massachusetts Amherst / Five College
Natalya PikulikCornell University

Alissa HafeleStanford University
Vandana ShahCornell University

Jeanette KalchikStanford UniversityxAmelia SuttonU. Massachusetts

Kevin KishimotoStanford UniversityxSimona TabacaruTexas A&M

Ingolf Kusshbz
Huey-Ning Tan

Stanford University


Kim LaineCornell University
Irina TrapidoStanford University

Joanne LearyCornell UniversityxKevin WalkerThe University of Alabama
xEliana LimaFenway Library OrganizationxAngela ZossDuke University

Eric LuhrsLehigh University


Discussion Items

Item

Who

Notes

Attendance & NotesAngela

Attendance & Notes

  • Today's attendance-taker: ?
  • Today's note-takers:  Team Leads for project updates
LDP query performanceNassib

 Basic tips for faster queries in LDP databases

  • Indexes
    • an index is like a card catalog for a table in a database
    • it is worthwhile to create indexes on some columns in large tables (i.e., having lots of rows)
    • Which columns? Any columns that will be used in:
      • WHERE column ...
      • JOIN table ON column ...
      • ORDER BY column
    • do this with: CREATE INDEX ON table (column)
    • Types of indexes:
      • B-tree (default; helps with WHERE or JOIN with operators like =, <>, etc.; also helps with ORDER BY)
      • GIN (helps with LIKE and ILIKE pattern matching operators); CREATE INDEX ON table USING GIN (column gin_trgm_ops)
      • can actually create both kinds of indexes on same column, but usually don't need to; even if you have a GIN index, you can use an equality operator
      • the only column in LDP that has a GIN index is the "content" column in the srs_marc table
      • if you create a large table and you have a varchar/text field that you want to do a lot of pattern matching on, you can create a GIN index
  • Vacuum analyze
    • helpful to run this after creating a large table or modifying its contents (e.g., updated or deleted a lot of rows)
    • create indexes before running vacuum analyze, especially when creating a GIN index
    • database system will automatically run this on tables after some period of time, but you can run it manually if you plan to use the table right away
  • Temporary tables
    • if you need a large table but only briefly, use a temporary table
    • temporary tables are faster to create than normal tables, because the database system does not worry about trying to recover their data in case of a crash
    • CREATE TEMP TABLE table AS SELECT
    • when your connection to the database closes, your temporary tables are automatically deleted
    • note that temporary tables are not visible to other users
  • Breaking apart large queries
    • queries with subqueries SELECT ... (SELECT ...) or CTEs (e.g., the WITH parameters AS... part at the top of some of our report queries)
    • if subqueries or CTEs use large tables, they are not always handled well by the database system
    • might be helpful to create the subqueries and CTEs as separate, intermediate tables
    • if only needed temporarily, could be temporary tables
    • if useful for other queries, can just create those tables
  • Use NOT EXISTS instead of NOT IN
    • NOT IN is slow with large tables
    • WHERE column1 NOT IN (SELECT column2 FROM table)
    • instead, use NOT EXISTS
      • WHERE NOT EXISTS (
        SELECT column2 FROM table WHERE column1 = table.column2
        )
  • Simple column names
    • if you have an expression that uses a column, the database will not be able to use the index
      • WHERE column + expression = 
      • WHERE function(column)
    • where possible, use simple columns on left and move expression to the right
      • WHERE column = ... - expression
  • System configuration
    • ask system administrator:
      • increase CPU cores, memory, and I/O specs
      • use recommended database configuration settings in the LDP Administrator Guide (should match the size of the system)
        • if you increase the size of the system, you have to update the configuration settings because they are conservative by default
      • upgrade to recent versions of LDP software
    • contact Nassib with questions!

Announcements /
Reminders

Angela

Trying to find our latest recordings


Recruiting New Query Developers

  • The Reporting SIG is always on the look-out for new query developers. Please let us know if you are interested in doing query development or if there are others at your institution who might be a good fit.


LDLite review(very brief)

LDLite notes from previous meeting

https://github.com/library-data-platform/ldlite

  • Missouri State first US institution to go live with FOLIO
  • had reports that were generated in previous system that library staff wanted to retain
  • info not really available via API
  • setting up LDLite was something we wanted to do anyway, gave us a chance to do other things 
  • are an EBSCO-hosted library, so don't have access to LDP
  • if we wanted to do anything using SQL, LDLite was only option
  • don't mind doing stuff with the APIs, enjoy that, but really like SQL
  • having things in an SQL format gives us some additional options we wouldn't normally have
  • using the variation Nassib mentioned - MSU is taking data from LDLite and putting it into a Postgres database
  • 1 database - complete harvest of all data; not updated often; have 1.4 million bib records, full harvest takes about 33 hours (VM with 4 processors 8 GB RAM, 300 GB storage), including SRS (15 hours just by themselves)
  • set that up first just to see what it was like
  • soon after, librarian needed to do some stats, didn't want them to have all the data, so created a smaller database and only harvested the data they need (collections, circulation) - harvest took about 9 hours
  • have been around libraries since 1993; back in mid-'90s, if you wanted to do a free scripting language, PHP and mySQL was the thing. don't have background in Postgres or Python, but have been able to figure it out.
  • have some scripts that run the harvest - start it end of day, let it run overnight. 
  • working on something else, with help from Nassib - want to do incremental updates. want to run a script that only brings changes since the last run of the script. hopefully run in the morning and get the data for working in the afternoon.
  • like the idea of using multiple databases to control access - just don't even include sensitive data in certain databases
  • script using is a variation of the script Nassib has on the GitHub page
  • have added the call into the database
  • have also created a text output; says when it starts and ends, also writes the endpoint and the tables it wrote for that endpoint; could also have it print table structure
  • have DBeaver connected to the VM for running queries; VM is ubuntu server, do everything commandline, so wanted DBeaver for an interface for queries
  • Questions: institutions accessing legacy data through LDP
    • Cornell has saved off data in a Postgres database, hope to bring it into the LDP; have pulled data out of legacy, do have a bunch that has been pulled into the local schema 
    • have been pulling requests data from Ares into the LDP to blend with FOLIO
    • have been talking about getting the data from the Voyager archive and using that; would really love that
    • have been talking about getting bits and pieces of the Voyager data separately (it's static, so just have to do it once)
    • getting the data from the legacy system into a format that will work is a big project; would have been nice to bring it in just before implementation
    • during migration, some things may have accidentally gotten skipped, so good to have that backup, would be easier if it was already in LDP
    • hard to query out the circ data and then bring it into the LDP
    • would be easier if it were all there
    • Stanford is likely to go straight to Metadb, which will eventually have a concept of multiple input sources; FOLIO would be one input source, but you could possibly set up the legacy ILS as a second input source, and Metadb could pull everything automatically (or a defined subset); and then if you needed to run both systems at the same time, it could continue to pull data from the legacy system
  • any lessons learned for implementation?
    • not terribly difficult, just install LDLite (one dependency, maybe); had more trouble getting Postgres installed and configured than LDLite
    • originally tried just using LDLite without database, but not great for harvesting absolutely everything; if you are going to use it for a lot of data, recommend go ahead and install Postgres and connect LDLite to that
    • that means you'll have to learn Postgres, at least a little
    • have gotten everything done with Google
    • If you're using LDLite in place of an LDP or Metadb instance, probably makes sense to have a shared database that would mimic an LDP, and use a regular scheduled to have LDLite update the data; probably works better for a smaller library where you can download the data regularly; a larger university probably needs the full LDP and Metadb
    • if Mark can get the incremental updates working, the plan is to write up the script and schedule it to run every night, which would give the users access to yesterday's data just like LDP users
  • what kind of documentation is there? you talk about having different databases that cover different things, how do you share that with users?
    • there's only two people using it right now at MSU, so not too many people who need to know; just shared the database connection details and he went ahead with DBeaver; he doesn't know about the other database he doesn't have access to
    • we're a small library - 45 employees, 3 IT staff, Mark is only one doing FOLIO stuff. documentation is just for him.
    • documentation for other universities
    • Note: we hope to have LDLite page up for the Juniper version of the FOLIO docs soon: https://juniper.docs.folio.org/docs/reporting/
  • Mark is happy to share the script, just message him on Slack


Question: interacting with Postgres via Python?

  • psycopg2 is typical, but note that it defaults to autocommit being turned off, so you can run a long query and then some kind of transaction is open, and basically all updates get blocked on those tables (the DB locks the tables)
  • autocommit off is a big problem as a default
  • LDLite opens the db connection for you and sets autocommit to on, very important to have that on
  • there are other less critical things to be aware of: 
    • how transactions map to database transactions, not entirely straightforward
    • look at LDLite code, think it does it the right way
    • Nassib can also help with that
    • if you do a SELECT on a lot of data, it defaults to loading all of the results into memory on your local machine; that doesn't happen if you're using a regular database client, it will stream data over and the client will do something sensible
    • you have to use a named cursor; when you create your cursor, which iterates through the results, you have to give it a name, and it will be cached on disk instead of loading into memory
    • pretty tricky
    • Python is the standard language for data science now so good language to learn, but has some issues and don't like it as a general purpose language
    • another language I would suggest looking at is Go. Metadb is written in Go. It's not quite as quick to get up and running with as Python, but with Python sometimes it makes things look easy but hides subtle issues. Go is more literal. A very tiny language, quick to learn. Borrowed good ideas from other languages, including Python. Not quite as expressive as Python, but doesn't have quite as many hidden things going on. If you're talking about data scientists using Jupyter notebooks, Python makes sense. For software or scripting complex code that supports institutional operations, would suggest looking at Go.
Demonstration of GitHub DesktopAngela
Updates and Query Demonstrations from Various Reporting Related Groups and EffortsCommunity & Coordination, Reporting Subgroup Leads

Project updates

Reporting development is using small subgroups to address priorities and complete work on report queries.  Each week, these groups will share reports/queries with the Reporting SIG.  Reporting development team leads are encouraged to enter a summary of their work group activities below.

RA/UM Working Group


MM Working Group

  • Meetings are 1st Tuesday of the month, 12-1pm ET via zoom using the usual FOLIO password. Our lab sessions are open to everyone. Please bring your questions, examples, and comments about reporting and metadata.
  • We are still looking for help with derived tables and reviewers. See our sign up sheet.
  • In next our lab session, we will use the LDP instance identifiers and go through the process of porting this over to metadb, how to contribute a pull request with the changes and update the runlist.txt, and go through a review.


ERM Working Group


RM Working Group


Reporting SIG Documentation Subgroup

  • Honeysuckle documentation is live on https://docs.folio.org/docs/
  • Iris documentation is in progress, due December 15
  • Additional Context
    • The Reporting SIG has representation on the Documentation Working Group, which is building end-user documentation for https://docs.folio.org/docs/ (mostly linking to existing documentation over on GitHub)


External Statistics Working Group

  • no updates currently
  • new organizational/tracking scheme for JIRA, with pointers to queries in folio-analytics repository
  • New organizational structure for External Statistics reports
    • external statistics reports (e.g., ACRL) typically require running queries from different functional reporting areas
    • these reports will be captured in JIRA under one UXPROD-XXXX report cluster issue, then the descriptions will point to each of the queries required to run them on the folio-analytics repository
    • institutions will need to rank each of these 8 new UXPROD-XXXX report cluster issues
    • each reporting development team will take responsibility for the queries in their area for the external statistics clusters


Product Council



For all recent work on FOLIO Reporting SQL development:


Demonstration of GitHub Desktop (probably next meeting)Angela
Topics for Future MeetingsAll
  • Next: Reviewing a pull request 
  • How to deal with External Stats reports?
    • maybe subteam leads check in about that
    • probably wait until after Metadb conversion is more complete
  • Definitely do GitHub Desktop demonstration at a SIG meeting in future, probably about 15 minutes
  • Query optimization by Nassib: Mar 28 and Apr 7


Review and update Topics for Future Reporting SIG Meetings 





  • A test Action Item (Ingolf)