| | |
|---|
Attendance & Notes | Angela | Attendance & Notes |
LDP query performance | Nassib | Basic tips for faster queries in LDP databases |
Announcements / Reminders | Angela | Trying to find our latest recordings Recruiting New Query Developers |
LDLite review | (very brief) | LDLite notes from previous meeting https://github.com/library-data-platform/ldlite Missouri State first US institution to go live with FOLIO had reports that were generated in previous system that library staff wanted to retain info not really available via API setting up LDLite was something we wanted to do anyway, gave us a chance to do other things are an EBSCO-hosted library, so don't have access to LDP if we wanted to do anything using SQL, LDLite was only option don't mind doing stuff with the APIs, enjoy that, but really like SQL having things in an SQL format gives us some additional options we wouldn't normally have using the variation Nassib mentioned - MSU is taking data from LDLite and putting it into a Postgres database 1 database - complete harvest of all data; not updated often; have 1.4 million bib records, full harvest takes about 33 hours (VM with 4 processors 8 GB RAM, 300 GB storage), including SRS (15 hours just by themselves) set that up first just to see what it was like soon after, librarian needed to do some stats, didn't want them to have all the data, so created a smaller database and only harvested the data they need (collections, circulation) - harvest took about 9 hours have been around libraries since 1993; back in mid-'90s, if you wanted to do a free scripting language, PHP and mySQL was the thing. don't have background in Postgres or Python, but have been able to figure it out. have some scripts that run the harvest - start it end of day, let it run overnight. working on something else, with help from Nassib - want to do incremental updates. want to run a script that only brings changes since the last run of the script. hopefully run in the morning and get the data for working in the afternoon. like the idea of using multiple databases to control access - just don't even include sensitive data in certain databases script using is a variation of the script Nassib has on the GitHub page have added the call into the database have also created a text output; says when it starts and ends, also writes the endpoint and the tables it wrote for that endpoint; could also have it print table structure have DBeaver connected to the VM for running queries; VM is ubuntu server, do everything commandline, so wanted DBeaver for an interface for queries Questions: institutions accessing legacy data through LDP Cornell has saved off data in a Postgres database, hope to bring it into the LDP; have pulled data out of legacy, do have a bunch that has been pulled into the local schema have been pulling requests data from Ares into the LDP to blend with FOLIO have been talking about getting the data from the Voyager archive and using that; would really love that have been talking about getting bits and pieces of the Voyager data separately (it's static, so just have to do it once) getting the data from the legacy system into a format that will work is a big project; would have been nice to bring it in just before implementation during migration, some things may have accidentally gotten skipped, so good to have that backup, would be easier if it was already in LDP hard to query out the circ data and then bring it into the LDP would be easier if it were all there Stanford is likely to go straight to Metadb, which will eventually have a concept of multiple input sources; FOLIO would be one input source, but you could possibly set up the legacy ILS as a second input source, and Metadb could pull everything automatically (or a defined subset); and then if you needed to run both systems at the same time, it could continue to pull data from the legacy system
any lessons learned for implementation? not terribly difficult, just install LDLite (one dependency, maybe); had more trouble getting Postgres installed and configured than LDLite originally tried just using LDLite without database, but not great for harvesting absolutely everything; if you are going to use it for a lot of data, recommend go ahead and install Postgres and connect LDLite to that that means you'll have to learn Postgres, at least a little have gotten everything done with Google If you're using LDLite in place of an LDP or Metadb instance, probably makes sense to have a shared database that would mimic an LDP, and use a regular scheduled to have LDLite update the data; probably works better for a smaller library where you can download the data regularly; a larger university probably needs the full LDP and Metadb if Mark can get the incremental updates working, the plan is to write up the script and schedule it to run every night, which would give the users access to yesterday's data just like LDP users
what kind of documentation is there? you talk about having different databases that cover different things, how do you share that with users? there's only two people using it right now at MSU, so not too many people who need to know; just shared the database connection details and he went ahead with DBeaver; he doesn't know about the other database he doesn't have access to we're a small library - 45 employees, 3 IT staff, Mark is only one doing FOLIO stuff. documentation is just for him. documentation for other universities Note: we hope to have LDLite page up for the Juniper version of the FOLIO docs soon: https://juniper.docs.folio.org/docs/reporting/
Mark is happy to share the script, just message him on Slack
Question: interacting with Postgres via Python? psycopg2 is typical, but note that it defaults to autocommit being turned off, so you can run a long query and then some kind of transaction is open, and basically all updates get blocked on those tables (the DB locks the tables) autocommit off is a big problem as a default LDLite opens the db connection for you and sets autocommit to on, very important to have that on there are other less critical things to be aware of: how transactions map to database transactions, not entirely straightforward look at LDLite code, think it does it the right way Nassib can also help with that if you do a SELECT on a lot of data, it defaults to loading all of the results into memory on your local machine; that doesn't happen if you're using a regular database client, it will stream data over and the client will do something sensible you have to use a named cursor; when you create your cursor, which iterates through the results, you have to give it a name, and it will be cached on disk instead of loading into memory pretty tricky Python is the standard language for data science now so good language to learn, but has some issues and don't like it as a general purpose language another language I would suggest looking at is Go. Metadb is written in Go. It's not quite as quick to get up and running with as Python, but with Python sometimes it makes things look easy but hides subtle issues. Go is more literal. A very tiny language, quick to learn. Borrowed good ideas from other languages, including Python. Not quite as expressive as Python, but doesn't have quite as many hidden things going on. If you're talking about data scientists using Jupyter notebooks, Python makes sense. For software or scripting complex code that supports institutional operations, would suggest looking at Go.
|
Demonstration of GitHub Desktop | Angela | |
Updates and Query Demonstrations from Various Reporting Related Groups and Efforts | Community & Coordination, Reporting Subgroup Leads | Project updates Reporting development is using small subgroups to address priorities and complete work on report queries. Each week, these groups will share reports/queries with the Reporting SIG. Reporting development team leads are encouraged to enter a summary of their work group activities below. RA/UM Working Group Meetings have become more of a lab session, working through specific problems Angela has been attending RA SIG meetings to open the lines of communication, now once a month Contact Angela if you would like to join these meetings; second Wednesdays at 1pm Eastern (but for Feb, actually Tuesday the 8th at 2pm Eastern) Context
MM Working Group Meetings are 1st Tuesday of the month, 12-1pm ET via zoom using the usual FOLIO password. Our lab sessions are open to everyone. Please bring your questions, examples, and comments about reporting and metadata. We are still looking for help with derived tables and reviewers. See our sign up sheet. In next our lab session, we will use the LDP instance identifiers and go through the process of porting this over to metadb, how to contribute a pull request with the changes and update the runlist.txt, and go through a review.
ERM Working Group ERM Prototype and Query Development Status Meetings are bi-weekly on tuesdays 11am ET alternating with RM Working Group Next meeting will be at 29th, March Contact @Axel Dörrer if you would like to get a calendar invitation.
Mermaid diagrams can now display directly in GitHub markdown! Example
RM Working Group group is working on RM-related derived tables for Metadb for latest query development updates, see RM Prototype and Query Development Status Meetings are biweekly on Tuesdays 11am-noon ET; contact @Sharon Markus if you would like to join us Next meeting is Tuesday, March 22 11am-noon ET
Reporting SIG Documentation Subgroup External Statistics Working Group no updates currently new organizational/tracking scheme for JIRA, with pointers to queries in folio-analytics repository New organizational structure for External Statistics reports external statistics reports (e.g., ACRL) typically require running queries from different functional reporting areas these reports will be captured in JIRA under one UXPROD-XXXX report cluster issue, then the descriptions will point to each of the queries required to run them on the folio-analytics repository institutions will need to rank each of these 8 new UXPROD-XXXX report cluster issues each reporting development team will take responsibility for the queries in their area for the external statistics clusters
Product Council For all recent work on FOLIO Reporting SQL development: |
Demonstration of GitHub Desktop (probably next meeting) | Angela | |
Topics for Future Meetings | All | Next: Reviewing a pull request How to deal with External Stats reports? Definitely do GitHub Desktop demonstration at a SIG meeting in future, probably about 15 minutes Query optimization by Nassib: Mar 28 and Apr 7
Review and update Topics for Future Reporting SIG Meetings |