[FOLIO-1042] Look at https://airflow.apache.org/ Created: 30/Jan/18  Updated: 18/Jan/19

Status: Open
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P3
Reporter: Heikki Levanto Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: madrid, sprint31, sprint32, sprint33
Remaining Estimate: 3 hours
Time Spent: 4 hours, 30 minutes
Original estimate: Not Specified

Issue links:
Blocks
blocks FOLIO-1081 High-level design of FOLIO workflow e... Closed
Sprint:
Development Team: Core: Platform

 Description   

At our workflow discussion in Madrid it was recommended to look at AirBnb's AirFlow, if that could be of use in Folio



 Comments   
Comment by Heikki Levanto [ 08/Feb/18 ]

Taking a quick read through the docs, writing down first impressions:

  • Looks like the users would have to understand python to use airflow, at least to work with the templates.
  • All the examples seem to be executing bash commands. I am sure manual steps can be made too, but it may be clumsy
  • The installation seems to need access to local file storage, at least by default. There are some clustering options, but I can't make much sense of them.
  • Airflow is not tenant-aware, all things seem to happen in the same space, all logs go to one place, etc. It could be possible to define one airflow user for each tenant, but in a cluster that user needs to be defined on every worker node.
  • There are HTTP request operators, which can be used against a Folio system.
  • Scheduling of repeated runs is based on cron strings. Sufficient, but we had problems with that in our harvester.
  • There is some sort of UI for managing it, but of course it will not fit in with Folio UI guidelines.

My first impression is not very positive. Airflow is big and complex, requires workflow authors to understand python, and seems mostly to be designed for running repeating scripts, with some dependencies, not controlling manual operations. There are some underlying assumptions, like that all operations must be idempotent (can be rerun in case of problems, so can not insert an item or increment a counter).

Comment by shale99 [ 08/Feb/18 ]

this is a python heavy workflow engine - everything is in python - so its a requirement - do we expect users to write their own tasks or are we coming out of the box with a set and adding a task is something or a rarity? - what can be done here is that we can create a template, something like API = expected response , API = expected response , etc... as a pipeline with some notification options (email, etc...) ---- and translate that into python so that jobs can be declarative and then as long is its not too complicated - the python can be abstracted from end users...(just thinking out loud)

the tasks are all python code, so we can think of it as whatever the programming language can do you can probably do as well - call apis to update something, fail that api if it was already updated , send an email when something fails. which may require someone , like a librarian, to make a change to something in folio that will correct the status, and then kickoff another pipeline from that point on.

the dags (files) are found on disk - usually recommended to store this in an s3 or something... we may be able to get creative and store in the db and move to temp local storage for running, but this is just a guess

the ui is sort of tenant aware, you can have a user per tenant and the ui will only display that info - but i am not sure how airflow handles this on the db layer (row level security? just a trigger? i only see a userid column in the chart table....can look deeper into it if needed) - but i dont know if anything else is - i guess if we work with apis only the tenant header will be the separator

as for scheduling, it is cron syntax - is it cron though that runs it?, there are some shortcuts as well daily, hourly

i am personally undecided - it would be cool to get a bunch of workflows that try to give some sort of an idea of the needed functionality - i think that can help understand if this should be written in house or if we should go with something like airflow

Comment by Mike Taylor [ 08/Feb/18 ]

Interesting stuff.

My first impression is not very positive. Airflow is big and complex, requires workflow authors to understand python, and seems mostly to be designed for running repeating scripts, with some dependencies, not controlling manual operations.

That is a huge issue. When I was thinking in detail about FOLIO workflow a while back, everything came down to the problem of integrating human actions into otherwise automated sequences. A solution that doesn't handle that is no solution at all – we really might just as well invoke curl from cron otherwise.

There are some underlying assumptions, like that all operations must be idempotent (can be rerun in case of problems, so can not insert an item or increment a counter).

Inability to do things like adding records is also going to be a deal-breaker.

The use of python for writing steps doesn't bother me. It's probably the best programming language for non-programmers to use, and as shale99 implies those steps will in any case mostly come out of the box that we provide.

It sounds like if we did use Airflow, we'd probably need to either maintain a custom derivative, or get a bunch of modification accepted back into the mainstream.

Comment by shale99 [ 08/Feb/18 ]

just an asterisk to the comments below - i have spent like a 1/2 a day on airflow so i am not an expert and am giving my opinion from what i have learned.
1. I dont think idempotent is a requirement - if there is a task that can not be retried that is ok and that task can be configured accordingly. you are free to do whatever you want with every task in the pipeline - many airflow use cases take data from A convert it to B store it in C - very similar to what we will be doing
2. i agree that airflow works well with recurring automatic tasks - but it can trigger tasks manually as well
3. it also seems to me that it is quite possible to combine automatic and manual tasks in a pipeline using airflow sensors that 'poke' to check status of something every X min/hours/days - may not be ideal but it may not be the only way

Comment by Mike Taylor [ 08/Feb/18 ]

That sounds rather more encouraging than the impression I formed from your and Heikki's earlier comments.

Comment by shale99 [ 11/Feb/18 ]

so after a bit more digging -
Airflow can be used to run both automated and async manual workflows (You can use sensors to create manual workflows and even create a dedicated pool of workers for sensors only so that they dont compete with task workers that actually run the workflows). Having said that, async manual workflows are not the main focus of Airflow - it is no doubt more oriented to automated workflows.
So if most of our workflows are manual with some automated ones sprinkled in - i would probably not go with Airflow

Generated at Thu Feb 08 23:10:24 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.