Automating tasks with R

James Laird-Smith

Data Scientist, Bank of England

Agenda

  • Why learn to automate?
  • What tools are available for automation?
  • The future?

Why learn automation?

Automation is the logical extension of coding.

Given that you have some operation.

Step 1: Re-write in code

  • Pain: Learn to code
  • Gain: Can then re-run at zero effort (among other things)

Step 2: Automate

  • Pain: Learn an automation tool
  • Gain: Can then re-run at zero effort even when you are not there

How can you automate?

Note

R itself is not an automation tool. Rather, it interfaces with other purpose-built tools.

  • Cron
  • Apache Airflow
  • Windows Task Scheduler
  • RStudio Connect
  • Jenkins
  • Github actions
  • Many more…

R does have tools for interfacing with these.

What do we want?

What does an ideal automation tool look like?


  • Simple to use.
  • Feature rich. Eg. have triggers (or listeners).
  • Developer friendly. Eg. Have an API
  • User friendly. Eg. Have a good graphical interface (GUI).
  • Free and open source with no vendor lock-in.
  • Language agnostic? Or at least work with R?

Cron

In the beginning God created the heavens and the earth … but also the Cron utility for Unix.1

  • Very old, widely used and well understood.
    • If you are running MacOS or Linux, it’s likely already on your computer.
  • Minimal and easy to get started with.
  • Cron syntax is the lingua franca of automation. Other tools will accept it as input.




Cron expressions

Cron has it’s own syntax for job frequency and schedule.

* * * * *
| | | | |
| | | | +---- Day of the Week   (range: 0-6, 0 is Sunday)
| | | +------ Month of the Year (range: 1-12)
| | +-------- Day of the Month  (range: 1-31)
| +---------- Hour              (range: 0-23)
+------------ Minute            (range: 0-59)


crontab file
# Comment lines start with hastags
0 0 * * * echo 'Hello midnight!'
0 0 1 1 * echo 'Hello New Year!'
0 0 25 * * echo 'Hello payday!'
0 0 * * 6 echo 'Hello Saturday!'
* * * * * Rscript -e 'print("Hello R!")'
* * * * * python -c 'print("Hello Python!")'


Cron’s syntax is richer than this. To learn more you can go to crontab.guru or consult the spec here.



cronR

  • Originally by Kevin Ushey, but maintained by Jan Wijffels.
  • Helpers for writing Cron expressions and writing to crontab files.
library(cronR)

cron_add(
  command = "Rscript -e 'print(\"Hello R!\")'",
  frequency = 'daily', 
  at='7AM',
  dry_run = TRUE,
  ask = FALSE
)
Adding cronjob:
---------------

## cronR job
## id:   2d2a128c4688230eb141b52d86f3befa
## tags: 
## desc: 
0 7 * * * Rscript -e 'print("Hello R!")'

Apache Airflow

Airflow is a platform […] to programmatically author, schedule and monitor workflows1

  • Perhaps the most mature and well understood modern automation tool for data.
  • Fully featured and extensible.

  • Started by Airbnb.
  • Open source and with a community of developers.

  • Powerful GUI.
  • Notable for it’s use of Directed Acyclic Graphs (DAGs).


Apache Airflow (2)


Apache Airflow (3)


from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def print_hello():
    return 'Hello world from first Airflow DAG!'

dag = DAG(
  'hello_world', 
  description = 'Hello World DAG',
  schedule_interval = '0 12 * * *',
  start_date = datetime(2017, 3, 20), 
  catchup = False
)

hello_operator = PythonOperator(task_id='hello_task',
                                python_callable = print_hello, 
                                dag = dag)



GitHub Actions

  • Widely used as part of package testing (sometimes called CI/CD).
  • Triggers can be repo events (like push and merge) or Cron schedules.
  • Workflows are specified in YAML.
  • Good support for R from work done by RStudio:
name: GitHub Actions Demo
on: [push]
jobs:
  Explore-GitHub-Actions:
    runs-on: ubuntu-latest
    steps:
      - run: echo "🎉 Job automatically triggered by ${{ github.event_name }}."
      - run: echo "🍏 This job's status is ${{ job.status }}."



Existing tools: Others

  • Windows Task Scheduler
    • Heavily tied into Windows.
    • Your PC always has to be on.
  • RStudio Connect
    • Good at managing R dependencies.
    • Scheduling still very GUI based.
  • Nothing really has the feature set of Airflow.

Existing tools: summary

Cron Airflow GitHub Actions RStudio Connect
Easy to use 🟡1
Trigger and DAG support
Good R support 🟡2 ✅✅
Language agnostic 🟡3
Has an API 🟡4
Nice GUI ✅✅
Free & open source
Not tied to a vendor





The future1

Focus on the open source technologies.


  • Cron for the simpler automations
    • Synchronise crontab from a YAML file.
  • Apache Airflow for the large scale deployments
    • R bindings to the API.
    • Create DAGs in R? Transpile them to Python?



Thank you!

Questions?