Dask is a parallel computing library for Python people. This talk will be a gentle introduction to Dask, showing how you can improve the speed of data science code on your laptop with a simple "pip install". Then we will use the same code to process big data on a cluster of machines. We will be going through an end-to-end data science pipeline, from ETL and exploratory analysis to machine learning model training and scoring.
We will cover:
- Example using publicly available data and single-node Python
- Pandas for data cleaning/transformation
- Scikit-learn for machine learning
- How to parallelize this workflow on a laptop and then a cluster using Dask
- Distributed model training
- Distributed inference/scoring