Topic
  Getting up to speed with Dask
  
  
  
  
  
    
    
  
    
  By: Aaron Richter
  
  
  Date: April 8, 2021, 6 p.m.
  
  Dask is a parallel computing library for Python people. This talk will be a gentle introduction to Dask, showing how you can improve the speed of data science code on your laptop with a simple "pip install". Then we will use the same code to process big data on a cluster of machines. We will be going through an end-to-end data science pipeline, from ETL and exploratory analysis to machine learning model training and scoring.
We will cover:
- Example using publicly available data and single-node Python 
- Pandas for data cleaning/transformation 
- Scikit-learn for machine learning 
- How to parallelize this workflow on a laptop and then a cluster using Dask 
- Distributed model training 
- Distributed inference/scoring