{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#What-is-Parallel-computation?\" data-toc-modified-id=\"What-is-Parallel-computation?-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>What is Parallel computation?</a></span><ul class=\"toc-item\"><li><span><a href=\"#Threads-and-Processes\" data-toc-modified-id=\"Threads-and-Processes-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Threads and Processes</a></span></li><li><span><a href=\"#Writing-parallel-algorithms\" data-toc-modified-id=\"Writing-parallel-algorithms-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Writing parallel algorithms</a></span></li><li><span><a href=\"#Syncronous-versus-Asyncronous-computation\" data-toc-modified-id=\"Syncronous-versus-Asyncronous-computation-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Syncronous versus Asyncronous computation</a></span></li><li><span><a href=\"#Lazy-evalution\" data-toc-modified-id=\"Lazy-evalution-1.4\"><span class=\"toc-item-num\">1.4&nbsp;&nbsp;</span>Lazy evalution</a></span></li><li><span><a href=\"#The-threading-and-multiprocessing-modules\" data-toc-modified-id=\"The-threading-and-multiprocessing-modules-1.5\"><span class=\"toc-item-num\">1.5&nbsp;&nbsp;</span>The <code>threading</code> and <code>multiprocessing</code> modules</a></span></li></ul></li><li><span><a href=\"#The-Global-Interpreter-Lock\" data-toc-modified-id=\"The-Global-Interpreter-Lock-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>The Global Interpreter Lock</a></span></li><li><span><a href=\"#The-Dask-delayed-function.\" data-toc-modified-id=\"The-Dask-delayed-function.-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>The Dask delayed function.</a></span><ul class=\"toc-item\"><li><span><a href=\"#Basics\" data-toc-modified-id=\"Basics-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>Basics</a></span><ul class=\"toc-item\"><li><span><a href=\"#Parallelize-with-the-dask.delayed-decorator\" data-toc-modified-id=\"Parallelize-with-the-dask.delayed-decorator-3.1.1\"><span class=\"toc-item-num\">3.1.1&nbsp;&nbsp;</span>Parallelize with the <code>dask.delayed</code> decorator</a></span></li></ul></li><li><span><a href=\"#What-just-happened?\" data-toc-modified-id=\"What-just-happened?-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>What just happened?</a></span><ul class=\"toc-item\"><li><span><a href=\"#Some-questions-to-consider:\" data-toc-modified-id=\"Some-questions-to-consider:-3.2.1\"><span class=\"toc-item-num\">3.2.1&nbsp;&nbsp;</span>Some questions to consider:</a></span></li></ul></li><li><span><a href=\"#Exercise:-Parallelize-a-for-loop\" data-toc-modified-id=\"Exercise:-Parallelize-a-for-loop-3.3\"><span class=\"toc-item-num\">3.3&nbsp;&nbsp;</span>Exercise: Parallelize a for loop</a></span></li><li><span><a href=\"#Exercise:-Parallelizing-a-for-loop-code-with-control-flow\" data-toc-modified-id=\"Exercise:-Parallelizing-a-for-loop-code-with-control-flow-3.4\"><span class=\"toc-item-num\">3.4&nbsp;&nbsp;</span>Exercise: Parallelizing a for-loop code with control flow</a></span><ul class=\"toc-item\"><li><span><a href=\"#Some-questions-to-consider:\" data-toc-modified-id=\"Some-questions-to-consider:-3.4.1\"><span class=\"toc-item-num\">3.4.1&nbsp;&nbsp;</span>Some questions to consider:</a></span></li></ul></li><li><span><a href=\"#Exercise:-Parallelizing-a-Pandas-Groupby-Reduction\" data-toc-modified-id=\"Exercise:-Parallelizing-a-Pandas-Groupby-Reduction-3.5\"><span class=\"toc-item-num\">3.5&nbsp;&nbsp;</span>Exercise: Parallelizing a Pandas Groupby Reduction</a></span></li><li><span><a href=\"#Create-data\" data-toc-modified-id=\"Create-data-3.6\"><span class=\"toc-item-num\">3.6&nbsp;&nbsp;</span>Create data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Inspect-data\" data-toc-modified-id=\"Inspect-data-3.6.1\"><span class=\"toc-item-num\">3.6.1&nbsp;&nbsp;</span>Inspect data</a></span></li><li><span><a href=\"#Read-one-file-with-pandas.read_csv-and-compute-mean-departure-delay\" data-toc-modified-id=\"Read-one-file-with-pandas.read_csv-and-compute-mean-departure-delay-3.6.2\"><span class=\"toc-item-num\">3.6.2&nbsp;&nbsp;</span>Read one file with <code>pandas.read_csv</code> and compute mean departure delay</a></span></li></ul></li><li><span><a href=\"#Sequential-code:-Mean-Departure-Delay-Per-Airport\" data-toc-modified-id=\"Sequential-code:-Mean-Departure-Delay-Per-Airport-3.7\"><span class=\"toc-item-num\">3.7&nbsp;&nbsp;</span>Sequential code: Mean Departure Delay Per Airport</a></span><ul class=\"toc-item\"><li><span><a href=\"#Parallelize-the-code-above\" data-toc-modified-id=\"Parallelize-the-code-above-3.7.1\"><span class=\"toc-item-num\">3.7.1&nbsp;&nbsp;</span>Parallelize the code above</a></span></li><li><span><a href=\"#Some-questions-to-consider:\" data-toc-modified-id=\"Some-questions-to-consider:-3.7.2\"><span class=\"toc-item-num\">3.7.2&nbsp;&nbsp;</span>Some questions to consider:</a></span></li><li><span><a href=\"#Learn-More\" data-toc-modified-id=\"Learn-More-3.7.3\"><span class=\"toc-item-num\">3.7.3&nbsp;&nbsp;</span>Learn More</a></span></li></ul></li></ul></li><li><span><a href=\"#Bag:-Parallel-Lists-for-semi-structured-data\" data-toc-modified-id=\"Bag:-Parallel-Lists-for-semi-structured-data-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Bag: Parallel Lists for semi-structured data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Create-data\" data-toc-modified-id=\"Create-data-4.1\"><span class=\"toc-item-num\">4.1&nbsp;&nbsp;</span>Create data</a></span></li><li><span><a href=\"#Setup\" data-toc-modified-id=\"Setup-4.2\"><span class=\"toc-item-num\">4.2&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href=\"#Creation\" data-toc-modified-id=\"Creation-4.3\"><span class=\"toc-item-num\">4.3&nbsp;&nbsp;</span>Creation</a></span></li><li><span><a href=\"#Manipulation\" data-toc-modified-id=\"Manipulation-4.4\"><span class=\"toc-item-num\">4.4&nbsp;&nbsp;</span>Manipulation</a></span><ul class=\"toc-item\"><li><span><a href=\"#Example:-Accounts-JSON-data\" data-toc-modified-id=\"Example:-Accounts-JSON-data-4.4.1\"><span class=\"toc-item-num\">4.4.1&nbsp;&nbsp;</span>Example: Accounts JSON data</a></span></li><li><span><a href=\"#Basic-Queries\" data-toc-modified-id=\"Basic-Queries-4.4.2\"><span class=\"toc-item-num\">4.4.2&nbsp;&nbsp;</span>Basic Queries</a></span></li><li><span><a href=\"#Use-flatten-to-de-nest\" data-toc-modified-id=\"Use-flatten-to-de-nest-4.4.3\"><span class=\"toc-item-num\">4.4.3&nbsp;&nbsp;</span>Use <code>flatten</code> to de-nest</a></span></li><li><span><a href=\"#Groupby-and-Foldby\" data-toc-modified-id=\"Groupby-and-Foldby-4.4.4\"><span class=\"toc-item-num\">4.4.4&nbsp;&nbsp;</span>Groupby and Foldby</a></span></li><li><span><a href=\"#groupby\" data-toc-modified-id=\"groupby-4.4.5\"><span class=\"toc-item-num\">4.4.5&nbsp;&nbsp;</span><code>groupby</code></a></span></li><li><span><a href=\"#foldby\" data-toc-modified-id=\"foldby-4.4.6\"><span class=\"toc-item-num\">4.4.6&nbsp;&nbsp;</span><code>foldby</code></a></span></li><li><span><a href=\"#Example-with-account-data\" data-toc-modified-id=\"Example-with-account-data-4.4.7\"><span class=\"toc-item-num\">4.4.7&nbsp;&nbsp;</span>Example with account data</a></span></li><li><span><a href=\"#Exercise:-compute-total-amount-per-name\" data-toc-modified-id=\"Exercise:-compute-total-amount-per-name-4.4.8\"><span class=\"toc-item-num\">4.4.8&nbsp;&nbsp;</span>Exercise: compute total amount per name</a></span></li></ul></li><li><span><a href=\"#DataFrames\" data-toc-modified-id=\"DataFrames-4.5\"><span class=\"toc-item-num\">4.5&nbsp;&nbsp;</span>DataFrames</a></span><ul class=\"toc-item\"><li><span><a href=\"#Denormalization\" data-toc-modified-id=\"Denormalization-4.5.1\"><span class=\"toc-item-num\">4.5.1&nbsp;&nbsp;</span>Denormalization</a></span></li></ul></li><li><span><a href=\"#Limitations\" data-toc-modified-id=\"Limitations-4.6\"><span class=\"toc-item-num\">4.6&nbsp;&nbsp;</span>Limitations</a></span></li><li><span><a href=\"#Learn-More\" data-toc-modified-id=\"Learn-More-4.7\"><span class=\"toc-item-num\">4.7&nbsp;&nbsp;</span>Learn More</a></span></li><li><span><a href=\"#Shutdown\" data-toc-modified-id=\"Shutdown-4.8\"><span class=\"toc-item-num\">4.8&nbsp;&nbsp;</span>Shutdown</a></span></li></ul></li><li><span><a href=\"#Arrays\" data-toc-modified-id=\"Arrays-5\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Arrays</a></span><ul class=\"toc-item\"><li><span><a href=\"#Create-data\" data-toc-modified-id=\"Create-data-5.1\"><span class=\"toc-item-num\">5.1&nbsp;&nbsp;</span>Create data</a></span></li><li><span><a href=\"#Setup\" data-toc-modified-id=\"Setup-5.2\"><span class=\"toc-item-num\">5.2&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href=\"#Blocked-Algorithms\" data-toc-modified-id=\"Blocked-Algorithms-5.3\"><span class=\"toc-item-num\">5.3&nbsp;&nbsp;</span>Blocked Algorithms</a></span></li><li><span><a href=\"#Exercise:--Compute-the-mean-using-a-blocked-algorithm\" data-toc-modified-id=\"Exercise:--Compute-the-mean-using-a-blocked-algorithm-5.4\"><span class=\"toc-item-num\">5.4&nbsp;&nbsp;</span>Exercise:  Compute the mean using a blocked algorithm</a></span></li></ul></li><li><span><a href=\"#dask.array-contains-these-algorithms\" data-toc-modified-id=\"dask.array-contains-these-algorithms-6\"><span class=\"toc-item-num\">6&nbsp;&nbsp;</span><code>dask.array</code> contains these algorithms</a></span><ul class=\"toc-item\"><li><span><a href=\"#Exercise:--Compute-the-mean\" data-toc-modified-id=\"Exercise:--Compute-the-mean-6.1\"><span class=\"toc-item-num\">6.1&nbsp;&nbsp;</span>Exercise:  Compute the mean</a></span></li></ul></li><li><span><a href=\"#Performance-and-Parallelism\" data-toc-modified-id=\"Performance-and-Parallelism-7\"><span class=\"toc-item-num\">7&nbsp;&nbsp;</span>Performance and Parallelism</a></span><ul class=\"toc-item\"><li><span><a href=\"#Example\" data-toc-modified-id=\"Example-7.1\"><span class=\"toc-item-num\">7.1&nbsp;&nbsp;</span>Example</a></span></li></ul></li><li><span><a href=\"#Performance-comparision\" data-toc-modified-id=\"Performance-comparision-8\"><span class=\"toc-item-num\">8&nbsp;&nbsp;</span>Performance comparision</a></span><ul class=\"toc-item\"><li><span><a href=\"#Exercise:--Meteorological-data\" data-toc-modified-id=\"Exercise:--Meteorological-data-8.1\"><span class=\"toc-item-num\">8.1&nbsp;&nbsp;</span>Exercise:  Meteorological data</a></span></li><li><span><a href=\"#Exercise:--Subsample-and-store\" data-toc-modified-id=\"Exercise:--Subsample-and-store-8.2\"><span class=\"toc-item-num\">8.2&nbsp;&nbsp;</span>Exercise:  Subsample and store</a></span></li><li><span><a href=\"#Example:-Lennard-Jones-potential\" data-toc-modified-id=\"Example:-Lennard-Jones-potential-8.3\"><span class=\"toc-item-num\">8.3&nbsp;&nbsp;</span>Example: Lennard-Jones potential</a></span><ul class=\"toc-item\"><li><span><a href=\"#Dask-version\" data-toc-modified-id=\"Dask-version-8.3.1\"><span class=\"toc-item-num\">8.3.1&nbsp;&nbsp;</span>Dask version</a></span></li></ul></li></ul></li><li><span><a href=\"#Limitations\" data-toc-modified-id=\"Limitations-9\"><span class=\"toc-item-num\">9&nbsp;&nbsp;</span>Limitations</a></span></li><li><span><a href=\"#Dask-DataFrames\" data-toc-modified-id=\"Dask-DataFrames-10\"><span class=\"toc-item-num\">10&nbsp;&nbsp;</span>Dask DataFrames</a></span><ul class=\"toc-item\"><li><span><a href=\"#When-to-use-dask.dataframe\" data-toc-modified-id=\"When-to-use-dask.dataframe-10.1\"><span class=\"toc-item-num\">10.1&nbsp;&nbsp;</span>When to use <code>dask.dataframe</code></a></span></li><li><span><a href=\"#Create-data\" data-toc-modified-id=\"Create-data-10.2\"><span class=\"toc-item-num\">10.2&nbsp;&nbsp;</span>Create data</a></span></li><li><span><a href=\"#Setup\" data-toc-modified-id=\"Setup-10.3\"><span class=\"toc-item-num\">10.3&nbsp;&nbsp;</span>Setup</a></span><ul class=\"toc-item\"><li><span><a href=\"#Real-Data\" data-toc-modified-id=\"Real-Data-10.3.1\"><span class=\"toc-item-num\">10.3.1&nbsp;&nbsp;</span>Real Data</a></span></li><li><span><a href=\"#What-just-happened?\" data-toc-modified-id=\"What-just-happened?-10.3.2\"><span class=\"toc-item-num\">10.3.2&nbsp;&nbsp;</span>What just happened?</a></span></li></ul></li><li><span><a href=\"#Computations-with-dask.dataframe\" data-toc-modified-id=\"Computations-with-dask.dataframe-10.4\"><span class=\"toc-item-num\">10.4&nbsp;&nbsp;</span>Computations with <code>dask.dataframe</code></a></span></li><li><span><a href=\"#Exercises\" data-toc-modified-id=\"Exercises-10.5\"><span class=\"toc-item-num\">10.5&nbsp;&nbsp;</span>Exercises</a></span><ul class=\"toc-item\"><li><span><a href=\"#1.)-How-many-rows-are-in-our-dataset?\" data-toc-modified-id=\"1.)-How-many-rows-are-in-our-dataset?-10.5.1\"><span class=\"toc-item-num\">10.5.1&nbsp;&nbsp;</span>1.) How many rows are in our dataset?</a></span></li><li><span><a href=\"#2.)-In-total,-how-many-non-canceled-flights-were-taken?\" data-toc-modified-id=\"2.)-In-total,-how-many-non-canceled-flights-were-taken?-10.5.2\"><span class=\"toc-item-num\">10.5.2&nbsp;&nbsp;</span>2.) In total, how many non-canceled flights were taken?</a></span></li><li><span><a href=\"#3.)-In-total,-how-many-non-cancelled-flights-were-taken-from-each-airport?\" data-toc-modified-id=\"3.)-In-total,-how-many-non-cancelled-flights-were-taken-from-each-airport?-10.5.3\"><span class=\"toc-item-num\">10.5.3&nbsp;&nbsp;</span>3.) In total, how many non-cancelled flights were taken from each airport?</a></span></li><li><span><a href=\"#4.)-What-was-the-average-departure-delay-from-each-airport?\" data-toc-modified-id=\"4.)-What-was-the-average-departure-delay-from-each-airport?-10.5.4\"><span class=\"toc-item-num\">10.5.4&nbsp;&nbsp;</span>4.) What was the average departure delay from each airport?</a></span></li><li><span><a href=\"#5.)-What-day-of-the-week-has-the-worst-average-departure-delay?\" data-toc-modified-id=\"5.)-What-day-of-the-week-has-the-worst-average-departure-delay?-10.5.5\"><span class=\"toc-item-num\">10.5.5&nbsp;&nbsp;</span>5.) What day of the week has the worst average departure delay?</a></span></li></ul></li><li><span><a href=\"#Sharing-Intermediate-Results\" data-toc-modified-id=\"Sharing-Intermediate-Results-10.6\"><span class=\"toc-item-num\">10.6&nbsp;&nbsp;</span>Sharing Intermediate Results</a></span></li><li><span><a href=\"#How-does-this-compare-to-Pandas?\" data-toc-modified-id=\"How-does-this-compare-to-Pandas?-10.7\"><span class=\"toc-item-num\">10.7&nbsp;&nbsp;</span>How does this compare to Pandas?</a></span></li><li><span><a href=\"#Dask-DataFrame-Data-Model\" data-toc-modified-id=\"Dask-DataFrame-Data-Model-10.8\"><span class=\"toc-item-num\">10.8&nbsp;&nbsp;</span>Dask DataFrame Data Model</a></span></li><li><span><a href=\"#Converting-CRSDepTime-to-a-timestamp\" data-toc-modified-id=\"Converting-CRSDepTime-to-a-timestamp-10.9\"><span class=\"toc-item-num\">10.9&nbsp;&nbsp;</span>Converting <code>CRSDepTime</code> to a timestamp</a></span><ul class=\"toc-item\"><li><span><a href=\"#Custom-code-and-Dask-Dataframe\" data-toc-modified-id=\"Custom-code-and-Dask-Dataframe-10.9.1\"><span class=\"toc-item-num\">10.9.1&nbsp;&nbsp;</span>Custom code and Dask Dataframe</a></span></li></ul></li><li><span><a href=\"#Exercise:-Rewrite-above-to-use-a-single-call-to-map_partitions\" data-toc-modified-id=\"Exercise:-Rewrite-above-to-use-a-single-call-to-map_partitions-10.10\"><span class=\"toc-item-num\">10.10&nbsp;&nbsp;</span>Exercise: Rewrite above to use a single call to <code>map_partitions</code></a></span></li><li><span><a href=\"#Limitations\" data-toc-modified-id=\"Limitations-10.11\"><span class=\"toc-item-num\">10.11&nbsp;&nbsp;</span>Limitations</a></span><ul class=\"toc-item\"><li><span><a href=\"#What-doesn't-work?\" data-toc-modified-id=\"What-doesn't-work?-10.11.1\"><span class=\"toc-item-num\">10.11.1&nbsp;&nbsp;</span>What doesn't work?</a></span></li></ul></li><li><span><a href=\"#Learn-More\" data-toc-modified-id=\"Learn-More-10.12\"><span class=\"toc-item-num\">10.12&nbsp;&nbsp;</span>Learn More</a></span></li></ul></li><li><span><a href=\"#Dask-Distributed\" data-toc-modified-id=\"Dask-Distributed-11\"><span class=\"toc-item-num\">11&nbsp;&nbsp;</span>Dask Distributed</a></span><ul class=\"toc-item\"><li><span><a href=\"#Some-Questions-to-Consider:\" data-toc-modified-id=\"Some-Questions-to-Consider:-11.1\"><span class=\"toc-item-num\">11.1&nbsp;&nbsp;</span>Some Questions to Consider:</a></span></li><li><span><a href=\"#Making-a-cluster\" data-toc-modified-id=\"Making-a-cluster-11.2\"><span class=\"toc-item-num\">11.2&nbsp;&nbsp;</span>Making a cluster</a></span><ul class=\"toc-item\"><li><span><a href=\"#Simple-method\" data-toc-modified-id=\"Simple-method-11.2.1\"><span class=\"toc-item-num\">11.2.1&nbsp;&nbsp;</span>Simple method</a></span></li></ul></li><li><span><a href=\"#Executing-with-the-distributed-client\" data-toc-modified-id=\"Executing-with-the-distributed-client-11.3\"><span class=\"toc-item-num\">11.3&nbsp;&nbsp;</span>Executing with the distributed client</a></span><ul class=\"toc-item\"><li><span><a href=\"#Exercise\" data-toc-modified-id=\"Exercise-11.3.1\"><span class=\"toc-item-num\">11.3.1&nbsp;&nbsp;</span>Exercise</a></span></li></ul></li></ul></li><li><span><a href=\"#Dask-Distributed,-Advanced\" data-toc-modified-id=\"Dask-Distributed,-Advanced-12\"><span class=\"toc-item-num\">12&nbsp;&nbsp;</span>Dask Distributed, Advanced</a></span><ul class=\"toc-item\"><li><span><a href=\"#Distributed-futures\" data-toc-modified-id=\"Distributed-futures-12.1\"><span class=\"toc-item-num\">12.1&nbsp;&nbsp;</span>Distributed futures</a></span><ul class=\"toc-item\"><li><span><a href=\"#Client.submit\" data-toc-modified-id=\"Client.submit-12.1.1\"><span class=\"toc-item-num\">12.1.1&nbsp;&nbsp;</span><code>Client.submit</code></a></span></li></ul></li><li><span><a href=\"#Exercise:-Rebuild-the-above-delayed-computation-using-Client.submit-instead\" data-toc-modified-id=\"Exercise:-Rebuild-the-above-delayed-computation-using-Client.submit-instead-12.2\"><span class=\"toc-item-num\">12.2&nbsp;&nbsp;</span>Exercise: Rebuild the above delayed computation using <code>Client.submit</code> instead</a></span><ul class=\"toc-item\"><li><span><a href=\"#Persist\" data-toc-modified-id=\"Persist-12.2.1\"><span class=\"toc-item-num\">12.2.1&nbsp;&nbsp;</span>Persist</a></span></li></ul></li><li><span><a href=\"#Asynchronous-computation\" data-toc-modified-id=\"Asynchronous-computation-12.3\"><span class=\"toc-item-num\">12.3&nbsp;&nbsp;</span>Asynchronous computation</a></span></li><li><span><a href=\"#Debugging\" data-toc-modified-id=\"Debugging-12.4\"><span class=\"toc-item-num\">12.4&nbsp;&nbsp;</span>Debugging</a></span></li></ul></li><li><span><a href=\"#Dask-Data-Storage\" data-toc-modified-id=\"Dask-Data-Storage-13\"><span class=\"toc-item-num\">13&nbsp;&nbsp;</span>Dask Data Storage</a></span><ul class=\"toc-item\"><li><span><a href=\"#Create-data\" data-toc-modified-id=\"Create-data-13.1\"><span class=\"toc-item-num\">13.1&nbsp;&nbsp;</span>Create data</a></span></li><li><span><a href=\"#Read-CSV\" data-toc-modified-id=\"Read-CSV-13.2\"><span class=\"toc-item-num\">13.2&nbsp;&nbsp;</span>Read CSV</a></span><ul class=\"toc-item\"><li><span><a href=\"#Write-to-HDF5\" data-toc-modified-id=\"Write-to-HDF5-13.2.1\"><span class=\"toc-item-num\">13.2.1&nbsp;&nbsp;</span>Write to HDF5</a></span></li><li><span><a href=\"#Compare-CSV-to-HDF5-speeds\" data-toc-modified-id=\"Compare-CSV-to-HDF5-speeds-13.2.2\"><span class=\"toc-item-num\">13.2.2&nbsp;&nbsp;</span>Compare CSV to HDF5 speeds</a></span></li><li><span><a href=\"#1.--Store-text-efficiently-with-categoricals\" data-toc-modified-id=\"1.--Store-text-efficiently-with-categoricals-13.2.3\"><span class=\"toc-item-num\">13.2.3&nbsp;&nbsp;</span>1.  Store text efficiently with categoricals</a></span></li><li><span><a href=\"#Exercise\" data-toc-modified-id=\"Exercise-13.2.4\"><span class=\"toc-item-num\">13.2.4&nbsp;&nbsp;</span>Exercise</a></span></li></ul></li><li><span><a href=\"#Remote-files\" data-toc-modified-id=\"Remote-files-13.3\"><span class=\"toc-item-num\">13.3&nbsp;&nbsp;</span>Remote files</a></span></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parallelism in Python\n",
    "\n",
    "_The last 2 hours of this lecture are modified from the [Anaconda Dask tutorial](https://github.com/dask/dask-tutorial). These lectures are used under [licence](./LICENSE.txt)._ \n",
    "\n",
    "You may find various packages aren't installed on your system. We recommend you run the following in an Anaconda prompt (on Windows) or a terminal (on Mac) run:\n",
    "\n",
    "```\n",
    "conda env create -f environment.yml\n",
    "conda activate dask-tutorial\n",
    "jupyter notebook lecture.ipynb\n",
    "```\n",
    "\n",
    "The first section only uses builtin modules, so we can work through it while that installation completes.\n",
    "\n",
    "\n",
    "## What is Parallel computation?\n",
    "\n",
    "> In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:\n",
    ">\n",
    ">    A problem is broken into discrete parts that can be solved concurrently\n",
    "    Each part is further broken down to a series of instructions\n",
    "    Instructions from each part execute simultaneously on different processors\n",
    "    An overall control/coordination mechanism is employed \n",
    "    \n",
    "[_Blaise Barney, Lawrence Livermore National Laboratory_](https://computing.llnl.gov/tutorials/parallel_comp/#Whatis)\n",
    "\n",
    "### Threads and Processes\n",
    "\n",
    "To perform a piece of music, one could have a single musician playing many notes at once (for example a pianist, or many musicians playing one note each (for example a string quartet). Similarly, to perform a parallel task on a modern computer, one could have a single instance of a program do many things at once, or have many copies of a program do one thing each. In more exact language, task parallelism on a multi-processor computing system can be multi-threaded or multi-process.\n",
    "\n",
    "Each version has its own advantages, and which version is best (or whether to combine both!) depends on the nature of the problem to be solved. Threads are lightweight, which make them cheap to create and destroy. They also share resources including program memory, which makes communication between them cheap. However this also requires thought to ensure that two threads don't modify the same piece of memory at the same time. Processes are larger, and thus more expensive to create temporarily, Communication is also more involved. However, there is no reason processes need to run on the same machine to be able to communicate, so multiprocessing works well at the scale of computer clusters, particularly if the problem being worked on divides naturally into independent parts.\n",
    "\n",
    "### Writing parallel algorithms\n",
    "\n",
    "Not all problems parallelise easily. if each step in a problem depends explictly and nonlinearly on the one before, then it is likely to be hard to divide the workload. Similarly, some problems are \"embarassingly parallel\", where a single set of operations needs to be performed on all the elements of an array of data. This is precisely the form of problem which is great to pass to a GPU hardware accelerator.\n",
    "\n",
    "### Syncronous versus Asyncronous computation\n",
    "\n",
    "As soon as a code inteacts with an external process, it can operate in two different modes, a synchrous one, where the call waits each time for the operation to return its result before continuing, or an asyncronous one, in which  the original process continues, perhaps checking back later to collect a result. Jobs which absolutely require waiting for them are called \"blocking\", while jobs which never need to be waited for are called \"non-blocking\". Modern Python 3 contains keywords `async`, to declare a function as asyncronous, and `await` to block for a call and collect final results. A useful builtin module, `asyncio`, gives us some more control over the running of these processes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "import asyncio\n",
    "\n",
    "def fun1(n, m):\n",
    "    a = [_**2 for _ in range(m)]\n",
    "    time.sleep(n)\n",
    "    return a\n",
    "   \n",
    "async def _fun2a(n):\n",
    "    t = await asyncio.sleep(n)\n",
    "    return t\n",
    "\n",
    "async def _fun2b(m):\n",
    "    a = [_**2 for _ in range(m)]\n",
    "    return a\n",
    "\n",
    "async def fun2(n, m):\n",
    "    a = await asyncio.gather(_fun2a(n), _fun2b(m))\n",
    "    return a[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t1 = time.time()\n",
    "a = fun1(2.0, 1000000)\n",
    "print(time.time()-t1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t1 = time.time()\n",
    "a = await fun2(2.0, 1000000)\n",
    "print(time.time()-t1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Asyncronous routines are notoriously hard to get straight, and as with many issues to do with parallel programming, can be very hard to debug.\n",
    "\n",
    "### Lazy evalution\n",
    "\n",
    "A concept directly related to asyncronous computation is [lazy evaluation](https://en.wikipedia.org/wiki/Lazy_evaluation). In general, the fastest code is code that does not run. Computer programs frequently contain multiple intermediate values which aren't necessarily required in a final computation, so if we delay evaluating the value of an expression until it is absolutely need to calculate a later value, we can save both memory and time. This philiosophy is termed \"lazy\" evaluation, with the opposite standpoint, in which all expressions are evaluated immediately is called \"eager\" evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from functools import total_ordering\n",
    "\n",
    "@total_ordering\n",
    "class Lazy_factorial:\n",
    "    \"\"\"Example of a factorial class with lazy evaluation & caching.\"\"\"\n",
    "    \n",
    "    def __init__(self, n):\n",
    "        \"\"\"Factorial class, returns n factorial when evaluated.\"\"\"\n",
    "        \n",
    "        self._result = None\n",
    "        self.n = n\n",
    "        \n",
    "    def compute(self):\n",
    "        if not self._result:\n",
    "            self._result = self.factorial(self.n)\n",
    "        return self._result\n",
    "    \n",
    "    def factorial(self, n):\n",
    "        res = 1\n",
    "        for _ in range(1, n+1):\n",
    "            res *= _\n",
    "        return res\n",
    "    \n",
    "    def __repr__(self):\n",
    "        return f'Lazy_factorial({self.n})'\n",
    "    \n",
    "    def __lt__(self, other):\n",
    "        return self.n < other.n\n",
    "    \n",
    "    def __eq__(self, other):\n",
    "        return self.n == other.n\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time a=Lazy_factorial(1)\n",
    "%time b=Lazy_factorial(1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time a.compute()\n",
    "%time b.compute()\n",
    "%time b.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The `threading` and `multiprocessing` modules\n",
    "\n",
    "The builtin `threading` module provide basic access to allow loops and functions to be multithreaded. Let's have an example for some simple functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import threading\n",
    "import multiprocessing\n",
    "import numpy as np\n",
    "\n",
    "def do_nothing(n):\n",
    "    print(f\"Run {n}.\")\n",
    "    time.sleep(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "threads = []\n",
    "for i in range(10):\n",
    "    threads.append(threading.Thread(target=do_nothing, args=(i,)))\n",
    "    threads[-1].start()\n",
    "for thread in threads:\n",
    "    thread.join()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll notice the rather messed up IO. This is because all the different threads are sending their output at the same time. We can fix this by using a [lock](https://en.wikipedia.org/wiki/Lock_(computer_science)) so that only one thread is allowed to write to screen at a time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def do_nothing_with_lock(n, lock):\n",
    "    lock.acquire()\n",
    "    print(f\"Run {n}.\")\n",
    "    lock.release()\n",
    "    time.sleep(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "threads = []\n",
    "lock = threading.Lock()\n",
    "for i in range(10):\n",
    "    threads.append(threading.Thread(target=do_nothing_with_lock, args=(i, lock)))\n",
    "    threads[-1].start()\n",
    "for thread in threads:\n",
    "    thread.join()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are actually a few locks built in to the IO system anyway, which is why the result is as tidy as it is, with the mess confined to new lines only.\n",
    "\n",
    "## The Global Interpreter Lock\n",
    "\n",
    "So far we've been speeding up a do nothing function using `time.sleep`. What happens if we try to do something useful instead? Maybe something in `numpy`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "arr0 = np.arange(1000000, dtype=float)\n",
    "arr1 = arr0.copy()\n",
    "\n",
    "def do_something_in_numpy(arr):\n",
    "    arr[:] = np.sin(arr)\n",
    "\n",
    "%time do_something_in_numpy(arr1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "arr2 = arr0.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "threads = []\n",
    "n = 2\n",
    "N = arr2.size//n\n",
    "for i in range(n):\n",
    "    threads.append(threading.Thread(target=do_something_in_numpy,\n",
    "                                    args=(arr2[i*N:(i+1)*N],)))\n",
    "    threads[-1].start()\n",
    "for thread in threads:\n",
    "    thread.join()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(arr1 == arr2).all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That worked out ok! Now let's try some regular Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def do_something_in_python(n):\n",
    "    print(f\"Run {n}.\")\n",
    "    a = [_**2 for _ in range(1000000)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# Our serial run\n",
    "for i in range(10):\n",
    "    do_something_in_python(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "threads = []\n",
    "for i in range(10):\n",
    "    threads.append(threading.Thread(target=do_something_in_python, args=(i,)))\n",
    "    threads[-1].start()\n",
    "for thread in threads:\n",
    "    thread.join()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hmm, that doesn't look so good.\n",
    "\n",
    "It turns out that for thread safety, and to maximise the speed of serial code, Python implements a single, global lock, which is applied while running most pure Python code. So what's the solution? The `multiprocessing` module implements process based parallelism for Python problems, and since these are separate processes, they each have their own GIL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "processes = []\n",
    "for i in range(10):\n",
    "    processes.append(multiprocessing.Process(target=do_something_in_python, args=(i,)))\n",
    "    processes[-1].start()\n",
    "for process in processes:\n",
    "    process.join()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fn(x):\n",
    "    return x**3\n",
    "\n",
    "a = range(100)\n",
    "\n",
    "pool = multiprocessing.Pool(processes=4)\n",
    "pool.map(fn, a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will note we have to write a lot of \"boilerplate\" code to apply the builtin Python methods for parallel processing. This gets even worse if we are running in a multi-machine computing cluster, so that the processes we wish to communicate with aren't all running on the same computer. The `dask` package simplifies a lot of this process. Let's start a new `conda` environment supporting the `dask` tutorial and begin the first Dask notebook.\n",
    "\n",
    "## The Dask delayed function.\n",
    "\n",
    "<img src=\"http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg\"\n",
    "     align=\"right\"\n",
    "     width=\"30%\"\n",
    "     alt=\"Dask logo\\\">\n",
    "\n",
    "\n",
    "In this section we parallelize simple for-loop style code with Dask and `dask.delayed`. Often, this is the only function that you will need to convert functions for use with Dask.\n",
    "\n",
    "This is a simple way to use `dask` to parallelize existing codebases or build [complex systems](https://blog.dask.org/2018/02/09/credit-models-with-dask).  This will also help us to develop an understanding for later sections.\n",
    "\n",
    "**Related Documentation**\n",
    "\n",
    "* [Delayed documentation](https://docs.dask.org/en/latest/delayed.html)\n",
    "* [Delayed screencast](https://www.youtube.com/watch?v=SHqFmynRxVU)\n",
    "* [Delayed API](https://docs.dask.org/en/latest/delayed-api.html)\n",
    "* [Delayed examples](https://examples.dask.org/delayed.html)\n",
    "* [Delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we'll see in the [distributed scheduler section](http://localhost:8888/notebooks/lecture.ipynb#Dask-Distributed), Dask has several ways of executing code in parallel. We'll use the distributed scheduler by creating a `dask.distributed.Client` object. For now, this will provide us with some nice diagnostics. We'll talk about schedulers in depth later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client\n",
    "\n",
    "## n_workers could be the number of threads or processes\n",
    "client = Client(n_workers=4, processes=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basics\n",
    "\n",
    "First let's make some toy functions, `inc` and `add`, that sleep for a while to simulate work. We'll then time running these functions normally.\n",
    "\n",
    "In the next section we'll parallelize this code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import sleep\n",
    "\n",
    "def inc(x):\n",
    "    sleep(1)\n",
    "    return x + 1\n",
    "\n",
    "def add(x, y):\n",
    "    sleep(1)\n",
    "    return x + y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# This takes three seconds to run because we call each\n",
    "# function sequentially, one after the other\n",
    "\n",
    "x = inc(1)\n",
    "y = inc(2)\n",
    "z = add(x, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Parallelize with the `dask.delayed` decorator\n",
    "\n",
    "Those two increment calls *could* be called in parallel, because they are totally independent of one-another.\n",
    "\n",
    "We'll transform the `inc` and `add` functions using the `dask.delayed` function. When we call the delayed version by passing the arguments, exactly as before, but the original function isn't actually called yet - which is why the cell execution finishes very quickly.\n",
    "Instead, a *delayed object* is made, which keeps track of the function to call and the arguments to pass to it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask import delayed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# This runs immediately, all it does is build a graph\n",
    "\n",
    "x = delayed(inc)(1)\n",
    "y = delayed(inc)(2)\n",
    "z = delayed(add)(x, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This ran immediately, since nothing has really happened yet.\n",
    "\n",
    "To get the result, call `compute`. Notice that this runs faster than the original code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# This actually runs our computation using a local thread pool\n",
    "\n",
    "z.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What just happened?\n",
    "\n",
    "The `z` object is a lazy `Delayed` object.  This object holds everything we need to compute the final result, including references to all of the functions that are required and their inputs and relationship to one-another.  We can evaluate the result with `.compute()` as above or we can visualize the task graph for this value with `.visualize()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "z"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Look at the task graph for `z`\n",
    "z.visualize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that this includes the names of the functions from before, and the logical flow of the outputs of the `inc` functions to the inputs of `add`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Some questions to consider:\n",
    "\n",
    "-  Why did we go from 3s to 2s?  Why weren't we able to parallelize down to 1s?\n",
    "-  What would have happened if the inc and add functions didn't include the `sleep(1)`?  Would Dask still be able to speed up this code?\n",
    "-  What if we have multiple outputs or also want to get access to x or y?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise: Parallelize a for loop\n",
    "\n",
    "`for` loops are one of the most common things that we want to parallelize.  Use `dask.delayed` on `inc` and `sum` to parallelize the computation below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [1, 2, 3, 4, 5, 6, 7, 8]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Sequential code\n",
    "\n",
    "results = []\n",
    "for x in data:\n",
    "    y = inc(x)\n",
    "    results.append(y)\n",
    "    \n",
    "total = sum(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Your parallel code here. See solutions.ipynb for a model answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How do the graph visualizations compare with the given solution, compared to a version with the `sum` function used directly rather than wrapped with `delayed`? Can you explain the latter version? You might find the result of the following expression illuminating\n",
    "```python\n",
    "delayed(inc)(1) + delayed(inc)(2)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise: Parallelizing a for-loop code with control flow\n",
    "\n",
    "Often we want to delay only *some* functions, running a few of them immediately.  This is especially helpful when those functions are fast and help us to determine what other slower functions we should call.  This decision, to delay or not to delay, is usually where we need to be thoughtful when using `dask.delayed`.\n",
    "\n",
    "In the example below we iterate through a list of inputs.  If that input is even then we want to call `inc`.  If the input is odd then we want to call `double`.  This `is_even` decision to call `inc` or `double` has to be made immediately (not lazily) in order for our graph-building Python code to proceed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def double(x):\n",
    "    sleep(1)\n",
    "    return 2 * x\n",
    "\n",
    "def is_even(x):\n",
    "    return not x % 2\n",
    "\n",
    "data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Sequential code\n",
    "\n",
    "results = []\n",
    "for x in data:\n",
    "    if is_even(x):\n",
    "        y = double(x)\n",
    "    else:\n",
    "        y = inc(x)\n",
    "    results.append(y)\n",
    "    \n",
    "total = sum(results)\n",
    "print(total)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Your parallel code here...\n",
    "# TODO: parallelize the sequential code above using dask.delayed\n",
    "# You will need to delay some functions, but not all"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Some questions to consider:\n",
    "\n",
    "-  What are other examples of control flow where we can't use delayed?\n",
    "-  What would have happened if we had delayed the evaluation of `is_even(x)` in the example above?\n",
    "-  What are your thoughts on delaying `sum`?  This function is both computational but also fast to run."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise: Parallelizing a Pandas Groupby Reduction\n",
    "\n",
    "In this exercise we read several CSV files and perform a groupby operation in parallel.  We are given sequential code to do this and parallelize it with `dask.delayed`.\n",
    "\n",
    "The computation we will parallelize is to compute the mean departure delay per airport from some historical flight data.  We will do this by using `dask.delayed` together with `pandas`.  In a future section we will do this same exercise with `dask.dataframe`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create data\n",
    "\n",
    "Run this code to prep some data.\n",
    "\n",
    "This downloads and extracts some historical flight data for flights out of NYC between 1990 and 2000. The data is originally from [here](http://stat-computing.org/dataexpo/2009/the-data.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run prep.py -d flights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Inspect data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "sorted(os.listdir(os.path.join('data', 'nycflights')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Read one file with `pandas.read_csv` and compute mean departure delay"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_csv(os.path.join('data', 'nycflights', '1990.csv'))\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# What is the schema?\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# What originating airports are in the data?\n",
    "df.Origin.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mean departure delay per-airport for one year\n",
    "df.groupby('Origin').DepDelay.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sequential code: Mean Departure Delay Per Airport\n",
    "\n",
    "The above cell computes the mean departure delay per-airport for one year. Here we expand that to all years using a sequential for loop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from glob import glob\n",
    "filenames = sorted(glob(os.path.join('data', 'nycflights', '*.csv')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "sums = []\n",
    "counts = []\n",
    "for fn in filenames:\n",
    "    # Read in file\n",
    "    df = pd.read_csv(fn)\n",
    "    \n",
    "    # Groupby origin airport\n",
    "    by_origin = df.groupby('Origin')\n",
    "    \n",
    "    # Sum of all departure delays by origin\n",
    "    total = by_origin.DepDelay.sum()\n",
    "    \n",
    "    # Number of flights by origin\n",
    "    count = by_origin.DepDelay.count()\n",
    "    \n",
    "    # Save the intermediates\n",
    "    sums.append(total)\n",
    "    counts.append(count)\n",
    "\n",
    "# Combine intermediates to get total mean-delay-per-origin\n",
    "total_delays = sum(sums)\n",
    "n_flights = sum(counts)\n",
    "mean = total_delays / n_flights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Parallelize the code above\n",
    "\n",
    "Use `dask.delayed` to parallelize the code above.  Some extra things you will need to know.\n",
    "\n",
    "1.  Methods and attribute access on delayed objects work automatically, so if you have a delayed object you can perform normal arithmetic, slicing, and method calls on it and it will produce the correct delayed calls.\n",
    "\n",
    "    ```python\n",
    "    x = delayed(np.arange)(10)\n",
    "    y = (x + 1)[::2].sum()  # everything here was delayed\n",
    "    ```\n",
    "2.  Calling the `.compute()` method works well when you have a single output.  When you have multiple outputs you might want to use the `dask.compute` function:\n",
    "\n",
    "    ```python\n",
    "    >>> from dask import compute\n",
    "    >>> x = delayed(np.arange)(10)\n",
    "    >>> y = x ** 2\n",
    "    >>> min_, max_ = compute(y.min(), y.max())\n",
    "    >>> min_, max_\n",
    "    (0, 81)\n",
    "    ```\n",
    "    \n",
    "    This way Dask can share the intermediate values (like `y = x**2`)\n",
    "    \n",
    "So your goal is to parallelize the code above (which has been copied below) using `dask.delayed`.  You may also want to visualize a bit of the computation to see if you're doing it correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask import compute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# copied sequential code\n",
    "\n",
    "sums = []\n",
    "counts = []\n",
    "for fn in filenames:\n",
    "    # Read in file\n",
    "    df = pd.read_csv(fn)\n",
    "    \n",
    "    # Groupby origin airport\n",
    "    by_origin = df.groupby('Origin')\n",
    "    \n",
    "    # Sum of all departure delays by origin\n",
    "    total = by_origin.DepDelay.sum()\n",
    "    \n",
    "    # Number of flights by origin\n",
    "    count = by_origin.DepDelay.count()\n",
    "    \n",
    "    # Save the intermediates\n",
    "    sums.append(total)\n",
    "    counts.append(count)\n",
    "\n",
    "# Combine intermediates to get total mean-delay-per-origin\n",
    "total_delays = sum(sums)\n",
    "n_flights = sum(counts)\n",
    "mean = total_delays / n_flights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# your code here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ensure the results still match\n",
    "mean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Some questions to consider:\n",
    "\n",
    "- How much speedup did you get? Is this how much speedup you'd expect?\n",
    "- Experiment with where to call `compute`. What happens when you call it on `sums` and `counts`? What happens if you wait and call it on `mean`?\n",
    "- Experiment with delaying the call to `sum`. What does the graph look like if `sum` is delayed? What does the graph look like if it isn't?\n",
    "- Can you think of any reason why you'd want to do the reduction one way over the other?\n",
    "\n",
    "#### Learn More\n",
    "\n",
    "Visit the [Delayed documentation](https://docs.dask.org/en/latest/delayed.html). In particular, this [delayed screencast](https://www.youtube.com/watch?v=SHqFmynRxVU) will reinforce the concepts you learned here and the [delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html) document collects advice on using `dask.delayed` well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Close the Client\n",
    "\n",
    "Before moving on to the next exercise, make sure to close your client or stop this kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bag: Parallel Lists for semi-structured data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dask-bag excels in processing data that can be represented as a sequence of arbitrary inputs. We'll refer to this as \"messy\" data, because it can contain complex nested structures, missing fields, mixtures of data types, etc. The *functional* programming style fits very nicely with standard Python iteration, such as can be found in the `itertools` module.\n",
    "\n",
    "Messy data is often encountered at the beginning of data processing pipelines when large volumes of raw data are first consumed. The initial set of data might be JSON, CSV, XML, or any other format that does not enforce strict structure and datatypes.\n",
    "For this reason, the initial data massaging and processing is often done with Python `list`s, `dict`s, and `set`s.\n",
    "\n",
    "These core data structures are optimized for general-purpose storage and processing.  Adding streaming computation with iterators/generator expressions or libraries like `itertools` or [`toolz`](https://toolz.readthedocs.io/en/latest/) let us process large volumes in a small space.  If we combine this with parallel processing then we can churn through a fair amount of data.\n",
    "\n",
    "Dask.bag is a high level Dask collection to automate common workloads of this form.  In a nutshell\n",
    "\n",
    "    dask.bag = map, filter, toolz + parallel execution\n",
    "    \n",
    "**Related Documentation**\n",
    "\n",
    "* [Bag documentation](https://docs.dask.org/en/latest/bag.html)\n",
    "* [Bag screencast](https://youtu.be/-qIiJ1XtSv0)\n",
    "* [Bag API](https://docs.dask.org/en/latest/bag-api.html)\n",
    "* [Bag examples](https://examples.dask.org/bag.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run prep.py -d accounts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, we'll use the distributed scheduler. Schedulers will be explained in depth [later](http://localhost:8888/notebooks/lecture.ipynb#Dask-Distributed)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client\n",
    "\n",
    "client = Client(n_workers=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can create a `Bag` from a Python sequence, from files, from data on S3, etc.\n",
    "We demonstrate using `.take()` to show elements of the data. (Doing `.take(1)` results in a tuple with one element)\n",
    "\n",
    "Note that the data are partitioned into blocks, and there are many items per block. In the first example, the two partitions contain five elements each, and in the following two, each file is partitioned into one or more bytes blocks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# each element is an integer\n",
    "import dask.bag as db\n",
    "b = db.from_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], npartitions=2)\n",
    "b.take(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# each element is a text file, where each line is a JSON object\n",
    "# note that the compression is handled automatically\n",
    "import os\n",
    "b = db.read_text(os.path.join('data', 'accounts.*.json.gz'))\n",
    "b.take(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Edit sources.py to configure source locations\n",
    "import sources\n",
    "sources.bag_url"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Requires `s3fs` library\n",
    "# each partition is a remote CSV text file\n",
    "b = db.read_text(sources.bag_url,\n",
    "                 storage_options={'anon': True})\n",
    "b.take(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Manipulation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Bag` objects hold the standard functional API found in projects like the Python standard library, `toolz`, or `pyspark`, including `map`, `filter`, `groupby`, etc..\n",
    "\n",
    "Operations on `Bag` objects create new bags.  Call the `.compute()` method to trigger execution, as we saw for `Delayed` objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_even(n):\n",
    "    return n % 2 == 0\n",
    "\n",
    "b = db.from_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])\n",
    "c = b.filter(is_even).map(lambda x: x ** 2)\n",
    "c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# blocking form: wait for completion (which is very fast in this case)\n",
    "c.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example: Accounts JSON data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've created a fake dataset of gzipped JSON data in your data directory.  This is like the example used in the `DataFrame` example we will see later, except that it has bundled up all of the entires for each individual `id` into a single record.  This is similar to data that you might collect off of a document store database or a web API.\n",
    "\n",
    "Each line is a JSON encoded dictionary with the following keys\n",
    "\n",
    "*  id: Unique identifier of the customer\n",
    "*  name: Name of the customer\n",
    "*  transactions: List of `transaction-id`, `amount` pairs, one for each transaction for the customer in that file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = os.path.join('data', 'accounts.*.json.gz')\n",
    "lines = db.read_text(filename)\n",
    "lines.take(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our data comes out of the file as lines of text. Notice that file decompression happened automatically. We can make this data look more reasonable by mapping the `json.loads` function onto our bag."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "js = lines.map(json.loads)\n",
    "# take: inspect first few elements\n",
    "js.take(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Basic Queries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we parse our JSON data into proper Python objects (`dict`s, `list`s, etc.) we can perform more interesting queries by creating small Python functions to run on our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# filter: keep only some elements of the sequence\n",
    "js.filter(lambda record: record['name'] == 'Alice').take(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_transactions(d):\n",
    "    return {'name': d['name'], 'count': len(d['transactions'])}\n",
    "\n",
    "# map: apply a function to each element\n",
    "(js.filter(lambda record: record['name'] == 'Alice')\n",
    "   .map(count_transactions)\n",
    "   .take(5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pluck: select a field, as from a dictionary, element[field]\n",
    "(js.filter(lambda record: record['name'] == 'Alice')\n",
    "   .map(count_transactions)\n",
    "   .pluck('count')\n",
    "   .take(5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Average number of transactions for all of the Alice entries\n",
    "(js.filter(lambda record: record['name'] == 'Alice')\n",
    "   .map(count_transactions)\n",
    "   .pluck('count')\n",
    "   .mean()\n",
    "   .compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Use `flatten` to de-nest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the example below we see the use of `.flatten()` to flatten results.  We compute the average amount for all transactions for all Alices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "js.filter(lambda record: record['name'] == 'Alice').pluck('transactions').take(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(js.filter(lambda record: record['name'] == 'Alice')\n",
    "   .pluck('transactions')\n",
    "   .flatten()\n",
    "   .take(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(js.filter(lambda record: record['name'] == 'Alice')\n",
    "   .pluck('transactions')\n",
    "   .flatten()\n",
    "   .pluck('amount')\n",
    "   .take(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(js.filter(lambda record: record['name'] == 'Alice')\n",
    "   .pluck('transactions')\n",
    "   .flatten()\n",
    "   .pluck('amount')\n",
    "   .mean()\n",
    "   .compute())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Groupby and Foldby"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often we want to group data by some function or key.  We can do this either with the `.groupby` method, which is straightforward but forces a full shuffle of the data (expensive) or with the harder-to-use but faster `.foldby` method, which does a streaming combined groupby and reduction.\n",
    "\n",
    "*  `groupby`:  Shuffles data so that all items with the same key are in the same key-value pair\n",
    "*  `foldby`:  Walks through the data accumulating a result per key\n",
    "\n",
    "*Note: the full groupby is particularly bad. In actual workloads you would do well to use `foldby` or switch to `DataFrame`s if possible.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `groupby`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Groupby collects items in your collection so that all items with the same value under some function are collected together into a key-value pair."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b = db.from_sequence(['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank'])\n",
    "b.groupby(len).compute()  # names grouped by length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b = db.from_sequence(list(range(10)))\n",
    "b.groupby(lambda x: x % 2).compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b.groupby(lambda x: x % 2).starmap(lambda k, v: (k, max(v))).compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `foldby`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Foldby can be quite odd at first.  It is similar to the following functions from other libraries:\n",
    "\n",
    "*  [`toolz.reduceby`](http://toolz.readthedocs.io/en/latest/streaming-analytics.html#streaming-split-apply-combine)\n",
    "*  [`pyspark.RDD.combineByKey`](http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/)\n",
    "\n",
    "When using `foldby` you provide \n",
    "\n",
    "1.  A key function on which to group elements\n",
    "2.  A binary operator such as you would pass to `reduce` that you use to perform reduction per each group\n",
    "3.  A combine binary operator that can combine the results of two `reduce` calls on different parts of your dataset.\n",
    "\n",
    "Your reduction must be associative.  It will happen in parallel in each of the partitions of your dataset.  Then all of these intermediate results will be combined by the `combine` binary operator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "is_even = lambda x: x % 2\n",
    "b.foldby(is_even, binop=max, combine=max).compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example with account data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We find the number of people with the same name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# Warning, this one takes a while...\n",
    "result = js.groupby(lambda item: item['name']).starmap(lambda k, v: (k, len(v))).compute()\n",
    "print(sorted(result))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# This one is comparatively fast and produces the same result.\n",
    "from operator import add\n",
    "def incr(tot, _):\n",
    "    return tot+1\n",
    "\n",
    "result = js.foldby(key='name', \n",
    "                   binop=incr, \n",
    "                   initial=0, \n",
    "                   combine=add, \n",
    "                   combine_initial=0).compute()\n",
    "print(sorted(result))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise: compute total amount per name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to groupby (or foldby) the `name` key, then add up the all of the amounts for each name.\n",
    "\n",
    "Steps\n",
    "\n",
    "1.  Create a small function that, given a dictionary like \n",
    "\n",
    "        {'name': 'Alice', 'transactions': [{'amount': 1, 'id': 123}, {'amount': 2, 'id': 456}]}\n",
    "        \n",
    "    produces the sum of the amounts, e.g. `3`\n",
    "    \n",
    "2.  Slightly change the binary operator of the `foldby` example above so that the binary operator doesn't count the number of entries, but instead accumulates the sum of the amounts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DataFrames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the same reasons that Pandas is often faster than pure Python, `dask.dataframe` can be faster than `dask.bag`.  We will work more with DataFrames later, but from for the bag point of view, they are frequently the end-point of the \"messy\" part of data ingestion—once the data can be made into a data-frame, then complex split-apply-combine logic will become much more straight-forward and efficient.\n",
    "\n",
    "You can transform a bag with a simple tuple or flat dictionary structure into a `dask.dataframe` with the `to_dataframe` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = js.to_dataframe()\n",
    "df1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This now looks like a well-defined DataFrame, and we can apply Pandas-like computations to it efficiently."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using a Dask DataFrame, how long does it take to do our prior computation of numbers of people with the same name?  It turns out that `dask.dataframe.groupby()` beats `dask.bag.groupby()` more than an order of magnitude; but it still cannot match `dask.bag.foldby()` for this case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time df1.groupby('name').id.count().compute().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Denormalization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This DataFrame format is less-than-optimal because the `transactions` column is filled with nested data so Pandas has to revert to `object` dtype, which is quite slow in Pandas.  Ideally we want to transform to a dataframe only after we have flattened our data so that each record is a single `int`, `string`, `float`, etc.."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def denormalize(record):\n",
    "    # returns a list for every nested item, each transaction of each person\n",
    "    return [{'id': record['id'], \n",
    "             'name': record['name'], \n",
    "             'amount': transaction['amount'], \n",
    "             'transaction-id': transaction['transaction-id']}\n",
    "            for transaction in record['transactions']]\n",
    "\n",
    "transactions = js.map(denormalize).flatten()\n",
    "transactions.take(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = transactions.to_dataframe()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# number of transactions per name\n",
    "# note that the time here includes the data load and ingestion\n",
    "df.groupby('name')['transaction-id'].count().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Limitations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bags provide very general computation (any Python function.)  This generality\n",
    "comes at cost.  Bags have the following known limitations\n",
    "\n",
    "1.  Bag operations tend to be slower than array/dataframe computations in the\n",
    "    same way that Python tends to be slower than NumPy/Pandas\n",
    "2.  ``Bag.groupby`` is slow.  You should try to use ``Bag.foldby`` if possible.\n",
    "    Using ``Bag.foldby`` requires more thought. Even better, consider creating\n",
    "    a normalised dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Learn More\n",
    "\n",
    "* [Bag documentation](https://docs.dask.org/en/latest/bag.html)\n",
    "* [Bag screencast](https://youtu.be/-qIiJ1XtSv0)\n",
    "* [Bag API](https://docs.dask.org/en/latest/bag-api.html)\n",
    "* [Bag examples](https://examples.dask.org/bag.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Shutdown"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.shutdown()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"images/array.png\" width=\"25%\" align=\"right\">\n",
    "Dask array provides a parallel, larger-than-memory, n-dimensional array using blocked algorithms. Simply put: distributed Numpy.\n",
    "\n",
    "*  **Parallel**: Uses all of the cores on your computer\n",
    "*  **Larger-than-memory**:  Lets you work on datasets that are larger than your available memory by breaking up your array into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.\n",
    "*  **Blocked Algorithms**:  Perform large computations by performing many smaller computations\n",
    "\n",
    "In this notebook, we'll build some understanding by implementing some blocked algorithms from scratch.\n",
    "We'll then use Dask Array to analyze large datasets, in parallel, using a familiar NumPy-like API.\n",
    "\n",
    "**Related Documentation**\n",
    "\n",
    "* [Array documentation](https://docs.dask.org/en/latest/array.html)\n",
    "* [Array screencast](https://youtu.be/9h_61hXCDuI)\n",
    "* [Array API](https://docs.dask.org/en/latest/array-api.html)\n",
    "* [Array examples](https://examples.dask.org/array.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run prep.py -d random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client\n",
    "\n",
    "client = Client(n_workers=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Blocked Algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A *blocked algorithm* executes on a large dataset by breaking it up into many small blocks.\n",
    "\n",
    "For example, consider taking the sum of a billion numbers.  We might instead break up the array into 1,000 chunks, each of size 1,000,000, take the sum of each chunk, and then take the sum of the intermediate sums.\n",
    "\n",
    "We achieve the intended result (one sum on one billion numbers) by performing many smaller results (one thousand sums on one million numbers each, followed by another sum of a thousand numbers.)\n",
    "\n",
    "We do exactly this with Python and NumPy in the following example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load data with h5py\n",
    "# this creates a pointer to the data, but does not actually load\n",
    "import h5py\n",
    "import os\n",
    "f = h5py.File(os.path.join('data', 'random.hdf5'), mode='r')\n",
    "dset = f['/x']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Compute sum using blocked algorithm**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before using dask, lets consider the concept of blocked algorithms. We can compute the sum of a large number of elements by loading them chunk-by-chunk, and keeping a running total.\n",
    "\n",
    "Here we compute the sum of this large array on disk by \n",
    "\n",
    "1.  Computing the sum of each 1,000,000 sized chunk of the array\n",
    "2.  Computing the sum of the 1,000 intermediate sums\n",
    "\n",
    "Note that this is a sequential process in the notebook kernel, both the loading and summing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute sum of large array, one million numbers at a time\n",
    "sums = []\n",
    "for i in range(0, 1000000000, 1000000):\n",
    "    chunk = dset[i: i + 1000000]  # pull out numpy array\n",
    "    sums.append(chunk.sum())\n",
    "\n",
    "total = sum(sums)\n",
    "print(total)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise:  Compute the mean using a blocked algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've seen the simple example above try doing a slightly more complicated problem, compute the mean of the array, assuming for a moment that we don't happen to already know how many elements are in the data.  You can do this by changing the code above with the following alterations:\n",
    "\n",
    "1.  Compute the sum of each block\n",
    "2.  Compute the length of each block\n",
    "3.  Compute the sum of the 1,000 intermediate sums and the sum of the 1,000 intermediate lengths and divide one by the other\n",
    "\n",
    "This approach is overkill for our case but does nicely generalize if we don't know the size of the array or individual blocks beforehand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute the mean of the array. A model answer is in solutions.ipynb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`dask.array` contains these algorithms\n",
    "--------------------------------------------\n",
    "\n",
    "Dask.array is a NumPy-like library that does these kinds of tricks to operate on large datasets that don't fit into memory.  It extends beyond the linear problems discussed above to full N-Dimensional algorithms and a decent subset of the NumPy interface."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Create `dask.array` object**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can create a `dask.array` `Array` object with the `da.from_array` function.  This function accepts\n",
    "\n",
    "1.  `data`: Any object that supports NumPy slicing, like `dset`\n",
    "2.  `chunks`: A chunk size to tell us how to block up our array, like `(1000000,)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask.array as da\n",
    "x = da.from_array(dset, chunks=(1000000,))\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** Manipulate `dask.array` object as you would a numpy array**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have an `Array` we perform standard numpy-style computations like arithmetic, mathematics, slicing, reductions, etc..\n",
    "\n",
    "The interface is familiar, but the actual work is different. dask_array.sum() does not do the same thing as numpy_array.sum()."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**What's the difference?**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`dask_array.sum()` builds an expression of the computation. It does not do the computation yet. `numpy_array.sum()` computes the sum immediately."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Why the difference?*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dask arrays are split into chunks. Each chunk must have computations run on that chunk explicitly. If the desired answer comes from a small slice of the entire dataset, running the computation over all data would be wasteful of CPU and memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = x.sum()\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Compute result**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dask.array objects are lazily evaluated.  Operations like `.sum` build up a graph of blocked tasks to execute.  \n",
    "\n",
    "We ask for the final result with a call to `.compute()`.  This triggers the actual computation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise:  Compute the mean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the variance, std, etc..  This should be a small change to the example above.\n",
    "\n",
    "Look at what other operations you can do with the Jupyter notebook's tab-completion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Does this match your result from before?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Performance and Parallelism\n",
    "-------------------------------\n",
    "\n",
    "<img src=\"images/fail-case.gif\" width=\"40%\" align=\"right\">\n",
    "\n",
    "In our first examples we used `for` loops to walk through the array one block at a time.  For simple operations like `sum` this is optimal.  However for complex operations we may want to traverse through the array differently.  In particular we may want the following:\n",
    "\n",
    "1.  Use multiple cores in parallel\n",
    "2.  Chain operations on a single blocks before moving on to the next one\n",
    "\n",
    "Dask.array translates your array operations into a graph of inter-related tasks with data dependencies between them.  Dask then executes this graph in parallel with multiple threads.  We'll discuss more about this in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.  Construct a 20000x20000 array of normally distributed random values broken up into 1000x1000 sized chunks\n",
    "2.  Take the mean along one axis\n",
    "3.  Take every 100th element"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import dask.array as da\n",
    "\n",
    "x = da.random.normal(10, 0.1, size=(20000, 20000),   # 400 million element array \n",
    "                              chunks=(1000, 1000))   # Cut into 1000x1000 sized chunks\n",
    "y = x.mean(axis=0)[::100]                            # Perform NumPy-style operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x.nbytes / 1e9  # Gigabytes of the input processed lazily"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "y.compute()     # Time to compute the result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Performance comparision\n",
    "---------------------------\n",
    "\n",
    "The following experiment was performed on a heavy personal laptop.  Your performance may vary.  If you attempt the NumPy version then please ensure that you have more than 4GB of main memory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NumPy: 19s, Needs gigabytes of memory**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "import numpy as np\n",
    "\n",
    "%%time \n",
    "x = np.random.normal(10, 0.1, size=(20000, 20000)) \n",
    "y = x.mean(axis=0)[::100] \n",
    "y\n",
    "\n",
    "CPU times: user 19.6 s, sys: 160 ms, total: 19.8 s\n",
    "Wall time: 19.7 s\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Dask Array: 4s, Needs megabytes of memory**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "import dask.array as da\n",
    "\n",
    "%%time\n",
    "x = da.random.normal(10, 0.1, size=(20000, 20000), chunks=(1000, 1000))\n",
    "y = x.mean(axis=0)[::100] \n",
    "y.compute() \n",
    "\n",
    "CPU times: user 29.4 s, sys: 1.07 s, total: 30.5 s\n",
    "Wall time: 4.01 s\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Discussion**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the Dask array computation ran in 4 seconds, but used 29.4 seconds of user CPU time. The numpy computation ran in 19.7 seconds and used 19.6 seconds of user CPU time.\n",
    "\n",
    "Dask finished faster, but used more total CPU time because Dask was able to transparently parallelize the computation because of the chunk size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Questions*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*  What happens if the dask chunks=(20000,20000)?\n",
    "    * Will the computation run in 4 seconds?\n",
    "    * How much memory will be used?\n",
    "* What happens if the dask chunks=(25,25)?\n",
    "    * What happens to CPU and memory?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise:  Meteorological data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is 2GB of somewhat artifical weather data in HDF5 files in `data/weather-big/*.hdf5`.  We'll use the `h5py` library to interact with this data and `dask.array` to compute on it.\n",
    "\n",
    "Our goal is to visualize the average temperature on the surface of the Earth for this month.  This will require a mean over all of this data.  We'll do this in the following steps\n",
    "\n",
    "1.  Create `h5py.Dataset` objects for each of the days of data on disk (`dsets`)\n",
    "2.  Wrap these with `da.from_array` calls \n",
    "3.  Stack these datasets along time with a call to `da.stack`\n",
    "4.  Compute the mean along the newly stacked time axis with the `.mean()` method\n",
    "5.  Visualize the result with `matplotlib.pyplot.imshow`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run prep.py -d weather"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import h5py\n",
    "from glob import glob\n",
    "import os\n",
    "\n",
    "filenames = sorted(glob(os.path.join('data', 'weather-big', '*.hdf5')))\n",
    "dsets = [h5py.File(filename, mode='r')['/t2m'] for filename in filenames]\n",
    "dsets[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dsets[0][:5, :5]  # Slicing into h5py.Dataset object gives a numpy array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig = plt.figure(figsize=(16, 8))\n",
    "plt.imshow(dsets[0][::4, ::4], cmap='RdBu_r');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Integrate with `dask.array`**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a list of `dask.array` objects out of your list of `h5py.Dataset` objects using the `da.from_array` function with a chunk size of `(500, 500)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "arrays = [da.from_array(dset, chunks=(500, 500)) for dset in dsets]\n",
    "arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Stack this list of `dask.array` objects into a single `dask.array` object with `da.stack`**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Stack these along the first axis so that the shape of the resulting array is `(31, 5760, 11520)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = da.stack(arrays, axis=0)\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Plot the mean of this array along the time (`0th`) axis**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [],
   "source": [
    "# complete the following:\n",
    "fig = plt.figure(figsize=(16, 8))\n",
    "plt.imshow(..., cmap='RdBu_r')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Plot the difference of the first day from the mean**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise:  Subsample and store"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above exercise the result of our computation is small, so we can call `compute` safely.  Sometimes our result is still too large to fit into memory and we want to save it to disk.  In these cases you can use one of the following two functions\n",
    "\n",
    "1.  `da.store`: Store dask.array into any object that supports numpy setitem syntax, e.g.\n",
    "\n",
    "        f = h5py.File('myfile.hdf5')\n",
    "        output = f.create_dataset(shape=..., dtype=...)\n",
    "        \n",
    "        da.store(my_dask_array, output)\n",
    "        \n",
    "2.  `da.to_hdf5`: A specialized function that creates and stores a `dask.array` object into an `HDF5` file.\n",
    "\n",
    "        da.to_hdf5('data/myfile.hdf5', '/output', my_dask_array)\n",
    "        \n",
    "The task in this exercise is to **use numpy step slicing to subsample the full dataset by a factor of two in both the latitude and longitude direction and then store this result to disk** using one of the functions listed above.\n",
    "\n",
    "As a reminder, Python slicing takes three elements\n",
    "\n",
    "    start:stop:step\n",
    "\n",
    "    >>> L = [1, 2, 3, 4, 5, 6, 7]\n",
    "    >>> L[::3]\n",
    "    [1, 4, 7]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Lennard-Jones potential"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [Lennard-Jones](https://en.wikipedia.org/wiki/Lennard-Jones_potential) is used in partical simuluations in physics, chemistry and engineering. It is highly parallelizable.\n",
    "\n",
    "First, we'll run and profile the Numpy version on 7,000 particles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# make a random collection of particles\n",
    "def make_cluster(natoms, radius=40, seed=1981):\n",
    "    np.random.seed(seed)\n",
    "    cluster = np.random.normal(0, radius, (natoms,3))-0.5\n",
    "    return cluster\n",
    "\n",
    "def lj(r2):\n",
    "    sr6 = (1./r2)**3\n",
    "    pot = 4.*(sr6*sr6 - sr6)\n",
    "    return pot\n",
    "\n",
    "# build the matrix of distances\n",
    "def distances(cluster):\n",
    "    diff = cluster[:, np.newaxis, :] - cluster[np.newaxis, :, :]\n",
    "    mat = (diff*diff).sum(-1)\n",
    "    return mat\n",
    "\n",
    "# the lj function is evaluated over the upper traingle\n",
    "# after removing distances near zero\n",
    "def potential(cluster):\n",
    "    d2 = distances(cluster)\n",
    "    dtri = np.triu(d2)\n",
    "    energy = lj(dtri[dtri > 1e-6]).sum()\n",
    "    return energy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cluster = make_cluster(int(7e3), radius=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time potential(cluster)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the most time consuming function is `distances`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this would open in another browser tab\n",
    "# %load_ext snakeviz\n",
    "# %snakeviz potential(cluster)\n",
    "\n",
    "# alternative simple version given text results in this tab\n",
    "%prun -s tottime potential(cluster)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dask version"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's the Dask version. Only the `potential` function needs to be rewritten to best utilize Dask.\n",
    "\n",
    "Note that `da.nansum` has been used over the full $NxN$ distance matrix to improve parallel efficiency."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask.array as da\n",
    "\n",
    "# compute the potential on the entire\n",
    "# matrix of distances and ignore division by zero\n",
    "def potential_dask(cluster):\n",
    "    d2 = distances(cluster)\n",
    "    energy = da.nansum(lj(d2))/2.\n",
    "    return energy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's convert the NumPy array to a Dask array. Since the entire NumPy array fits in memory it is more computationally efficient to chunk the array by number of CPU cores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from os import cpu_count\n",
    "\n",
    "dcluster = da.from_array(cluster, chunks=cluster.shape[0]//cpu_count())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This step should scale quite well with number of cores. The warnings are complaining about dividing by zero, which is why we used `da.nansum` in `potential_dask`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "e = potential_dask(dcluster)\n",
    "%time e.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Limitations\n",
    "-----------\n",
    "\n",
    "Dask Array does not implement the entire numpy interface.  Users expecting this\n",
    "will be disappointed.  Notably Dask Array has the following failings:\n",
    "\n",
    "1.  Dask does not implement all of ``np.linalg``.  This has been done by a\n",
    "    number of excellent BLAS/LAPACK implementations and is the focus of\n",
    "    numerous ongoing academic research projects.\n",
    "2.  Dask Array does not support some operations where the resulting shape\n",
    "    depends on the values of the array. For those that it does support\n",
    "    (for example, masking one Dask Array with another boolean mask),\n",
    "    the chunk sizes will be unknown, which may cause issues with other\n",
    "    operations that need to know the chunk sizes.\n",
    "3.  Dask Array does not attempt operations like ``sort`` which are notoriously\n",
    "    difficult to do in parallel and are of somewhat diminished value on very\n",
    "    large data (you rarely actually need a full sort).\n",
    "    Often we include parallel-friendly alternatives like ``topk``.\n",
    "4.  Dask development is driven by immediate need, and so many lesser used\n",
    "    functions, like ``np.sometrue`` have not been implemented purely out of\n",
    "    laziness.  These would make excellent community contributions.\n",
    "    \n",
    "* [Array documentation](https://docs.dask.org/en/latest/array.html)\n",
    "* [Array screencast](https://youtu.be/9h_61hXCDuI)\n",
    "* [Array API](https://docs.dask.org/en/latest/array-api.html)\n",
    "* [Array examples](https://examples.dask.org/array.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.shutdown()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dask DataFrames\n",
    "\n",
    "We finished the Dask Delayed Chapter by building a parallel dataframe computation over a directory of CSV files using `dask.delayed`.  In this section we use `dask.dataframe` to automatically build similiar computations, for the common case of tabular computations.  Dask dataframes look and feel like Pandas dataframes but they run on the same infrastructure that powers `dask.delayed`.\n",
    "\n",
    "In this notebook we use the same airline data as before, but now rather than write for-loops we let `dask.dataframe` construct our computations for us.  The `dask.dataframe.read_csv` function can take a globstring like `\"data/nycflights/*.csv\"` and build parallel computations on all of our data at once.\n",
    "\n",
    "### When to use `dask.dataframe`\n",
    "\n",
    "Pandas is great for tabular datasets that fit in memory. Dask becomes useful when the dataset you want to analyze is larger than your machine's RAM. The demo dataset we're working with is only about 200MB, so that you can download it in a reasonable time, but `dask.dataframe` will scale to  datasets much larger than memory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"images/pandas_logo.png\" align=\"right\" width=\"28%\">\n",
    "\n",
    "The `dask.dataframe` module implements a blocked parallel `DataFrame` object that mimics a large subset of the Pandas `DataFrame`. One Dask `DataFrame` is comprised of many in-memory pandas `DataFrames` separated along the index. One operation on a Dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.\n",
    "\n",
    "**Related Documentation**\n",
    "\n",
    "* [DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)\n",
    "* [DataFrame screencast](https://youtu.be/AT2XtFehFSQ)\n",
    "* [DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)\n",
    "* [DataFrame examples](https://examples.dask.org/dataframe.html)\n",
    "* [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)\n",
    "\n",
    "**Main Take-aways**\n",
    "\n",
    "1.  Dask DataFrame should be familiar to Pandas users\n",
    "2.  The partitioning of dataframes is important for efficient execution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run prep.py -d flights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client\n",
    "\n",
    "client = Client(n_workers=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create artifical data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from prep import accounts_csvs\n",
    "accounts_csvs()\n",
    "\n",
    "import os\n",
    "import dask\n",
    "filename = os.path.join('data', 'accounts.*.csv')\n",
    "filename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Filename includes a glob pattern `*`, so all files in the path matching that pattern will be read into the same Dask DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask.dataframe as dd\n",
    "df = dd.read_csv(filename)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load and count number of rows\n",
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happened here?\n",
    "- Dask investigated the input path and found that there are three matching files \n",
    "- a set of jobs was intelligently created for each chunk - one per original CSV file in this case\n",
    "- each file was loaded into a pandas dataframe, had `len()` applied to it\n",
    "- the subtotals were combined to give you the final grand total."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Real Data\n",
    "\n",
    "Lets try this with an extract of flights in the USA across several years. This data is specific to flights out of the three airports in the New York City area."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),\n",
    "                 parse_dates={'Date': [0, 1, 2]})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the respresentation of the dataframe object contains no data - Dask has just done enough to read the start of the first file, and infer the column names and dtypes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can view the start and end of the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [],
   "source": [
    "df.tail()  # this fails"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What just happened?\n",
    "\n",
    "Unlike `pandas.read_csv` which reads in the entire file before inferring datatypes, `dask.dataframe.read_csv` only reads in a sample from the beginning of the file (or first file if using a glob). These inferred datatypes are then enforced when reading all partitions.\n",
    "\n",
    "In this case, the datatypes inferred in the sample are incorrect. The first `n` rows have no value for `CRSElapsedTime` (which pandas infers as a `float`), and later on turn out to be strings (`object` dtype). Note that Dask gives an informative error message about the mismatch. When this happens you have a few options:\n",
    "\n",
    "- Specify dtypes directly using the `dtype` keyword. This is the recommended solution, as it's the least error prone (better to be explicit than implicit) and also the most performant.\n",
    "- Increase the size of the `sample` keyword (in bytes)\n",
    "- Use `assume_missing` to make `dask` assume that columns inferred to be `int` (which don't allow missing values) are actually floats (which do allow missing values). In our particular case this doesn't apply.\n",
    "\n",
    "In our case we'll use the first option and directly specify the `dtypes` of the offending columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),\n",
    "                 parse_dates={'Date': [0, 1, 2]},\n",
    "                 dtype={'TailNum': str,\n",
    "                        'CRSElapsedTime': float,\n",
    "                        'Cancelled': bool})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.tail()  # now works"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computations with `dask.dataframe`\n",
    "\n",
    "We compute the maximum of the `DepDelay` column. With just pandas, we would loop over each file to find the individual maximums, then find the final maximum over all the individual maximums\n",
    "\n",
    "```python\n",
    "maxes = []\n",
    "for fn in filenames:\n",
    "    df = pd.read_csv(fn)\n",
    "    maxes.append(df.DepDelay.max())\n",
    "    \n",
    "final_max = max(maxes)\n",
    "```\n",
    "\n",
    "We could wrap that `pd.read_csv` with `dask.delayed` so that it runs in parallel. Regardless, we're still having to think about loops, intermediate results (one per file) and the final reduction (`max` of the intermediate maxes). This is just noise around the real task, which pandas solves with\n",
    "\n",
    "```python\n",
    "df = pd.read_csv(filename, dtype=dtype)\n",
    "df.DepDelay.max()\n",
    "```\n",
    "\n",
    "`dask.dataframe` lets us write pandas-like code, that operates on larger than memory datasets in parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time df.DepDelay.max().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This writes the delayed computation for us and then runs it.  \n",
    "\n",
    "Some things to note:\n",
    "\n",
    "1.  As with `dask.delayed`, we need to call `.compute()` when we're done.  Up until this point everything is lazy.\n",
    "2.  Dask will delete intermediate results (like the full pandas dataframe for each file) as soon as possible.\n",
    "    -  This lets us handle datasets that are larger than memory\n",
    "    -  This means that repeated computations will have to load all of the data in each time (run the code above again, is it faster or slower than you would expect?)\n",
    "    \n",
    "As with `Delayed` objects, you can view the underlying task graph using the `.visualize` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# notice the parallelism\n",
    "df.DepDelay.max().visualize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "In this section we do a few `dask.dataframe` computations. If you are comfortable with Pandas then these should be familiar. You will have to think about when to call `compute`.\n",
    "\n",
    "#### 1.) How many rows are in our dataset?\n",
    "\n",
    "If you aren't familiar with pandas, how would you check how many records are in a list of tuples?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.) In total, how many non-canceled flights were taken?\n",
    "\n",
    "With pandas, you would use [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.) In total, how many non-cancelled flights were taken from each airport?\n",
    "\n",
    "*Hint*: use [`df.groupby`](https://pandas.pydata.org/pandas-docs/stable/groupby.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.) What was the average departure delay from each airport?\n",
    "\n",
    "Note, this is the same computation you did in the previous notebook (is this approach faster or slower?)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.) What day of the week has the worst average departure delay?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sharing Intermediate Results\n",
    "\n",
    "When computing all of the above, we sometimes did the same operation more than once. For most operations, `dask.dataframe` hashes the arguments, allowing duplicate computations to be shared, and only computed once.\n",
    "\n",
    "For example, lets compute the mean and standard deviation for departure delay of all non-canceled flights. Since dask operations are lazy, those values aren't the final results yet. They're just the recipe required to get the result.\n",
    "\n",
    "If we compute them with two calls to compute, there is no sharing of intermediate computations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "non_cancelled = df[~df.Cancelled]\n",
    "mean_delay = non_cancelled.DepDelay.mean()\n",
    "std_delay = non_cancelled.DepDelay.std()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "mean_delay_res = mean_delay.compute()\n",
    "std_delay_res = std_delay.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But lets try by passing both to a single `compute` call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "mean_delay_res, std_delay_res = dask.compute(mean_delay, std_delay)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `dask.compute` takes roughly 1/2 the time. This is because the task graphs for both results are merged when calling `dask.compute`, allowing shared operations to only be done once instead of twice. In particular, using `dask.compute` only does the following once:\n",
    "\n",
    "- the calls to `read_csv`\n",
    "- the filter (`df[~df.Cancelled]`)\n",
    "- some of the necessary reductions (`sum`, `count`)\n",
    "\n",
    "To see what the merged task graphs between multiple results look like (and what's shared), you can use the `dask.visualize` function (we might want to use `filename='graph.pdf'` to zoom in on the graph better):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dask.visualize(mean_delay, std_delay)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How does this compare to Pandas?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas is more mature and fully featured than `dask.dataframe`.  If your data fits in memory then you should use Pandas.  The `dask.dataframe` module gives you a limited `pandas` experience when you operate on datasets that don't fit comfortably in memory.\n",
    "\n",
    "During this tutorial we provide a small dataset consisting of a few CSV files.  This dataset is 45MB on disk that expands to about 400MB in memory. This dataset is small enough that you would normally use Pandas.\n",
    "\n",
    "We've chosen this size so that exercises finish quickly.  Dask.dataframe only really becomes meaningful for problems significantly larger than this, when Pandas breaks with the dreaded \n",
    "\n",
    "    MemoryError:  ...\n",
    "    \n",
    "Furthermore, the distributed scheduler allows the same dataframe expressions to be executed across a cluster. To enable massive \"big data\" processing, one could execute data ingestion functions such as `read_csv`, where the data is held on storage accessible to every worker node (e.g., amazon's S3), and because most operations begin by selecting only some columns, transforming and filtering the data, only relatively small amounts of data need to be communicated between the machines.\n",
    "\n",
    "Dask.dataframe operations use `pandas` operations internally.  Generally they run at about the same speed except in the following two cases:\n",
    "\n",
    "1.  Dask introduces a bit of overhead, around 1ms per task.  This is usually negligible.\n",
    "2.  When Pandas releases the GIL (coming to `groupby` in the next version) `dask.dataframe` can call several pandas operations in parallel within a process, increasing speed somewhat proportional to the number of cores. For operations which don't release the GIL, multiple processes would be needed to get the same speedup."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dask DataFrame Data Model\n",
    "\n",
    "For the most part, a Dask DataFrame feels like a pandas DataFrame.\n",
    "So far, the biggest difference we've seen is that Dask operations are lazy; they build up a task graph instead of executing immediately (more details coming in [Schedulers](05_distributed.ipynb)).\n",
    "This lets Dask do operations in parallel and out of core.\n",
    "\n",
    "In [Dask Arrays](03_array.ipynb), we saw that a `dask.array` was composed of many NumPy arrays, chunked along one or more dimensions.\n",
    "It's similar for `dask.dataframe`: a Dask DataFrame is composed of many pandas DataFrames. For `dask.dataframe` the chunking happens only along the index.\n",
    "\n",
    "<img src=\"http://dask.pydata.org/en/latest/_images/dask-dataframe.svg\" width=\"30%\">\n",
    "\n",
    "We call each chunk a *partition*, and the upper / lower bounds are *divisions*.\n",
    "Dask *can* store information about the divisions. For now, partitions come up when you write custom functions to apply to Dask DataFrames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Converting `CRSDepTime` to a timestamp\n",
    "\n",
    "This dataset stores timestamps as `HHMM`, which are read in as integers in `read_csv`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "crs_dep_time = df.CRSDepTime.head(10)\n",
    "crs_dep_time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To convert these to timestamps of scheduled departure time, we need to convert these integers into `pd.Timedelta` objects, and then combine them with the `Date` column.\n",
    "\n",
    "In pandas we'd do this using the `pd.to_timedelta` function, and a bit of arithmetic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Get the first 10 dates to complement our `crs_dep_time`\n",
    "date = df.Date.head(10)\n",
    "\n",
    "# Get hours as an integer, convert to a timedelta\n",
    "hours = crs_dep_time // 100\n",
    "hours_timedelta = pd.to_timedelta(hours, unit='h')\n",
    "\n",
    "# Get minutes as an integer, convert to a timedelta\n",
    "minutes = crs_dep_time % 100\n",
    "minutes_timedelta = pd.to_timedelta(minutes, unit='m')\n",
    "\n",
    "# Apply the timedeltas to offset the dates by the departure time\n",
    "departure_timestamp = date + hours_timedelta + minutes_timedelta\n",
    "departure_timestamp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Custom code and Dask Dataframe\n",
    "\n",
    "We could swap out `pd.to_timedelta` for `dd.to_timedelta` and do the same operations on the entire dask DataFrame. But let's say that Dask hadn't implemented a `dd.to_timedelta` that works on Dask DataFrames. What would you do then?\n",
    "\n",
    "`dask.dataframe` provides a few methods to make applying custom functions to Dask DataFrames easier:\n",
    "\n",
    "- [`map_partitions`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions)\n",
    "- [`map_overlap`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_overlap)\n",
    "- [`reduction`](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.reduction)\n",
    "\n",
    "Here we'll just be discussing `map_partitions`, which we can use to implement `to_timedelta` on our own:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Look at the docs for `map_partitions`\n",
    "\n",
    "help(df.CRSDepTime.map_partitions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The basic idea is to apply a function that operates on a DataFrame to each partition.\n",
    "In this case, we'll apply `pd.to_timedelta`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hours = df.CRSDepTime // 100\n",
    "# hours_timedelta = pd.to_timedelta(hours, unit='h')\n",
    "hours_timedelta = hours.map_partitions(pd.to_timedelta, unit='h')\n",
    "\n",
    "minutes = df.CRSDepTime % 100\n",
    "# minutes_timedelta = pd.to_timedelta(minutes, unit='m')\n",
    "minutes_timedelta = minutes.map_partitions(pd.to_timedelta, unit='m')\n",
    "\n",
    "departure_timestamp = df.Date + hours_timedelta + minutes_timedelta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "departure_timestamp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "departure_timestamp.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise: Rewrite above to use a single call to `map_partitions`\n",
    "\n",
    "This will be slightly more efficient than two separate calls, as it reduces the number of tasks in the graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_departure_timestamp(df):\n",
    "    pass  # TODO: implement this"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [],
   "source": [
    "departure_timestamp = df.map_partitions(compute_departure_timestamp)\n",
    "\n",
    "departure_timestamp.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Limitations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### What doesn't work?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dask.dataframe only covers a small but well-used portion of the Pandas API.\n",
    "This limitation is for two reasons:\n",
    "\n",
    "1.  The Pandas API is *huge*\n",
    "2.  Some operations are genuinely hard to do in parallel (e.g. sort)\n",
    "\n",
    "Additionally, some important operations like ``set_index`` work, but are slower\n",
    "than in Pandas because they include substantial shuffling of data, and may write out to disk."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Learn More\n",
    "\n",
    "\n",
    "* [DataFrame documentation](https://docs.dask.org/en/latest/dataframe.html)\n",
    "* [DataFrame screencast](https://youtu.be/AT2XtFehFSQ)\n",
    "* [DataFrame API](https://docs.dask.org/en/latest/dataframe-api.html)\n",
    "* [DataFrame examples](https://examples.dask.org/dataframe.html)\n",
    "* [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.shutdown()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dask Distributed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we have seen so far, Dask allows you to simply construct graphs of tasks with dependencies, as well as have graphs created automatically for you using functional, Numpy or Pandas syntax on data collections. None of this would be very useful, if there weren't also a way to execute these graphs, in a parallel and memory-aware way. So far we have been calling `thing.compute()` or `dask.compute(thing)` without worrying what this entails. Now we will discuss the options available for that execution, and in particular, the distributed scheduler, which comes with additional functionality.\n",
    "\n",
    "Dask comes with four available schedulers:\n",
    "- \"threaded\": a scheduler backed by a thread pool\n",
    "- \"processes\": a scheduler backed by a process pool\n",
    "- \"single-threaded\" (aka \"sync\"): a synchronous scheduler, good for debugging\n",
    "- distributed: a distributed scheduler for executing graphs on multiple machines, see below.\n",
    "\n",
    "To select one of these for computation, you can specify at the time of asking for a result, e.g.,\n",
    "```python\n",
    "myvalue.compute(scheduler=\"single-threaded\")  # for debugging\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or set the current default, either temporarily or globally\n",
    "```python\n",
    "with dask.config.set(scheduler='processes'):\n",
    "    # set temporarily for this block only\n",
    "    myvalue.compute()\n",
    "\n",
    "dask.config.set(scheduler='processes')\n",
    "# set until further notice\n",
    "```\n",
    "\n",
    "Lets see the difference for the familiar case of the flights data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run prep.py -d flights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask.dataframe as dd\n",
    "import os\n",
    "df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),\n",
    "                 parse_dates={'Date': [0, 1, 2]},\n",
    "                 dtype={'TailNum': object,\n",
    "                        'CRSElapsedTime': float,\n",
    "                        'Cancelled': bool})\n",
    "\n",
    "# Maximum average non-cancelled delay grouped by Airport\n",
    "largest_delay = df[~df.Cancelled].groupby('Origin').DepDelay.mean().max()\n",
    "largest_delay"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# each of the following gives the same results (you can check!)\n",
    "# any surprises?\n",
    "import time\n",
    "for sch in ['threading', 'processes', 'sync']:\n",
    "    t0 = time.time()\n",
    "    r = largest_delay.compute(scheduler=sch)\n",
    "    t1 = time.time()\n",
    "    print(f\"{sch:>10}, {t1 - t0:0.4f} s; result, {r:0.2f} hours\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Some Questions to Consider:\n",
    "\n",
    "- How much speedup is possible for this task (hint, look at the graph).\n",
    "- Given how many cores are on this machine, how much faster could the parallel schedulers be than the single-threaded scheduler.\n",
    "- How much faster was using threads over a single thread? Why does this differ from the optimal speedup?\n",
    "- Why is the multiprocessing scheduler so much slower here?\n",
    "\n",
    "The `threaded` scheduler is a fine choice for working with large datasets out-of-core on a single machine, as long as the functions being used release the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) most of the time. NumPy and pandas release the GIL in most places, so the `threaded` scheduler is the default for `dask.array` and `dask.dataframe`. The distributed scheduler, perhaps with `processes=False`, will also work well for these workloads on a single machine.\n",
    "\n",
    "For workloads that do hold the GIL, as is common with `dask.bag` and custom code wrapped with `dask.delayed`, we recommend using the distributed scheduler, even on a single machine. Generally speaking, it's more intelligent and provides better diagnostics than the `processes` scheduler.\n",
    "\n",
    "https://docs.dask.org/en/latest/scheduling.html provides some additional details on choosing a scheduler.\n",
    "\n",
    "For scaling out work across a cluster, the distributed scheduler is required."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Making a cluster"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Simple method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `dask.distributed` system is composed of a single centralized scheduler and one or more worker processes. [Deploying](https://docs.dask.org/en/latest/setup.html) a remote Dask cluster involves some additional effort. But doing things locally is just involves creating a `Client` object, which lets you interact with the \"cluster\" (local threads or processes on your machine). For more information see [here](https://docs.dask.org/en/latest/setup/single-distributed.html). \n",
    "\n",
    "Note that `Client()` takes a lot of optional [arguments](https://distributed.dask.org/en/latest/local-cluster.html#api), to configure the number of processes/threads, memory limits and other"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client\n",
    "\n",
    "# Setup a local cluster.\n",
    "# By default this sets up 1 worker per core\n",
    "client = Client()\n",
    "client.cluster"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Be sure to click the `Dashboard` link to open up the diagnostics dashboard.\n",
    "\n",
    "### Executing with the distributed client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consider some trivial calculation, such as we've used before, where we have added sleep statements in order to simulate real work being done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask import delayed\n",
    "import time\n",
    "\n",
    "def inc(x):\n",
    "    time.sleep(5)\n",
    "    return x + 1\n",
    "\n",
    "def dec(x):\n",
    "    time.sleep(3)\n",
    "    return x - 1\n",
    "\n",
    "def add(x, y):\n",
    "    time.sleep(7)\n",
    "    return x + y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, creating a `Client` makes it the default scheduler. Any calls to `.compute` will use the cluster your `client` is attached to, unless you specify otherwise, as above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = delayed(inc)(1)\n",
    "y = delayed(dec)(2)\n",
    "total = delayed(add)(x, y)\n",
    "total.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The tasks will appear in the web UI as they are processed by the cluster and, eventually, a result will be printed as output of the cell above. Note that the kernel is blocked while waiting for the result. The resulting tasks block graph might look something like below. Hovering over each block gives which function it related to, and how long it took to execute. ![this](images/tasks.png)\n",
    "\n",
    "You can also see a simplified version of the graph being executed on Graph pane of the dashboard, so long as the calculation is in-flight."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's return to the flights computation from before, and see what happens on the dashboard (you may wish to have both the notebook and dashboard side-by-side). How did does this perform compared to before?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time largest_delay.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this particular case, this should be as fast or faster than the best case, threading, above. Why do you suppose this is? You should start your reading [here](https://distributed.dask.org/en/latest/index.html#architecture), and in particular note that the distributed scheduler was a complete rewrite with more intelligence around sharing of intermediate results and which tasks run on which worker. This will result in better performance in *some* cases, but still larger latency and overhead compared to the threaded scheduler, so there will be rare cases where it performs worse. Fortunately, the dashboard now gives us a lot more [diagnostic information](https://distributed.dask.org/en/latest/diagnosing-performance.html). Look at the Profile page of the dashboard to find out what takes the biggest fraction of CPU time for the computation we just performed?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If all you want to do is execute computations created using delayed, or run calculations based on the higher-level data collections, then that is about all you need to know to scale your work up to cluster scale. However, there is more detail to know about the distributed scheduler that will help with efficient usage. See the chapter Distributed, Advanced."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Run the following computations while looking at the diagnostics page. In each case what is taking the most time?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of flights\n",
    "_ = len(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of non-cancelled flights\n",
    "_ = len(df[~df.Cancelled])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of non-cancelled flights per-airport\n",
    "_ = df[~df.Cancelled].groupby('Origin').Origin.count().compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Average departure delay from each airport?\n",
    "_ = df[~df.Cancelled].groupby('Origin').DepDelay.mean().compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Average departure delay per day-of-week\n",
    "_ = df.groupby(df.Date.dt.dayofweek).DepDelay.mean().compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.shutdown()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dask Distributed, Advanced"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Distributed futures"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client\n",
    "c = Client(n_workers=4)\n",
    "c.cluster"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In chapter Distributed, we showed that executing a calculation (created using delayed) with the distributed executor is identical to any other executor. However, we now have access to additional functionality, and control over what data is held in memory.\n",
    "\n",
    "To begin, the `futures` interface (derived from the built-in `concurrent.futures`) allow map-reduce like functionality. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. Notice that the call returns immediately, giving one or more *futures*, whose status begins as \"pending\" and later becomes \"finished\". There is no blocking of the local Python session."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the simplest example of `submit` in action:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def inc(x):\n",
    "    return x + 1\n",
    "\n",
    "fut = c.submit(inc, 1)\n",
    "fut"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can re-execute the following cell as often as we want as a way to poll the status of the future. This could of course be done in a loop, pausing for a short time on each iteration. We could continue with our work, or view a progressbar of work still going on, or force a wait until the future is ready. \n",
    "\n",
    "In the meantime, the `status` dashboard (link above next to the Cluster widget) has gained a new element in the task stream, indicating that `inc()` has completed, and the progress section at the problem shows one task complete and held in memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fut"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Possible alternatives you could investigate:\n",
    "```python\n",
    "from dask.distributed import wait, progress\n",
    "progress(fut)\n",
    "```\n",
    "would show a progress bar in *this* notebook, rather than having to go to the dashboard. This progress bar is also asynchronous, and doesn't block the execution of other code in the meanwhile.\n",
    "\n",
    "```python\n",
    "wait(fut)\n",
    "```\n",
    "would block and force the notebook to wait until the computation pointed to by `fut` was done. However, note that the result of `inc()` is sitting in the cluster, it would take **no time** to execute the computation now, because Dask notices that we are asking for the result of a computation it already knows about. More on this later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# grab the information back - this blocks if fut is not ready\n",
    "c.gather(fut)\n",
    "# equivalent action when only considering a single future\n",
    "# fut.result()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we see an alternative way to execute work on the cluster: when you submit or map with the inputs as futures, the *computation moves to the data* rather than the other way around, and the client, in the local Python session, need never see the intermediate values. This is similar to building the graph using delayed, and indeed, delayed can be used in conjunction with futures. Here we use the delayed object `total` from before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some trivial work that takes time\n",
    "# repeated from the Distributed chapter.\n",
    "\n",
    "from dask import delayed\n",
    "import time\n",
    "\n",
    "def inc(x):\n",
    "    time.sleep(5)\n",
    "    return x + 1\n",
    "\n",
    "def dec(x):\n",
    "    time.sleep(3)\n",
    "    return x - 1\n",
    "\n",
    "def add(x, y):\n",
    "    time.sleep(7)\n",
    "    return x + y\n",
    "\n",
    "x = delayed(inc)(1)\n",
    "y = delayed(dec)(2)\n",
    "total = delayed(add)(x, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# notice the difference from total.compute()\n",
    "# notice that this cell completes immediately\n",
    "fut = c.compute(total)\n",
    "fut"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c.gather(fut) # waits until result is ready"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### `Client.submit`\n",
    "\n",
    "`submit` takes a function and arguments, pushes these to the cluster, returning a *Future* representing the result to be computed. The function is passed to a worker process for evaluation. Note that this cell returns immediately, while computation may still be ongoing on the cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fut = c.submit(inc, 1)\n",
    "fut"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks a lot like doing `compute()`, above, except now we are passing the function and arguments directly to the cluster. To anyone used to `concurrent.futures`, this will look familiar. This new `fut` behaves the same way as the one above. Note that we have now over-written the previous definition of `fut`, which will get garbage-collected, and, as a result, that previous result is released by the cluster\n",
    "\n",
    "### Exercise: Rebuild the above delayed computation using `Client.submit` instead\n",
    "\n",
    "The arguments passed to `submit` can be futures from other submit operations or delayed objects. The former, in particular, demonstrated the concept of *moving the computation to the data* which is one of the most powerful elements of programming with Dask."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each futures represents a result held, or being evaluated by the cluster. Thus we can control caching of intermediate values - when a future is no longer referenced, its value is forgotten. In the solution, above, futures are held for each of the function calls. These results would not need to be re-evaluated if we chose to submit more work that needed them.\n",
    "\n",
    "We can explicitly pass data from our local session into the cluster using `scatter()`, but usually better is to construct functions that do the loading of data within the workers themselves, so that there is no need to serialise and communicate the data. Most of the loading functions within Dask, sudh as `dd.read_csv`, work this way. Similarly, we normally don't want to `gather()` results that are too big in memory.\n",
    "\n",
    "The [full API](http://distributed.readthedocs.io/en/latest/api.html) of the distributed scheduler gives details of interacting with the cluster, which remember, can be on your local machine or possibly on a massive computational resource."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The futures API offers a work submission style that can easily emulate the map/reduce paradigm (see `c.map()`) that may be familiar to many people. The intermediate results, represented by futures, can be passed to new tasks without having to bring the pull locally from the cluster, and new work can be assigned to work on the output of previous jobs that haven't even begun yet.\n",
    "\n",
    "Generally, any Dask operation that is executed using `.compute()` can be submitted for asynchronous execution using `c.compute()` instead, and this applies to all collections. Here is an example with the calculation previously seen in the Bag chapter. We have replaced the `.compute()` method there with the distributed client version, so, again, we could continue to submit more work (perhaps based on the result of the calculation), or, in the next cell, follow the progress of the computation. A similar progress-bar appears in the monitoring UI page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run prep.py -d accounts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask.bag as db\n",
    "import os\n",
    "import json\n",
    "filename = os.path.join('data', 'accounts.*.json.gz')\n",
    "lines = db.read_text(filename)\n",
    "js = lines.map(json.loads)\n",
    "\n",
    "f = c.compute(js.filter(lambda record: record['name'] == 'Alice')\n",
    "       .pluck('transactions')\n",
    "       .flatten()\n",
    "       .pluck('amount')\n",
    "       .mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import progress\n",
    "# note that progress must be the last line of a cell\n",
    "# in order to show up\n",
    "progress(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get result.\n",
    "c.gather(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# release values by deleting the futures\n",
    "del f, fut, x, y, total"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Persist"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Considering which data should be loaded by the workers, as opposed to passed, and which intermediate values to persist in worker memory, will in many cases determine the computation efficiency of a process.\n",
    "\n",
    "In the example here, we repeat a calculation from the Array chapter - notice that each call to `compute()` is roughly the same speed, because the loading of the data is included every time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run prep.py -d random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import h5py\n",
    "import os\n",
    "f = h5py.File(os.path.join('data', 'random.hdf5'), mode='r')\n",
    "dset = f['/x']\n",
    "import dask.array as da\n",
    "x = da.from_array(dset, chunks=(1000000,))\n",
    "\n",
    "%time x.sum().compute()\n",
    "%time x.sum().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If, instead, we persist the data to RAM up front (this takes a few seconds to complete - we could `wait()` on this process), then further computations will be much faster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# changes x from a set of delayed prescriptions\n",
    "# to a set of futures pointing to data in RAM\n",
    "# See this on the UI dashboard.\n",
    "x = c.persist(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time x.sum().compute()\n",
    "%time x.sum().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Naturally, persisting every intermediate along the way is a bad idea, because this will tend to fill up all available RAM and make the whole system slow (or break!). The ideal persist point is often at the end of a set of data cleaning steps, when the data is in a form which will get queried often."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**: how is the memory associated with `x` released, once we know we are done with it?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Asynchronous computation\n",
    "<img style=\"float: right;\" src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Rosenbrock_function.svg/450px-Rosenbrock_function.svg.png\" height=200 width=200>\n",
    "\n",
    "One benefit of using the futures API is that you can have dynamic computations that adjust as things progress. Here we implement a simple naive search by looping through results as they come in, and submit new points to compute as others are still running.\n",
    "\n",
    "Watching the [diagnostics dashboard](../../9002/status) as this runs you can see computations are being concurrently run while more are being submitted. This flexibility can be useful for parallel algorithms that require some level of synchronization.\n",
    "\n",
    "Lets perform a very simple minimization using dynamic programming. The function of interest is known as Rosenbrock:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# a simple function with interesting minima\n",
    "import time\n",
    "\n",
    "def rosenbrock(point):\n",
    "    \"\"\"Compute the rosenbrock function and return the point and result\"\"\"\n",
    "    time.sleep(0.1)\n",
    "    score = (1 - point[0])**2 + 2 * (point[1] - point[0]**2)**2\n",
    "    return point, score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initial setup, including creating a graphical figure. We use Bokeh for this, which allows for dynamic update of the figure as results come in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bokeh.io import output_notebook, push_notebook\n",
    "from bokeh.models.sources import ColumnDataSource\n",
    "from bokeh.plotting import figure, show\n",
    "import numpy as np\n",
    "output_notebook()\n",
    "\n",
    "# set up plot background\n",
    "N = 500\n",
    "x = np.linspace(-5, 5, N)\n",
    "y = np.linspace(-5, 5, N)\n",
    "xx, yy = np.meshgrid(x, y)\n",
    "d = (1 - xx)**2 + 2 * (yy - xx**2)**2\n",
    "d = np.log(d)\n",
    "\n",
    "p = figure(x_range=(-5, 5), y_range=(-5, 5))\n",
    "p.image(image=[d], x=-5, y=-5, dw=10, dh=10, palette=\"Spectral11\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start off with a point at (0, 0), and randomly scatter test points around it. Each evaluation takes ~100ms, and as result come in, we test to see if we have a new best point, and choose random points around that new best point, as the search box shrinks.\n",
    "\n",
    "We print the function value and current best location each time we have a new best value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import as_completed\n",
    "from random import uniform\n",
    "\n",
    "scale = 5                  # Intial random perturbation scale\n",
    "best_point = (0, 0)        # Initial guess\n",
    "best_score = float('inf')  # Best score so far\n",
    "startx = [uniform(-scale, scale) for _ in range(10)]\n",
    "starty = [uniform(-scale, scale) for _ in range(10)]\n",
    "\n",
    "# set up plot\n",
    "source = ColumnDataSource({'x': startx, 'y': starty, 'c': ['grey'] * 10})\n",
    "p.circle(source=source, x='x', y='y', color='c')\n",
    "t = show(p, notebook_handle=True)\n",
    "\n",
    "# initial 10 random points\n",
    "futures = [c.submit(rosenbrock, (x, y)) for x, y in zip(startx, starty)]\n",
    "iterator = as_completed(futures)\n",
    "\n",
    "for res in iterator:\n",
    "    # take a completed point, is it an improvement?\n",
    "    point, score = res.result()\n",
    "    if score < best_score:\n",
    "        best_score, best_point = score, point\n",
    "        print(score, point)\n",
    "\n",
    "    x, y = best_point\n",
    "    newx, newy = (x + uniform(-scale, scale), y + uniform(-scale, scale))\n",
    "    \n",
    "    # update plot\n",
    "    source.stream({'x': [newx], 'y': [newy], 'c': ['grey']}, rollover=20)\n",
    "    push_notebook(document=t)\n",
    "    \n",
    "    # add new point, dynamically, to work on the cluster\n",
    "    new_point = c.submit(rosenbrock, (newx, newy))\n",
    "    iterator.add(new_point)  # Start tracking new task as well\n",
    "\n",
    "    # Narrow search and consider stopping\n",
    "    scale *= 0.99\n",
    "    if scale < 0.001:\n",
    "        break\n",
    "point"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Debugging"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When something goes wrong in a distributed job, it is hard to figure out what the problem was and what to do about it. When a task raises an exception, the exception will show up when that result, or other result that depend upon it, is gathered.\n",
    "\n",
    "Consider the following delayed calculation to be computed by the cluster. As usual, we get back a future, which the cluster is working on to compute (this happens very slowly for the trivial procedure)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@delayed\n",
    "def ratio(a, b):\n",
    "    return a // b\n",
    "\n",
    "@delayed\n",
    "def summation(*a):\n",
    "    return sum(*a)\n",
    "\n",
    "ina = [5, 25, 30]\n",
    "inb = [5, 5, 6]\n",
    "out = summation([ratio(a, b) for (a, b) in zip(ina, inb)])\n",
    "f = c.compute(out)\n",
    "f"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We only get to know what happened when we gather the result (this is also true for `out.compute()`, except we could not have done other stuff in the meantime). For the first set of inputs, it works fine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c.gather(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But if we introduce bad input, an exception is raised. The exception happens in `ratio`, but only comes to our attention when calculating `summation`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [],
   "source": [
    "ina = [5, 25, 30]\n",
    "inb = [5, 0, 6]\n",
    "out = summation([ratio(a, b) for (a, b) in zip(ina, inb)])\n",
    "f = c.compute(out)\n",
    "c.gather(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The display in this case makes the origin of the exception obvious, but this is not always the case. How should this be debugged, how would we go about finding out the exact conditions that caused the exception? \n",
    "\n",
    "The first step, of course, is to write well-tested code which makes appropriate assertions about its input and clear warnings and error messages when something goes wrong. This applies to all code.\n",
    "\n",
    "The most typical thing to do is to execute some portion of the computation in the local thread, so that we can run the Python debugger and query the state of things at the time that the exception happened. Obviously, this cannot be performed on the whole data-set when dealing with Big Data on a cluster, but a suitable sample will probably do even then."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [],
   "source": [
    "import dask\n",
    "with dask.config.set(scheduler=\"sync\"):\n",
    "    # do NOT use c.compute(out) here - we specifically do not\n",
    "    # want the distributed scheduler\n",
    "    out.compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# uncomment to enter post-mortem debugger\n",
    "# %debug"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The trouble with this approach is that Dask is meant for the execution of large datasets/computations - you probably can't simply run the whole thing \n",
    "in one local thread, else you wouldn't have used Dask in the first place. So the code above should only be used on a small part of the data that also exchibits the error. \n",
    "Furthermore, the method will not work when you are dealing with futures (such as `f`, above, or after persisting) instead of delayed-based computations.\n",
    "\n",
    "As alternative, you can ask the scheduler to analyze your calculation and find the specific sub-task responsible for the error, and pull only it and its dependnecies locally for execution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "raises-exception"
    ]
   },
   "outputs": [],
   "source": [
    "c.recreate_error_locally(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# uncomment to enter post-mortem debugger\n",
    "# %debug"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, there are errors other than exceptions, when we need to look at the state of the scheduler/workers. In the standard \"LocalCluster\" we started, we\n",
    "have direct access to these."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[(k, v.state) for k, v in c.cluster.scheduler.tasks.items() if v.exception is not None]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dask Data Storage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"images/hdd.jpg\" width=\"20%\" align=\"right\">\n",
    "Efficient storage can dramatically improve performance, particularly when operating repeatedly from disk.\n",
    "\n",
    "Decompressing text and parsing CSV files is expensive.  One of the most effective strategies with medium data is to use a binary storage format like HDF5.  Often the performance gains from doing this is sufficient so that you can switch back to using Pandas again instead of using `dask.dataframe`.\n",
    "\n",
    "In this section we'll learn how to efficiently arrange and store your datasets in on-disk binary formats.  We'll use the following:\n",
    "\n",
    "1.  [Pandas `HDFStore`](http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5) format on top of `HDF5`\n",
    "2.  Categoricals for storing text data numerically\n",
    "\n",
    "**Main Take-aways**\n",
    "\n",
    "1.  Storage formats affect performance by an order of magnitude\n",
    "2.  Text data will keep even a fast format like HDF5 slow\n",
    "3.  A combination of binary formats, column storage, and partitioned data turns one second wait times into 80ms wait times."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run prep.py -d accounts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read CSV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we read our csv data as before.\n",
    "\n",
    "CSV and other text-based file formats are the most common storage for data from many sources, because they require minimal pre-processing, can be written line-by-line and are human-readable. Since Pandas' `read_csv` is well-optimized, CSVs are a reasonable input, but far from optimized, since reading required extensive text parsing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "filename = os.path.join('data', 'accounts.*.csv')\n",
    "filename"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask.dataframe as dd\n",
    "df_csv = dd.read_csv(filename)\n",
    "df_csv.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Write to HDF5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "HDF5 and netCDF are binary array formats very commonly used in the scientific realm.\n",
    "\n",
    "Pandas contains a specialized HDF5 format, `HDFStore`.  The ``dd.DataFrame.to_hdf`` method works exactly like the ``pd.DataFrame.to_hdf`` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target = os.path.join('data', 'accounts.h5')\n",
    "target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert to binary format, takes some time up-front\n",
    "%time df_csv.to_hdf(target, '/data')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# same data as before\n",
    "df_hdf = dd.read_hdf(target, '/data')\n",
    "df_hdf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compare CSV to HDF5 speeds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We do a simple computation that requires reading a column of our dataset and compare performance between CSV files and our newly created HDF5 file.  Which do you expect to be faster?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time df_csv.amount.sum().compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time df_hdf.amount.sum().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sadly they are about the same, or perhaps even slower. \n",
    "\n",
    "The culprit here is `names` column, which is of `object` dtype and thus hard to store efficiently.  There are two problems here:\n",
    "\n",
    "1.  How do we store text data like `names` efficiently on disk?\n",
    "2.  Why did we have to read the `names` column when all we wanted was `amount`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.  Store text efficiently with categoricals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use Pandas categoricals to replace our object dtypes with a numerical representation.  This takes a bit more time up front, but results in better performance.\n",
    "\n",
    "More on categoricals at the [pandas docs](http://pandas.pydata.org/pandas-docs/stable/categorical.html) and [this blogpost](http://matthewrocklin.com/blog/work/2015/06/18/Categoricals)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Categorize data, then store in HDFStore\n",
    "%time df_hdf.categorize(columns=['names']).to_hdf(target, '/data2')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# It looks the same\n",
    "df_hdf = dd.read_hdf(target, '/data2')\n",
    "df_hdf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# But loads more quickly\n",
    "%time df_hdf.amount.sum().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is now definitely faster than before.  This tells us that it's not only the file type that we use but also how we represent our variables that influences storage performance. \n",
    "\n",
    "How does the performance of reading depend on the scheduler we use? You can try this with threaded, processes and distributed.\n",
    "\n",
    "However this can still be better.  We had to read all of the columns (`names` and `amount`) in order to compute the sum of one (`amount`).  We'll improve further on this with `parquet`, an on-disk column-store.  First though we learn about how to set an index in a dask.dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`fastparquet` is a library for interacting with parquet-format files, which are a very common format in the Big Data ecosystem, and used by tools such as Hadoop, Spark and Impala."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target = os.path.join('data', 'accounts.parquet')\n",
    "df_csv.categorize(columns=['names']).to_parquet(target, storage_options={\"has_nulls\": True}, engine=\"fastparquet\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Investigate the file structure in the resultant new directory - what do you suppose those files are for?\n",
    "\n",
    "`to_parquet` comes with many options, such as compression, whether to explicitly write NULLs information (not necessary in this case), and how to encode strings. You can experiment with these, to see what effect they have on the file size and the processing times, below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ls -l data/accounts.parquet/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_p = dd.read_parquet(target)\n",
    "# note that column names shows the type of the values - we could\n",
    "# choose to load as a categorical column or not.\n",
    "df_p.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rerun the sum computation above for this version of the data, and time how long it takes. You may want to try this more than once - it is common for many libraries to do various setup work when called for the first time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time df_p.amount.sum().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When archiving data, it is common to sort and partition by a column with unique identifiers, to facilitate fast look-ups later. For this data, that column is `id`. Time how long it takes to retrieve the rows corresponding to `id==100` from the raw CSV, from HDF5 and parquet versions, and finally from a new parquet version written after applying `set_index('id')`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# df_p.set_index('id').to_parquet(...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Remote files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dask can access various cloud- and cluster-oriented data storage services such as Amazon S3 or HDFS\n",
    "\n",
    "Advantages:\n",
    "* scalable, secure storage\n",
    "\n",
    "Disadvantages:\n",
    "* network speed becomes bottleneck"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The way to set up dataframes (and other collections) remains very similar to before. Note that the data here is available anonymously, but in general an extra parameter `storage_options=` can be passed with further details about how to interact with the remote storage.\n",
    "\n",
    "```python\n",
    "taxi = dd.read_csv('s3://nyc-tlc/trip data/yellow_tripdata_2015-*.csv',\n",
    "                   storage_options={'anon': True})\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Warning**: operations over the Internet can take a long time to run. Such operations work really well in a cloud clustered set-up, e.g., amazon EC2 machines reading from S3, Microsoft Azure VMs or Google compute machines reading from GCS."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "352px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}