Integration with Dask

Dask is a powerful and flexible tool for scaling Python analytics across a cluster. Dask works out-of-the-box with JupyterHub, but there are several things you can configure to make the experience nicer.

Install Dependencies

To run Dask on Hadoop you’ll need to install dask-yarn in the notebook environment.

# Install with conda
$ conda install -c conda-forge dask-yarn

# Or install with pip
$ pip install dask-yarn

Install Optional Dependencies

For a nicer UI experience and access to the Dask Dashboard, you’ll also want to install jupyter-server-proxy and ipywidgets in the notebook environment.

# Install with conda
$ conda install -c conda-forge jupyter-server-proxy ipywidgets

#  Or install with pip
$ pip install jupyter-server-proxy ipywidgets

Configuration

While Dask will work as long as dask-yarn is installed, as an administrator you can preconfigure a few default fields to make things easier for your users.

Dask’s default configuration is collected from the following locations (earlier sources taking precedence):

  • Environment variables like DASK_YARN__ENVIRONMENT
  • YAML files found in $DASK_CONFIG (defaults to ~/.config/dask/)
  • YAML files found in <ENV>/etc/dask where <ENV> is a user’s Conda/Virtual environment
  • YAML files found in the system /etc/dask/ directory
  • Default settings within the various dask libraries

As such, you have a few options for how to pass configuration on to your users. If you’re using archived notebook environments, a good option is to put configuration in <ENV>/etc/dask/config.yaml (where <ENV> is the top-directory of the environment). This allows having different configuration settings for different environemnts. These can also be combined with environment variables set via YarnSpawner.environment (useful for things determined at runtime, or overrides for the static config files).

Below we provide an example configuration file:

yarn:
  # Specify the default Python environment to use for the workers.
  # In most cases this should be the same as the environment used for the
  # user's notebook server.
  # Here we use an archived environment stored on hdfs
  environment: hdfs:///jupyterhub/example.tar.gz',

  # To instead specify an environment already on every node, use
  # environment: local:///path/to/environment

  # The YARN queue to submit applications to by default
  queue: dask

  # Configure the default worker vcores and memory
  # These can be overridden by the user as needed
  worker:
    vcores: 2
    memory: 2 GiB

  # Use `local` deploy mode. This runs the scheduler in the same container
  # as the notebook, and allows for viewing the dashboard using
  # jupyter-server-proxy.
  deploy-mode: local

distributed:
  dashboard:
    # Configure the link template for the dask dashboard. This updates the
    # dashboard links to proxy through jupyter-server-proxy
    link: /user/{JUPYTERHUB_USER}/proxy/{port}/status

The above are likely parameters you’ll want to set, but Dask has many more configuration options. For more information see the Dask configuration documentation.

Usage

Given a fully configured system, users should be able to create a Dask Cluster as follows:

import dask_yarn
cluster = dask_yarn.YarnCluster()

Default parameters can be overridden by specifying them at runtime:

import dask_yarn
# Use different worker resources than the default
cluster = dask_yarn.YarnCluster(
    worker_vcores=4,
    worker_memory='4 GiB'
)

Users can then connect to their cluster and start doing work:

from dask.distributed import Client
client = Client(cluster)

# Start doing computations
import dask.dataframe as dd
ddf = dd.read_parquet("hdfs:///path/to/my/data.parquet")

Clusters can be scaled dynamically at runtime:

# Scale up to 10 workers
cluster.scale(10)

# Scale down to 4 workers
cluster.scale(4)

If you installed jupyter-server-proxy and ipywidgets you’ll also get a nice UI for interacting with the cluster:

Interactive usage of Dask

Clusters can be shutdown manually, or will be automatically shutdown on notebook exit.

Further Reading

Dask integrates well with the extensive Python datascience ecosystem. For more information please see the following resources: