Integration with Dask ===================== Dask_ is a powerful and flexible tool for scaling Python analytics across a cluster. Dask works out-of-the-box with JupyterHub, but there are several things you can configure to make the experience nicer. Install Dependencies -------------------- To run Dask on Hadoop you'll need to install `dask-yarn`_ in the :ref:`notebook environment `. .. code-block:: shell # Install with conda $ conda install -c conda-forge dask-yarn # Or install with pip $ pip install dask-yarn Install Optional Dependencies ----------------------------- For a nicer UI experience and access to the `Dask Dashboard`_, you'll also want to install `jupyter-server-proxy`_ and ipywidgets_ in the :ref:`notebook environment `. .. code-block:: shell # Install with conda $ conda install -c conda-forge jupyter-server-proxy ipywidgets # Or install with pip $ pip install jupyter-server-proxy ipywidgets Configuration ------------- While Dask will work as long as ``dask-yarn`` is installed, as an administrator you can preconfigure a few default fields to make things easier for your users. Dask's default configuration is collected from the following locations (earlier sources taking precedence): - Environment variables like ``DASK_YARN__ENVIRONMENT`` - YAML files found in ``$DASK_CONFIG`` (defaults to ``~/.config/dask/``) - YAML files found in ``/etc/dask`` where ```` is a user's Conda/Virtual environment - YAML files found in the system ``/etc/dask/`` directory - Default settings within the various dask libraries As such, you have a few options for how to pass configuration on to your users. If you're using :ref:`archived notebook environments `, a good option is to put configuration in ``/etc/dask/config.yaml`` (where ```` is the top-directory of the environment). This allows having different configuration settings for different environemnts. These can also be combined with environment variables set via ``YarnSpawner.environment`` (useful for things determined at runtime, or overrides for the static config files). Below we provide an example configuration file: .. code-block:: yaml yarn: # Specify the default Python environment to use for the workers. # In most cases this should be the same as the environment used for the # user's notebook server. # Here we use an archived environment stored on hdfs environment: hdfs:///jupyterhub/example.tar.gz', # To instead specify an environment already on every node, use # environment: local:///path/to/environment # The YARN queue to submit applications to by default queue: dask # Configure the default worker vcores and memory # These can be overridden by the user as needed worker: vcores: 2 memory: 2 GiB # Use `local` deploy mode. This runs the scheduler in the same container # as the notebook, and allows for viewing the dashboard using # jupyter-server-proxy. deploy-mode: local distributed: dashboard: # Configure the link template for the dask dashboard. This updates the # dashboard links to proxy through jupyter-server-proxy link: /user/{JUPYTERHUB_USER}/proxy/{port}/status The above are likely parameters you'll want to set, but Dask has many more configuration options. For more information see the `Dask configuration documentation`_. Usage ----- Given a fully configured system, users should be able to create a Dask Cluster as follows: .. code-block:: python import dask_yarn cluster = dask_yarn.YarnCluster() Default parameters can be overridden by specifying them at runtime: .. code-block:: python import dask_yarn # Use different worker resources than the default cluster = dask_yarn.YarnCluster( worker_vcores=4, worker_memory='4 GiB' ) Users can then connect to their cluster and start doing work: .. code-block:: python from dask.distributed import Client client = Client(cluster) # Start doing computations import dask.dataframe as dd ddf = dd.read_parquet("hdfs:///path/to/my/data.parquet") Clusters can be scaled dynamically at runtime: .. code-block:: python # Scale up to 10 workers cluster.scale(10) # Scale down to 4 workers cluster.scale(4) If you installed ``jupyter-server-proxy`` and ``ipywidgets`` you'll also get a nice UI for interacting with the cluster: .. image:: /_images/dask-usage.gif :width: 90 % :align: center :alt: Interactive usage of Dask Clusters can be shutdown manually, or will be automatically shutdown on notebook exit. Further Reading --------------- Dask integrates well with the extensive Python datascience ecosystem. For more information please see the following resources: - `Dask documentation`_ - `Dask on YARN`_ - `Dask Examples`_ .. _Dask documentation: .. _Dask: https://dask.org/ .. _Dask on Yarn: https://yarn.dask.org .. _Dask-Yarn: https://yarn.dask.org .. _Dask Dashboard: https://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard .. _Dask configuration documentation: https://docs.dask.org/en/latest/configuration.html .. _jupyter-server-proxy: https://jupyter-server-proxy.readthedocs.io/ .. _ipywidgets: https://ipywidgets.readthedocs.io/ .. _Dask Examples: https://examples.dask.org/