JupyterHub on Hadoop
====================
JupyterHub_ provides a secure, multi-user interactive notebook_ environment. It
allows users to access a shared computing environment conveniently through a
webpage, with no local installation required.
JupyterHub is flexible and can be deployed in many different environments. In
the spirit of Zero-to-JupyterHub-Kubernetes_, this guide aims to help you set
up your own JupyterHub on an existing `Hadoop Cluster`_.
Note that this guide is under active development. If you find things unclear or
incorrect, or have any questions/comments, feel free to `create an issue on
github`_.
Walkthrough
-----------
.. raw:: html
Why JupyterHub?
---------------
JupyterHub is not the only option for providing users a notebook environment
with Hadoop integration, but we believe this setup has some benefits over other
options.
- **Familiar**: JupyterHub provides the same Jupyter_ interface users know and
love. It integrates well with the existing Data Science ecosystem, and is
used extensively in both the private and public sector.
- **Extensible**: JupyterHub is open source and community supported, and has a
large ecosystem of plugins. It can support `dozens of languages`_ (Python, R,
Julia, Scala...), and user interfaces (Jupyter Notebooks, JupyterLab,
RStudio...).
- **Scalable**: With JupyterHub, each user gets their own environment running
in their own private container. This reduces the load on a single node, and
allows resource usage to scale dynamically with the number of users. For
large data, tools such as Spark_ and Dask_ work natively with no additional
overhead.
- **Portable**: JupyterHub is flexible and isn't bound to a single cluster
manager. It runs great on clusters (Kubernetes, Hadoop, HPC...) as well
as single machines. This means that if you change your infrastructure in the
future you can still keep using JupyterHub.
Architecture Overview
---------------------
JupyterHub is divided into three separate components:
- Multiple **single-user notebook servers** (one per active user)
- An **HTTP proxy** for proxying traffic between users and their respective
servers.
- A central **Hub** that manages authentication and single-user server
startup/shutdown.
When deploying JupyterHub on a Hadoop cluster, the **Hub** and **HTTP proxy**
are run on a single node (typically an edge node), and the **single-user
servers** are distributed throughout the cluster.
.. image:: /_images/architecture.svg
:width: 90 %
:align: center
:alt: JupyterHub on Hadoop high-level architecture
The resource requirements for the Hub node are minimal (a minimum of 1 GB RAM
should be sufficient), as user's notebooks (where the actual work is being done)
are distributed throughout the Hadoop cluster reducing the load on any single
node.
Installation
------------
As cluster management practices differ, we hope to provide several options for
installation. Currently only a manual installation tutorial is provided - if
you're interested in providing alternative options (`Cloudera Parcels`_,
RPMs_...) please `get in touch on github`_.
- :doc:`manual-installation`
Customization
-------------
Once basic installation is complete, there are several options for additional
customization.
- :doc:`enable-https`
- :doc:`contents-managers`
- :doc:`jupyterlab`
- :doc:`dask`
- :doc:`spark`
.. toctree::
:hidden:
installation
customization
demo
.. _JupyterHub: https://jupyterhub.readthedocs.io/
.. _Jupyter:
.. _notebook: https://jupyter.org/
.. _Zero-to-JupyterHub-Kubernetes: https://zero-to-jupyterhub.readthedocs.io/
.. _Hadoop Cluster: https://hadoop.apache.org/
.. _create an issue on github: https://github.com/jupyterhub/jupyterhub-on-hadoop/issues
.. _dozens of languages: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels
.. _Dask: https://dask.org/
.. _Spark: https://spark.apache.org/
.. _Cloudera Parcels: https://github.com/jupyterhub/jupyterhub-on-hadoop/issues/1
.. _RPMs: https://github.com/jupyterhub/jupyterhub-on-hadoop/issues/8
.. _get in touch on github: https://github.com/jupyterhub/jupyterhub-on-hadoop/