JupyterHub on Hadoop

JupyterHub provides a secure, multi-user interactive notebook environment. It allows users to access a shared computing environment conveniently through a webpage, with no local installation required.

JupyterHub is flexible and can be deployed in many different environments. In the spirit of Zero-to-JupyterHub-Kubernetes, this guide aims to help you set up your own JupyterHub on an existing Hadoop Cluster.

Note that this guide is under active development. If you find things unclear or incorrect, or have any questions/comments, feel free to create an issue on github.

Walkthrough

Why JupyterHub?

JupyterHub is not the only option for providing users a notebook environment with Hadoop integration, but we believe this setup has some benefits over other options.

  • Familiar: JupyterHub provides the same Jupyter interface users know and love. It integrates well with the existing Data Science ecosystem, and is used extensively in both the private and public sector.
  • Extensible: JupyterHub is open source and community supported, and has a large ecosystem of plugins. It can support dozens of languages (Python, R, Julia, Scala…), and user interfaces (Jupyter Notebooks, JupyterLab, RStudio…).
  • Scalable: With JupyterHub, each user gets their own environment running in their own private container. This reduces the load on a single node, and allows resource usage to scale dynamically with the number of users. For large data, tools such as Spark and Dask work natively with no additional overhead.
  • Portable: JupyterHub is flexible and isn’t bound to a single cluster manager. It runs great on clusters (Kubernetes, Hadoop, HPC…) as well as single machines. This means that if you change your infrastructure in the future you can still keep using JupyterHub.

Architecture Overview

JupyterHub is divided into three separate components:

  • Multiple single-user notebook servers (one per active user)
  • An HTTP proxy for proxying traffic between users and their respective servers.
  • A central Hub that manages authentication and single-user server startup/shutdown.

When deploying JupyterHub on a Hadoop cluster, the Hub and HTTP proxy are run on a single node (typically an edge node), and the single-user servers are distributed throughout the cluster.

JupyterHub on Hadoop high-level architecture

The resource requirements for the Hub node are minimal (a minimum of 1 GB RAM should be sufficient), as user’s notebooks (where the actual work is being done) are distributed throughout the Hadoop cluster reducing the load on any single node.

Installation

As cluster management practices differ, we hope to provide several options for installation. Currently only a manual installation tutorial is provided - if you’re interested in providing alternative options (Cloudera Parcels, RPMs…) please get in touch on github.

Customization

Once basic installation is complete, there are several options for additional customization.