Integration with Spark

By using JupyterHub, users get secure access to a container running inside the Hadoop cluster, which means they can interact with Spark directly (instead of by proxy with Livy). This is both simpler and faster, as results don’t need to be serialized through Livy.

Installation

Spark must be installed on your cluster before use. Follow the installation guidelines from your distribution, or refer to the Spark-on-Yarn documentation.

Configuration

PySpark isn’t installed like a normal Python library, rather it’s packaged separately and needs to be added to the PYTHONPATH to be importable. This can be done by configuring jupyterhub_config.py to find the required libraries and set PYTHONPATH in the user’s notebook environment. You’ll also want to set PYSPARK_PYTHON to the same Python path that the notebook environment is running in.

import os
import glob
# Find pyspark modules to add to PYTHONPATH, so they can be used as regular
# libraries
pyspark = '/usr/lib/spark/python/'
py4j = glob.glob(os.path.join(pyspark, 'lib', 'py4j-*.zip'))[0]
pythonpath = ':'.join([pyspark, py4j])

# Set PYTHONPATH and PYSPARK_PYTHON in the user's notebook environment
c.YarnSpawner.environment = {
    'PYTHONPATH': pythonpath,
    'PYSPARK_PYTHON': '/opt/jupyterhub/miniconda/bin/python',
}

If you’re using an archived notebook environment <archived-environments>, you may instead want to bundle a spark config directory in the archive, and set the SPARK_CONF_DIR to the extracted path. This allows you to specify the path to the same archive in the config, so your users don’t have to themselves. This might look like:

# A custom spark-defaults.conf
# Stored at `<ENV>/etc/spark/spark-defaults.conf`, where `<ENV>` is the top
# directory of the unarchived Conda/virtual environment.

# Common configuration
spark.master yarn
spark.submit.deployMode client
spark.yarn.queue myqueue

# If the spark jars are already on every node, avoid serializing them
spark.yarn.jars local:/usr/lib/spark/jars/*

# Path to the archived Python environment
spark.yarn.dist.archives hdfs:///jupyterhub/example.tar.gz#environment

# Pyspark configuration
spark.pyspark.python ./environment/bin/python
spark.pyspark.driver.python ./environment/bin/python

And the jupyterhub_config.py file:

# Add PySpark to PYTHONPATH, same as above
# ...

# Set PYTHONPATH and SPARK_CONF_DIR in the user's notebook environment
c.YarnSpawner.environment = {
    'PYTHONPATH': pythonpath,
    'SPARK_CONF_DIR': './environment/etc/spark'
}

Usage

Given configuration like above, users may not need to enter any parameters when creating a SparkContext - the default values may already be sufficiently set:

import pyspark

# Create a spark context from the defaults set in configuration
sc = pyspark.SparkContext()

Of course, overrides can always be provided at runtime if needed:

import pyspark

conf = pyspark.SparkConf()

# Override a few default parameters
conf.set('spark.executor.memory', '512m')
conf.set('spark.executor.instances', 1)

# Create a spark context with the overrides
sc = pyspark.SparkContext(conf=conf)

If all nodes are configured to use the same Python path/archive, then all dependencies should be available on all workers:

def some_function(x):
    # Libraries are imported and available from the same environment as the
    # notebook
    import sklearn
    import pandas as pd
    import numpy as np

    # Use the libraries to do work
    return ...


rdd = sc.parallelize(range(1000)).map(some_function).take(10)

When you’re done, the Spark clusters can be shutdown manually, or will be automatically shutdown when the notebook exits.

Further Reading

There are additional Jupyter and Spark integrations that may be useful for your installation. Please refer to their documentation for more information:

  • sparkmonitor: Realtime monitoring of Spark applications from inside the notebook
  • jupyter-spark: Simpler progress indicators for running Spark jobs

Additionally, you may find the following resources useful: