Connect Jupyter Notebooks
Ocean Spark’s integration with Jupyter Notebooks enables you to run Jupyter kernels with Spark support on an Ocean Spark cluster. You can connect your notebooks from a Jupyter or Jupyterlab server running locally or from a hosted JupyterHub.
Assumption: You already know how to create and manage Config templates for Ocean Spark.
Connect a Local Jupyter Server
The Jupyter notebook server has an option to specify a gateway service in charge of running kernels on its behalf. Ocean Spark can fill this role and enables you to run Jupyter Spark kernels on the platform.
Install the Jupyter notebook Python package locally. Be sure to use the latest version (or at least 6.0.0) with:
pip install notebook --upgrade
Launch a local Jupyter notebook server configured to interact with an Ocean Spark cluster:
jupyter notebook \
--GatewayClient.url=https://api.spotinst.io/ocean/spark/cluster/<your ocean spark cluster id>/notebook/ \
--GatewayClient.auth_token=<spot token> \
--GatewayClient.request_timeout=600
# With Notebook v7+, add this option :
--GatewayWebSocketConnection.kernel_ws_protocol=""
- The GatewayClient.url points to an Ocean Spark cluster, with an Ocean Spark cluster ID of the format osc-xxxxxxxx that you can find on the Clusters list in the Spot console.
- The GatewayClient.auth_token is a Spot API token.
- The GatewayClient.request_timeout parameter specifies the maximum amount of time Jupyter will wait until the Spark driver starts. If you have capacity available in your cluster, the waiting time should be very short. If there isn't capacity, the Kubernetes cluster will get a new node from the cloud provider, which usually takes a couple of minutes. You should set the request_timeout to 10 minutes to give you a security margin. Omitting this parameter prevents you from starting a notebook.
- The GatewayWebSocketConnection.kernel_ws_protocol specifies we want to use the legacy websocket subprotocol for compatibility reason.
Tip: If you run into issues starting the Jupyter notebook server, ensure that your Ocean for Apache Spark cluster is marked as available in the Spot console.
Ocean Spark is also compatible with JupyterLab. Install with:
pip install jupyterlab --upgrade
and run with:
jupyter lab \
--GatewayClient.url=https://api.spotinst.io/ocean/spark/cluster/<your ocean spark cluster id>/notebook/ \
--GatewayClient.request_timeout=600 \
--GatewayClient.auth_token=<spot token>
# With JupyterLab v4+, add this option :
--GatewayWebSocketConnection.kernel_ws_protocol=""
Define Jupyter kernels with configuration templates
Jupyter uses kernels to provide support for different languages and to configure notebook behavior. When a Jupyter server is connected to Ocean Spark, any Configuration template can be used as a kernel.
You can use the Spot console or the API to create a Configuration template. Here’s a configuration template example to help you get started:
{
"type": "Python",
"sparkVersion": "3.2.1",
"sparkConf": {
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.minExecutors": "0",
"spark.dynamicAllocation.maxExecutors": "10",
"spark.dynamicAllocation.initialExecutors": "1"
}
}
After creating it in the Spot console:
The Configuration Template “notebook-template” appears in the list of kernels in the Jupyter dashboard:
Scala Kernels
Ocean Spark also supports Jupyter Scala kernels. To open up a Scala kernel, all you need is to change the type field
in your configuration template. Here's an example configuration for a Scala kernel:
{
"type": "Scala",
"sparkVersion": "3.2.1",
"sparkConf": {
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.minExecutors": "0",
"spark.dynamicAllocation.maxExecutors": "10",
"spark.dynamicAllocation.initialExecutors": "1"
}
}
Warning: Adding external JAR dependencies to Scala Notebooks
The deps.jars field in the application configuration does not work with Scala Notebooks and should not be set. The JARs specified in this field are not available on the driver Java classpath.
Instead, you can add external JARs to the Spark context from the notebook with these magic commands (once the Spark session is up):
-
Add a JAR with URL:
%AddJar <URL>%AddJar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.20/postgresql-42.2.20.jar -
Add a dependency from maven repo:
%AddDeps <group-id> <artifact-id> <version>%AddDeps org.postgresql postgresql 42.2.20If the dependency has transitive dependencies, you can add the
--transitiveflag to add those dependencies.
More documentation for these magic commands is available in the Toree documentation.
Use a notebook
When you open a notebook, you need to wait for the kernel (i.e., the Spark driver) to be ready. As long as the kernel is marked as "busy" in the top right corner of the page, it means it has not started yet. This can take a few minutes. You can track the progress by looking at your Spark application page in the Spot console.
Here are the objects you can use to interact with Spark:
- The Spark context in variable sc
- The Spark SQL context in variable sqlContext
If those objects are not ready yet, you should see something like this upon invocation:
<__main__.WaitingForSparkSessionToBeInitialized at 0x7f8c15f4f240>
After a few seconds, they should be ready and you can use them to run Spark commands:
You can install your own libraries by running:
!pip3 install <some-library>
Installing the libraries this way makes them available only for the driver. If the libraries need to be available for both driver and executors, install directly in the Docker image.
If you are new to Jupyter notebooks, you can use this tutorial as a starting point.
Close a Notebook
To close a notebook application, you should not use the "Kill" action from the Spot console, because Jupyter interprets this as a kernel failure and it restarts your kernel, causing a new notebook application to appear.
Close your notebooks from Jupyter (File > Close & Halt). This terminates the Spark app in the Ocean Spark cluster.
Important Note
In some cases, a notebook may be "leaked", for example, if the Jupyter server (running on your laptop) quits abruptly or loses internet connection. This may leave a notebook application running on Ocean Spark without being linked to a Jupyter server. In this scenario, use the Kill action to terminate it. If no action is taken, the inactive kernel will be culled in 60 minutes.