Apache Spark on Kubernetes - Submitting a job to Spark on Kubernetes (Part 2)

This is the Preparing Spark Docker image for submitting a job to Spark on Kubernetes (Part 2) from article series (see Part 1).

In this article we explain two options for job submitting to cluster:

Create an inherited custom Docker image with job jar and submit capability.
Mount a volume to original image with job jar.

Introduction

Given the Spark Standalone cluster we built in previous article, we no create a sample Spark job to be submitted to cluster. Then we describe two options for submitting to the cluster - jar file via inherited Docker image or Kubernetes volume.

Sample Spark Job

We have prepared an example Spark Job in examples/sql-example directory. The job has a simple pom.xml with only 2 dependencies:

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>

Both of them are defined as provided, since these Jars are part of the Spark distribution. Beware of versions, however. There are variables in pom.xml

    <properties>
        <project.java.version>1.8</project.java.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <scala.binary.version>2.12</scala.binary.version>
        <spark.version>3.0.1</spark.version>
    </properties>

which should match the Spark distribution version used in Docker images.

Source code of the project is very simple. It just prints a small Dataset with Spark SQL.

Building this project with maven:

examples/sql-example > mvn package

Produces a tiny 5KB Jar file:

-rw-rw-r-- 1 michalklempa michalklempa 4,7K jan  6 13:10 spark-sql-example-1.0.0-SNAPSHOT.jar

Submitting a job

We describe two ways of providing the Jar file into driver container:

Inherited Docker image
Mounting Kubernetes volume

Depending on surrounding environment and circumstances, you may find each of them useful in different scenarios. If your build system keeps versions of built Docker images from your source code, former option may be suitable. If your build system keeps track of Jar files and versions and deployment scripts are capable of providing volumes to Kubernetes dynamically, the latter is a way t ogo.

Inherited Docker image

We create a multi-stage Docker image for our example SQL Spark job. First stage (called build) will build the Jar file and second phase will place the Jar inside inherited Docker image:

ARG SPARK_VERSION=3.0.1
ARG HADOOP_VERSION=2.7
ARG ARG_JAR_NAME=spark-sql-example-1.0.0-SNAPSHOT.jar
ARG ARG_MAIN_CLASS=com.michalklempa.spark.sql.example.Main

FROM maven:3-openjdk-8 as build
ARG ARG_JAR_NAME
ARG ARG_MAIN_CLASS

RUN mkdir /work
COPY pom.xml /work/pom.xml
COPY src /work/src

WORKDIR /work
RUN mvn clean package

FROM michalklempa/spark:${SPARK_VERSION}-hadoop${HADOOP_VERSION}
ARG ARG_JAR_NAME
ARG ARG_MAIN_CLASS

COPY --from=build /work/target/${ARG_JAR_NAME} /${ARG_JAR_NAME}

ENV MAIN_CLASS=${ARG_MAIN_CLASS}
ENV JAR="/${ARG_JAR_NAME}"

Those build arguments for Spark version and Hadoop version, although not used in first stage, need to be on first lines of Dockefile, since those are re-used in FROM statement later. As we can see, besides building the Jar, only thing we do, is placing the Jar to root directory in final image and we set two environment variables:

MAIN_CLASS
JAR

Both of them are used in script submit.sh from original image. The submit.sh is an interesting piece:

env

export SPARK_PUBLIC_DNS=$(hostname -i)

java ${JAVA_OPTS} \
  -cp "${JAR}:${SPARK_HOME}/conf:${SPARK_HOME}/jars/*" \
  org.apache.spark.deploy.SparkSubmit \
  --deploy-mode client \
  --master spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT} \
  --class ${MAIN_CLASS} \
  ${SUBMIT_OPTS} \
  --conf 'spark.driver.host='$(hostname -i) \
  local://${JAR}

There is a possibility to specify custom JAVA_OPTS if you want, just by setting the environment variable. Custom submission options to SparkSubmit class, can be specified by SUBMIT_OPTS. Url used to specify Jar file starts with local:// scheme, which needs the Jar file to be present on every worker node when submitting a job to Spark cluster. Usually, the Jar is distributed by Spark itself, or with YARN or by some other means (HDFS, S3, Azure Blob Storage). Since in this setup, we have complete control of the Docker image used, we can just supply the Jar file in place and then we do not need to manage the Jar distribution. If your build system is versioning Docker images from your artifacts, putting the Jar file inside, makes perfect sense.

Using the former Dockerfile, lets build the image:

> eval $(minikube docker-env)
> docker build -t michalklempa/spark-sql-example .
...
 ---> Running in 5990a696bf1c
Removing intermediate container 5990a696bf1c
 ---> 1c52373acb80
Step 16/18 : COPY --from=build /work/target/${ARG_JAR_NAME} /${ARG_JAR_NAME}
 ---> 83a7fae7826f
Step 17/18 : ENV MAIN_CLASS=${ARG_MAIN_CLASS}
 ---> Running in 7c206f9e6499
Removing intermediate container 7c206f9e6499
 ---> 9927df4584e1
Step 18/18 : ENV JAR="/${ARG_JAR_NAME}"
 ---> Running in 26992958fe9e
Removing intermediate container 26992958fe9e
 ---> 2233ba372a4c
Successfully built 2233ba372a4c
Successfully tagged michalklempa/spark-sql-example:latest

Once build, we may test the job with manifest.yml from examples directory.

> kubectl apply -f manifest.yml
> kubectl get pod
NAME                                       READY   STATUS      RESTARTS   AGE
spark-driver-z5542                         1/1     Running     0          3m
spark-master-deployment-6ff87fc9df-n5g2s   1/1     Running     0          3m
spark-worker-deployment-56fb57c88d-5jhgc   1/1     Running     0          3m
spark-worker-deployment-56fb57c88d-xbwx4   1/1     Running     0          3m

Driver pod should log output Dataset:

> kubectl logs spark-driver-z5542
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_SERVICE_PORT=443
HOSTNAME=spark-driver
LANGUAGE=en_US:en
JAVA_HOME=/usr/lib/jvm/zulu11-ca-amd64
SPARK_MASTER_HOST=spark-master
PWD=/opt/spark
HOME=/root
LANG=en_US.UTF-8
KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443
SPARK_MASTER_PORT=7077
SHLVL=1
SPARK_HOME=/opt/spark
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1
MAIN_CLASS=com.michalklempa.spark.sql.example.Main
...
21/01/06 19:41:14 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://spark-master:7077...
21/01/06 19:41:34 INFO TransportClientFactory: Successfully created connection to spark-master/10.97.134.250:7077 after 2 ms (0 ms spent in bootstraps)
21/01/06 19:41:34 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20210106194134-0000
21/01/06 19:41:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45031.
...
21/01/06 19:41:54 INFO CodeGenerator: Code generated in 27.510135 ms
+-----------+----+
|temperature|snow|
+-----------+----+
|        1.0|30.0|
|        1.5|60.0|
|       -5.5|70.0|
|       -7.0|55.0|
+-----------+----+

21/01/06 19:41:54 INFO SparkUI: Stopped Spark web UI at http://172.18.0.4:4040

After running, the status of driver pod is completed:

kubectl get pod
NAME                                       READY   STATUS      RESTARTS   AGE
spark-driver-z5542                         0/1     Completed   0          39m
spark-master-deployment-6ff87fc9df-n5g2s   1/1     Running     0          39m
spark-worker-deployment-56fb57c88d-5jhgc   1/1     Running     0          39m
spark-worker-deployment-56fb57c88d-xbwx4   1/1     Running     0          39m

Mounting Kubernetes volume

Assuming minikube with docker driver, we first need to mount our examples/sql-example directory into minikube cluster:

> pwd
example/sql-examples
> minikube mount $(pwd):/mnt
📁  Mounting host path /home/michalklempa/github/docker-spark/examples/sql-example into VM as /mnt ...
    ▪ Mount type:   
    ▪ User ID:      docker
    ▪ Group ID:     docker
    ▪ Version:      9p2000.L
    ▪ Message Size: 262144
    ▪ Permissions:  755 (-rwxr-xr-x)
    ▪ Options:      map[]
    ▪ Bind Address: 172.17.0.1:39291
🚀  Userspace file server: ufs starting
✅  Successfully mounted /home/michalklempa/github/docker-spark/examples/sql-example to /mnt

📌  NOTE: This process must stay alive for the mount to be accessible ...

This will mount our working source code directory into running docker container of minikube, to the /mnt directory. When we build the project on our host machine, the target/spark-sql-example-1.0.0-SNAPSHOT.jar file appears as /mnt/target/spark-sql-example-1.0.0-SNAPSHOT.jar for Kubernetes.

We will use the stock docker image michalklempa/spark:3.0.1-hadoop2.7 for running Spark cluster and mount the JAR file as a volume. We need to add volume specification to the Pod template spec:

          volumes:
            - name: jar
              hostPath:
                path: /mnt/target/spark-sql-example-1.0.0-SNAPSHOT.jar
                type: File

And volume mount to container spec:

              volumeMounts:
                - mountPath: /spark-sql-example-1.0.0-SNAPSHOT.jar
                  name: jar

This two snippets are needed for worker deployment pods and driver pod, complete snippet of worker pod spec:

  - apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: spark-worker-deployment
      labels:
        app: spark
        role: worker
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: spark
          role: worker
      template:
        metadata:
          labels:
            app: spark
            role: worker
        spec:
          restartPolicy: Always
          enableServiceLinks: false
          containers:
            - name: spark-worker
              image: michalklempa/spark:3.0.1-hadoop2.7
              imagePullPolicy: Never
              command:
                - /worker.sh
              env:
                - name: SPARK_MASTER_HOST
                  value: "spark-master"
                - name: SPARK_MASTER_PORT
                  value: "7077"
              volumeMounts:
                - mountPath: /spark-sql-example-1.0.0-SNAPSHOT.jar
                  name: jar
          volumes:
            - name: jar
              hostPath:
                path: /mnt/target/spark-sql-example-1.0.0-SNAPSHOT.jar
                type: File

and driver spec:

spec:
  completions: 1
  template:
    metadata:
      labels:
        app: spark
        role: driver
    spec:
      restartPolicy: OnFailure
      enableServiceLinks: false
      hostname: spark-driver
      containers:
        - name: spark-driver
          image: michalklempa/spark:3.0.1-hadoop2.7
          imagePullPolicy: Never
          command:
            - /submit.sh
          env:
            - name: SPARK_MASTER_HOST
              value: "spark-master"
            - name: SPARK_MASTER_PORT
              value: "7077"
            - name: MAIN_CLASS
              value: "com.michalklempa.spark.sql.example.Main"
            - name: JAR
              value: "/spark-sql-example-1.0.0-SNAPSHOT.jar"
          volumeMounts:
            - mountPath: /spark-sql-example-1.0.0-SNAPSHOT.jar
              name: jar
      volumes:
        - name: jar
          hostPath:
            path: /mnt/target/spark-sql-example-1.0.0-SNAPSHOT.jar
            type: File

Driver pod also needs environment variables JAR and MAIN_CLASS for submit.sh to work. Complete manifest example is available as manifest-volumes.yml.

Apply this manifest:

kubectl apply -f manifest-volume.yml

Thats it.

Conclusion

All the project files and roles are available in docker-spark github repository. Pre-built docker images available at hub.docker.com/r/michalklempa/spark.