Apache Spark on Kubernetes - Submitting a job to Spark on Kubernetes (Part 2)

This is the Preparing Spark Docker image for submitting a job to Spark on Kubernetes (Part 2) from article series (see Part 1).

In this article we explain two options for job submitting to cluster:

  1. Create an inherited custom Docker image with job jar and submit capability.
  2. Mount a volume to original image with job jar.

Apache Spark on Kubernetes - Docker image for Spark Standalone cluster (Part 1)

In this series of articles we create Apache Spark on Kubernetes deployment. Spark will be running in standalone cluster mode, not using Spark Kubernetes support as we do not want any Spark submit to spin-up new pods for us.

This is the Docker image for Spark Standalone cluster (Part 1), where we create a custom Docker image with our Spark distribution and scripts to start-up Spark master and Spark workers.

Complete guide comprises of 3 parts:

  • Docker image for Spark Standalone cluster (Part 1)
  • Submitting a job to Spark on Kubernetes (Part 2)
  • Publishing Spark UIs on Kubernetes (Part 3)

Running Ansible from inside Docker image for CI/CD pipeline

In this article we prepare simple Docker image packed with our Ansible roles, which will be ready-made for provisioning just by running the container from this image.

In this article we describe process of encapsulating ansible executable, Ansible roles, dependent galaxy roles, SSH key material and group variables into a docker image for CI/CD use. We also present a way to run prepared image from command-line without installing Ansible.

tt: Simplest and fastest time tracking

When it comes to tracking time on activities precisely, one has to start coping with time tracking applications. I was not able to find a simple solution, all the options were UI-based, feature-full and annoying to use. Then I started searching for command-line options, where the situation is much more clean, but still, too many features.

Finally I came up with tt: a simple script on 6 lines of code. In this article I name some other options and inspirations you may use, and the script itself with installation steps.

Example:

tt hello
... some time
tt writing blog
... some time
tt lunch 

Composing Avro Schemas from Subtypes

While working with Avro Schemas, one can quickly come to the point, where schema definitions for multiple entities start to overlap and schema files grow in number of lines. As with object oriented design of classes in your program, same principle could be applied to design of your Avro schema collection. Unfortunately Avro Schema Definitition language does not have a native require or import syntax.

One possible solution is to rewrite all the schemas into Avro Interface Definition language, which have the import feature (see [1]).

If you do not want to rewrite all the schemas or simply like the JSON schema definitions more, in this article we will introduce a mechanism how to:

  • design small schema file units, containing Avro named types
  • programatically compose the files into large Avro schemas, one file per one type

Article is accompanied with full example on usage and source code of the Avro Compose - automatic schema composition tool.

Move Flink Savepoint to a different S3 location

Users of Apache Flink are familiar with creating a savepoint and restarting a job from savepoint.

The issue with savepoint is, how to move a savepoint to a different location and be able to start a Flink job from the new location. Problem lies in the _metadata file of savepoint files, which contains absolute URIs (see documentation on moving savepoint).

In this article, we go step-by-step on how to move Flink savepoint from one S3 bucket to another and how to safely (without corrupting) alter the _metadata file in the destination, so that the Flink job starts smoothly from a new savepoint location. Setup is tested with S3 and filesystem state backend.

NiFi Registry behind nginx proxy with (client) SSL/TLS and basic auth

Running NiFi Registry behind nginx proxy with SSL/TLS and basic_auth (inside nginx) is a bit tricky. In this article, we will go step-by-step to create this hybrid setup:

  1. NiFi Registry listening plain HTTP on port 18080 and without authentication
  2. nginx reverse proxy listening on port 18443 with server-side SSL/TLS certificate and with optional client SSL/TLS authentication
  3. nginx reverse proxy fallback to basic auth for clients which do not present themselves with valid client SSL/TLS certificate
  4. Apache NiFi configured to use pre-baked keystore and truststore to authenticate itself using client SSL/TLS against nginx
  5. NiFi Registry Web UI browser accessible using basic auth

In this setup, NiFi does not authenticate against NiFi Registry (we will still use anonymous access), but the communication is encrypted between NiFi and nginx. By using two-way SSL between NiFi and nginx we can be sure, only NiFi with supplied private key and certificate will be able to talk our NiFi Registry. By using basic auth when no client-side SSL certificate is supplied, we can be sure, only web browsers (users) who know correct user/password are allowed to access NiFi Registry web UI.

We will prepare certificates and truststores in a way, that makes nginx sure about authenticity of NiFi client and vice-versa (using own CA, but you can buy commercial certificates if you want).

NiFi Registry in Docker with Git auto-cloning on startup

tl;dr

Running NiFi Registry with Git and auto-cloning on startup is possible with three authentication options:

  1. HTTPS user and password
  2. git+ssh (~/.ssh bind mount)
  3. git+ssh (SSH keys as environment variables)