The Most Important Kafka Producer Setting You're Probably Ignoring

When setting up Kafka producers, most developers focus on bootstrap.servers, maybe acks, and call it a day. But there’s one setting that can reduce your storage costs by 10x and improve throughput: compression.type.

TLDR:

On producer configuration set this:

compression.type: 'zstd'  # Just to this

If you are interested why, continue reading

Flink One-to-Many and Many-to-One Pattern

Enriching a stream of events with a secondary stream of events is a common use case for Apache Flink.

However, when objects in the main stream are of one-to-many character, we face a challenge in Apache Flink as the join is no longer 1:1 and we cannot use join operators directly.

In a SQL world, one would use the UNNEST or CROSS JOIN operator and this article goes through the same pattern with Apache Flink DataStream API.

We take an example of orders as main stream events, each order has a list of product ids to be linked with real products from secondary stream.

The pattern itself is reusable and open-sourced, available for download at michalklempa/flink-tools

Kafka to SQL - easy & dirty, done

graph LR Kafka -->|events| SQL[SQL Database]

Kafka messaging system is commonly used for communication between services.

Often times, we need to get messages from Kafka into some materialized form, i.e. SQL database, for analytical purposes. In this article, we discuss a way to achieve this in a very simple and quick way, without the need to deploy new frameworks.

When your use-case:

  • handles low volume of data, lets say at most hundred messages per second
  • analytical SQL database is not required to have the data in near real-time manner, minute of a delay is not an issue

You may consider our approach as valid solution, avoiding coding and deploying:

  • Kafka Consumer/Kafka Connect
  • Beam/Flink/Kafka Streams/Spark Structured Streaming

saving a lot of effort maintaining these deployments.

Curious how? Continue reading.

Combining Docker Images - a way to compose multiple images into one

From time to time, we need to combine or compose mutliple tools into a single docker image. Instead of running a tedious apt-get/yum install and waiting for your image to build, try just combining the upstream docker images directly into one.

Docker Azcopy image

Although there is a request to provide official image and there are few images for AzCopy out there, either those are not updated or are very bare with a minimum toolset.

In this article we introduce, publish and maintain our michalklempa/azcopy-all image which includes az-cli, azcopy, kubectl all with bash completion.

Apache Spark on Kubernetes - Publishing Spark UIs on Kubernetes (Part 3)

This is the Publishing Spark UIs on Kubernetes (Part 3) from article series (see Part 1 and Part 2).

In this article we go through the process of publishing Spark Master, workers and driver UIs in our Kubernetes setup.

Apache Spark on Kubernetes - Submitting a job to Spark on Kubernetes (Part 2)

This is the Preparing Spark Docker image for submitting a job to Spark on Kubernetes (Part 2) from article series (see Part 1).

In this article we explain two options for job submitting to cluster:

  1. Create an inherited custom Docker image with job jar and submit capability.
  2. Mount a volume to original image with job jar.

Apache Spark on Kubernetes - Docker image for Spark Standalone cluster (Part 1)

In this series of articles we create Apache Spark on Kubernetes deployment. Spark will be running in standalone cluster mode, not using Spark Kubernetes support as we do not want any Spark submit to spin-up new pods for us.

This is the Docker image for Spark Standalone cluster (Part 1), where we create a custom Docker image with our Spark distribution and scripts to start-up Spark master and Spark workers.

Complete guide comprises of 3 parts:

  • Docker image for Spark Standalone cluster (Part 1)
  • Submitting a job to Spark on Kubernetes (Part 2)
  • Publishing Spark UIs on Kubernetes (Part 3)