The Most Important Kafka Producer Setting You're Probably Ignoring
When setting up Kafka producers, most developers focus on bootstrap.servers, maybe acks, and call it a day.
But there’s one setting that can reduce your storage costs by 10x and improve throughput: compression.type.
TLDR:
On producer configuration set this:
compression.type: 'zstd' # Just to this
If you are interested why, continue reading
Flink One-to-Many and Many-to-One Pattern
Enriching a stream of events with a secondary stream of events is a common use case for Apache Flink.
However, when objects in the main stream are of one-to-many character, we face a challenge in Apache Flink as the join is no longer 1:1 and we cannot use join operators directly.
In a SQL world, one would use the UNNEST or CROSS JOIN operator and this article goes through the same pattern with Apache Flink DataStream API.
We take an example of orders as main stream events, each order has a list of product ids to be linked with real products from secondary stream.
The pattern itself is reusable and open-sourced, available for download at michalklempa/flink-tools
Kafka to SQL - easy & dirty, done
Kafka messaging system is commonly used for communication between services.
Often times, we need to get messages from Kafka into some materialized form, i.e. SQL database, for analytical purposes. In this article, we discuss a way to achieve this in a very simple and quick way, without the need to deploy new frameworks.
When your use-case:
- handles low volume of data, lets say at most hundred messages per second
- analytical SQL database is not required to have the data in near real-time manner, minute of a delay is not an issue
You may consider our approach as valid solution, avoiding coding and deploying:
- Kafka Consumer/Kafka Connect
- Beam/Flink/Kafka Streams/Spark Structured Streaming
saving a lot of effort maintaining these deployments.
Curious how? Continue reading.
Combining Docker Images - a way to compose multiple images into one
From time to time, we need to combine or compose mutliple tools into a single docker image.
Instead of running a tedious apt-get/yum install and waiting for your image to build,
try just combining the upstream docker images directly into one.
Docker Azcopy image
Although there is a request to provide official image and there are few images for AzCopy out there, either those are not updated or are very bare with a minimum toolset.
In this article we introduce, publish and maintain our michalklempa/azcopy-all image which includes az-cli, azcopy, kubectl
all with bash completion.
Apache Spark on Kubernetes - Publishing Spark UIs on Kubernetes (Part 3)
This is the Publishing Spark UIs on Kubernetes (Part 3) from article series (see Part 1 and Part 2).
In this article we go through the process of publishing Spark Master, workers and driver UIs in our Kubernetes setup.
Apache Spark on Kubernetes - Submitting a job to Spark on Kubernetes (Part 2)
This is the Preparing Spark Docker image for submitting a job to Spark on Kubernetes (Part 2) from article series (see Part 1).
In this article we explain two options for job submitting to cluster:
- Create an inherited custom Docker image with job jar and submit capability.
- Mount a volume to original image with job jar.
Apache Spark on Kubernetes - Docker image for Spark Standalone cluster (Part 1)
In this series of articles we create Apache Spark on Kubernetes deployment. Spark will be running in standalone cluster mode, not using Spark Kubernetes support as we do not want any Spark submit to spin-up new pods for us.
This is the Docker image for Spark Standalone cluster (Part 1), where we create a custom Docker image with our Spark distribution and scripts to start-up Spark master and Spark workers.
Complete guide comprises of 3 parts:
- Docker image for Spark Standalone cluster (Part 1)
- Submitting a job to Spark on Kubernetes (Part 2)
- Publishing Spark UIs on Kubernetes (Part 3)