When setting up Kafka producers, most developers focus on bootstrap.servers, maybe acks, and call it a day.
But there’s one setting that can reduce your storage costs by 10x and improve throughput: compression.type.
TLDR:
On producer configuration set this:
compression.type: 'zstd' # Just to this
If you are interested why, continue reading
The Problem
Kafka stores messages as-is by default. If you’re producing JSON messages (and let’s be honest, most of us are), you’re storing verbose, repetitive text. A typical system monitoring payload might look like this:
{
"timestamp": "2025-08-08T09:45:32.789Z",
"hostname": "production-server-cluster-01.example.com",
"system_info": {
"cpu": {"model": "AMD EPYC 7742 64-Core Processor", "cores": 64},
"memory": {"total_mb": 131072, "used_mb": 89456}
},
"processes": [...]
}
Multiply this by millions of messages per day, and you’re looking at terabytes of storage filled with repeated keys like "cpu_percent", "memory_percent", "timestamp".
The Solution: Producer-Side Compression
Kafka supports five compression modes:
| Compression | CPU Cost | Compression Ratio | Best For |
|---|---|---|---|
| none | Zero | 1:1 | Already compressed data |
| lz4 | Low | ~4:1 | High throughput, balanced |
| snappy | Low | ~4:1 | High throughput, balanced |
| zstd | Medium | ~12:1 | Storage optimization |
| gzip | High | ~7:1 | Maximum compatibility |
The setting is simple:
producer = Producer({
'bootstrap.servers': 'localhost:9092',
'compression.type': 'zstd'
})
The compression is completely transparent to consumer. Whatever code using standard Kafka consumer libraries will read the compressed data and decompress it transparently for consumer application programmer. The change can therefore be deployed to existing Kafka producer/consumer workflows as-is, without any code change on any side. Neat.
Benchmarking Setup
Theory is nice, but numbers are better. Here’s how to test compression on your actual data. The repository with benchmark is available on github.com/michalklempa/kafka-compression-test.
Generate Test Data
First, create realistic test messages. The create_data.py script generates system snapshot JSON files with randomized, but realistic content:
python create_data.py --start 1 --count 100
This creates 100 JSON files (system_snapshot_01.json through system_snapshot_100.json), each containing system metrics, process lists, and resource usage data.
Run the Benchmark
The main.py script produces messages to Kafka with each compression type:
python main.py --files 100 --messages 200000 --topic test3
Arguments:
--files: Number of JSON files to load as message templates--messages: Total messages to produce per compression type--compression: Test specific compression (orall)--topic: Base name for topics
The script creates separate topics for each compression type (e.g., run3-zstd-n100-m200000), making it easy to compare storage usage.
Check the Results
After running the benchmark, log into the Kafka container to see actual disk usage:
docker exec -it kafka bash
cd /tmp/kraft-combined-logs
du -sh * | grep test3
Output:
121M run3-zstd-n100-m200000-0
1.5G run3-none-n100-m200000-0
204M run3-gzip-n100-m200000-0
381M run3-lz4-n100-m200000-0
381M run3-snappy-n100-m200000-0
Benchmark Results
Message count: 200,000 (files: 100)
| Compression | Time | Log Size |
|---|---|---|
| zstd | 5.24s | 121M |
| none | 4.17s | 1.5G |
| gzip | 51.66s | 204M |
| lz4 | 3.27s | 381M |
| snappy | 4.38s | 381M |
Analysis
The results speak clearly:
zstd wins on storage - 121MB vs 1.5GB uncompressed. That’s 12x reduction. For JSON payloads with repetitive structure, zstd’s dictionary-based compression excels.
lz4 wins on speed - 3.27s to produce 200k messages, faster than even uncompressed (4.17s). The reduced network I/O more than compensates for compression CPU cost.
gzip is the worst choice - 51 seconds is painful. Unless you need maximum compatibility with legacy consumers, avoid it.
snappy and lz4 are equivalent - Same compression ratio, similar speed. Pick either for balanced workloads.
Recommendation
For JSON messages:
- Use zstd if storage cost matters and you can tolerate slightly higher CPU usage
- Use lz4 if throughput is critical and storage is cheap
- Never use gzip unless forced by compatibility requirements
- Never use none unless your data is already compressed (images, video, etc.)
The setting takes 30 seconds to change. The savings compound forever.
compression.type: 'zstd' # Just to this
What about binary formats: Avro and Protobuf
“But I’m using Avro/Protobuf, my data is already binary-encoded. Compression won’t help much, right?”
Not necessarily.
Both Avro and Protobuf encode strings as UTF-8 bytes with a length prefix. That’s it. No compression, no transformation.
The string "production-server-cluster-01.example.com" takes exactly 40 bytes in Protobuf, same as in JSON (minus the quotes).
Look at a typical message payload. What takes up the most space?
- Hostnames: strings
- File paths: strings
- Command lines: strings
- User names: strings
- Timestamps: often strings
- UUIDs: strings
- Error messages: strings
Binary formats eliminate field name overhead (no more "cpu_percent": repeated millions of times), but the actual values - the strings - remain uncompressed UTF-8 bytes.
And strings dominate most real-world payloads.
The math is simple: if 80% of your message is string data, switching from JSON to Protobuf might save you 15-20% (field names).
Enabling zstd compression saves you 90%. Do both if you can, but don’t skip compression thinking binary format solved the problem.