The Most Important Kafka Producer Setting You're Probably Ignoring


When setting up Kafka producers, most developers focus on bootstrap.servers, maybe acks, and call it a day. But there’s one setting that can reduce your storage costs by 10x and improve throughput: compression.type.

TLDR:

On producer configuration set this:

compression.type: 'zstd'  # Just to this

If you are interested why, continue reading

The Problem

Kafka stores messages as-is by default. If you’re producing JSON messages (and let’s be honest, most of us are), you’re storing verbose, repetitive text. A typical system monitoring payload might look like this:

{
  "timestamp": "2025-08-08T09:45:32.789Z",
  "hostname": "production-server-cluster-01.example.com",
  "system_info": {
    "cpu": {"model": "AMD EPYC 7742 64-Core Processor", "cores": 64},
    "memory": {"total_mb": 131072, "used_mb": 89456}
  },
  "processes": [...]
}

Multiply this by millions of messages per day, and you’re looking at terabytes of storage filled with repeated keys like "cpu_percent", "memory_percent", "timestamp".

The Solution: Producer-Side Compression

Kafka supports five compression modes:

Compression CPU Cost Compression Ratio Best For
none Zero 1:1 Already compressed data
lz4 Low ~4:1 High throughput, balanced
snappy Low ~4:1 High throughput, balanced
zstd Medium ~12:1 Storage optimization
gzip High ~7:1 Maximum compatibility

The setting is simple:

producer = Producer({
    'bootstrap.servers': 'localhost:9092',
    'compression.type': 'zstd'
})

The compression is completely transparent to consumer. Whatever code using standard Kafka consumer libraries will read the compressed data and decompress it transparently for consumer application programmer. The change can therefore be deployed to existing Kafka producer/consumer workflows as-is, without any code change on any side. Neat.

Benchmarking Setup

Theory is nice, but numbers are better. Here’s how to test compression on your actual data. The repository with benchmark is available on github.com/michalklempa/kafka-compression-test.

Generate Test Data

First, create realistic test messages. The create_data.py script generates system snapshot JSON files with randomized, but realistic content:

python create_data.py --start 1 --count 100

This creates 100 JSON files (system_snapshot_01.json through system_snapshot_100.json), each containing system metrics, process lists, and resource usage data.

Run the Benchmark

The main.py script produces messages to Kafka with each compression type:

python main.py --files 100 --messages 200000 --topic test3

Arguments:

  • --files: Number of JSON files to load as message templates
  • --messages: Total messages to produce per compression type
  • --compression: Test specific compression (or all)
  • --topic: Base name for topics

The script creates separate topics for each compression type (e.g., run3-zstd-n100-m200000), making it easy to compare storage usage.

Check the Results

After running the benchmark, log into the Kafka container to see actual disk usage:

docker exec -it kafka bash
cd /tmp/kraft-combined-logs
du -sh * | grep test3

Output:

121M    run3-zstd-n100-m200000-0
1.5G    run3-none-n100-m200000-0
204M    run3-gzip-n100-m200000-0
381M    run3-lz4-n100-m200000-0
381M    run3-snappy-n100-m200000-0

Benchmark Results

Message count: 200,000 (files: 100)

Compression Time Log Size
zstd 5.24s 121M
none 4.17s 1.5G
gzip 51.66s 204M
lz4 3.27s 381M
snappy 4.38s 381M

Analysis

The results speak clearly:

zstd wins on storage - 121MB vs 1.5GB uncompressed. That’s 12x reduction. For JSON payloads with repetitive structure, zstd’s dictionary-based compression excels.

lz4 wins on speed - 3.27s to produce 200k messages, faster than even uncompressed (4.17s). The reduced network I/O more than compensates for compression CPU cost.

gzip is the worst choice - 51 seconds is painful. Unless you need maximum compatibility with legacy consumers, avoid it.

snappy and lz4 are equivalent - Same compression ratio, similar speed. Pick either for balanced workloads.

Recommendation

For JSON messages:

  1. Use zstd if storage cost matters and you can tolerate slightly higher CPU usage
  2. Use lz4 if throughput is critical and storage is cheap
  3. Never use gzip unless forced by compatibility requirements
  4. Never use none unless your data is already compressed (images, video, etc.)

The setting takes 30 seconds to change. The savings compound forever.

compression.type: 'zstd'  # Just to this

What about binary formats: Avro and Protobuf

“But I’m using Avro/Protobuf, my data is already binary-encoded. Compression won’t help much, right?”

Not necessarily.

Both Avro and Protobuf encode strings as UTF-8 bytes with a length prefix. That’s it. No compression, no transformation. The string "production-server-cluster-01.example.com" takes exactly 40 bytes in Protobuf, same as in JSON (minus the quotes).

Look at a typical message payload. What takes up the most space?

  • Hostnames: strings
  • File paths: strings
  • Command lines: strings
  • User names: strings
  • Timestamps: often strings
  • UUIDs: strings
  • Error messages: strings

Binary formats eliminate field name overhead (no more "cpu_percent": repeated millions of times), but the actual values - the strings - remain uncompressed UTF-8 bytes. And strings dominate most real-world payloads.

The math is simple: if 80% of your message is string data, switching from JSON to Protobuf might save you 15-20% (field names). Enabling zstd compression saves you 90%. Do both if you can, but don’t skip compression thinking binary format solved the problem.