STRMCourse

Real-time data streaming

Lessons9modules

Total90mfull study

Quick7mtrailer

Projects9docker labs

CHEATSHEET · 01Streaming · master cheatsheet

Mental model

Producer commitments

Consumer commitments

Time semantics

·Event time = when the event happened in the world
·Ingest time = when the broker received it
·Processing time = when your operator saw it (lies on rebalance / replay)
·Watermarks = 'I will not see events earlier than this' — the operator's promise
·Late events = events past the watermark; route to a side-output, never drop silently

State semantics

Production guardrails

CHEATSHEET · 02Platform & framework picks · 2026

Default broker for new builds

·Apache Kafka 4.x — standard, JVM, KRaft only (ZooKeeper removed in 4.0). Use when ecosystem breadth matters.
·Redpanda — single-binary C++, Kafka API compatible, lower ops cost. Use for small teams without JVM expertise.
·Apache Pulsar 3.x/4.x — tiered storage native, multi-tenancy first-class. Use when SaaS-grade tenant isolation dominates.

Storage-decoupled brokers

·Confluent WarpStream / Kora — stateless brokers backed by S3, ~10× cheaper at high throughput.
·AutoMQ — open-source stateless Kafka on S3.
·Use when egress + replication costs dominate your bill — typical for global, write-heavy workloads.

Stream processing

·Apache Flink 2.x — SQL + DataStream, disaggregated state (ForSt), materialized tables (FLIP-435), exactly-once. Cross-team production default.
·Kafka Streams — JVM library, in-app, no separate cluster. Use for service-local processing only.
·RisingWave / Materialize — streaming SQL databases. Use when you want a Postgres-shaped face on a stream.
·ksqlDB — being deprecated in favour of Flink SQL on Confluent Cloud. Migrate.

Schema Registry

·Confluent Schema Registry — the de facto. Free for community licence; paid for advanced features.
·Karapace — Aiven's open-source SR (Apache 2.0). Drop-in replacement, lighter footprint.
·Apicurio Registry 3.x — Red Hat's. Avro/Protobuf/JSON Schema/AsyncAPI, integrates with Kafka.

CDC + Lakehouse

·Debezium 2.x — Postgres / MySQL / MongoDB / SQL Server tail. Engine + Server modes.
·Confluent Tableflow (GA 2025) — exposes Kafka topics as native Iceberg tables.
·Apache Iceberg + Kafka Connect S3 sink — DIY lakehouse stream landing.
·Estuary Flow / Decodable — managed CDC + transforms when you don't want to run Debezium yourself.

Observability

Avoid / migrate

·ZooKeeper-based Kafka (3.x and below) — KRaft is the only supported mode in 4.x.
·ksqlDB — Confluent's own roadmap points to Flink SQL.
·Custom JSON-on-the-bus without a Schema Registry — every regret in streaming starts here.
·Spark Structured Streaming for sub-second latency — Flink wins clearly here.