STRMCourse

Real-time data streaming

Lessons9modules
Total90mfull study
Quick7mtrailer
Projects9docker labs
CHEATSHEET · 01Streaming · master cheatsheet
Mental model
  • ·Topic = ordered, partitioned, durable log
  • ·Partition = the unit of parallelism per consumer group
  • ·Offset = your bookmark; commits are how you mark 'done'
  • ·Consumer group = a parallel scan of the topic; one consumer per partition max
  • ·Tombstone (key, null) = delete record for compacted topics
Producer commitments
  • ·enable.idempotence=true (default in Kafka 3.x+) — dedup retries
  • ·acks=all — wait for full ISR replication before ACK
  • ·Pair acks=all with min.insync.replicas≥2 in prod
  • ·Use keys for ordering invariants — same entity → same partition
  • ·Transactions for cross-topic atomic writes (Lesson 6)
Consumer commitments
  • ·enable.auto.commit=FALSE — commit AFTER processing
  • ·isolation.level=read_committed when transactions are in play
  • ·Tune max.poll.interval.ms to 2-3× your worst-case process() time
  • ·Process idempotently — replay is normal, not exceptional
  • ·Kill -9 should never lose data; design for it
Time semantics
  • ·Event time = when the event happened in the world
  • ·Ingest time = when the broker received it
  • ·Processing time = when your operator saw it (lies on rebalance / replay)
  • ·Watermarks = 'I will not see events earlier than this' — the operator's promise
  • ·Late events = events past the watermark; route to a side-output, never drop silently
State semantics
  • ·Stateless first; stateful only when you must aggregate or join
  • ·RocksDB-backed state stores (Flink, Kafka Streams) survive restarts
  • ·Checkpoints persist state — incremental, async, exactly-once-on-restore
  • ·Bound your state — TTL, windows, or an LRU eviction
  • ·State size is the new memory pressure; budget it like RAM
Production guardrails
  • ·Quotas at the broker — protect tenants from each other
  • ·Consumer-lag SLO + page on lag > P99 baseline × 3
  • ·Schema Registry compatibility CI gate — block backward-incompatible PRs
  • ·DLQ for poison messages — never drop, always quarantine
  • ·Backfill is a deploy event — practice it before you need it
  • ·Tag every producer/consumer with service + version for tracing
CHEATSHEET · 02Platform & framework picks · 2026
Default broker for new builds
  • ·Apache Kafka 4.x — standard, JVM, KRaft only (ZooKeeper removed in 4.0). Use when ecosystem breadth matters.
  • ·Redpanda — single-binary C++, Kafka API compatible, lower ops cost. Use for small teams without JVM expertise.
  • ·Apache Pulsar 3.x/4.x — tiered storage native, multi-tenancy first-class. Use when SaaS-grade tenant isolation dominates.
Storage-decoupled brokers
  • ·Confluent WarpStream / Kora — stateless brokers backed by S3, ~10× cheaper at high throughput.
  • ·AutoMQ — open-source stateless Kafka on S3.
  • ·Use when egress + replication costs dominate your bill — typical for global, write-heavy workloads.
Stream processing
  • ·Apache Flink 2.x — SQL + DataStream, disaggregated state (ForSt), materialized tables (FLIP-435), exactly-once. Cross-team production default.
  • ·Kafka Streams — JVM library, in-app, no separate cluster. Use for service-local processing only.
  • ·RisingWave / Materialize — streaming SQL databases. Use when you want a Postgres-shaped face on a stream.
  • ·ksqlDB — being deprecated in favour of Flink SQL on Confluent Cloud. Migrate.
Schema Registry
  • ·Confluent Schema Registry — the de facto. Free for community licence; paid for advanced features.
  • ·Karapace — Aiven's open-source SR (Apache 2.0). Drop-in replacement, lighter footprint.
  • ·Apicurio Registry 3.x — Red Hat's. Avro/Protobuf/JSON Schema/AsyncAPI, integrates with Kafka.
CDC + Lakehouse
  • ·Debezium 2.x — Postgres / MySQL / MongoDB / SQL Server tail. Engine + Server modes.
  • ·Confluent Tableflow (GA 2025) — exposes Kafka topics as native Iceberg tables.
  • ·Apache Iceberg + Kafka Connect S3 sink — DIY lakehouse stream landing.
  • ·Estuary Flow / Decodable — managed CDC + transforms when you don't want to run Debezium yourself.
Observability
  • ·OpenTelemetry — instrument producers/consumers/Flink jobs with kafka.* spans.
  • ·Prometheus + Grafana — RED metrics + consumer lag from JMX/Kafka exporter.
  • ·Confluent Health+ / Lenses / Conduktor — operator-grade dashboards.
  • ·Streamdal — runtime data observability (in-flight payload validation).
Avoid / migrate
  • ·ZooKeeper-based Kafka (3.x and below) — KRaft is the only supported mode in 4.x.
  • ·ksqlDB — Confluent's own roadmap points to Flink SQL.
  • ·Custom JSON-on-the-bus without a Schema Registry — every regret in streaming starts here.
  • ·Spark Structured Streaming for sub-second latency — Flink wins clearly here.