Real-time data streaming — Quick Intro

INTROBLOCK · 01

STRM · 7 MIN PREVIEW

Stream-first design that holds under load.

89% of 4,175 IT leaders in Confluent's 2025 Data Streaming Report rate streaming as critical — but most teams are still bolting it onto a batch architecture and wondering why it hurts. This trailer shows the difference between a streaming product and a batch product with Kafka stapled on.

CONCEPTBLOCK · 02

The one-line difference

A streaming platform is a durable, ordered, partitioned log that producers append to and consumers read from at their own pace. It is not a queue, not a pub-sub, not a database — it is a log. Once you internalise the log, every partition / consumer-group / offset / watermark question answers itself. If the only thing your 'streaming' system does is move events into Postgres every 10 minutes via a cron job, you don't have a streaming system. You have batch with extra steps.

TIPPick the platform LAST. Pick the partitioning key, time semantics, and consumer guarantees FIRST — those are 95% of the design.

WATCH OUTConfluent's 2024 incident review reported that 9 of the top 10 production outages traced back to consumer-group rebalance storms triggered by a single misconfigured client. Streaming is sharp.

GOTCHAauto.commit.enable=true is the most expensive 12-character footgun in distributed systems. We turn it off in Lesson 3 and never speak of it again.

DIAGRAMBLOCK · 03

Producer → topic partitions → consumer groups

ONE topic. Many consumer groups. Each group reads at its own pace, with its own offsets, without affecting any other consumer.

CODEBLOCK · 04

A correct producer + consumer in 16 lines

PYTHON

1from confluent_kafka import Producer, Consumer

2import json

4p = Producer({

5 "bootstrap.servers": "localhost:9092",

6 "enable.idempotence": True, # dedup retries

7 "acks": "all", # wait for full ISR replication

8 "linger.ms": 5, # batch up to 5 ms

9})

10p.produce("orders", key="order-7", value=json.dumps({"id": 7, "amt": 99.0}))

11p.flush()

13c = Consumer({

14 "bootstrap.servers": "localhost:9092",

15 "group.id": "invoice-processor",

16 "auto.offset.reset": "earliest",

17 "enable.auto.commit": False, # commit AFTER processing

18 "isolation.level": "read_committed",

19})

20c.subscribe(["orders"])

21while (msg := c.poll(1.0)) is not None and not msg.error():

22 process(json.loads(msg.value()))

23 c.commit(msg)

Lines 6-7: idempotent producer + acks=all is the modern default. Line 16: auto-commit OFF. Line 17: read_committed pairs with transactional producers (Lesson 6). Line 22: commit AFTER process() succeeds — replay-safe.

CHEATSHEETBLOCK · 05

The 6 rules every 2026 streaming shipper knows

01The log is the source of truth. Caches and DBs are derivations.

02Pick the partitioning key from your business invariants — same entity = same partition = ordered.

03Idempotent producer + acks=all + manual commit. Always.

04Event time, never processing time. Wall clocks are not honest.

05Schema Registry is non-negotiable. Untyped JSON on a bus is technical debt with a return address.

06Backfill = replay. Make consumers idempotent or you'll discover this the hard way.

MINIGAME · RAPIDFIRETFBLOCK · 06

Quick check — true or false?

Kafka, Redpanda, and Pulsar all speak the same producer/consumer wire protocol.

CLAIM 1/6 · READY · scroll into view

CONCEPTBLOCK · 07

What you'll ship in the full study

Nine lessons. Eight docker projects. By the end you will have: — A Redpanda single-broker compose stack with a correct Python producer/consumer (lift-to-work for any new service). — A Postgres → Debezium → Kafka → console CDC tail you can point at your real OLTP database. — A Flink SQL tumbling/hopping/session window job over event-time with watermarks. — An exactly-once pipeline with transactional producer + idempotent sink + read_committed. — A backfill replayer that re-emits from offset 0 into a parallel consumer group without reprocessing the live workload. — A Karapace Schema Registry + CI compatibility gate that breaks the build on a backward-incompatible schema change. — A Tableflow / Iceberg sink that turns a Kafka topic into a live lakehouse table. — A full observability stack (OTel + Prometheus + Grafana) over a producer/consumer/Flink job, including consumer-lag alerts. Every docker project is meant to be lifted into your real work — not a demo.

INCLUDEDEach project ships with composeYaml, expectedOutcome, and a 'lift to work' note explaining how to drop it into your team's repo.

LESSON COMPLETEBLOCK · 08

That's the trailer.

NEXTLesson 1 · Kafka vs Redpanda vs Pulsar

WHAT YOU'LL WALK AWAY WITH

Real skills, real career delta.

Skills you'll gain

Pick Kafka / Redpanda / Pulsar by trade-offWorking
Place all three on the ops-cost vs ecosystem vs multi-tenancy axes; defend the choice in a design review without resorting to vendor decks.
Design stream-first systemsProduction
Identify when the log should be the source of truth (vs polling/batch), pick a partitioning key from business invariants, and avoid the 'batch with Kafka stapled on' anti-pattern.
Build durable producers and consumersProduction
Idempotent producer + acks=all + manual commit + read_committed — the four-line discipline that turns a demo into a service.
Reason about event time and watermarksProduction
Distinguish event/ingest/processing time; configure watermark strategy with bounded out-of-orderness; route late events to side-outputs instead of dropping them.
Implement stateful Flink jobsProduction
Write tumbling/hopping/session window aggregates in Flink SQL with RocksDB state, checkpointing, and graceful rescaling — the daily bread of cross-team stream processing.
Ship exactly-once pipelinesAdvanced
Wire a transactional producer + read_committed consumer + idempotent sink, understand the two-phase commit cost, and explain why exactly-once is per-pipeline (not per-system).
Stream Postgres CDC into a lakehouseProduction
Run Debezium 2.x against Postgres, land into Kafka topics with Avro schemas, expose as Iceberg tables via Tableflow / Iceberg sink — production medallion in a docker compose.
Govern schemas across teamsProduction
Configure backward/forward/full compatibility per topic, set CI gates that fail breaking changes before they merge, document the upgrade dance for every Avro/Protobuf change.
Observe streaming systems in productionProduction
Define RED metrics + lag SLOs, instrument with OTel, alert on rebalance storms and DLQ growth, and maintain a runbook every on-call can execute at 03:00.
Run a streaming production rolloutAdvanced
Sequence the rollout — shadow → dual-write → cutover → backfill — with quotas, rate limits, and a kill switch; document the ADR that lets the next team replicate the playbook.

Career & income delta

Career moves

Title yourself credibly as a 'streaming engineer' or 'data platform engineer' — the 2026 hiring channel for senior IC roles at $200-360K in US/EU markets.
Lead a streaming initiative on your team — most enterprise roadmaps have a 'real-time' line item that nobody owns; that ownership is the staff-promo lever.
Pick up consulting work at $200-400/hr — the most common 2026 inquiry is 'we have Kafka but it's slow / lossy / costing too much'.
Move from generic backend role to platform / data-platform team where streaming expertise is the entry ticket and the path to staff/principal.

Income impact

$25-50K bump for senior backend ICs adding production streaming to their resume in 2026.
$60-150K bump moving from a generic role to a data-platform / streaming-platform team at a series-B+ company.
Freelance / consulting rates: $200-400/hr — Debezium + Flink SQL + exactly-once is the rate-bumping triple play.
Enterprise sales engineering: closing one 6-figure analytics deal per quarter often requires demonstrating the CDC → Iceberg path live.

Market resilience

The log abstraction is durable — every framework and platform consolidation in the last 12 years has reinforced it, not replaced it.
The Kafka wire protocol is the de facto interop standard; investments transfer across Kafka, Redpanda, WarpStream, AutoMQ, and Confluent Cloud.
CDC + Iceberg is the cross-vendor lakehouse pattern (Snowflake, Databricks, Trino, BigQuery all read it natively) — protocol fluency outlives any single vendor.
Production discipline (lag SLO, schema CI, exactly-once, observability) carries forward to whatever the 2027 stream stack is — the tools change, the discipline doesn't.