INTROBLOCK · 01
DIST · 7 MIN PREVIEW
Distributed processing, OLAP & query opt
MapReduce mental model. Spark in 2026. Where DuckDB wins. Tune queries with the planner — not vibes.
CONCEPTBLOCK · 02
Distributed = shuffle is the bottleneck
Every distributed query is dominated by one operation: the shuffle. Map-side and reduce-side ops scale linearly with cores; the shuffle scales with data movement, which is gated by network and disk. Three quarters of query optimisation in 2026 is shuffle reduction: better partitioning, broadcast joins where they fit, predicate pushdown, columnar formats. The other quarter is skew handling — one bad key can stall a 10-machine job behind a single straggler.
TIPWhen in doubt, measure shuffle bytes. It's the single most predictive number for query cost.
WATCH OUTDuckDB beats Spark below ~500 GB on a single node. Don't reach for a cluster for problems a laptop solves.
DIAGRAMBLOCK · 03
Stages, shuffles, and where time goes
Filter early. Shuffle less. Salt skewed keys. Three rules carry the field.
CODEBLOCK · 04
Spark broadcast join — the simplest 10x win
PYTHON1from pyspark.sql import SparkSession
2from pyspark.sql.functions import broadcast
3
4spark = SparkSession.builder.getOrCreate()
5orders = spark.read.parquet("s3://lake/orders/") # 200 GB
6users = spark.read.parquet("s3://lake/users/") # 8 MB
7
8# WRONG: shuffles 200 GB by user_id
9slow = orders.join(users, "user_id")
10
11# RIGHT: broadcasts 8 MB to every executor; zero shuffle
12fast = orders.join(broadcast(users), "user_id")
If one side fits in memory (~10s of MB), broadcast it. Spark has autoBroadcastJoinThreshold but explicit broadcast() removes guesswork.
CHEATSHEETBLOCK · 05
Five things to remember
01Shuffle is the bottleneck. Always measure shuffle bytes.
02Broadcast small tables. Skip the shuffle entirely.
03Salt skewed keys to spread hot partitions.
04Columnar + predicate pushdown beats row-format scans on every metric.
05DuckDB on one big box beats Spark on a small cluster up to ~500 GB.
MINIGAME · RAPIDFIRETFBLOCK · 06
True or false: 6 seconds each
Broadcast joins scale with the size of the broadcast side.
CLAIM 1/5 · READY · scroll into view
LESSON COMPLETEBLOCK · 07
Distributed mental model: locked.
NEXTHello Spark: a 10x win in one DAG
WHAT YOU'LL WALK AWAY WITH
Real skills, real career delta.
Skills you'll gain
09- Reason about shuffles and skewWorking
Outcome from completing the course: reason about shuffles and skew.
- Pick Trino vs DuckDB vs SparkWorking
Outcome from completing the course: pick trino vs duckdb vs spark.
- Tune queries with the planner, not vibesWorking
Outcome from completing the course: tune queries with the planner, not vibes.
- MapReduce mental modelWorking
Covered in lesson sequence — drop-in ready.
- Spark in 2026Working
Covered in lesson sequence — drop-in ready.
- DuckDB vs TrinoWorking
Covered in lesson sequence — drop-in ready.
- Query optimisation tacticsWorking
Covered in lesson sequence — drop-in ready.
- OLAP fundamentalsWorking
Covered in lesson sequence — drop-in ready.
- GPU-accelerated processingWorking
Covered in lesson sequence — drop-in ready.
Career & income delta
Career moves
- Lead a Distributed processing, OLAP & query opt initiative on your team — most orgs have it on the roadmap and few have shipped it.
- Consulting work at $150-300/hr — 'DIST shipped to production' is a sought-after specialty in 2026.
- Move from generic IC to platform/AI-platform team where Distributed processing, OLAP & query opt expertise is the entry ticket.
Income impact
- $15-40K bump for senior ICs adding Distributed processing, OLAP & query opt to their resume.
- Freelance / consulting demand for the same skill: $150-300/hr in 2026.
- Closing enterprise deals often hinges on demonstrating the production patterns from this course.
Market resilience
- Distributed processing, OLAP & query opt is a durable skill across model and framework consolidations.
- Production guardrails (cost caps, observability, audit, evals) carry forward to whatever the 2027 stack is.
- Core patterns transfer to cloud, on-prem, and hybrid deployments.