Distributed processing, OLAP & query opt

INTROBLOCK · 01

DIST · 7 MIN PREVIEW

Distributed processing, OLAP & query opt

MapReduce mental model. Spark in 2026. Where DuckDB wins. Tune queries with the planner — not vibes.

CONCEPTBLOCK · 02

Distributed = shuffle is the bottleneck

Every distributed query is dominated by one operation: the shuffle. Map-side and reduce-side ops scale linearly with cores; the shuffle scales with data movement, which is gated by network and disk. Three quarters of query optimisation in 2026 is shuffle reduction: better partitioning, broadcast joins where they fit, predicate pushdown, columnar formats. The other quarter is skew handling — one bad key can stall a 10-machine job behind a single straggler.

TIPWhen in doubt, measure shuffle bytes. It's the single most predictive number for query cost.

WATCH OUTDuckDB beats Spark below ~500 GB on a single node. Don't reach for a cluster for problems a laptop solves.

DIAGRAMBLOCK · 03

Stages, shuffles, and where time goes

Filter early. Shuffle less. Salt skewed keys. Three rules carry the field.

CODEBLOCK · 04

Spark broadcast join — the simplest 10x win

PYTHON

1from pyspark.sql import SparkSession

2from pyspark.sql.functions import broadcast

4spark = SparkSession.builder.getOrCreate()

5orders = spark.read.parquet("s3://lake/orders/") # 200 GB

6users = spark.read.parquet("s3://lake/users/") # 8 MB

8# WRONG: shuffles 200 GB by user_id

9slow = orders.join(users, "user_id")

11# RIGHT: broadcasts 8 MB to every executor; zero shuffle

12fast = orders.join(broadcast(users), "user_id")

If one side fits in memory (~10s of MB), broadcast it. Spark has autoBroadcastJoinThreshold but explicit broadcast() removes guesswork.

CHEATSHEETBLOCK · 05

Five things to remember

01Shuffle is the bottleneck. Always measure shuffle bytes.

02Broadcast small tables. Skip the shuffle entirely.

03Salt skewed keys to spread hot partitions.

04Columnar + predicate pushdown beats row-format scans on every metric.

05DuckDB on one big box beats Spark on a small cluster up to ~500 GB.

MINIGAME · RAPIDFIRETFBLOCK · 06

True or false: 6 seconds each

Broadcast joins scale with the size of the broadcast side.

CLAIM 1/5 · READY · scroll into view

LESSON COMPLETEBLOCK · 07

Distributed mental model: locked.

NEXTHello Spark: a 10x win in one DAG

WHAT YOU'LL WALK AWAY WITH

Real skills, real career delta.

Skills you'll gain

Reason about shuffles and skewWorking
Outcome from completing the course: reason about shuffles and skew.
Pick Trino vs DuckDB vs SparkWorking
Outcome from completing the course: pick trino vs duckdb vs spark.
Tune queries with the planner, not vibesWorking
Outcome from completing the course: tune queries with the planner, not vibes.
MapReduce mental modelWorking
Covered in lesson sequence — drop-in ready.
Spark in 2026Working
Covered in lesson sequence — drop-in ready.
DuckDB vs TrinoWorking
Covered in lesson sequence — drop-in ready.
Query optimisation tacticsWorking
Covered in lesson sequence — drop-in ready.
OLAP fundamentalsWorking
Covered in lesson sequence — drop-in ready.
GPU-accelerated processingWorking
Covered in lesson sequence — drop-in ready.

Career & income delta

Career moves

Lead a Distributed processing, OLAP & query opt initiative on your team — most orgs have it on the roadmap and few have shipped it.
Consulting work at $150-300/hr — 'DIST shipped to production' is a sought-after specialty in 2026.
Move from generic IC to platform/AI-platform team where Distributed processing, OLAP & query opt expertise is the entry ticket.

Income impact

$15-40K bump for senior ICs adding Distributed processing, OLAP & query opt to their resume.
Freelance / consulting demand for the same skill: $150-300/hr in 2026.
Closing enterprise deals often hinges on demonstrating the production patterns from this course.

Market resilience

Distributed processing, OLAP & query opt is a durable skill across model and framework consolidations.
Production guardrails (cost caps, observability, audit, evals) carry forward to whatever the 2027 stack is.
Core patterns transfer to cloud, on-prem, and hybrid deployments.