Quick Intro~7 MIN· AIDQ

AI-ready data quality

Full Study

A scannable trailer of the 8-lesson course. Read top to bottom — no clicks needed.

INTROBLOCK · 01
AIDQ · 7 MIN PREVIEW

Your model is only as smart as the data behind it.

Gartner: 60% of AI projects will be abandoned through 2026 because of inadequate AI-ready data. Average annual cost of poor data quality: $12.9M per organisation. 63% of orgs say they aren't sure their data practices are right for AI. This trailer shows what 'AI-ready' actually means — and how to ship it.

CONCEPTBLOCK · 02

Eight dimensions, not six

Traditional data quality has six dimensions: accuracy, completeness, consistency, timeliness, validity, uniqueness. Necessary, but not sufficient. AI-ready data adds two more: **representativeness** (does the dataset cover every distribution the model will see in prod?) and **provenance** (can you cite where every row came from, when, and under what consent?). The rest of this course is the operating manual: contracts that block bad data at the gate, lakehouse formats that evolve safely, vector pipelines that score recall@K, lineage that satisfies the EU AI Act, and an eval harness that gates your CI.
TIPAI-ready data is a CI/CD problem, not a one-time clean-up. Every commit to the schema or transform stack should pass DQ + recall + lineage gates.
WATCH OUTKlarna shipped a RAG bot in Jan 2024 — 2.3M conversations in month one — and reversed course in May 2025 to re-hire humans. They optimised CSAT proxies; they did not measure trust erosion. The fix was AI-ready DATA discipline, not a different LLM.
GOTCHAEU AI Act GPAI obligations apply since 2 Aug 2025. Full enforcement powers (fines, recalls) start 2 Aug 2026. If you ship AI in the EU, you need documented data provenance and a public training-data summary.
DIAGRAMBLOCK · 03

The AI-ready data stack — one picture

SOURCESCONTRACTSPII GATEICEBERG LHVECTOR IDXEVAL/CILINEAGE
Sources hit a contract gate AND a PII gate. Clean rows land in Iceberg + the vector index. Both feed an eval/CI step that gates the next index version. Lineage records every hop.
CODEBLOCK · 04

A 14-line ODCS v3 contract that blocks bad data

YAML
1# customers.contract.yaml — Open Data Contract Standard v3
2dataContractSpecification: "3.0.0"
3id: customers-prod
4info:
5 title: customers
6 version: 1.4.0
7 owner: data-platform@snap.dev
8schema:
9 customers:
10 type: table
11 fields:
12 id: { type: bigint, required: true, unique: true }
13 email: { type: string, required: true, format: email }
14 created_at:{ type: timestamp, required: true }
15 country: { type: string, enum: [US, EU, UK, OTHER] }
16quality:
17 - type: sql
18 description: zero null emails
19 query: SELECT COUNT(*) FROM customers WHERE email IS NULL
20 mustBe: 0
Line 9: typed schema. Line 14: enum constraint — values outside the set fail before they hit the warehouse. Line 18: a SQL quality rule the contract CLI runs in your CI. Producers publish; consumers subscribe.
CHEATSHEETBLOCK · 05

The 5 rules every 2026 AI-ready data shipper knows

01Eight dimensions, not six. Always check representativeness + provenance.
02Contracts at the producer boundary (ODCS v3 + dbt). Block bad data before it lands.
03Iceberg + REST catalog (Polaris or Nessie) — Delta is now Databricks-first; Iceberg is the open default.
04Recall@5 ≥ 0.85 on a 50-question gold set is the realistic production bar for RAG.
05Lineage with OpenLineage 1.x — required for EU AI Act Annex IV and right-to-be-forgotten.
MINIGAME · RAPIDFIRETFBLOCK · 06

Quick check — true or false?

60% of AI projects will be abandoned through 2026 due to non-AI-ready data.
CLAIM 1/5 · READY · scroll into view
CONCEPTBLOCK · 07

What you'll ship in the full study

Eight lessons. Eight docker projects. By the end you'll have: — A dbt + Soda + DuckDB DQ gate pipeline that fails CI on contract breaches. — An ODCS v3 contracts lab with `datacontract-cli` and `dbt model contracts`. — A MinIO + Apache Polaris (Iceberg REST catalog) lakehouse with safe schema evolution. — A vector-readiness checker that benchmarks 3 chunking strategies × N embedding models against your gold set. — An OpenLineage end-to-end emitter (Airflow + dbt) feeding Marquez. — A self-hosted Presidio PII redaction microservice. — A streaming CDC pipeline (Postgres → Debezium → RisingWave → Qdrant) for second-scale RAG freshness. — A bias-audit bench (Fairlearn + AIF360 + Aequitas) producing a governance-ready HTML report. Every project is meant to be lifted into your real work — not a demo.
INCLUDEDEach project ships with composeYaml, expectedStdout, and a 'lift to work' note explaining how to drop it into your team's repo.
LESSON COMPLETEBLOCK · 08

That's the trailer.

NEXTLesson 1 · Why AI-ready data is different
WHAT YOU'LL WALK AWAY WITH

Real skills, real career delta.

Skills you'll gain

10
  • Diagnose AI-readiness gaps across the 8 dimensionsWorking

    Run the 8-dimension scorecard on any dataset (accuracy / completeness / consistency / timeliness / validity / uniqueness + AI-specific representativeness + provenance). Map each gap to a concrete fix in the contract / lineage / eval stack.

  • Author and enforce data contracts with ODCS v3 + dbtProduction

    Write Open Data Contract Standard v3 contracts, generate dbt models with `contract: enforced`, run `datacontract test` in CI, block merges on schema-breaking changes — producer-side, before data lands.

  • Architect a lakehouse with Iceberg + REST catalogProduction

    Stand up Apache Polaris (or Nessie) as an Iceberg REST catalog, evolve schemas safely (add/drop/reorder/rename + type promotion), and integrate with Spark / DuckDB / Trino without rewriting data.

  • Build an embedding-readiness checker (chunking + recall@K)Production

    Benchmark 3 chunking strategies × 3 embedding models on a 50–200 question gold set; report Recall@5 and faithfulness; ship the 'should we promote this RAG to prod' gate.

  • Wire end-to-end data lineage with OpenLineageWorking

    Emit OpenLineage 1.x events from Airflow + dbt + Spark; receive in Marquez; surface column-level + run-level + dataset-version-level lineage; demonstrate right-to-be-forgotten propagation.

  • Detect and redact PII at ingest with PresidioProduction

    Self-host Presidio analyzer + anonymizer; integrate as a FastAPI gateway in front of training-data ingestion; configure per-entity policies (mask vs hash vs synthetic) for Art. 10 EU AI Act compliance.

  • Ship streaming CDC → vector pipelines for second-scale RAGWorking

    Capture changes from Postgres with Debezium, materialize through RisingWave, upsert into Qdrant within seconds — replacing nightly batch RAG re-indexing.

  • Run a tabular bias audit ready for governance reviewWorking

    Fairlearn for exploratory disparity scan, AIF360 for mitigation, Aequitas for HTML/CSV reports; cover demographic parity, equalized odds, disparate impact (4/5ths), calibration within groups.

  • Build eval-driven ingestion gates (Soda + RAGAS in CI)Production

    Soda Core in CI for tabular DQ, DeepEval/RAGAS at staging for RAG, TruLens in production for drift; gate every merge on a tolerance budget for cost, latency, and recall.

  • Comply with EU AI Act Art. 10 + Annex IV data governanceAdvanced

    Author the public training-data summary using the AI Office template, document data governance per Art. 10 (relevance, representativeness, error-freedom), keep Annex IV provenance evidence — in time for the 2 Aug 2026 enforcement deadline.

Career & income delta

Career moves
  • Title yourself credibly as 'AI data engineer' / 'AI data platform engineer' — the 2026 hiring channel for senior IC roles at $180–340K base.
  • Lead the data-readiness workstream on your AI platform team — the biggest unowned mandate in most series-B/C orgs.
  • Pick up contracting work at $180–350/hr fixing the 60% of AI projects Gartner says will be abandoned for non-AI-ready data.
  • Own the 'why is this AI feature failing' line item — the answer is almost always upstream of the model.
  • Become the EU AI Act point person for your org — a rare, durable specialty going into Aug 2026 enforcement.
Income impact
  • $25–60K bump moving from generic data-engineering into an AI-data-platform team in 2026.
  • $50–150K bump for senior ICs adding production AI-ready discipline (contracts + lineage + eval) to their resume.
  • Freelance / consulting rates: $180–350/hr — 'we shipped a RAG and it hallucinates' is the most common 2026 inquiry, and the fix is always data.
  • Enterprise: every six-figure deal that touches the EU now needs an Annex IV / Art. 10 story; the engineer who can produce it commands a premium.
Market resilience
  • Data quality + lineage is upstream of every model — survives any foundation-model consolidation.
  • EU AI Act, NIS2, and emerging US/UK regimes all converge on documented data governance — the demand only grows through 2027.
  • Iceberg + OpenLineage are LF-governed standards — protocol fluency is durable across cloud vendors.
  • Vector + RAG is the visible AI; the invisible foundation is AI-ready data. Recruiters know the second one is harder to hire.
  • If model APIs commoditise, the differentiator becomes proprietary, well-governed data — Bloomberg's lesson, restated for everyone.