AI-ready data quality — Quick Intro

INTROBLOCK · 01

AIDQ · 7 MIN PREVIEW

Your model is only as smart as the data behind it.

Gartner: 60% of AI projects will be abandoned through 2026 because of inadequate AI-ready data. Average annual cost of poor data quality: $12.9M per organisation. 63% of orgs say they aren't sure their data practices are right for AI. This trailer shows what 'AI-ready' actually means — and how to ship it.

CONCEPTBLOCK · 02

Eight dimensions, not six

Traditional data quality has six dimensions: accuracy, completeness, consistency, timeliness, validity, uniqueness. Necessary, but not sufficient. AI-ready data adds two more: **representativeness** (does the dataset cover every distribution the model will see in prod?) and **provenance** (can you cite where every row came from, when, and under what consent?). The rest of this course is the operating manual: contracts that block bad data at the gate, lakehouse formats that evolve safely, vector pipelines that score recall@K, lineage that satisfies the EU AI Act, and an eval harness that gates your CI.

TIPAI-ready data is a CI/CD problem, not a one-time clean-up. Every commit to the schema or transform stack should pass DQ + recall + lineage gates.

WATCH OUTKlarna shipped a RAG bot in Jan 2024 — 2.3M conversations in month one — and reversed course in May 2025 to re-hire humans. They optimised CSAT proxies; they did not measure trust erosion. The fix was AI-ready DATA discipline, not a different LLM.

GOTCHAEU AI Act GPAI obligations apply since 2 Aug 2025. Full enforcement powers (fines, recalls) start 2 Aug 2026. If you ship AI in the EU, you need documented data provenance and a public training-data summary.

DIAGRAMBLOCK · 03

The AI-ready data stack — one picture

Sources hit a contract gate AND a PII gate. Clean rows land in Iceberg + the vector index. Both feed an eval/CI step that gates the next index version. Lineage records every hop.

CODEBLOCK · 04

A 14-line ODCS v3 contract that blocks bad data

YAML

1# customers.contract.yaml — Open Data Contract Standard v3

2dataContractSpecification: "3.0.0"

3id: customers-prod

4info:

5 title: customers

6 version: 1.4.0

7 owner: data-platform@snap.dev

8schema:

9 customers:

10 type: table

11 fields:

12 id: { type: bigint, required: true, unique: true }

13 email: { type: string, required: true, format: email }

14 created_at:{ type: timestamp, required: true }

15 country: { type: string, enum: [US, EU, UK, OTHER] }

16quality:

17 - type: sql

18 description: zero null emails

19 query: SELECT COUNT(*) FROM customers WHERE email IS NULL

20 mustBe: 0

Line 9: typed schema. Line 14: enum constraint — values outside the set fail before they hit the warehouse. Line 18: a SQL quality rule the contract CLI runs in your CI. Producers publish; consumers subscribe.

CHEATSHEETBLOCK · 05

The 5 rules every 2026 AI-ready data shipper knows

01Eight dimensions, not six. Always check representativeness + provenance.

02Contracts at the producer boundary (ODCS v3 + dbt). Block bad data before it lands.

03Iceberg + REST catalog (Polaris or Nessie) — Delta is now Databricks-first; Iceberg is the open default.

04Recall@5 ≥ 0.85 on a 50-question gold set is the realistic production bar for RAG.

05Lineage with OpenLineage 1.x — required for EU AI Act Annex IV and right-to-be-forgotten.

MINIGAME · RAPIDFIRETFBLOCK · 06

Quick check — true or false?

60% of AI projects will be abandoned through 2026 due to non-AI-ready data.

CLAIM 1/5 · READY · scroll into view

CONCEPTBLOCK · 07

What you'll ship in the full study

Eight lessons. Eight docker projects. By the end you'll have: — A dbt + Soda + DuckDB DQ gate pipeline that fails CI on contract breaches. — An ODCS v3 contracts lab with `datacontract-cli` and `dbt model contracts`. — A MinIO + Apache Polaris (Iceberg REST catalog) lakehouse with safe schema evolution. — A vector-readiness checker that benchmarks 3 chunking strategies × N embedding models against your gold set. — An OpenLineage end-to-end emitter (Airflow + dbt) feeding Marquez. — A self-hosted Presidio PII redaction microservice. — A streaming CDC pipeline (Postgres → Debezium → RisingWave → Qdrant) for second-scale RAG freshness. — A bias-audit bench (Fairlearn + AIF360 + Aequitas) producing a governance-ready HTML report. Every project is meant to be lifted into your real work — not a demo.

INCLUDEDEach project ships with composeYaml, expectedStdout, and a 'lift to work' note explaining how to drop it into your team's repo.

LESSON COMPLETEBLOCK · 08

That's the trailer.

NEXTLesson 1 · Why AI-ready data is different

WHAT YOU'LL WALK AWAY WITH

Real skills, real career delta.

Skills you'll gain

Diagnose AI-readiness gaps across the 8 dimensionsWorking
Run the 8-dimension scorecard on any dataset (accuracy / completeness / consistency / timeliness / validity / uniqueness + AI-specific representativeness + provenance). Map each gap to a concrete fix in the contract / lineage / eval stack.
Author and enforce data contracts with ODCS v3 + dbtProduction
Write Open Data Contract Standard v3 contracts, generate dbt models with `contract: enforced`, run `datacontract test` in CI, block merges on schema-breaking changes — producer-side, before data lands.
Architect a lakehouse with Iceberg + REST catalogProduction
Stand up Apache Polaris (or Nessie) as an Iceberg REST catalog, evolve schemas safely (add/drop/reorder/rename + type promotion), and integrate with Spark / DuckDB / Trino without rewriting data.
Build an embedding-readiness checker (chunking + recall@K)Production
Benchmark 3 chunking strategies × 3 embedding models on a 50–200 question gold set; report Recall@5 and faithfulness; ship the 'should we promote this RAG to prod' gate.
Wire end-to-end data lineage with OpenLineageWorking
Emit OpenLineage 1.x events from Airflow + dbt + Spark; receive in Marquez; surface column-level + run-level + dataset-version-level lineage; demonstrate right-to-be-forgotten propagation.
Detect and redact PII at ingest with PresidioProduction
Self-host Presidio analyzer + anonymizer; integrate as a FastAPI gateway in front of training-data ingestion; configure per-entity policies (mask vs hash vs synthetic) for Art. 10 EU AI Act compliance.
Ship streaming CDC → vector pipelines for second-scale RAGWorking
Capture changes from Postgres with Debezium, materialize through RisingWave, upsert into Qdrant within seconds — replacing nightly batch RAG re-indexing.
Run a tabular bias audit ready for governance reviewWorking
Fairlearn for exploratory disparity scan, AIF360 for mitigation, Aequitas for HTML/CSV reports; cover demographic parity, equalized odds, disparate impact (4/5ths), calibration within groups.
Build eval-driven ingestion gates (Soda + RAGAS in CI)Production
Soda Core in CI for tabular DQ, DeepEval/RAGAS at staging for RAG, TruLens in production for drift; gate every merge on a tolerance budget for cost, latency, and recall.
Comply with EU AI Act Art. 10 + Annex IV data governanceAdvanced
Author the public training-data summary using the AI Office template, document data governance per Art. 10 (relevance, representativeness, error-freedom), keep Annex IV provenance evidence — in time for the 2 Aug 2026 enforcement deadline.

Career & income delta

Career moves

Title yourself credibly as 'AI data engineer' / 'AI data platform engineer' — the 2026 hiring channel for senior IC roles at $180–340K base.
Lead the data-readiness workstream on your AI platform team — the biggest unowned mandate in most series-B/C orgs.
Pick up contracting work at $180–350/hr fixing the 60% of AI projects Gartner says will be abandoned for non-AI-ready data.
Own the 'why is this AI feature failing' line item — the answer is almost always upstream of the model.
Become the EU AI Act point person for your org — a rare, durable specialty going into Aug 2026 enforcement.

Income impact

$25–60K bump moving from generic data-engineering into an AI-data-platform team in 2026.
$50–150K bump for senior ICs adding production AI-ready discipline (contracts + lineage + eval) to their resume.
Freelance / consulting rates: $180–350/hr — 'we shipped a RAG and it hallucinates' is the most common 2026 inquiry, and the fix is always data.
Enterprise: every six-figure deal that touches the EU now needs an Annex IV / Art. 10 story; the engineer who can produce it commands a premium.

Market resilience

Data quality + lineage is upstream of every model — survives any foundation-model consolidation.
EU AI Act, NIS2, and emerging US/UK regimes all converge on documented data governance — the demand only grows through 2027.
Iceberg + OpenLineage are LF-governed standards — protocol fluency is durable across cloud vendors.
Vector + RAG is the visible AI; the invisible foundation is AI-ready data. Recruiters know the second one is harder to hire.
If model APIs commoditise, the differentiator becomes proprietary, well-governed data — Bloomberg's lesson, restated for everyone.