AIDQMOD.AIDQ-08 · v1.0

Pipelines for
real workloads,
not demos.

8 micro-lessons · ~84 min · Real Docker images

QUALITY CALIBRATOR · LIVE

CALIBRATOR.A · LIVE

6 DIM · 1.5s tick

COMPLETENESS

94%

FRESHNESS

88%

ACCURACY

92%

CONSISTENCY

76%

VALIDITY

81%

UNIQUENESS

99%

PASS >85WARN 76-85FAIL <76

5/6 PASS

AIDQDATA ENGINEERINGTRENDING

AI-ready data quality

Data your AI can actually use — contracts, lakehouse, lineage, eval.

WHY THIS MATTERS · GARTNER 2026 · IBM AI-READY DATA · EU AI OFFICE · LF AI & DATA

60% of AI projects will be abandoned through 2026 from non-AI-ready data (Gartner Feb 2025). Average annual cost of poor data quality: $12.9M. EU AI Act GPAI obligations apply since 2 Aug 2025; full enforcement powers begin 2 Aug 2026. AI-ready data is now a CI/CD discipline — not a quarterly cleanup.

WHAT YOU'LL LEARN

01Why AI-ready data is different

02Data contracts: ODCS v3 + dbt

03Lakehouse: Iceberg + Polaris/Nessie

04Embedding readiness: chunking + recall@K

05Lineage with OpenLineage + Marquez

06PII + EU AI Act compliance

07Streaming freshness: CDC → vectors

08Eval-driven ingestion + bias audit + production

YOU'LL BE ABLE TO

Score any dataset across the 8 AI-ready dimensions and fix the lowest

Ship producer-side contracts (ODCS v3 + dbt) that block bad data before it lands

Stand up an Iceberg + REST-catalog lakehouse with safe schema evolution

Bake-off chunkers × embedders against a 50-question gold set; gate CI on Recall@5 ≥ 0.85

Wire OpenLineage + Marquez for EU AI Act Annex IV-grade provenance

Redact PII at ingest with self-hosted Presidio; produce the AI Office training-data summary

Replace nightly batch RAG re-indexing with second-scale CDC → vector pipelines

Produce a governance-ready bias audit (Fairlearn + AIF360 + Aequitas) for any classifier

SKILLS YOU'LL GAIN

Real skills, real career delta.

Skills you'll gain

Diagnose AI-readiness gaps across the 8 dimensionsWorking
Run the 8-dimension scorecard on any dataset (accuracy / completeness / consistency / timeliness / validity / uniqueness + AI-specific representativeness + provenance). Map each gap to a concrete fix in the contract / lineage / eval stack.
Author and enforce data contracts with ODCS v3 + dbtProduction
Write Open Data Contract Standard v3 contracts, generate dbt models with `contract: enforced`, run `datacontract test` in CI, block merges on schema-breaking changes — producer-side, before data lands.
Architect a lakehouse with Iceberg + REST catalogProduction
Stand up Apache Polaris (or Nessie) as an Iceberg REST catalog, evolve schemas safely (add/drop/reorder/rename + type promotion), and integrate with Spark / DuckDB / Trino without rewriting data.
Build an embedding-readiness checker (chunking + recall@K)Production
Benchmark 3 chunking strategies × 3 embedding models on a 50–200 question gold set; report Recall@5 and faithfulness; ship the 'should we promote this RAG to prod' gate.
Wire end-to-end data lineage with OpenLineageWorking
Emit OpenLineage 1.x events from Airflow + dbt + Spark; receive in Marquez; surface column-level + run-level + dataset-version-level lineage; demonstrate right-to-be-forgotten propagation.
Detect and redact PII at ingest with PresidioProduction
Self-host Presidio analyzer + anonymizer; integrate as a FastAPI gateway in front of training-data ingestion; configure per-entity policies (mask vs hash vs synthetic) for Art. 10 EU AI Act compliance.
Ship streaming CDC → vector pipelines for second-scale RAGWorking
Capture changes from Postgres with Debezium, materialize through RisingWave, upsert into Qdrant within seconds — replacing nightly batch RAG re-indexing.
Run a tabular bias audit ready for governance reviewWorking
Fairlearn for exploratory disparity scan, AIF360 for mitigation, Aequitas for HTML/CSV reports; cover demographic parity, equalized odds, disparate impact (4/5ths), calibration within groups.
Build eval-driven ingestion gates (Soda + RAGAS in CI)Production
Soda Core in CI for tabular DQ, DeepEval/RAGAS at staging for RAG, TruLens in production for drift; gate every merge on a tolerance budget for cost, latency, and recall.
Comply with EU AI Act Art. 10 + Annex IV data governanceAdvanced
Author the public training-data summary using the AI Office template, document data governance per Art. 10 (relevance, representativeness, error-freedom), keep Annex IV provenance evidence — in time for the 2 Aug 2026 enforcement deadline.

RUNNABLE ON YOUR MACHINE

$ docker pull snap/ai-ready-data:dq-gate

$ docker run --rm -it snap/ai-ready-data:dq-gate

snap/ai-ready-data:dq-gate

QUICK PREVIEW · 7 MIN

VERIFIED ENGINEER REVIEWS

The dq-gate-pipeline Docker is now the first PR check on our dbt repo. Caught two breaking-schema PRs in week one.

@aidq_amelieVERIFY ON GITHUB

I used the rag-readiness-checker bake-off to retire our text-embedding-3-large + recursive-128 setup. BGE-M3 + recursive-512 won — same recall, 1/8 the cost.

@platform_reneeVERIFY ON GITHUB

LESSONS8

HOURS~1.4

LEARNERS1,980

THIS WEEK+21%

Pipelines forreal workloads,not demos.

AI-ready data quality

Real skills, real career delta.

Skills you'll gain

Pipelines for
real workloads,
not demos.