AIDQMOD.AIDQ-08 · v1.0

Pipelines for
real workloads,
not demos.

8 micro-lessons · ~84 min · Real Docker images

QUALITY CALIBRATOR · LIVE
CALIBRATOR.A · LIVE
6 DIM · 1.5s tick
COMPLETENESS
94%
FRESHNESS
88%
ACCURACY
92%
CONSISTENCY
76%
VALIDITY
81%
UNIQUENESS
99%
PASS >85WARN 76-85FAIL <76
5/6 PASS
AIDQDATA ENGINEERINGTRENDING

AI-ready data quality

Data your AI can actually use — contracts, lakehouse, lineage, eval.

60% of AI projects will be abandoned through 2026 from non-AI-ready data (Gartner Feb 2025). Average annual cost of poor data quality: $12.9M. EU AI Act GPAI obligations apply since 2 Aug 2025; full enforcement powers begin 2 Aug 2026. AI-ready data is now a CI/CD discipline — not a quarterly cleanup.
WHAT YOU'LL LEARN
01Why AI-ready data is different
02Data contracts: ODCS v3 + dbt
03Lakehouse: Iceberg + Polaris/Nessie
04Embedding readiness: chunking + recall@K
05Lineage with OpenLineage + Marquez
06PII + EU AI Act compliance
07Streaming freshness: CDC → vectors
08Eval-driven ingestion + bias audit + production
YOU'LL BE ABLE TO
Score any dataset across the 8 AI-ready dimensions and fix the lowest
Ship producer-side contracts (ODCS v3 + dbt) that block bad data before it lands
Stand up an Iceberg + REST-catalog lakehouse with safe schema evolution
Bake-off chunkers × embedders against a 50-question gold set; gate CI on Recall@5 ≥ 0.85
Wire OpenLineage + Marquez for EU AI Act Annex IV-grade provenance
Redact PII at ingest with self-hosted Presidio; produce the AI Office training-data summary
Replace nightly batch RAG re-indexing with second-scale CDC → vector pipelines
Produce a governance-ready bias audit (Fairlearn + AIF360 + Aequitas) for any classifier
SKILLS YOU'LL GAIN

Real skills, real career delta.

Skills you'll gain

10
  • Diagnose AI-readiness gaps across the 8 dimensionsWorking

    Run the 8-dimension scorecard on any dataset (accuracy / completeness / consistency / timeliness / validity / uniqueness + AI-specific representativeness + provenance). Map each gap to a concrete fix in the contract / lineage / eval stack.

  • Author and enforce data contracts with ODCS v3 + dbtProduction

    Write Open Data Contract Standard v3 contracts, generate dbt models with `contract: enforced`, run `datacontract test` in CI, block merges on schema-breaking changes — producer-side, before data lands.

  • Architect a lakehouse with Iceberg + REST catalogProduction

    Stand up Apache Polaris (or Nessie) as an Iceberg REST catalog, evolve schemas safely (add/drop/reorder/rename + type promotion), and integrate with Spark / DuckDB / Trino without rewriting data.

  • Build an embedding-readiness checker (chunking + recall@K)Production

    Benchmark 3 chunking strategies × 3 embedding models on a 50–200 question gold set; report Recall@5 and faithfulness; ship the 'should we promote this RAG to prod' gate.

  • Wire end-to-end data lineage with OpenLineageWorking

    Emit OpenLineage 1.x events from Airflow + dbt + Spark; receive in Marquez; surface column-level + run-level + dataset-version-level lineage; demonstrate right-to-be-forgotten propagation.

  • Detect and redact PII at ingest with PresidioProduction

    Self-host Presidio analyzer + anonymizer; integrate as a FastAPI gateway in front of training-data ingestion; configure per-entity policies (mask vs hash vs synthetic) for Art. 10 EU AI Act compliance.

  • Ship streaming CDC → vector pipelines for second-scale RAGWorking

    Capture changes from Postgres with Debezium, materialize through RisingWave, upsert into Qdrant within seconds — replacing nightly batch RAG re-indexing.

  • Run a tabular bias audit ready for governance reviewWorking

    Fairlearn for exploratory disparity scan, AIF360 for mitigation, Aequitas for HTML/CSV reports; cover demographic parity, equalized odds, disparate impact (4/5ths), calibration within groups.

  • Build eval-driven ingestion gates (Soda + RAGAS in CI)Production

    Soda Core in CI for tabular DQ, DeepEval/RAGAS at staging for RAG, TruLens in production for drift; gate every merge on a tolerance budget for cost, latency, and recall.

  • Comply with EU AI Act Art. 10 + Annex IV data governanceAdvanced

    Author the public training-data summary using the AI Office template, document data governance per Art. 10 (relevance, representativeness, error-freedom), keep Annex IV provenance evidence — in time for the 2 Aug 2026 enforcement deadline.

RUNNABLE ON YOUR MACHINE
$ docker pull snap/ai-ready-data:dq-gate
$ docker run --rm -it snap/ai-ready-data:dq-gate
snap/ai-ready-data:dq-gate
QUICK PREVIEW · 7 MIN
VERIFIED ENGINEER REVIEWS
The dq-gate-pipeline Docker is now the first PR check on our dbt repo. Caught two breaking-schema PRs in week one.
@aidq_amelieVERIFY ON GITHUB
I used the rag-readiness-checker bake-off to retire our text-embedding-3-large + recursive-128 setup. BGE-M3 + recursive-512 won — same recall, 1/8 the cost.
@platform_reneeVERIFY ON GITHUB
LESSONS8
HOURS~1.4
LEARNERS1,980
THIS WEEK+21%