AIDQCourse

AI-ready data quality

Lessons8modules

Total86mfull study

Quick7mtrailer

Projects8docker labs

CHEATSHEET · 01AI-ready data · master cheatsheet

The 8 dimensions

Freshness budgets for AI

Where to enforce contracts

Lineage discipline

·Emit OpenLineage 1.x events from every job (Airflow, dbt, Spark, Flink).
·Marquez or DataHub as the receiver; Atlan/Collibra for enterprise UX.
·Column-level + run-level + dataset-version-level — table-only is not enough in 2026.
·Hash inputs (Shopify Tangle pattern) so unchanged inputs skip downstream rebuilds.
·Required for EU AI Act Annex IV (high-risk system documentation).

Production eval gates

·Soda Core in CI for tabular DQ (YAML, SQL-native).
·DeepEval (pytest-style) for unit-style LLM-app tests.
·RAGAS at staging for batch RAG metrics (faithfulness, context precision/recall).
·TruLens / LangSmith / Phoenix in production for drift + traces.
·Cost & latency budgets enforced; fail the build at 1.2× tolerance.

CHEATSHEET · 02Tools by 2026 — what to pick

Data quality / contracts

Lakehouse + catalog

·Apache Iceberg — the open format default.
·Apache Polaris (1.0, Oct 2025) — Snowflake-donated, reference Iceberg REST catalog.
·Project Nessie — Git-style branching/merging (great for ML feature backfills).
·Databricks Unity Catalog OSS — also Iceberg-REST compatible since 2024.
·Avoid Delta-only stacks — treat as legacy.

Embedding + vector

Lineage + governance

PII / EU AI Act

·Microsoft Presidio — OSS detector + anonymizer (text + images + structured).
·BigID + Microsoft Purview DSPM — enterprise classification across hybrid + SaaS.
·Privacera (Securiti) — fine-grained access for AI workloads.
·EU AI Act GPAI obligations apply 2 Aug 2025; full enforcement 2 Aug 2026.
·Annex IV requires documented provenance for high-risk systems.

Streaming + CDC

·Debezium — reference CDC connector (Postgres/MySQL/SQL Server/Mongo/Oracle).
·RisingWave — Postgres-compatible streaming DB; ingests CDC without Kafka or Debezium.
·Apache Flink + Flink CDC 3.x — heavy-duty general-purpose streaming.
·Materialize — strict-serializable for regulated joins (BSL, cloud-only).
·Estuary Flow — multi-destination managed CDC (analytics + ops + AI from one capture).

Synthetic + bias