AIDQCourse

AI-ready data quality

Lessons8modules
Total86mfull study
Quick7mtrailer
Projects8docker labs
CHEATSHEET · 01AI-ready data · master cheatsheet
The 8 dimensions
  • ·Accuracy — values match reality (sample + reconcile).
  • ·Completeness — every required field is non-null.
  • ·Consistency — same fact, same answer across systems.
  • ·Timeliness — freshness within the budget for the use case.
  • ·Validity — values match the schema/enum.
  • ·Uniqueness — no duplicate keys.
  • ·Representativeness (AI) — coverage of every distribution the model will see.
  • ·Provenance (AI) — every row carries source URI, version, and consent.
Freshness budgets for AI
  • ·Foundation training — weeks/months (cost amortises over compute).
  • ·Fine-tuning / DPO — days (capture recent intent/behaviour).
  • ·RAG indexes — minutes-to-hours (CDC + streaming territory).
  • ·Online features for inference — seconds (Kafka + Flink/RisingWave + Redis).
Where to enforce contracts
  • ·Producer side: ODCS v3 + datacontract-cli + Buf for Protobuf streams.
  • ·Warehouse side: dbt model contracts (`contract: enforced: true`).
  • ·Storage side: Iceberg schema evolution + REST catalog RBAC (Polaris/Nessie).
  • ·Consumer side: Soda Core checks + Great Expectations suites.
  • ·Move enforcement LEFT — block bad data before it lands.
Lineage discipline
  • ·Emit OpenLineage 1.x events from every job (Airflow, dbt, Spark, Flink).
  • ·Marquez or DataHub as the receiver; Atlan/Collibra for enterprise UX.
  • ·Column-level + run-level + dataset-version-level — table-only is not enough in 2026.
  • ·Hash inputs (Shopify Tangle pattern) so unchanged inputs skip downstream rebuilds.
  • ·Required for EU AI Act Annex IV (high-risk system documentation).
Production eval gates
  • ·Soda Core in CI for tabular DQ (YAML, SQL-native).
  • ·DeepEval (pytest-style) for unit-style LLM-app tests.
  • ·RAGAS at staging for batch RAG metrics (faithfulness, context precision/recall).
  • ·TruLens / LangSmith / Phoenix in production for drift + traces.
  • ·Cost & latency budgets enforced; fail the build at 1.2× tolerance.
CHEATSHEET · 02Tools by 2026 — what to pick
Data quality / contracts
  • ·ODCS v3 + datacontract-cli — producer-side contracts (12+ export formats).
  • ·dbt model contracts (dbt-core 1.10+) — warehouse-side enforcement.
  • ·Soda Core — production monitoring (YAML + SQL-native).
  • ·Great Expectations — Python-first deep validation.
  • ·Buf — Protobuf governance for streaming schemas.
Lakehouse + catalog
  • ·Apache Iceberg — the open format default.
  • ·Apache Polaris (1.0, Oct 2025) — Snowflake-donated, reference Iceberg REST catalog.
  • ·Project Nessie — Git-style branching/merging (great for ML feature backfills).
  • ·Databricks Unity Catalog OSS — also Iceberg-REST compatible since 2024.
  • ·Avoid Delta-only stacks — treat as legacy.
Embedding + vector
  • ·Voyage-3-large — best for domain-specific (code/legal/medical/finance).
  • ·Cohere embed-v4 — best multilingual (100+ languages).
  • ·BGE-M3 — best open-source; dense + sparse + multi-vector in one model.
  • ·Qdrant / Milvus / Weaviate / pgvector / Turbopuffer — pick by ops shape.
  • ·MTEB v2 = sanity check; gold set on YOUR corpus = ground truth.
Lineage + governance
  • ·OpenLineage 1.x — the LF AI & Data standard.
  • ·Marquez — the reference receiver (OSS).
  • ·DataHub — large OSS deployments + UI.
  • ·Atlan / Collibra / IBM MANTA / Acceldata — enterprise.
PII / EU AI Act
  • ·Microsoft Presidio — OSS detector + anonymizer (text + images + structured).
  • ·BigID + Microsoft Purview DSPM — enterprise classification across hybrid + SaaS.
  • ·Privacera (Securiti) — fine-grained access for AI workloads.
  • ·EU AI Act GPAI obligations apply 2 Aug 2025; full enforcement 2 Aug 2026.
  • ·Annex IV requires documented provenance for high-risk systems.
Streaming + CDC
  • ·Debezium — reference CDC connector (Postgres/MySQL/SQL Server/Mongo/Oracle).
  • ·RisingWave — Postgres-compatible streaming DB; ingests CDC without Kafka or Debezium.
  • ·Apache Flink + Flink CDC 3.x — heavy-duty general-purpose streaming.
  • ·Materialize — strict-serializable for regulated joins (BSL, cloud-only).
  • ·Estuary Flow — multi-destination managed CDC (analytics + ops + AI from one capture).
Synthetic + bias
  • ·MOSTLY AI — financial / time-series; demographic-parity controls.
  • ·Gretel — developer-first; differential privacy native.
  • ·SDV — OSS go-to (CTGAN, GaussianCopula, multi-table).
  • ·Fairlearn — sklearn-native fairness scan.
  • ·AIF360 — 70+ metrics + 11 mitigation algorithms.
  • ·Aequitas — governance-ready HTML/CSV reports.