CHEATSHEET · 01AI-ready data · master cheatsheet
The 8 dimensions
- ·Accuracy — values match reality (sample + reconcile).
- ·Completeness — every required field is non-null.
- ·Consistency — same fact, same answer across systems.
- ·Timeliness — freshness within the budget for the use case.
- ·Validity — values match the schema/enum.
- ·Uniqueness — no duplicate keys.
- ·Representativeness (AI) — coverage of every distribution the model will see.
- ·Provenance (AI) — every row carries source URI, version, and consent.
Freshness budgets for AI
- ·Foundation training — weeks/months (cost amortises over compute).
- ·Fine-tuning / DPO — days (capture recent intent/behaviour).
- ·RAG indexes — minutes-to-hours (CDC + streaming territory).
- ·Online features for inference — seconds (Kafka + Flink/RisingWave + Redis).
Where to enforce contracts
- ·Producer side: ODCS v3 + datacontract-cli + Buf for Protobuf streams.
- ·Warehouse side: dbt model contracts (`contract: enforced: true`).
- ·Storage side: Iceberg schema evolution + REST catalog RBAC (Polaris/Nessie).
- ·Consumer side: Soda Core checks + Great Expectations suites.
- ·Move enforcement LEFT — block bad data before it lands.
Lineage discipline
- ·Emit OpenLineage 1.x events from every job (Airflow, dbt, Spark, Flink).
- ·Marquez or DataHub as the receiver; Atlan/Collibra for enterprise UX.
- ·Column-level + run-level + dataset-version-level — table-only is not enough in 2026.
- ·Hash inputs (Shopify Tangle pattern) so unchanged inputs skip downstream rebuilds.
- ·Required for EU AI Act Annex IV (high-risk system documentation).
Production eval gates
- ·Soda Core in CI for tabular DQ (YAML, SQL-native).
- ·DeepEval (pytest-style) for unit-style LLM-app tests.
- ·RAGAS at staging for batch RAG metrics (faithfulness, context precision/recall).
- ·TruLens / LangSmith / Phoenix in production for drift + traces.
- ·Cost & latency budgets enforced; fail the build at 1.2× tolerance.
CHEATSHEET · 02Tools by 2026 — what to pick
Data quality / contracts
- ·ODCS v3 + datacontract-cli — producer-side contracts (12+ export formats).
- ·dbt model contracts (dbt-core 1.10+) — warehouse-side enforcement.
- ·Soda Core — production monitoring (YAML + SQL-native).
- ·Great Expectations — Python-first deep validation.
- ·Buf — Protobuf governance for streaming schemas.
Lakehouse + catalog
- ·Apache Iceberg — the open format default.
- ·Apache Polaris (1.0, Oct 2025) — Snowflake-donated, reference Iceberg REST catalog.
- ·Project Nessie — Git-style branching/merging (great for ML feature backfills).
- ·Databricks Unity Catalog OSS — also Iceberg-REST compatible since 2024.
- ·Avoid Delta-only stacks — treat as legacy.
Embedding + vector
- ·Voyage-3-large — best for domain-specific (code/legal/medical/finance).
- ·Cohere embed-v4 — best multilingual (100+ languages).
- ·BGE-M3 — best open-source; dense + sparse + multi-vector in one model.
- ·Qdrant / Milvus / Weaviate / pgvector / Turbopuffer — pick by ops shape.
- ·MTEB v2 = sanity check; gold set on YOUR corpus = ground truth.
Lineage + governance
- ·OpenLineage 1.x — the LF AI & Data standard.
- ·Marquez — the reference receiver (OSS).
- ·DataHub — large OSS deployments + UI.
- ·Atlan / Collibra / IBM MANTA / Acceldata — enterprise.
PII / EU AI Act
- ·Microsoft Presidio — OSS detector + anonymizer (text + images + structured).
- ·BigID + Microsoft Purview DSPM — enterprise classification across hybrid + SaaS.
- ·Privacera (Securiti) — fine-grained access for AI workloads.
- ·EU AI Act GPAI obligations apply 2 Aug 2025; full enforcement 2 Aug 2026.
- ·Annex IV requires documented provenance for high-risk systems.
Streaming + CDC
- ·Debezium — reference CDC connector (Postgres/MySQL/SQL Server/Mongo/Oracle).
- ·RisingWave — Postgres-compatible streaming DB; ingests CDC without Kafka or Debezium.
- ·Apache Flink + Flink CDC 3.x — heavy-duty general-purpose streaming.
- ·Materialize — strict-serializable for regulated joins (BSL, cloud-only).
- ·Estuary Flow — multi-destination managed CDC (analytics + ops + AI from one capture).
Synthetic + bias
- ·MOSTLY AI — financial / time-series; demographic-parity controls.
- ·Gretel — developer-first; differential privacy native.
- ·SDV — OSS go-to (CTGAN, GaussianCopula, multi-table).
- ·Fairlearn — sklearn-native fairness scan.
- ·AIF360 — 70+ metrics + 11 mitigation algorithms.
- ·Aequitas — governance-ready HTML/CSV reports.