Pipelines for
real workloads,
not demos.
8 micro-lessons · ~84 min · Real Docker images
AI-ready data quality
Data your AI can actually use — contracts, lakehouse, lineage, eval.
Real skills, real career delta.
Skills you'll gain
10- Diagnose AI-readiness gaps across the 8 dimensionsWorking
Run the 8-dimension scorecard on any dataset (accuracy / completeness / consistency / timeliness / validity / uniqueness + AI-specific representativeness + provenance). Map each gap to a concrete fix in the contract / lineage / eval stack.
- Author and enforce data contracts with ODCS v3 + dbtProduction
Write Open Data Contract Standard v3 contracts, generate dbt models with `contract: enforced`, run `datacontract test` in CI, block merges on schema-breaking changes — producer-side, before data lands.
- Architect a lakehouse with Iceberg + REST catalogProduction
Stand up Apache Polaris (or Nessie) as an Iceberg REST catalog, evolve schemas safely (add/drop/reorder/rename + type promotion), and integrate with Spark / DuckDB / Trino without rewriting data.
- Build an embedding-readiness checker (chunking + recall@K)Production
Benchmark 3 chunking strategies × 3 embedding models on a 50–200 question gold set; report Recall@5 and faithfulness; ship the 'should we promote this RAG to prod' gate.
- Wire end-to-end data lineage with OpenLineageWorking
Emit OpenLineage 1.x events from Airflow + dbt + Spark; receive in Marquez; surface column-level + run-level + dataset-version-level lineage; demonstrate right-to-be-forgotten propagation.
- Detect and redact PII at ingest with PresidioProduction
Self-host Presidio analyzer + anonymizer; integrate as a FastAPI gateway in front of training-data ingestion; configure per-entity policies (mask vs hash vs synthetic) for Art. 10 EU AI Act compliance.
- Ship streaming CDC → vector pipelines for second-scale RAGWorking
Capture changes from Postgres with Debezium, materialize through RisingWave, upsert into Qdrant within seconds — replacing nightly batch RAG re-indexing.
- Run a tabular bias audit ready for governance reviewWorking
Fairlearn for exploratory disparity scan, AIF360 for mitigation, Aequitas for HTML/CSV reports; cover demographic parity, equalized odds, disparate impact (4/5ths), calibration within groups.
- Build eval-driven ingestion gates (Soda + RAGAS in CI)Production
Soda Core in CI for tabular DQ, DeepEval/RAGAS at staging for RAG, TruLens in production for drift; gate every merge on a tolerance budget for cost, latency, and recall.
- Comply with EU AI Act Art. 10 + Annex IV data governanceAdvanced
Author the public training-data summary using the AI Office template, document data governance per Art. 10 (relevance, representativeness, error-freedom), keep Annex IV provenance evidence — in time for the 2 Aug 2026 enforcement deadline.