AIDQCourse

AI-ready data quality

Lessons8modules
Total86mfull study
Quick7mtrailer
Projects8docker labs

Skills you'll gain

10
  • Diagnose AI-readiness gaps across the 8 dimensionsWorking

    Run the 8-dimension scorecard on any dataset (accuracy / completeness / consistency / timeliness / validity / uniqueness + AI-specific representativeness + provenance). Map each gap to a concrete fix in the contract / lineage / eval stack.

  • Author and enforce data contracts with ODCS v3 + dbtProduction

    Write Open Data Contract Standard v3 contracts, generate dbt models with `contract: enforced`, run `datacontract test` in CI, block merges on schema-breaking changes — producer-side, before data lands.

  • Architect a lakehouse with Iceberg + REST catalogProduction

    Stand up Apache Polaris (or Nessie) as an Iceberg REST catalog, evolve schemas safely (add/drop/reorder/rename + type promotion), and integrate with Spark / DuckDB / Trino without rewriting data.

  • Build an embedding-readiness checker (chunking + recall@K)Production

    Benchmark 3 chunking strategies × 3 embedding models on a 50–200 question gold set; report Recall@5 and faithfulness; ship the 'should we promote this RAG to prod' gate.

  • Wire end-to-end data lineage with OpenLineageWorking

    Emit OpenLineage 1.x events from Airflow + dbt + Spark; receive in Marquez; surface column-level + run-level + dataset-version-level lineage; demonstrate right-to-be-forgotten propagation.

  • Detect and redact PII at ingest with PresidioProduction

    Self-host Presidio analyzer + anonymizer; integrate as a FastAPI gateway in front of training-data ingestion; configure per-entity policies (mask vs hash vs synthetic) for Art. 10 EU AI Act compliance.

  • Ship streaming CDC → vector pipelines for second-scale RAGWorking

    Capture changes from Postgres with Debezium, materialize through RisingWave, upsert into Qdrant within seconds — replacing nightly batch RAG re-indexing.

  • Run a tabular bias audit ready for governance reviewWorking

    Fairlearn for exploratory disparity scan, AIF360 for mitigation, Aequitas for HTML/CSV reports; cover demographic parity, equalized odds, disparate impact (4/5ths), calibration within groups.

  • Build eval-driven ingestion gates (Soda + RAGAS in CI)Production

    Soda Core in CI for tabular DQ, DeepEval/RAGAS at staging for RAG, TruLens in production for drift; gate every merge on a tolerance budget for cost, latency, and recall.

  • Comply with EU AI Act Art. 10 + Annex IV data governanceAdvanced

    Author the public training-data summary using the AI Office template, document data governance per Art. 10 (relevance, representativeness, error-freedom), keep Annex IV provenance evidence — in time for the 2 Aug 2026 enforcement deadline.