DCNTCourse

Data contracts for data platforms

Lessons8modules
Total80mfull study
Quick7mtrailer
Projects8docker labs
CHEATSHEET · 01Data contracts · operations cheatsheet
ODCS v3.1.0 contract anatomy
  • ·dataContractSpecification: '3.1.0' at root; version, description, owner required
  • ·servers: [{type: 'kafka'|'postgres'|'bigquery'|'s3', ...}] defines where data lives
  • ·schema.fields: [{name, type, required, pii, description, ...}] — type is Avro
  • ·quality.type: 'SodaCL'|'GreatExpectations'|'dbt_expectations'; checks array
  • ·slaProperties: {freshness, completeness, accuracy, latency} with thresholds
  • ·models: [{name, version, deprecationDate, ...}] for dbt model versioning
dbt 1.9 contract enforcement
  • ·contracts: {enforced: true} on model block; build fails if column removed
  • ·columns: [{name, data_type, description, constraints: [not_null, ...]}]
  • ·dbt-checkpoint 2.x pre-commit hook: dbt-check-model-columns-exist
  • ·dbt model versions: version: 2, latest_version: 2, deprecation_date: '2025-06-01'
  • ·dbt build --select model_name exits non-zero on contract violation
  • ·dbt docs generate includes contract status; dbt parse validates YAML syntax
Kafka Schema Registry compatibility (v7.8+)
  • ·BACKWARD: new schema reads old data; safe for additive fields only
  • ·FORWARD: old schema reads new data; safe for field removal only
  • ·FULL: both directions; requires field default values and no reuse of field numbers
  • ·NONE: no compatibility check; use only in dev; HTTP 409 on conflict
  • ·curl -X POST http://registry:8081/subjects/{subject}-value/versions -d '{schema}'
  • ·Avro field numbers immutable; reusing number breaks BACKWARD + FORWARD
Breaking-change CI gates
  • ·buf breaking --against .git#branch=main for Protobuf; FILE + WIRE_JSON rules
  • ·dbt-checkpoint dbt-check-model-columns-exist in .pre-commit-config.yaml
  • ·datacontract-cli diff v1.yaml v2.yaml shows breaking vs non-breaking changes
  • ·GitHub Actions: run breaking checks on every PR; fail if exit code != 0
  • ·Schema Registry HTTP 409 Conflict blocks incompatible schema registration
  • ·sqlfluff parse + dbt parse catch syntax errors before schema validation
Value-level enforcement (GX Core 1.x + Soda Core 3.x)
  • ·GX FileDataContext: context = FileDataContext(root_dir='gx'); no DB credentials in code
  • ·GX Checkpoint: checkpoint.run() returns CheckpointResult; .success == False fails DAG
  • ·Soda SodaCL checks.yml: freshness < 2h, completeness > 99%, no_missing_values
  • ·Soda scan = Scan(); scan.add_sodacl_yaml_file('checks.yml'); scan.execute()
  • ·GX Data Docs: context.build_data_docs() generates HTML; commit to repo or S3
  • ·Airflow: @task(trigger_rule='all_done') after GX/Soda; skip downstream on failure
Contract-driven code generation + discoverability
  • ·datacontract-cli 0.10.x export --format dbt-yaml contract.yaml > schema.yml
  • ·datacontract-cli export --format avro contract.yaml > schema.avsc
  • ·datacontract-cli export --format great-expectations contract.yaml > suite.json
  • ·CI job: regenerate all three; fail if committed artifacts differ from generated
  • ·Backstage: register contract YAML as a Location; catalog-info.yaml references it
  • ·datacontract-cli publish --server backstage-url pushes contract to Backstage API
CHEATSHEET · 02Data contracts · 2 AM debugging cheatsheet
Contract validation failures
  • ·datacontract-cli lint <file.yaml> → check ODCS v3.1.0 syntax, required fields, enum values
  • ·datacontract-cli diff v1.yaml v2.yaml → pinpoint breaking changes (removed fields, type shifts)
  • ·dbt build --select <model> → enforced: true models exit non-zero if contracted columns missing
  • ·dbt-checkpoint hook → pre-commit blocks commits with schema violations before dbt runs
  • ·buf breaking --against .git#branch=main → gRPC field number reuse, wire-type changes caught
Schema Registry / Kafka compatibility
  • ·curl -X GET http://localhost:8081/subjects/<topic>-value/versions → list all registered schemas
  • ·curl -X POST -H 'Content-Type: application/vnd.schemaregistry.v1+json' http://localhost:8081/compatibility/subjects/<topic>-value/versions/latest -d '{"schema":"..."}' → test compatibility before register
  • ·BACKWARD mode: new schema must read old data; blocks required field deletion, type narrowing
  • ·FORWARD mode: old schema must read new data; blocks required field addition without default
  • ·FULL mode: both BACKWARD + FORWARD; safest but most restrictive; use for shared dimensions
dbt contract enforcement
  • ·enforced: true in contract block → dbt build fails if column missing or type mismatch
  • ·version: 2 + deprecation_date: '2025-06-01' → signals v1 end-of-life; dbt logs warning to consumers
  • ·dbt build --select state:modified+ → test only changed models + downstream; catches contract breaks early
  • ·dbt parse → validates contract YAML syntax before build; fails fast on typos in column names
  • ·dbt docs generate → contract metadata appears in dbt Cloud lineage; consumers see SLOs, owners
Quality gate failures (GX / Soda)
  • ·GX Checkpoint: FileDataContext.build_checkpoint() → fails DAG if expectation_suite violations
  • ·GX Expectation: expect_column_values_to_not_be_null → catches 40% nulls in revenue before dashboard
  • ·Soda SodaCL: freshness < 2h, completeness > 99% → SLA breach blocks downstream tasks
  • ·Soda scan → exit code 1 if any check fails; wire to Airflow task_failed_alert or dbt post-hook
  • ·GX Data Docs HTML → inspect failed expectations; shows row counts, null %, distribution by value
CI/CD contract gates
  • ·GitHub Actions: datacontract-cli lint on every PR → fail if ODCS YAML invalid or breaking
  • ·buf breaking in CI → block PR if .proto field number reused, wire type changed, required field deleted
  • ·dbt-checkpoint pre-commit → run before git commit; blocks schema changes locally, no CI wait
  • ·Schema Registry HTTP 409 → compatibility check failed; inspect error body for which field/type broke
  • ·dbt build --fail-fast → stop on first contract violation; speeds up feedback loop in CI
Contract discoverability & debugging
  • ·Backstage: datacontract-cli export --format backstage → register ODCS as API entity in catalog
  • ·datacontract-cli export --format dbt-yaml → regenerate schema.yml from ODCS; diff against committed
  • ·datacontract-cli export --format avro → regenerate .avsc from ODCS; validate against Kafka schema
  • ·dbt meta: contract_owner, sla_freshness_hours → searchable in dbt Cloud; link to runbook
  • ·grep -r 'deprecated: true' dbt/models/ → find all deprecated models; audit consumer migrations