Plan
The normalized execution object. YAML, presets and kwargs become an immutable plan before data is written.
Core concepts
The framework is intentionally small in concept count. Most behavior comes from how a contract combines source, transformation, quality, schema policy, write mode and operational metadata.
The normalized execution object. YAML, presets and kwargs become an immutable plan before data is written.
The source reader. It should retrieve data and expose source metadata, not hide business transformations.
Physical normalization before quality/write: mapping, JSON parsing, flattening, arrays, projections and deduplication.
The Delta write semantics: append, overwrite, upsert, hash diff, SCD2 or full snapshot soft delete.
ContractForge supports split contracts because ingestion logic, catalog metadata, operational ownership and access control usually have different owners and approval paths.
| File | Primary owner | Contains | Applied by |
|---|---|---|---|
*.ingestion.yaml | Data engineering | Source, target, mode, schema policy, transformations, quality, idempotency and execution controls. | ingest_bundle() or contractforge validate-bundle + job execution. |
*.annotations.yaml | Data governance / domain owners | Table descriptions, column descriptions, aliases, PII classification, deprecation metadata and tags. | Ingestion workflow or apply_annotations_bundle(). |
*.operations.yaml | SRE / platform / data product owner | Business owner, technical owner, criticality, expected frequency, freshness SLA, runbook and alert metadata. | Ingestion workflow records it to control tables. |
*.access.yaml | Security / compliance | Grants, row filters, column masks and drift policy. | Dedicated access workflow with a privileged principal. |
The execution order is part of the product contract. It allows users to reason about cost, failure modes and observability without reading the implementation.
layer is a logical classification used by presets, validations and observability. target.schema is the physical schema in the catalog. They can be the same, but they do not have to be.
target:
catalog: main
schema: crm_curated
table: s_customers
layer: silver
ContractForge distinguishes between a thrown Python exception and the persisted operational state of a run. When an ingestion fails, the framework writes the failed run, the short error message and the full traceback to control tables, then raises ContractForgeExecutionError to the caller by default.
Notebook and job tasks stop naturally when the contract returns FAILED or ABORTED.
Use raise_on_failure=False in tests or exploratory notebooks when the failed result itself is the assertion target.
The result payload and control tables share the same operational vocabulary. A notebook can inspect the result immediately, while jobs and dashboards query persisted evidence.
| Question | Result payload | Control-table evidence |
|---|---|---|
| Did the run succeed? | status, error_message | ctrl_ingestion_runs.status, ctrl_ingestion_errors.stack_trace |
| How much data moved? | rows_read, rows_written, rows_inserted, rows_updated, rows_deleted | ctrl_ingestion_runs metrics and operation_metrics_json |
| Did quality pass? | quality_status, rows_quarantined | ctrl_ingestion_quality, ctrl_ingestion_quarantine |
| Did schema change? | schema_changes | ctrl_ingestion_schema_changes |
| Which runtime was used? | runtime_type, spark_version, python_version, framework_version | ctrl_ingestion_runs and ctrl_ingestion_metadata |
| Did governance apply? | governance, annotations_status | ctrl_ingestion_annotations, ctrl_ingestion_operations, ctrl_ingestion_access |
The same contract should be explicit about runtime-sensitive behavior. Some capabilities depend on Databricks serverless, classic clusters, Unity Catalog external locations or installed Spark connector libraries.
| Area | Serverless guidance | Classic cluster guidance |
|---|---|---|
| Object storage | Prefer Volumes or External Locations. | Can use direct Hadoop/S3A/ABFS configuration when credentials are allowed. |
| Auto Loader | Use checkpoint/schema locations available to the workspace. | Use cloudFiles with cluster-level storage access. |
| JDBC | Requires driver availability and network route. | Install driver and configure VPC/firewall/peering as needed. |
| External Spark connectors | Depends on serverless library support. | Install connector jars/packages on the cluster. |