Plan

The normalized execution object. YAML, presets and kwargs become an immutable plan before data is written.

Connector

The source reader. It should retrieve data and expose source metadata, not hide business transformations.

Transform

Physical normalization before quality/write: mapping, JSON parsing, flattening, arrays, projections and deduplication.

Writer mode

The Delta write semantics: append, overwrite, upsert, hash diff, SCD2 or full snapshot soft delete.

Contract files and responsibility boundaries

ContractForge supports split contracts because ingestion logic, catalog metadata, operational ownership and access control usually have different owners and approval paths.

FilePrimary ownerContainsApplied by
*.ingestion.yamlData engineeringSource, target, mode, schema policy, transformations, quality, idempotency and execution controls.ingest_bundle() or contractforge validate-bundle + job execution.
*.annotations.yamlData governance / domain ownersTable descriptions, column descriptions, aliases, PII classification, deprecation metadata and tags.Ingestion workflow or apply_annotations_bundle().
*.operations.yamlSRE / platform / data product ownerBusiness owner, technical owner, criticality, expected frequency, freshness SLA, runbook and alert metadata.Ingestion workflow records it to control tables.
*.access.yamlSecurity / complianceGrants, row filters, column masks and drift policy.Dedicated access workflow with a privileged principal.

Execution order

The execution order is part of the product contract. It allows users to reason about cost, failure modes and observability without reading the implementation.

  • Source resolution happens before DataFrame transformations.
  • Schema and quality gates run before write operations.
  • Annotations are applied after the table and final columns exist.
  • Access governance is intentionally deferred to dedicated commands.
sequenceDiagram participant User participant Contract participant Spark participant Delta participant Ctrl as Control tables participant Gov as Governance rect rgb(255, 247, 236) User->>Contract: YAML or Python contract Contract->>Contract: presets, enums and plan validation end rect rgb(238, 246, 248) Contract->>Spark: resolve source connector Spark->>Spark: mapping, transform.shape and dedup end rect rgb(244, 240, 255) Spark->>Spark: schema policy and quality gates end rect rgb(237, 247, 239) Spark->>Delta: write with selected mode end rect rgb(248, 238, 238) Spark->>Ctrl: persist runs, metrics, errors and lineage Spark->>Gov: apply annotations and operations Gov-->>Ctrl: record governance evidence end

Layer and physical schema are separate

layer is a logical classification used by presets, validations and observability. target.schema is the physical schema in the catalog. They can be the same, but they do not have to be.

target:
  catalog: main
  schema: crm_curated
  table: s_customers

layer: silver

Failure model

ContractForge distinguishes between a thrown Python exception and the persisted operational state of a run. When an ingestion fails, the framework writes the failed run, the short error message and the full traceback to control tables, then raises ContractForgeExecutionError to the caller by default.

Default

Fail fast to the caller

Notebook and job tasks stop naturally when the contract returns FAILED or ABORTED.

Diagnostics

Inspect payloads explicitly

Use raise_on_failure=False in tests or exploratory notebooks when the failed result itself is the assertion target.

Observability model

The result payload and control tables share the same operational vocabulary. A notebook can inspect the result immediately, while jobs and dashboards query persisted evidence.

QuestionResult payloadControl-table evidence
Did the run succeed?status, error_messagectrl_ingestion_runs.status, ctrl_ingestion_errors.stack_trace
How much data moved?rows_read, rows_written, rows_inserted, rows_updated, rows_deletedctrl_ingestion_runs metrics and operation_metrics_json
Did quality pass?quality_status, rows_quarantinedctrl_ingestion_quality, ctrl_ingestion_quarantine
Did schema change?schema_changesctrl_ingestion_schema_changes
Which runtime was used?runtime_type, spark_version, python_version, framework_versionctrl_ingestion_runs and ctrl_ingestion_metadata
Did governance apply?governance, annotations_statusctrl_ingestion_annotations, ctrl_ingestion_operations, ctrl_ingestion_access

Runtime model

The same contract should be explicit about runtime-sensitive behavior. Some capabilities depend on Databricks serverless, classic clusters, Unity Catalog external locations or installed Spark connector libraries.

AreaServerless guidanceClassic cluster guidance
Object storagePrefer Volumes or External Locations.Can use direct Hadoop/S3A/ABFS configuration when credentials are allowed.
Auto LoaderUse checkpoint/schema locations available to the workspace.Use cloudFiles with cluster-level storage access.
JDBCRequires driver availability and network route.Install driver and configure VPC/firewall/peering as needed.
External Spark connectorsDepends on serverless library support.Install connector jars/packages on the cluster.