The vocabulary behind a ContractForge pipeline.

The framework is intentionally small in concept count. Most behavior comes from how a contract combines source, transformation, quality, schema policy, write mode and operational metadata.

Execution object

Plan

The normalized execution object. YAML, presets and kwargs become an immutable plan before data is written.

Source boundary

Connector

The source reader. It should retrieve data and expose source metadata, not hide business transformations.

Preparation boundary

Transform

Physical normalization before quality/write: mapping, JSON parsing, flattening, arrays, projections, casts, string cleanup, derivations and deduplication.

Target semantics

Writer mode

The Delta write semantics: append, overwrite, upsert, hash diff, historical or full snapshot soft delete.

Contract files and responsibility boundaries

ContractForge supports split contracts because ingestion logic, catalog metadata, operational ownership and access control usually have different owners and approval paths.

File	Primary owner	Contains	Applied by
`*.ingestion.yaml`	Data engineering	Source, target, mode, schema policy, transformations, quality, idempotency and execution controls.	`ingest_bundle()` or `contractforge validate-bundle` + job execution.
`*.annotations.yaml`	Data governance / domain owners	Table descriptions, column descriptions, aliases, PII classification, deprecation metadata and tags.	Ingestion workflow or `apply_annotations_bundle()`.
`*.operations.yaml`	SRE / platform / data product owner	Business owner, technical owner, criticality, expected frequency, freshness SLA, runbook and alert metadata.	Ingestion workflow records it to control tables.
`*.access.yaml`	Security / compliance	Grants, row filters, column masks and drift policy.	Dedicated access workflow with a privileged principal.

Decision guides

Use these guides when a contract can be implemented in more than one valid way. The goal is not to hide design choices, but to make the default choice explicit and reviewable.

Decision	Default choice	Use another path when
YAML, Python or bundle	Use YAML for repeatable pipelines and bundles when ingestion, annotations, operations and access have different owners.	Use `ingest()` directly in notebooks for exploratory work, complex gold transformations or migration steps that still need ContractForge write/quality/observability.
Direct connector or landed files	Use direct connectors for bounded reads with explicit page, record, byte and timeout limits.	Use object storage landing when the source behaves like a feed, produces large exports, needs replay, or has strict provider rate limits.
Serverless or classic cluster	Use serverless when the source can be reached through governed workspace capabilities, External Locations, Volumes, Lakehouse Federation or supported libraries.	Use classic clusters when you need direct Hadoop credential configuration, custom network routing, custom connector packages or dependency control.
Transform in connector or contract	Keep connectors focused on reading and source metadata. Put business shaping under `transform`.	Only parse enough in the connector to expose records safely. Column naming, flattening, casting, array explosion and deduplication belong to the contract.
Bronze or silver shaping	Keep bronze as raw and auditable as practical; apply deterministic structure in silver.	Shape in bronze only when the raw source is unusable for storage or when downstream governance requires a stable bronze schema immediately.
Write mode	Use the simplest mode that preserves source semantics: append for immutable events, overwrite for replaceable snapshots, current-state for current-state dimensions, historical for history.	Use `snapshot_reconcile_soft_delete` only for full snapshots; do not combine it with watermarks or source filters because that changes the meaning of missing rows.
Schema location	Declare source schemas where Spark needs them to read bytes correctly. Declare shape schemas where JSON strings or nested payloads need deterministic parsing.	Do not duplicate schema definitions for convenience. If the same schema appears in multiple places, check whether one is a file-read schema and the other is a transformation schema.

Execution order

The execution order is part of the product contract. It allows users to reason about cost, failure modes and observability without reading the implementation.

Source resolution happens before DataFrame transformations.
Schema and quality gates run before write operations.
Annotations are applied after the table and final columns exist.
Access governance is intentionally deferred to dedicated commands.

Layer and physical schema are separate

layer is a logical classification used by presets, validations and observability. target.schema is the physical schema in the catalog. They can be the same, but they do not have to be.

target:
  catalog: main
  schema: crm_curated
  table: s_customers

layer: silver

Failure model

ContractForge distinguishes between a thrown Python exception and the persisted operational state of a run. When an ingestion fails, the framework writes the failed run, the short error message and the full traceback to control tables, then raises ContractForgeExecutionError to the caller by default.

Default

Fail fast to the caller

Notebook and job tasks stop naturally when the contract returns FAILED or ABORTED.

Diagnostics

Inspect payloads explicitly

Use raise_on_failure=False in tests or exploratory notebooks when the failed result itself is the assertion target.

Observability model

The result payload and control tables share the same operational vocabulary. A notebook can inspect the result immediately, while jobs and dashboards query persisted evidence.

Question	Result payload	Control-table evidence
Did the run succeed?	`status`, `error_message`	`ctrl_ingestion_runs.status`, `ctrl_ingestion_errors.stack_trace`
How much data moved?	`rows_read`, `rows_written`, `rows_inserted`, `rows_updated`, `rows_deleted`	`ctrl_ingestion_runs` metrics and `operation_metrics_json`
Did quality pass?	`quality_status`, `rows_quarantined`	`ctrl_ingestion_quality`, `ctrl_ingestion_quarantine`
Did schema change?	`schema_changes`	`ctrl_ingestion_schema_changes`
Which runtime was used?	`runtime_type`, `spark_version`, `python_version`, `framework_version`	`ctrl_ingestion_runs` and `ctrl_ingestion_metadata`
Did governance apply?	`governance`, `annotations_status`	`ctrl_ingestion_annotations`, `ctrl_ingestion_operations`, `ctrl_ingestion_access`

Runtime model

The same contract should be explicit about runtime-sensitive behavior. Some capabilities depend on Databricks serverless, classic clusters, Unity Catalog external locations or installed Spark connector libraries.

Area	Serverless guidance	Classic cluster guidance
Object storage	Prefer Volumes or External Locations.	Can use direct Hadoop/S3A/ABFS configuration when credentials are allowed.
Auto Loader	Use checkpoint/schema locations available to the workspace.	Use cloudFiles with cluster-level storage access.
JDBC	Requires driver availability and network route.	Install driver and configure VPC/firewall/peering as needed.
Externalized systems	Prefer federation, Lakeflow/native connectors or landed tables/files.	Install project-owned connector jars/packages only when you intentionally use a custom resolver.

Plan​

Connector​

Transform​

Writer mode​

Contract files and responsibility boundaries​

Decision guides​

Execution order​

Layer and physical schema are separate​

Failure model​

Fail fast to the caller​

Inspect payloads explicitly​

Observability model​

Runtime model​