Choose table semantics explicitly.
A write mode is not just an implementation detail. It defines how new data relates to existing target state, how history is preserved and what source completeness is required.
Public mode names
Use these names in contracts:
| Mode | Display name | Meaning |
|---|---|---|
append | Append | Add prepared rows without reconciling existing target rows. |
overwrite | Overwrite | Replace the full target or declared target scope. |
upsert | Current-state upsert | Keep one current row per key by merging source rows into the target. |
hash_diff_upsert | Hash-diff upsert | Upsert only rows whose compared content changed. |
historical | Historical | Preserve old and current versions with validity windows. |
snapshot_reconcile_soft_delete | Snapshot reconciliation with soft delete | Reconcile a complete snapshot and mark missing active keys inactive/deleted. |
Decision guide
| Need | Recommended mode | Design note |
|---|---|---|
| Capture everything as received | append | Keep Bronze simple and auditable. |
| Rebuild a small aggregate | overwrite | Use when full recompute is cheaper than incremental logic. |
| Keep one row per key | upsert | Validate null and duplicate keys before MERGE. |
| Track current-state changes cheaply | hash_diff_upsert | Needs stable dedup ordering for the latest state. |
| Audit historical versions | historical | Restrict change columns to meaningful business attributes. |
| Deactivate records absent from source | snapshot_reconcile_soft_delete | Do not combine with watermark or partial filters. |
Mode requirements
The table below is the practical contract checklist. If a field is listed as required, validation should fail before the framework performs a dangerous or ambiguous write.
| Mode | Required fields | Recommended quality rules | Do not use when |
|---|---|---|---|
append | None beyond source and target. | required_columns, not_null for business keys, optional expressions. | The same batch can be replayed without an idempotency key. |
overwrite | None beyond source and target. | min_rows and required columns to avoid replacing a table with an empty or malformed source. | The source is partial or not deterministic. |
upsert | merge_keys. | not_null and unique_key on merge keys; deduplicate with deterministic order. | Multiple source rows can update the same target key. |
hash_diff_upsert | merge_keys plus either hash_keys or hash_strategy: all_columns_except; deterministic latest-state order when history exists. | not_null, unique_key after dedup, accepted values for low-cardinality dimensions. | The target has multiple versions per key and no reliable ordering column. |
historical | merge_keys or mode-specific keys, plus change columns when business comparison should be scoped. | Required keys, uniqueness after dedup, non-null effective date when used. | You only need current state or the source cannot produce stable business keys. |
snapshot_reconcile_soft_delete | merge_keys and source completeness evidence. | not_null, unique_key, min_rows for complete snapshots. | The source is filtered, incremental, watermarked or represents only a slice of current state. |
Optimization and table layout
Write semantics and physical layout are separate choices. Keep the mode focused on business state and use layout settings to improve query performance or maintenance behavior.
| Field | Use | Guidance |
|---|---|---|
partition_column, partition_value | Scope overwrite/replace patterns to a physical partition. | Only use partition_value with partition_column. Do not confuse partition filtering with source completeness. |
merge_strategy | Select the MERGE strategy used by merge modes. | Use the default unless the target is partitioned and the contract explicitly declares the partition strategy. |
merge_partition_column | Partition key for partition-aware merge strategies. | Required for partition merge strategies. |
replace_partitions_source_complete | Declares that the source contains the complete partition slice being replaced. | Use for partition replacement only when the source truly covers the full affected partition. |
cluster_columns | Liquid clustering columns for Databricks tables. | Prefer for new Databricks-managed Delta tables when query patterns are known. |
zorder_columns | Columns for explicit ZORDER after write. | Use with optimize_after_write=true; do not declare it as a passive hint. |
delta_properties | Delta table properties at creation/update time. | Use for CDF, retention, auto optimize and table-level behavior that belongs with the target contract. |
Partition replacement
upsert with merge_strategy: replace_partitions is for complete partition snapshots. ContractForge uses Databricks SQL INSERT INTO ... REPLACE WHERE ... SELECT so rows matching the partition predicate are replaced by the incoming query as one Delta operation.
Use this only when the source covers the full affected partition:
mode: upsert
merge_strategy: replace_partitions
merge_partition_column: order_date
replace_partitions_source_complete: true
merge_keys: [order_id]
If the source is incremental, late-arriving or filtered, use delta or delta_by_partition instead.
Snapshot reconciliation with soft delete
This mode is intentionally defensive. The source must represent the complete current state, otherwise the framework cannot tell if a target row disappeared or was simply outside the loaded slice.
- YAML
mode: snapshot_reconcile_soft_delete
merge_keys: [device_id]
source:
type: connector
connector: table
table: main.raw.devices_snapshot
read:
source_complete: true
Operational note
Use incremental modes for partial sources. A snapshot with watermark is a contradiction and should fail before data is touched.
Merge safety
MERGE-based modes fail before writing when the source contains null or duplicate keys that would make the target update ambiguous. The guardrail is generic and protects every merge target from non-deterministic updates.
mode: upsert
merge_keys: [customer_id]
quality_rules:
not_null: [customer_id]
unique_key: [customer_id]
transform:
deduplicate:
keys: [customer_id]
order_by: updated_at DESC NULLS LAST
Metrics by mode
Each mode returns logical metrics even when Delta history does not expose the same operation counters across runtimes. Use metrics_source to understand whether values came from logical calculations, Delta operation metrics or a mix.
| Mode | Important metrics | Dashboard interpretation |
|---|---|---|
append | rows_inserted, rows_written | Batch size and ingestion throughput. |
upsert | rows_inserted, rows_updated | Change ratio in current-state tables. |
hash_diff_upsert | rows_written, hash-change counts when available, hash_diff_candidate_rows on AWS | Whether source batches are actually changing target state. On Iceberg, file rewrite counters can be larger than the business change count. |
historical | expired versions and inserted versions | Historical churn and slowly changing dimension behavior. |
snapshot_reconcile_soft_delete | inserted, updated, soft-deleted rows | Completeness and unexpected disappearance monitoring. |
Mode examples
- YAML
- Python
# Current-state table
source:
type: connector
connector: jdbc
options:
url: "{{ secret:crm/jdbc_url }}"
dbtable: public.customers
driver: org.postgresql.Driver
target:
catalog: main
schema: crm_curated
table: s_customers
layer: silver
mode: upsert
merge_keys: [customer_id]
transform:
deduplicate:
keys: [customer_id]
order_by: updated_at DESC NULLS LAST
quality_rules:
not_null: [customer_id]
unique_key: [customer_id]
from contractforge import ingest
result = ingest(
source={
"type": "connector",
"connector": "jdbc",
"options": {
"url": "{{ secret:crm/jdbc_url }}",
"dbtable": "public.customers",
"driver": "org.postgresql.Driver",
},
},
target_catalog="main",
target_schema="crm_curated",
target_table="s_customers",
layer="silver",
mode="upsert",
merge_keys=["customer_id"],
transform={
"deduplicate": {
"keys": ["customer_id"],
"order_by": "updated_at DESC NULLS LAST",
},
},
quality_rules={
"not_null": ["customer_id"],
"unique_key": ["customer_id"],
},
)
Raise-on-failure behavior
Failed writes are not silent payloads by default. The API writes run evidence and then raises ContractForgeExecutionError, so Databricks Jobs and notebooks fail naturally.
from contractforge import ContractForgeExecutionError, ingest
try:
result = ingest(...)
except ContractForgeExecutionError as exc:
display(exc.result)
raise