Skip to main content

Normalize structure before quality and write.

Connectors retrieve data. Transformations make the physical DataFrame usable before quality rules and write modes run. Use this namespace for mutations that change rows, columns or values. Keep catalog descriptions, PII, aliases and business ownership in split governance contracts.

The canonical namespace

Use transform for physical DataFrame preparation. The top-level shape shortcut remains supported for simple notebooks, but new contracts should use the canonical namespace.

Execution order is deterministic:

  1. select_columns
  2. column_mapping
  3. transform.shape
  4. transform.cast
  5. transform.standardize
  6. transform.derive
  7. filter_expression
  8. custom_keys
  9. watermark_columns
  10. transform.deduplicate
  11. encoding cleanup and control columns
transform:
shape:
parse_json:
- column: raw_response
alias: payload
schema: "STRUCT<events:ARRAY<STRUCT<id:STRING,title:STRING>>>"
arrays:
- path: payload.events
mode: explode_outer
alias: event
columns:
event.id:
alias: event_id
cast: STRING
event.title:
alias: event_title
cast: STRING
cast:
event_id: string
standardize:
event_title:
trim: true
normalize_whitespace: true
derive:
event_date: to_date(ingestion_ts_utc)
deduplicate:
keys: [event_id]
order_by:
- column: ingestion_ts_utc
direction: desc
nulls: last

Design principles

Explicit schema

Parsing JSON requires Spark DDL. Avoid relying on runtime inference for important contracts.

Cardinality is intentional

Explode operations change row counts. Bronze blocks unsafe cardinality changes unless explicitly allowed.

Columns are projections

When columns is declared, only declared aliases remain as business columns.

Connectors stay neutral

Business structuring belongs in transformations, not in connector-specific workarounds.

Metadata stays out of transform

If a field describes meaning, ownership, PII, classification or catalog discoverability, it belongs in annotations or operations, not in transform.

Where schema belongs

ContractForge uses schema declarations in two different places. They should not be treated as interchangeable fields.

LocationPurposeUse it forDo not use it for
source.read.schemaPhysical source-row schema used while a connector or Spark reader materializes data.CSV, JSON, XML, Avro, file folders and REST response.mode: records where inference would be fragile.Business JSON inside an already-read string column.
Top-level schemas with shape.parse_json.schema_refReusable Spark DDL for parsing a JSON string after a DataFrame already exists.Event Hubs body, REST raw response, text payload columns and envelope formats.Configuring how Spark reads the original source files or connector records.
annotationsCatalog and governance metadata.Descriptions, aliases, tags, PII, sensitivity and data-product context.Any transformation, cast, explode or parsing behavior.

Practical rule

If Spark needs the schema to read the source, put it in source.read.schema. If ContractForge already has a string column and must parse its JSON content, put the DDL in top-level schemas and reference it from transform.shape.parse_json.

Transform field reference

FieldTypeRequiredBehavior
shapeobjectNoParses JSON strings, flattens structs, handles arrays and projects nested columns.
castmap string -> stringNoCasts existing columns to Spark SQL types such as string, double, timestamp or decimal(18,2).
standardizemap string -> objectNoApplies deterministic string cleanup to declared columns.
standardize.<column>.trimbooleanNoTrims leading and trailing whitespace.
standardize.<column>.lowerbooleanNoConverts text to lowercase. Cannot be combined with upper.
standardize.<column>.upperbooleanNoConverts text to uppercase. Cannot be combined with lower.
standardize.<column>.normalize_whitespacebooleanNoReplaces repeated whitespace with a single space.
standardize.<column>.empty_as_nullbooleanNoConverts the final empty string to null.
derivemap string -> stringNoCreates or replaces columns using Spark SQL expressions.
deduplicate.keysstring or listYes when deduplicate existsKey columns used to keep one row per key.
deduplicate.order_bystring or listYes when deduplicate existsDeterministic ordering. Prefer the structured list form for new contracts.
deduplicate.order_by[].columnstringYes for structured order itemsColumn used for ordering.
deduplicate.order_by[].directionasc or descNoSort direction. Defaults to desc.
deduplicate.order_by[].nullsfirst or lastNoNull ordering clause.

Cast, standardize and derive

Use these operations when the source is already tabular enough and the target contract needs explicit value preparation without custom notebook code.

transform:
cast:
order_id: string
amount: decimal(18,2)
updated_at: timestamp
standardize:
customer_email:
trim: true
lower: true
empty_as_null: true
status:
trim: true
upper: true
normalize_whitespace: true
derive:
order_date: to_date(updated_at)
net_amount: amount - coalesce(discount_amount, 0)

Use cast for target type preparation, standardize for simple string cleanup and derive for Spark SQL expressions that create columns. If the expression becomes large business logic, promote it to a reviewed view, notebook step or downstream model.

Shape field reference

The shape contract runs after the source connector and before quality rules and write modes. It is intended for physical normalization of nested or semi-structured records, not for business aggregation.

FieldTypeRequiredBehavior
parse_jsonlistNoParses declared string columns with an explicit Spark DDL schema using from_json.
parse_json[].columnstringYesSource string column or nested string path to parse.
parse_json[].schemastringYesSpark DDL schema. Use STRUCT<...>, ARRAY<...> or any valid Spark data type accepted by from_json.
parse_json[].aliasstringNoOutput column. Required when parsing a nested path. Must be a simple top-level column name.
parse_json[].drop_sourcebooleanNoDrops the original source column after parsing. Only supported for top-level source columns.
flatten.enabledbooleanNoFlattens struct fields into top-level columns.
flatten.separatorstringNoSeparator used in generated column names. Default behavior follows the implementation default.
flatten.max_depthintegerNoMaximum struct nesting depth to flatten.
flatten.includelistNoTop-level struct columns to flatten. Other columns are kept as-is.
flatten.excludelistNoPaths to exclude from flattening.
zip_arrayslistNoCombines parallel arrays into one array of structs before exploding.
zip_arrays[].aliasstringYesOutput array column containing structs.
zip_arrays[].columnsmapYesMap of source array path to output struct field name. Requires at least two arrays.
arrayslistNoTransforms arrays by keeping, serializing, sizing, taking first element or exploding.
arrays[].pathstringYesArray path. Use dot notation; do not use [] syntax.
arrays[].modeenumNokeep, to_json, size, first, explode or explode_outer.
arrays[].aliasstringNoOutput column. Required implicitly for cardinality-changing modes when no default alias is desired.
arrays[].allow_cartesianbooleanNoAllows multiple sibling explodes that may multiply rows. Default protects against accidental Cartesian expansion.
columnsmapNoFinal projection. When declared, only projected columns are kept.
columns.<path>.aliasstringNoOutput column. Defaults to the path with dots replaced by underscores when no expression is used.
columns.<path>.caststringNoSpark SQL type used to cast the projected expression.
columns.<path>.expressionstringNoSpark SQL expression. Requires an explicit alias.
allow_cardinality_change_on_bronzebooleanNoAllows explode/explode_outer in bronze contracts when the contract intentionally changes row cardinality.

Parallel arrays

Use zip_arrays before exploding when an API returns multiple arrays that represent aligned observations.

transform:
shape:
zip_arrays:
- alias: hourly_observation
columns:
payload.hourly.time: time
payload.hourly.temperature_2m: temperature
payload.hourly.relative_humidity_2m: humidity
arrays:
- path: hourly_observation
mode: explode_outer
alias: observation

Example patterns

NASA EONET

Raw REST JSON to rows

The REST connector captures raw JSON, parse_json applies an explicit DDL schema, and array explosion turns events into silver records.

Open-Meteo

Parallel arrays

Hourly weather responses return aligned arrays. zip_arrays preserves positional meaning before exploding.

USGS earthquakes

Nested structs and coordinates

Struct fields and array indexes can be projected declaratively, keeping notebooks free of Spark column plumbing.

Blob/S3 files

Schema first

Explicit file schemas plus transform.deduplicate make object storage examples deterministic and reviewable.

Deterministic deduplication

Use transform.deduplicate when a batch can contain multiple records for the same business key and the contract has a clear ordering rule.

transform:
deduplicate:
keys: [order_id]
order_by:
- column: updated_at
direction: desc
nulls: last
- column: sequence
direction: desc

For existing contracts, SQL fragments such as updated_at DESC NULLS LAST are still accepted. For new contracts, prefer the structured list because it is easier to validate and review.

Column mapping versus shape

Use column_mapping for simple source-to-target renames before technical columns are added. Use transform.shape.columns when the target projection needs casts, nested paths, expressions, array indexes or a complete structural rewrite.

column_mapping:
id: customer_id
ingestion_date: source_ingestion_date

transform:
shape:
columns:
properties.mag:
alias: magnitude
cast: DOUBLE
geometry.coordinates[0]:
alias: longitude
cast: DOUBLE

Nested arrays in any order

For nested arrays, declare the parent explode and then reference the generated alias in child operations. ContractForge resolves pending array paths iteratively, so the contract can express the intended hierarchy without custom Spark code.

transform:
shape:
arrays:
- path: payload.orders
mode: explode_outer
alias: order
- path: order.items
mode: explode_outer
alias: item
columns:
order.id:
alias: order_id
cast: STRING
item.sku:
alias: sku
cast: STRING
item.quantity:
alias: quantity
cast: INT

Cardinality rule

Two sibling explodes under the same parent are blocked unless allow_cartesian is true. Parent-child explodes are valid because each step has a clear row-expansion path.

Guardrails

  • JSON parsing requires a declared Spark DDL schema; this avoids accidental inference drift.
  • Changing cardinality in Bronze is blocked by default because raw layers should normally preserve source records.
  • Sibling arrays are not exploded independently unless the contract avoids accidental Cartesian products.
  • Deduplication order must be deterministic for merge and hash-diff modes.
  • lower and upper cannot be enabled together for the same standardized column.
  • cast, standardize and deduplicate.keys require columns to exist after shape and before write.