Transformations | ContractForge

The canonical namespace

Use transform.shape and transform.deduplicate for new contracts. The top-level shape shortcut remains convenient, but the transform namespace makes room for future transformation families.

transform:
  shape:
    parse_json:
      - column: raw_response
        alias: payload
        schema: "STRUCT<events:ARRAY<STRUCT<id:STRING,title:STRING>>>"
    arrays:
      - path: payload.events
        mode: explode_outer
        alias: event
    columns:
      event.id:
        alias: event_id
        cast: STRING
      event.title:
        alias: event_title
        cast: STRING
  deduplicate:
    keys: [event_id]
    order_by: ingestion_ts_utc DESC NULLS LAST

Design principles

Explicit schema

Parsing JSON requires Spark DDL. Avoid relying on runtime inference for important contracts.

Cardinality is intentional

Explode operations change row counts. Bronze blocks unsafe cardinality changes unless explicitly allowed.

Columns are projections

When columns is declared, only declared aliases remain as business columns.

Connectors stay neutral

Business structuring belongs in transformations, not in connector-specific workarounds.

Shape field reference

The shape contract runs after the source connector and before quality rules and write modes. It is intended for physical normalization of nested or semi-structured records, not for business aggregation.

Field	Type	Required	Behavior
`parse_json`	list	No	Parses declared string columns with an explicit Spark DDL schema using `from_json`.
`parse_json[].column`	string	Yes	Source string column or nested string path to parse.
`parse_json[].schema`	string	Yes	Spark DDL schema. Use `STRUCT<...>`, `ARRAY<...>` or any valid Spark data type accepted by `from_json`.
`parse_json[].alias`	string	No	Output column. Required when parsing a nested path. Must be a simple top-level column name.
`parse_json[].drop_source`	boolean	No	Drops the original source column after parsing. Only supported for top-level source columns.
`flatten.enabled`	boolean	No	Flattens struct fields into top-level columns.
`flatten.separator`	string	No	Separator used in generated column names. Default behavior follows the implementation default.
`flatten.max_depth`	integer	No	Maximum struct nesting depth to flatten.
`flatten.include`	list	No	Top-level struct columns to flatten. Other columns are kept as-is.
`flatten.exclude`	list	No	Paths to exclude from flattening.
`zip_arrays`	list	No	Combines parallel arrays into one array of structs before exploding.
`zip_arrays[].alias`	string	Yes	Output array column containing structs.
`zip_arrays[].columns`	map	Yes	Map of source array path to output struct field name. Requires at least two arrays.
`arrays`	list	No	Transforms arrays by keeping, serializing, sizing, taking first element or exploding.
`arrays[].path`	string	Yes	Array path. Use dot notation; do not use `[]` syntax.
`arrays[].mode`	enum	No	`keep`, `to_json`, `size`, `first`, `explode` or `explode_outer`.
`arrays[].alias`	string	No	Output column. Required implicitly for cardinality-changing modes when no default alias is desired.
`arrays[].allow_cartesian`	boolean	No	Allows multiple sibling explodes that may multiply rows. Default protects against accidental Cartesian expansion.
`columns`	map	No	Final projection. When declared, only projected columns are kept.
`columns.<path>.alias`	string	No	Output column. Defaults to the path with dots replaced by underscores when no expression is used.
`columns.<path>.cast`	string	No	Spark SQL type used to cast the projected expression.
`columns.<path>.expression`	string	No	Spark SQL expression. Requires an explicit alias.
`allow_cardinality_change_on_bronze`	boolean	No	Allows `explode`/`explode_outer` in bronze contracts when the contract intentionally changes row cardinality.

Parallel arrays

Use zip_arrays before exploding when an API returns multiple arrays that represent aligned observations.

transform:
  shape:
    zip_arrays:
      - alias: hourly_observation
        arrays:
          time: payload.hourly.time
          temperature: payload.hourly.temperature_2m
          humidity: payload.hourly.relative_humidity_2m
    arrays:
      - path: hourly_observation
        mode: explode_outer
        alias: observation

Example patterns

NASA EONET

Raw REST JSON to rows

The REST connector captures raw JSON, parse_json applies an explicit DDL schema, and array explosion turns events into silver records.

Open-Meteo

Parallel arrays

Hourly weather responses return aligned arrays. zip_arrays preserves positional meaning before exploding.

USGS earthquakes

Nested structs and coordinates

Struct fields and array indexes can be projected declaratively, keeping notebooks free of Spark column plumbing.

Blob/S3 files

Schema first

Explicit file schemas plus transform.deduplicate make object storage examples deterministic and reviewable.

Column mapping versus shape

Use column_mapping for simple source-to-target renames before technical columns are added. Use transform.shape.columns when the target projection needs casts, nested paths, expressions, array indexes or a complete structural rewrite.

column_mapping:
  id: customer_id
  ingestion_date: source_ingestion_date

transform:
  shape:
    columns:
      properties.mag:
        alias: magnitude
        cast: DOUBLE
      geometry.coordinates[0]:
        alias: longitude
        cast: DOUBLE

Nested arrays in any order

For nested arrays, declare the parent explode and then reference the generated alias in child operations. ContractForge resolves pending array paths iteratively, so the contract can express the intended hierarchy without custom Spark code.

transform:
  shape:
    arrays:
      - path: payload.orders
        mode: explode_outer
        alias: order
      - path: order.items
        mode: explode_outer
        alias: item
    columns:
      order.id:
        alias: order_id
        cast: STRING
      item.sku:
        alias: sku
        cast: STRING
      item.quantity:
        alias: quantity
        cast: INT

Cardinality rule

Two sibling explodes under the same parent are blocked unless allow_cartesian is true. Parent-child explodes are valid because each step has a clear row-expansion path.

Guardrails

JSON parsing requires a declared Spark DDL schema; this avoids accidental inference drift.
Changing cardinality in Bronze is blocked by default because raw layers should normally preserve source records.
Sibling arrays are not exploded independently unless the contract avoids accidental Cartesian products.
Deduplication order must be deterministic for merge and hash-diff modes.