The canonical namespace

Use transform.shape and transform.deduplicate for new contracts. The top-level shape shortcut remains convenient, but the transform namespace makes room for future transformation families.

transform:
  shape:
    parse_json:
      - column: raw_response
        alias: payload
        schema: "STRUCT<events:ARRAY<STRUCT<id:STRING,title:STRING>>>"
    arrays:
      - path: payload.events
        mode: explode_outer
        alias: event
    columns:
      event.id:
        alias: event_id
        cast: STRING
      event.title:
        alias: event_title
        cast: STRING
  deduplicate:
    keys: [event_id]
    order_by: ingestion_ts_utc DESC NULLS LAST

Design principles

Explicit schema

Parsing JSON requires Spark DDL. Avoid relying on runtime inference for important contracts.

Cardinality is intentional

Explode operations change row counts. Bronze blocks unsafe cardinality changes unless explicitly allowed.

Columns are projections

When columns is declared, only declared aliases remain as business columns.

Connectors stay neutral

Business structuring belongs in transformations, not in connector-specific workarounds.

Shape field reference

The shape contract runs after the source connector and before quality rules and write modes. It is intended for physical normalization of nested or semi-structured records, not for business aggregation.

FieldTypeRequiredBehavior
parse_jsonlistNoParses declared string columns with an explicit Spark DDL schema using from_json.
parse_json[].columnstringYesSource string column or nested string path to parse.
parse_json[].schemastringYesSpark DDL schema. Use STRUCT<...>, ARRAY<...> or any valid Spark data type accepted by from_json.
parse_json[].aliasstringNoOutput column. Required when parsing a nested path. Must be a simple top-level column name.
parse_json[].drop_sourcebooleanNoDrops the original source column after parsing. Only supported for top-level source columns.
flatten.enabledbooleanNoFlattens struct fields into top-level columns.
flatten.separatorstringNoSeparator used in generated column names. Default behavior follows the implementation default.
flatten.max_depthintegerNoMaximum struct nesting depth to flatten.
flatten.includelistNoTop-level struct columns to flatten. Other columns are kept as-is.
flatten.excludelistNoPaths to exclude from flattening.
zip_arrayslistNoCombines parallel arrays into one array of structs before exploding.
zip_arrays[].aliasstringYesOutput array column containing structs.
zip_arrays[].columnsmapYesMap of source array path to output struct field name. Requires at least two arrays.
arrayslistNoTransforms arrays by keeping, serializing, sizing, taking first element or exploding.
arrays[].pathstringYesArray path. Use dot notation; do not use [] syntax.
arrays[].modeenumNokeep, to_json, size, first, explode or explode_outer.
arrays[].aliasstringNoOutput column. Required implicitly for cardinality-changing modes when no default alias is desired.
arrays[].allow_cartesianbooleanNoAllows multiple sibling explodes that may multiply rows. Default protects against accidental Cartesian expansion.
columnsmapNoFinal projection. When declared, only projected columns are kept.
columns.<path>.aliasstringNoOutput column. Defaults to the path with dots replaced by underscores when no expression is used.
columns.<path>.caststringNoSpark SQL type used to cast the projected expression.
columns.<path>.expressionstringNoSpark SQL expression. Requires an explicit alias.
allow_cardinality_change_on_bronzebooleanNoAllows explode/explode_outer in bronze contracts when the contract intentionally changes row cardinality.

Parallel arrays

Use zip_arrays before exploding when an API returns multiple arrays that represent aligned observations.

transform:
  shape:
    zip_arrays:
      - alias: hourly_observation
        arrays:
          time: payload.hourly.time
          temperature: payload.hourly.temperature_2m
          humidity: payload.hourly.relative_humidity_2m
    arrays:
      - path: hourly_observation
        mode: explode_outer
        alias: observation

Example patterns

NASA EONET

Raw REST JSON to rows

The REST connector captures raw JSON, parse_json applies an explicit DDL schema, and array explosion turns events into silver records.

Open-Meteo

Parallel arrays

Hourly weather responses return aligned arrays. zip_arrays preserves positional meaning before exploding.

USGS earthquakes

Nested structs and coordinates

Struct fields and array indexes can be projected declaratively, keeping notebooks free of Spark column plumbing.

Blob/S3 files

Schema first

Explicit file schemas plus transform.deduplicate make object storage examples deterministic and reviewable.

Column mapping versus shape

Use column_mapping for simple source-to-target renames before technical columns are added. Use transform.shape.columns when the target projection needs casts, nested paths, expressions, array indexes or a complete structural rewrite.

column_mapping:
  id: customer_id
  ingestion_date: source_ingestion_date

transform:
  shape:
    columns:
      properties.mag:
        alias: magnitude
        cast: DOUBLE
      geometry.coordinates[0]:
        alias: longitude
        cast: DOUBLE

Nested arrays in any order

For nested arrays, declare the parent explode and then reference the generated alias in child operations. ContractForge resolves pending array paths iteratively, so the contract can express the intended hierarchy without custom Spark code.

transform:
  shape:
    arrays:
      - path: payload.orders
        mode: explode_outer
        alias: order
      - path: order.items
        mode: explode_outer
        alias: item
    columns:
      order.id:
        alias: order_id
        cast: STRING
      item.sku:
        alias: sku
        cast: STRING
      item.quantity:
        alias: quantity
        cast: INT
Cardinality rule

Two sibling explodes under the same parent are blocked unless allow_cartesian is true. Parent-child explodes are valid because each step has a clear row-expansion path.

Guardrails

  • JSON parsing requires a declared Spark DDL schema; this avoids accidental inference drift.
  • Changing cardinality in Bronze is blocked by default because raw layers should normally preserve source records.
  • Sibling arrays are not exploded independently unless the contract avoids accidental Cartesian products.
  • Deduplication order must be deterministic for merge and hash-diff modes.