Explicit schema
Parsing JSON requires Spark DDL. Avoid relying on runtime inference for important contracts.
Transformations
Connectors retrieve data. Transformations make the physical shape usable: parse JSON, flatten structs, handle arrays, project columns and deduplicate batches.
Use transform.shape and transform.deduplicate for new contracts. The top-level shape shortcut remains convenient, but the transform namespace makes room for future transformation families.
transform:
shape:
parse_json:
- column: raw_response
alias: payload
schema: "STRUCT<events:ARRAY<STRUCT<id:STRING,title:STRING>>>"
arrays:
- path: payload.events
mode: explode_outer
alias: event
columns:
event.id:
alias: event_id
cast: STRING
event.title:
alias: event_title
cast: STRING
deduplicate:
keys: [event_id]
order_by: ingestion_ts_utc DESC NULLS LAST
Parsing JSON requires Spark DDL. Avoid relying on runtime inference for important contracts.
Explode operations change row counts. Bronze blocks unsafe cardinality changes unless explicitly allowed.
When columns is declared, only declared aliases remain as business columns.
Business structuring belongs in transformations, not in connector-specific workarounds.
The shape contract runs after the source connector and before quality rules and write modes. It is intended for physical normalization of nested or semi-structured records, not for business aggregation.
| Field | Type | Required | Behavior |
|---|---|---|---|
parse_json | list | No | Parses declared string columns with an explicit Spark DDL schema using from_json. |
parse_json[].column | string | Yes | Source string column or nested string path to parse. |
parse_json[].schema | string | Yes | Spark DDL schema. Use STRUCT<...>, ARRAY<...> or any valid Spark data type accepted by from_json. |
parse_json[].alias | string | No | Output column. Required when parsing a nested path. Must be a simple top-level column name. |
parse_json[].drop_source | boolean | No | Drops the original source column after parsing. Only supported for top-level source columns. |
flatten.enabled | boolean | No | Flattens struct fields into top-level columns. |
flatten.separator | string | No | Separator used in generated column names. Default behavior follows the implementation default. |
flatten.max_depth | integer | No | Maximum struct nesting depth to flatten. |
flatten.include | list | No | Top-level struct columns to flatten. Other columns are kept as-is. |
flatten.exclude | list | No | Paths to exclude from flattening. |
zip_arrays | list | No | Combines parallel arrays into one array of structs before exploding. |
zip_arrays[].alias | string | Yes | Output array column containing structs. |
zip_arrays[].columns | map | Yes | Map of source array path to output struct field name. Requires at least two arrays. |
arrays | list | No | Transforms arrays by keeping, serializing, sizing, taking first element or exploding. |
arrays[].path | string | Yes | Array path. Use dot notation; do not use [] syntax. |
arrays[].mode | enum | No | keep, to_json, size, first, explode or explode_outer. |
arrays[].alias | string | No | Output column. Required implicitly for cardinality-changing modes when no default alias is desired. |
arrays[].allow_cartesian | boolean | No | Allows multiple sibling explodes that may multiply rows. Default protects against accidental Cartesian expansion. |
columns | map | No | Final projection. When declared, only projected columns are kept. |
columns.<path>.alias | string | No | Output column. Defaults to the path with dots replaced by underscores when no expression is used. |
columns.<path>.cast | string | No | Spark SQL type used to cast the projected expression. |
columns.<path>.expression | string | No | Spark SQL expression. Requires an explicit alias. |
allow_cardinality_change_on_bronze | boolean | No | Allows explode/explode_outer in bronze contracts when the contract intentionally changes row cardinality. |
Use zip_arrays before exploding when an API returns multiple arrays that represent aligned observations.
transform:
shape:
zip_arrays:
- alias: hourly_observation
arrays:
time: payload.hourly.time
temperature: payload.hourly.temperature_2m
humidity: payload.hourly.relative_humidity_2m
arrays:
- path: hourly_observation
mode: explode_outer
alias: observation
The REST connector captures raw JSON, parse_json applies an explicit DDL schema, and array explosion turns events into silver records.
Hourly weather responses return aligned arrays. zip_arrays preserves positional meaning before exploding.
Struct fields and array indexes can be projected declaratively, keeping notebooks free of Spark column plumbing.
Explicit file schemas plus transform.deduplicate make object storage examples deterministic and reviewable.
Use column_mapping for simple source-to-target renames before technical columns are added. Use transform.shape.columns when the target projection needs casts, nested paths, expressions, array indexes or a complete structural rewrite.
column_mapping:
id: customer_id
ingestion_date: source_ingestion_date
transform:
shape:
columns:
properties.mag:
alias: magnitude
cast: DOUBLE
geometry.coordinates[0]:
alias: longitude
cast: DOUBLE
For nested arrays, declare the parent explode and then reference the generated alias in child operations. ContractForge resolves pending array paths iteratively, so the contract can express the intended hierarchy without custom Spark code.
transform:
shape:
arrays:
- path: payload.orders
mode: explode_outer
alias: order
- path: order.items
mode: explode_outer
alias: item
columns:
order.id:
alias: order_id
cast: STRING
item.sku:
alias: sku
cast: STRING
item.quantity:
alias: quantity
cast: INT
Two sibling explodes under the same parent are blocked unless allow_cartesian is true. Parent-child explodes are valid because each step has a clear row-expansion path.