REST API connector | ContractForge

When to use it

Good fit

Bounded APIs

Daily snapshots, reference endpoints, small paginated APIs, public datasets and controlled operational feeds.

Use another path

Large extracts

If the payload is very large or long-running, land files in object storage first and use file or Auto Loader ingestion.

Connector role

Retrieve, do not model

The connector should fetch pages and expose metadata. Use transform.shape for nested records, arrays and domain projections.

Runtime

Driver-side HTTP

Requests are made from the Python driver. Network egress, DNS, proxy and API allowlists must be solved by the platform.

Runtime requirements

Requirement	Details
Driver egress	The Python driver must reach the API endpoint, including DNS, proxy, firewall and allowlist configuration.
API limits	Use explicit page, record, byte, timeout and retry limits.
Authentication	Store tokens, keys and headers in secrets. Redacted metadata is written to control tables.
Payload model	Use record mode for tabular arrays and raw mode for complex or evolving JSON documents.

Basic example

Use record extraction when the response contains a clear list of records and the JSON shape is already close to tabular.

source:
  type: connector
  connector: rest_api
  request:
    method: GET
    url: https://api.example.com/v1/orders
    headers:
      Accept: application/json
      Authorization: "Bearer {{ secret:orders-api/token }}"
    query:
      status: closed
  response:
    records_path: $.data
  pagination:
    type: cursor
    cursor_param: cursor
    next_cursor_path: $.next_cursor
  limits:
    max_pages: 50
    max_records: 100000
    max_page_bytes: 10485760
    timeout_seconds: 60
    retry_attempts: 3
    retry_backoff_seconds: 2

target:
  catalog: main
  schema: bronze_api
  table: b_orders_api

layer: bronze
mode: scd0_append
schema_policy: additive_only

Raw payload mode

Use raw mode when the response is a document, when the payload is deeply nested or when the API shape may evolve. This is the preferred pattern for APIs such as NASA EONET where a top-level document contains arrays of domain objects.

source:
  type: connector
  connector: rest_api
  request:
    method: GET
    url: https://eonet.gsfc.nasa.gov/api/v3/events
    query:
      status: open
      days: "30"
  response:
    mode: raw
    raw_column: raw_response
  limits:
    max_pages: 1
    max_page_bytes: 10485760
    timeout_seconds: 60

target:
  catalog: main
  schema: bronze_public
  table: b_nasa_eonet_raw

layer: bronze
mode: scd0_overwrite

Then use transform.shape in the next contract or the same contract, depending on the layer design.

transform:
  shape:
    parse_json:
      - column: raw_response
        alias: payload
        schema: "STRUCT<events:ARRAY<STRUCT<id:STRING,title:STRING,geometry:ARRAY<STRUCT<date:STRING,type:STRING,coordinates:ARRAY<DOUBLE>>>>>"
    arrays:
      - path: payload.events
        mode: explode_outer
        alias: event
      - path: event.geometry
        mode: explode_outer
        alias: geometry
    columns:
      event.id: event_id
      event.title: title
      geometry.date:
        alias: event_ts
        cast: TIMESTAMP
      geometry.type: geometry_type
      geometry.coordinates[0]:
        alias: longitude
        cast: DOUBLE
      geometry.coordinates[1]:
        alias: latitude
        cast: DOUBLE

Pagination strategies

Strategy	Use when	Typical fields
`none`	The endpoint returns the complete bounded response in one call.	`max_pages: 1`
`cursor`	The response returns a token for the next page.	`cursor_param`, `next_cursor_path`
`page`	The API accepts page numbers.	`page_param`, `start_page`, `max_pages`
`offset`	The API uses offset/limit semantics.	`offset_param`, `limit_param`, `page_size`

pagination:
  type: offset
  offset_param: offset
  limit_param: limit
  page_size: 1000
  start_offset: 0
limits:
  max_pages: 100
  max_records: 100000

Authentication and secrets

Keep credentials in Databricks Secrets or another supported secret source. Secret values are redacted from result payloads, source metadata and persisted control tables.

request:
  url: https://api.example.com/v1/customers
  headers:
    Authorization: "Bearer {{ secret:crm-api/token }}"
    x-api-key: "{{ secret:crm-api/key }}"
  query:
    tenant: "{{ secret:crm-api/tenant }}"

Do not log raw headers

Use the source metadata fields in control tables for diagnostics. They are designed to store redacted options and request details.

Limits and failure behavior

REST extraction is intentionally bounded. Do not leave API jobs open-ended.

Limit	Why it matters
`timeout_seconds`	Prevents hung requests from blocking a job indefinitely.
`retry_attempts`	Handles transient API failures without hiding persistent errors.
`retry_backoff_seconds`	Reduces pressure on rate-limited APIs.
`max_pages`	Protects the driver from unbounded pagination loops.
`max_records`	Caps the materialized record set.
`max_page_bytes`	Prevents unexpectedly large response pages.

Incremental API pattern

If the API supports a time filter, combine ContractForge watermark state with request query parameters. Keep the API's filter semantics explicit and test late-arriving data behavior.

source:
  type: connector
  connector: rest_api
  request:
    url: https://api.example.com/v1/orders
    query:
      updated_after: "{{ watermark:updated_at }}"
  response:
    records_path: $.orders
  incremental:
    watermark_column: updated_at

watermark_columns: [updated_at]
mode: scd1_upsert
merge_keys: [order_id]
transform:
  deduplicate:
    keys: [order_id]
    order_by: updated_at DESC NULLS LAST

Late updates

Watermark-based APIs assume that the source update column is suitable for incremental extraction. If the API can mutate old records without updating the watermark column, use periodic reconciliation or snapshot patterns.

Operational metadata

REST runs record connector, URL, redacted request metadata, pagination details and source metrics for troubleshooting.

SELECT
  run_id,
  source_connector,
  source_request_redacted_json,
  source_pagination_redacted_json,
  source_response_redacted_json,
  source_limits_redacted_json,
  source_metrics_json
FROM main.ops.ctrl_ingestion_runs
WHERE source_connector = 'rest_api'
ORDER BY started_at_utc DESC;

Common issues

Symptom	Likely cause	Action
DNS or connection timeout	Runtime cannot reach the API endpoint.	Fix workspace network, egress, allowlist, proxy or private endpoint.
401 or 403	Invalid token, expired secret or missing scope.	Verify secret value and API permissions.
Payload too large	Endpoint returned more data than expected.	Use pagination, filters or land files externally.
Incorrect rows after parsing	Payload was nested and flattened in notebook code.	Move structure handling into `transform.shape` with explicit schema.