Skip to main content

REST API

validated

Use the REST API connector for bounded HTTP API extraction with explicit request configuration, pagination, limits, redacted secrets and either record extraction or raw payload capture.

When to use it

Operational APIs

Bounded APIs

Daily snapshots, reference endpoints, small paginated APIs, public datasets and controlled operational feeds.

Alternative path

Large extracts

If the payload is very large or long-running, land files in object storage first and use file or Auto Loader ingestion.

Connector boundary

Retrieve, do not model

The connector should fetch pages and expose metadata. Use transform.shape for nested records, arrays and domain projections.

Runtime boundary

Driver-side HTTP

Requests are made from the Python driver. Network egress, DNS, proxy and API allowlists must be solved by the platform.

Runtime requirements

RequirementDetails
Driver egressThe Python driver must reach the API endpoint, including DNS, proxy, firewall and allowlist configuration.
API limitsUse explicit page, record, byte, timeout and retry limits.
AuthenticationStore tokens, keys and headers in secrets. Redacted metadata is written to control tables.
Payload modelUse record mode for tabular arrays and raw mode for complex or evolving JSON documents.

The shared runtime client validates request URLs and OAuth token URLs before fetching. Only http and https schemes are accepted, private/link-local hosts are rejected unless the operator explicitly opts in, and HTTP redirects are refused instead of followed.

Basic example

Use record extraction when the response contains a clear list of records and the JSON shape is already close to tabular.

source:
type: connector
connector: rest_api
request:
method: GET
url: https://api.example.com/v1/orders
headers:
Accept: application/json
Authorization: "Bearer {{ secret:orders-api/token }}"
query:
status: closed
response:
records_path: $.data
pagination:
type: cursor
cursor_param: cursor
next_cursor_path: $.next_cursor
limits:
max_pages: 50
max_records: 100000
max_page_bytes: 10485760
timeout_seconds: 60
retry_attempts: 3
retry_backoff_seconds: 2

target:
catalog: main
schema: bronze_api
table: b_orders_api

layer: bronze
mode: append
schema_policy: additive_only

Raw payload mode

Use raw mode when the response is a document, when the payload is deeply nested or when the API shape may evolve. This is the preferred pattern for APIs such as NASA EONET where a top-level document contains arrays of domain objects.

source:
type: connector
connector: rest_api
request:
method: GET
url: https://eonet.gsfc.nasa.gov/api/v3/events
query:
status: open
days: "30"
response:
mode: raw
raw_column: raw_response
limits:
max_pages: 1
max_page_bytes: 10485760
timeout_seconds: 60

target:
catalog: main
schema: bronze_public
table: b_nasa_eonet_raw

layer: bronze
mode: overwrite

Then use transform.shape in the next contract or the same contract, depending on the layer design.

transform:
shape:
parse_json:
- column: raw_response
alias: payload
schema: "STRUCT<events:ARRAY<STRUCT<id:STRING,title:STRING,geometry:ARRAY<STRUCT<date:STRING,type:STRING,coordinates:ARRAY<DOUBLE>>>>>"
arrays:
- path: payload.events
mode: explode_outer
alias: event
- path: event.geometry
mode: explode_outer
alias: geometry
columns:
event.id: event_id
event.title: title
geometry.date:
alias: event_ts
cast: TIMESTAMP
geometry.type: geometry_type
geometry.coordinates[0]:
alias: longitude
cast: DOUBLE
geometry.coordinates[1]:
alias: latitude
cast: DOUBLE

Pagination strategies

StrategyUse whenTypical fields
noneThe endpoint returns the complete bounded response in one call.max_pages: 1
cursorThe response returns a token for the next page.cursor_param, next_cursor_path
pageThe API accepts page numbers.page_param, start_page, max_pages
offsetThe API uses offset/limit semantics.offset_param, limit_param, page_size
pagination:
type: offset
offset_param: offset
limit_param: limit
page_size: 1000
start_offset: 0
limits:
max_pages: 100
max_records: 100000

Authentication and secrets

Keep credentials in Databricks Secrets or another supported secret source. Secret values are redacted from result payloads, source metadata and persisted control tables.

request:
url: https://api.example.com/v1/customers
headers:
Authorization: "Bearer {{ secret:crm-api/token }}"
x-api-key: "{{ secret:crm-api/key }}"
query:
tenant: "{{ secret:crm-api/tenant }}"

Do not log raw headers

Use the source metadata fields in control tables for diagnostics. They are designed to store redacted options and request details.

Limits and failure behavior

REST extraction is intentionally bounded. Do not leave API jobs open-ended.

LimitWhy it matters
timeout_secondsPrevents hung requests from blocking a job indefinitely.
retry_attemptsHandles transient API failures without hiding persistent errors.
retry_backoff_secondsReduces pressure on rate-limited APIs.
max_pagesProtects the driver from unbounded pagination loops.
max_recordsCaps the materialized record set.
max_page_bytesPrevents unexpectedly large response pages.

Incremental API pattern

If the API supports a time filter, combine ContractForge watermark state with request query parameters. Keep the API's filter semantics explicit and test late-arriving data behavior.

source:
type: connector
connector: rest_api
request:
url: https://api.example.com/v1/orders
query:
updated_after: "{{ watermark:updated_at }}"
response:
records_path: $.orders
incremental:
watermark_column: updated_at

watermark_columns: [updated_at]
mode: upsert
merge_keys: [order_id]
transform:
deduplicate:
keys: [order_id]
order_by: updated_at DESC NULLS LAST

Late updates

Watermark-based APIs assume that the source update column is suitable for incremental extraction. If the API can mutate old records without updating the watermark column, use periodic reconciliation or snapshot patterns.

Operational metadata

REST runs record connector, URL, redacted request metadata, pagination details and source metrics for troubleshooting.

SELECT
run_id,
source_connector,
source_request_redacted_json,
source_pagination_redacted_json,
source_response_redacted_json,
source_limits_redacted_json,
source_metrics_json
FROM main.ops.ctrl_ingestion_runs
WHERE source_connector = 'rest_api'
ORDER BY started_at_utc DESC;

Common issues

SymptomLikely causeAction
DNS or connection timeoutRuntime cannot reach the API endpoint.Fix workspace network, egress, allowlist, proxy or private endpoint.
401 or 403Invalid token, expired secret or missing scope.Verify secret value and API permissions.
Payload too largeEndpoint returned more data than expected.Use pagination, filters or land files externally.
Incorrect rows after parsingPayload was nested and flattened in notebook code.Move structure handling into transform.shape with explicit schema.