Bounded APIs
Daily snapshots, reference endpoints, small paginated APIs, public datasets and controlled operational feeds.
Connector
Use the REST API connector for bounded HTTP API extraction with explicit request configuration, pagination, limits, redacted secrets and either record extraction or raw payload capture.
Daily snapshots, reference endpoints, small paginated APIs, public datasets and controlled operational feeds.
If the payload is very large or long-running, land files in object storage first and use file or Auto Loader ingestion.
The connector should fetch pages and expose metadata. Use transform.shape for nested records, arrays and domain projections.
Requests are made from the Python driver. Network egress, DNS, proxy and API allowlists must be solved by the platform.
| Requirement | Details |
|---|---|
| Driver egress | The Python driver must reach the API endpoint, including DNS, proxy, firewall and allowlist configuration. |
| API limits | Use explicit page, record, byte, timeout and retry limits. |
| Authentication | Store tokens, keys and headers in secrets. Redacted metadata is written to control tables. |
| Payload model | Use record mode for tabular arrays and raw mode for complex or evolving JSON documents. |
Use record extraction when the response contains a clear list of records and the JSON shape is already close to tabular.
source:
type: connector
connector: rest_api
request:
method: GET
url: https://api.example.com/v1/orders
headers:
Accept: application/json
Authorization: "Bearer {{ secret:orders-api/token }}"
query:
status: closed
response:
records_path: $.data
pagination:
type: cursor
cursor_param: cursor
next_cursor_path: $.next_cursor
limits:
max_pages: 50
max_records: 100000
max_page_bytes: 10485760
timeout_seconds: 60
retry_attempts: 3
retry_backoff_seconds: 2
target:
catalog: main
schema: bronze_api
table: b_orders_api
layer: bronze
mode: scd0_append
schema_policy: additive_only
Use raw mode when the response is a document, when the payload is deeply nested or when the API shape may evolve. This is the preferred pattern for APIs such as NASA EONET where a top-level document contains arrays of domain objects.
source:
type: connector
connector: rest_api
request:
method: GET
url: https://eonet.gsfc.nasa.gov/api/v3/events
query:
status: open
days: "30"
response:
mode: raw
raw_column: raw_response
limits:
max_pages: 1
max_page_bytes: 10485760
timeout_seconds: 60
target:
catalog: main
schema: bronze_public
table: b_nasa_eonet_raw
layer: bronze
mode: scd0_overwrite
Then use transform.shape in the next contract or the same contract, depending on the layer design.
transform:
shape:
parse_json:
- column: raw_response
alias: payload
schema: "STRUCT<events:ARRAY<STRUCT<id:STRING,title:STRING,geometry:ARRAY<STRUCT<date:STRING,type:STRING,coordinates:ARRAY<DOUBLE>>>>>"
arrays:
- path: payload.events
mode: explode_outer
alias: event
- path: event.geometry
mode: explode_outer
alias: geometry
columns:
event.id: event_id
event.title: title
geometry.date:
alias: event_ts
cast: TIMESTAMP
geometry.type: geometry_type
geometry.coordinates[0]:
alias: longitude
cast: DOUBLE
geometry.coordinates[1]:
alias: latitude
cast: DOUBLE
| Strategy | Use when | Typical fields |
|---|---|---|
none | The endpoint returns the complete bounded response in one call. | max_pages: 1 |
cursor | The response returns a token for the next page. | cursor_param, next_cursor_path |
page | The API accepts page numbers. | page_param, start_page, max_pages |
offset | The API uses offset/limit semantics. | offset_param, limit_param, page_size |
pagination:
type: offset
offset_param: offset
limit_param: limit
page_size: 1000
start_offset: 0
limits:
max_pages: 100
max_records: 100000
Keep credentials in Databricks Secrets or another supported secret source. Secret values are redacted from result payloads, source metadata and persisted control tables.
request:
url: https://api.example.com/v1/customers
headers:
Authorization: "Bearer {{ secret:crm-api/token }}"
x-api-key: "{{ secret:crm-api/key }}"
query:
tenant: "{{ secret:crm-api/tenant }}"
Use the source metadata fields in control tables for diagnostics. They are designed to store redacted options and request details.
REST extraction is intentionally bounded. Do not leave API jobs open-ended.
| Limit | Why it matters |
|---|---|
timeout_seconds | Prevents hung requests from blocking a job indefinitely. |
retry_attempts | Handles transient API failures without hiding persistent errors. |
retry_backoff_seconds | Reduces pressure on rate-limited APIs. |
max_pages | Protects the driver from unbounded pagination loops. |
max_records | Caps the materialized record set. |
max_page_bytes | Prevents unexpectedly large response pages. |
If the API supports a time filter, combine ContractForge watermark state with request query parameters. Keep the API's filter semantics explicit and test late-arriving data behavior.
source:
type: connector
connector: rest_api
request:
url: https://api.example.com/v1/orders
query:
updated_after: "{{ watermark:updated_at }}"
response:
records_path: $.orders
incremental:
watermark_column: updated_at
watermark_columns: [updated_at]
mode: scd1_upsert
merge_keys: [order_id]
transform:
deduplicate:
keys: [order_id]
order_by: updated_at DESC NULLS LAST
Watermark-based APIs assume that the source update column is suitable for incremental extraction. If the API can mutate old records without updating the watermark column, use periodic reconciliation or snapshot patterns.
REST runs record connector, URL, redacted request metadata, pagination details and source metrics for troubleshooting.
SELECT
run_id,
source_connector,
source_request_redacted_json,
source_pagination_redacted_json,
source_response_redacted_json,
source_limits_redacted_json,
source_metrics_json
FROM main.ops.ctrl_ingestion_runs
WHERE source_connector = 'rest_api'
ORDER BY started_at_utc DESC;
| Symptom | Likely cause | Action |
|---|---|---|
| DNS or connection timeout | Runtime cannot reach the API endpoint. | Fix workspace network, egress, allowlist, proxy or private endpoint. |
| 401 or 403 | Invalid token, expired secret or missing scope. | Verify secret value and API permissions. |
| Payload too large | Endpoint returned more data than expected. | Use pagination, filters or land files externally. |
| Incorrect rows after parsing | Payload was nested and flattened in notebook code. | Move structure handling into transform.shape with explicit schema. |