Skip to content
Cloudflare Docs

/crawl - Crawl web content

The /crawl endpoint scrapes content from a starting URL and follows links across the site, up to a configurable depth or page limit. Responses can be returned as HTML, Markdown, or JSON.

Endpoint

https://api.cloudflare.com/client/v4/accounts/<account_id>/browser-rendering/crawl

Required fields

  • url (string)

Refer to optional parameters for additional customization options.

Common use cases

  • Building knowledge bases or training AI systems (such as RAG applications) with up-to-date web content
  • Scraping and analyzing content across multiple pages for research, summarization, or monitoring

How it works

There are two steps to using the /crawl endpoint:

  1. Initiate the crawl job — A POST request where you initiate the crawl and receive a response with a job id.
  2. Request results of the crawl job — A GET request where you request the status or results of the crawl.

Crawl jobs have a maximum run time of seven days. If a job does not finish within this time, it will be cancelled due to timeout. Job results are available for 14 days after the job completes, after which the job data is deleted.

Initiate the crawl job

Send a POST request with a url to start a crawl job. The API responds immediately with a job id you will use to retrieve results. Refer to optional parameters for additional customization options.

Terminal window
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://developers.cloudflare.com/workers/"
}'

Example response:

{
"success": true,
"result": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e"
}

Request results of the crawl job

To check the status or request the results of your crawl job, use the job id you received:

Terminal window
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e' \
-H 'Authorization: Bearer YOUR_API_TOKEN'

The response includes a status field indicating the current state of the crawl job. The possible job statuses are:

  • running — The crawl job is currently in progress.
  • cancelled_due_to_timeout — The crawl job exceeded the maximum run time of seven days.
  • cancelled_due_to_limits — The crawl job was cancelled because it hit account limits.
  • cancelled_by_user — The crawl job was manually cancelled by the user.
  • errored — The crawl job encountered an error.
  • completed — The crawl job finished successfully.

Polling for completion

Since crawl jobs run asynchronously, you can poll the endpoint periodically to check when the job finishes. Add ?limit=1 to the request URL so the response stays lightweight — you only need the job status, not the full set of crawled records.

JavaScript
async function waitForCrawl(accountId, jobId, apiToken) {
const maxAttempts = 60;
const delayMs = 5000;
for (let i = 0; i < maxAttempts; i++) {
const response = await fetch(
`https://api.cloudflare.com/client/v4/accounts/${accountId}/browser-rendering/crawl/${jobId}?limit=1`,
{
headers: {
Authorization: `Bearer ${apiToken}`,
},
},
);
const data = await response.json();
const status = data.result.status;
if (status !== "running") {
return data.result;
}
await new Promise((resolve) => setTimeout(resolve, delayMs));
}
throw new Error("Crawl job did not complete within timeout");
}

Once the job reaches a terminal status, fetch the full results without the limit parameter. You can also use the following query parameters to filter and paginate results:

  • cursor — Cursor for pagination. If the response exceeds 10 MB, a cursor value will be included. Pass it as a query parameter to retrieve the next page of results.
  • limit — Maximum number of records to return.
  • status — Filter by URL status: queued, completed, disallowed, skipped, errored, or cancelled.

Example with query parameters:

Terminal window
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e?cursor=10&limit=10&status=completed' \
-H 'Authorization: Bearer YOUR_API_TOKEN'

Example response:

{
"result": {
"id": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e",
"status": "completed",
"browserSecondsUsed": 134.7,
"total": 50,
"finished": 50,
"records": [
{
"url": "https://developers.cloudflare.com/workers/",
"status": "completed",
"markdown": "# Cloudflare Workers\nBuild and deploy serverless applications...",
"metadata": {
"status": 200,
"title": "Cloudflare Workers · Cloudflare Workers docs",
"url": "https://developers.cloudflare.com/workers/"
}
},
{
"url": "https://developers.cloudflare.com/workers/get-started/quickstarts/",
"status": "completed",
"markdown": "## Quickstarts\nGet up and running with a simple Hello World...",
"metadata": {
"status": 200,
"title": "Quickstarts · Cloudflare Workers docs",
"url": "https://developers.cloudflare.com/workers/get-started/quickstarts/"
}
}
// ... 48 more entries omitted for brevity
],
"cursor": 10
},
"success": true
}

Cancel a crawl job

To cancel a crawl job that is currently in progress, use the job id you received:

Terminal window
curl -X DELETE 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e' \
-H 'Authorization: Bearer YOUR_API_TOKEN'

A successful cancellation will return a 200 OK status code. The job status will be updated to cancelled, and all URLs that have been queued to be crawled will be cancelled.

Optional parameters

The following optional parameters can be used in your crawl request, in addition to the required url parameter. For the full list, refer to the API docs.

Optional parameterTypeDescription
limitNumberMaximum number of pages to crawl (default is 10, maximum is 100,000).
depthNumberMaximum link depth to crawl from the starting URL (default is 100,000, maximum is 100,000).
sourceStringSource for discovering URLs. Options are all, sitemaps, or links. Default is all.
formatsArray of stringsResponse format (default is HTML, other options are Markdown and JSON). The JSON format leverages Workers AI by default for data extraction, which incurs usage on Workers AI. Refer to the /json endpoint to learn more, including how to use a custom model and fallbacks.
renderBooleanIf false, does a fast HTML fetch without executing JavaScript (default is true, learn more about render).
jsonOptionsObjectOnly required if formats includes json. Contains prompt, response_format, and custom_ai properties (same types as the /json endpoint).
maxAgeNumberMaximum length of time in seconds the crawler can use a cached resource before it must re-fetch it from the origin server (default is 86,400, maximum is 604,800). Cache is served from R2 only if the URL and parameters exactly match.
modifiedSinceNumberUnix timestamp (in seconds) indicating to only crawl pages that were modified since this time.
options.includeExternalLinksBooleanIf true, follows links to external domains (default is false).
options.includeSubdomainsBooleanIf true, follows links to subdomains of the starting URL (default is false).
options.includePatternsArray of stringsOnly visits URLs that match one of these wildcard patterns. Use * to match any characters except /, or ** to match any characters including /.
options.excludePatternsArray of stringsDoes not visit URLs that match any of these wildcard patterns. Use * to match any characters except /, or ** to match any characters including /.

Pattern behavior

excludePatterns has strictly higher priority. If a URL matches an exclude rule, it is skipped, regardless of whether it matches an include rule.

  • No rules — Everything is indexed.
  • Exclude only — Everything is indexed except items matching the exclude patterns.
  • Include only — Only items matching the include patterns are indexed; everything else is ignored.

Viewing skipped URLs

To view URLs that were discovered but skipped, query the crawl job results with status=skipped. URLs can be skipped due to includeExternalLinks, includeSubdomains, includePatterns/excludePatterns, or the modifiedSince parameter. Skipped URLs will also be visible in the dashboard in a future release.

Terminal window
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}?status=skipped' \
-H 'Authorization: Bearer YOUR_API_TOKEN'

render parameter

If you use render: true, which is the default, the crawl endpoint spins up a headless browser and executes page JavaScript. If you use render: false, the crawl endpoint does a fast HTML fetch without executing JavaScript.

Use render: true when the page builds content in the browser. Use render: false when the content you need is already in the initial HTML response.

Crawls that use render: true use a headless browser and are billed under typical Browser Rendering pricing. Crawls that use render: false run on Workers instead of a headless browser. During the beta, render: false crawls are not billed. After the beta, they will be billed under Workers pricing.

Example with all optional parameters

Terminal window
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://www.exampledocs.com/docs/",
"limit": 50,
"depth": 2,
"formats": ["markdown"],
"render": false,
"maxAge": 7200,
"modifiedSince": 1704067200,
"source": "all",
"options": {
"includeExternalLinks": true,
"includeSubdomains": true,
"includePatterns": [
"**/api/v1/*"
],
"excludePatterns": [
"*/learning-paths/*"
]
}
}'

Advanced usage

Documentation site crawl

Crawl only documentation pages and exclude specific sections:

Terminal window
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/docs",
"limit": 200,
"depth": 5,
"formats": ["markdown"],
"options": {
"includePatterns": [
"https://example.com/docs/**"
],
"excludePatterns": [
"https://example.com/docs/changelog/**",
"https://example.com/docs/archive/**"
]
}
}'

Product catalog extraction with AI

Extract structured product data using the json format. This leverages Workers AI by default.

Terminal window
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://shop.example.com/products",
"limit": 50,
"formats": ["json"],
"jsonOptions": {
"prompt": "Extract product name, price, description, and availability",
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "product",
"properties": {
"name": "string",
"price": "number",
"currency": "string",
"description": "string",
"inStock": "boolean"
}
}
}
},
"options": {
"includePatterns": [
"https://shop.example.com/products/*"
]
}
}'

Fast static content fetch

Fetch static HTML without rendering for faster crawling of static sites:

Terminal window
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"limit": 100,
"render": false,
"formats": ["html", "markdown"]
}'

Crawl with authentication

Crawl pages behind HTTP authentication or with custom headers:

Terminal window
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://secure.example.com",
"limit": 50,
"authenticate": {
"username": "user",
"password": "pass"
}
}'

You can also use cookies or custom headers for token-based authentication:

Terminal window
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://api.example.com/docs",
"limit": 100,
"setExtraHTTPHeaders": {
"X-API-Key": "your-api-key"
}
}'

Wait for dynamic content

Crawl single-page applications that load content dynamically:

Terminal window
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://app.example.com",
"limit": 50,
"gotoOptions": {
"waitUntil": "networkidle2",
"timeout": 60000
},
"waitForSelector": {
"selector": "[data-content-loaded]",
"timeout": 30000,
"visible": true
}
}'

Block unnecessary resources

Speed up crawling by blocking images and media:

Terminal window
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
-H 'Authorization: Bearer <apiToken>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"limit": 100,
"rejectResourceTypes": [
"image",
"media",
"font",
"stylesheet"
]
}'

Crawler behavior

How the crawler discovers URLs

The crawler discovers and processes URLs in the following order (when using source: all, the default):

  1. Starting URL — The URL specified in your request.
  2. Sitemap links — URLs found in the site's sitemap.
  3. Page links — Links scraped from pages, if not already found in the sitemap.

Use the source parameter to customize which sources the crawler uses. The available options are:

  • all — Uses both sitemaps and page links (default).
  • sitemaps — Only crawls URLs found in the site's sitemap.
  • links — Only crawls links found on pages, ignoring sitemaps.

robots.txt and bot protection

The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed". For guidance on configuring robots.txt and sitemaps for sites you plan to crawl, refer to robots.txt and sitemaps.

Set a custom user agent

You can change the user agent at the page level by passing userAgent as a top-level parameter in the JSON body. This is useful if the target website serves different content based on the user agent.

The /crawl endpoint uses CloudflareBrowserRenderingCrawler/1.0 as its default User-Agent, which is different from the other REST API endpoints. For a full list of default User-Agent strings, refer to Automatic request headers.

Troubleshooting

Crawl job returns no results or all URLs are skipped

If your crawl job completes but returns an empty records array, or all URLs show skipped or disallowed status:

  • robots.txt blocking — The crawler respects robots.txt rules. The /crawl endpoint identifies itself as CloudflareBrowserRenderingCrawler/1.0. Check the target site's robots.txt file to verify this user agent is allowed. Blocked URLs appear with "status": "disallowed".
  • Pattern filters too restrictive — Your includePatterns may not match any URLs on the site. Try crawling without patterns first to confirm URLs are discoverable, then add patterns.
  • No links found — The starting URL may not contain links. Try using source: "sitemaps", increasing the depth parameter, or setting includeSubdomains or includeExternalLinks to true.

Crawl job takes too long

If a crawl job remains in running status for an extended period:

  • Slow page loads — Pages with heavy JavaScript take longer to render. Use render: false if the content you need is in the initial HTML.
  • Rate limiting — Sites with strict rate limits slow crawling. The crawler respects robots.txt Crawl-delay and implements backoff. Reduce limit and run multiple smaller crawls.
  • Unnecessary resources — Block resources that are not needed for content extraction using rejectResourceTypes (for example, image, media, font).

Crawl job cancelled due to limits

A cancelled_due_to_limits status means your account hit its browser time limit. Workers Free plan accounts are capped at 10 minutes of browser use per day. To resolve this:

  • Upgrade to a Workers Paid plan for higher limits.
  • Use render: false for static content to avoid consuming browser time.
  • Increase maxAge to use cached results where possible.
  • Reduce the limit parameter.

JSON extraction errors

If the json format returns null or empty results:

  • Provide a clear prompt — Be specific about what data to extract and where it appears on the page (for example, "Extract the product name, price, and description from the main product section").
  • Define a response schema — Use response_format with a JSON schema to enforce the expected output structure.
  • Use a custom model — If the default Workers AI model does not produce the desired results, use the custom_ai parameter to specify a different model. Refer to Using a custom model (BYO API Key) for details.

If you have questions or encounter other errors, refer to the Browser Rendering FAQ and troubleshooting guide.

Troubleshooting

If you have questions or encounter an error, see the Browser Rendering FAQ and troubleshooting guide.