Skip to main content

Documentation Index

Fetch the complete documentation index at: https://webscraping.titannet.io/docs/llms.txt

Use this file to discover all available pages before exploring further.

The crawl action type is navigation-first: start from known URLs (or URLs produced by an earlier step) and follow links within configured depth and breadth limits. Use it when discovery needs systematic expansion of a site or subgraph, not only a single hop into a fixed list of pages.
Examples use TITAN_API_URL and TITAN_TOKEN. Tab titles match other integration pages. Rust: ureq + serde_json.

When to use crawl

  • You know entry URLs but not every target page in advance.
  • Layout changes often, but internal linking is stable enough to traverse.
  • You will hand off URLs to scrape for schema-shaped extraction, or combine crawl with search upstream.

Inputs and wiring

  • static_urls — crawl starts from URLs on the task.
  • previous_step — crawl consumes URLs emitted by search or another step.
  • task_url_inventory — crawl is driven from the URL inventory when your Titan environment supports that flow.
Always set limits (depth, max pages, domain rules—whatever your template and script require) so runs stay bounded and predictable.

Single-action example (POST /api/v1/tasks)

Field names under limits are template-specific; align them with the script you bind.
curl -sS -X POST "$TITAN_API_URL/api/v1/tasks" \
  -H "Authorization: Bearer $TITAN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Vendor docs crawl",
    "objective": "Enumerate documentation pages under https://docs.vendor.example up to depth 2",
    "execution_type": "single",
    "urls": ["https://docs.vendor.example/"],
    "action_type": "crawl",
    "input_source": "static_urls",
    "template_slug": "docs-site-crawl",
    "limits": {
      "max_depth": 2,
      "max_pages": 200
    }
  }'

Chained example (crawl → scrape)

In an execution_plan, keep each step self-contained. Crawl discovers URLs; scrape reads them into records:
{
  "name": "Crawl vendor docs then extract API tables",
  "objective": "Walk the docs tree, then scrape each page for REST tables",
  "execution_type": "single",
  "execution_plan": {
    "steps": [
      {
        "step_id": "walk",
        "action_type": "crawl",
        "input_source": "static_urls",
        "template_slug": "docs-crawl",
        "limits": { "max_depth": 2, "max_pages": 150 }
      },
      {
        "step_id": "extract",
        "action_type": "scrape",
        "input_source": "previous_step",
        "template_slug": "docs-api-table-scrape",
        "output_schema": {
          "type": "object",
          "properties": {
            "endpoint": { "type": "string" },
            "method": { "type": "string" }
          }
        }
      }
    ]
  }
}

Run

curl -sS -X POST "$TITAN_API_URL/api/v1/tasks/$TASK_ID/run" \
  -H "Authorization: Bearer $TITAN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{}'