Files
crawlshot/API_DOCUMENTATION.md
2025-08-11 02:35:35 +08:00

20 KiB

Crawlshot API Documentation

Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This comprehensive API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities, webhook notifications, and intelligent retry mechanisms.

Overview

Core Capabilities:

  • HTML Crawling: Extract clean HTML content from web pages with ad/tracker blocking
  • Screenshot Capture: Generate high-quality WebP screenshots with optimizable quality settings
  • Webhook Notifications: Real-time status updates with event filtering and progressive retry
  • Background Processing: Asynchronous job processing via Laravel Horizon
  • Smart Filtering: EasyList integration for ad/tracker/cookie banner blocking
  • Auto-cleanup: 24-hour file retention with automated cleanup

Perfect for:

  • Content extraction and monitoring
  • Website screenshot automation
  • Quality assurance and testing
  • Social media preview generation
  • Compliance and archival systems

Base URL

https://crawlshot.test

Replace crawlshot.test with your actual Crawlshot service URL.

Quick Start

1. Authentication

All API endpoints (except health check) require authentication using Laravel Sanctum API tokens.

Authentication Header:

Authorization: Bearer {your-api-token}

Example API Token:

1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c

2. Your First API Call

Simple HTML Crawl:

curl -X POST "https://crawlshot.test/api/crawl" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "message": "Crawl job initiated successfully"
}

3. Check Job Status

curl -H "Authorization: Bearer YOUR_TOKEN" \
  "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb"

Health Check

GET /api/health

Check if the Crawlshot service is running and healthy.

Authentication: Not required

Request Example

curl -X GET "https://crawlshot.test/api/health" \
  -H "Accept: application/json"

Response Example

{
  "status": "healthy",
  "timestamp": "2025-08-10T09:54:52.195383Z",
  "service": "crawlshot"
}

Web Crawling APIs

POST /api/crawl

Initiate a web crawling job to extract HTML content from a URL.

Authentication: Required

Request Parameters

Parameter Type Required Default Description
url string - Target URL to crawl (max 2048 chars)
timeout integer 30 Request timeout in seconds (5-300)
delay integer 0 Wait time before capture in milliseconds (0-30000)
block_ads boolean true Block ads using EasyList filters
block_cookie_banners boolean true Block cookie consent banners
block_trackers boolean true Block tracking scripts
webhook_url string null URL to receive job status webhooks (max 2048 chars)
webhook_events_filter array ["queued","processing","completed","failed"] Which job statuses trigger webhooks. Empty array [] disables webhooks

Request Examples

Basic Crawl:

curl -X POST "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "timeout": 30,
    "delay": 2000,
    "block_ads": true,
    "block_cookie_banners": true,
    "block_trackers": true
  }'

With Webhook Notifications:

curl -X POST "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "webhook_url": "https://myapp.com/webhooks/crawlshot",
    "webhook_events_filter": ["completed", "failed"],
    "block_ads": true,
    "timeout": 60
  }'

Response Example

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "message": "Crawl job initiated successfully"
}

GET /api/crawl/{uuid}

Check the status and retrieve results of a crawl job.

Authentication: Required

Path Parameters

Parameter Type Required Description
uuid string Job UUID returned from crawl initiation

Request Example

curl -X GET "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Examples

Queued Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z"
}

Processing Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "processing",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z"
}

Completed Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z",
  "completed_at": "2025-08-10T10:01:12.000000Z",
  "result": "<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n</head>\n<body>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples...</p>\n</body>\n</html>"
}

Failed Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "failed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z",
  "completed_at": "2025-08-10T10:00:50.000000Z",
  "error": "Timeout: Navigation failed after 30 seconds"
}

GET /api/crawl

List all crawl jobs with pagination (optional endpoint for debugging).

Authentication: Required

Request Example

curl -X GET "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "jobs": [
    {
      "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
      "type": "crawl",
      "url": "https://example.com",
      "status": "completed",
      "created_at": "2025-08-10T10:00:42.000000Z",
      "completed_at": "2025-08-10T10:01:12.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 5,
    "total_items": 100,
    "per_page": 20
  }
}

Screenshot APIs

POST /api/shot

Initiate a screenshot job to capture an image of a webpage.

Authentication: Required

Request Parameters

Parameter Type Required Default Description
url string - Target URL to screenshot (max 2048 chars)
viewport_width integer 1920 Viewport width in pixels (320-3840)
viewport_height integer 1080 Viewport height in pixels (240-2160)
quality integer 90 Image quality 1-100 (always WebP format)
timeout integer 30 Request timeout in seconds (5-300)
delay integer 0 Wait time before capture in milliseconds (0-30000)
block_ads boolean true Block ads using EasyList filters
block_cookie_banners boolean true Block cookie consent banners
block_trackers boolean true Block tracking scripts
webhook_url string null URL to receive job status webhooks (max 2048 chars)
webhook_events_filter array ["queued","processing","completed","failed"] Which job statuses trigger webhooks. Empty array [] disables webhooks

Request Examples

Basic Screenshot:

curl -X POST "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "viewport_width": 1920,
    "viewport_height": 1080,
    "quality": 90,
    "timeout": 30,
    "delay": 2000,
    "block_ads": true,
    "block_cookie_banners": true,
    "block_trackers": true
  }'

With Webhook Notifications:

curl -X POST "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "webhook_url": "https://myapp.com/webhooks/crawlshot",
    "webhook_events_filter": ["completed"],
    "viewport_width": 1200,
    "viewport_height": 800,
    "quality": 85
  }'

Response Example

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "queued",
  "message": "Screenshot job initiated successfully"
}

GET /api/shot/{uuid}

Check the status and retrieve results of a screenshot job. When completed, returns base64 image data and download URL.

Authentication: Required

Path Parameters

Parameter Type Required Description
uuid string Job UUID returned from screenshot initiation

Request Example

curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Examples

Queued Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "queued",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z"
}

Processing Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "processing",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z"
}

Completed Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:06:12.000000Z",
  "result": {
    "image_data": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAAyCAYAAACXpx/Y...",
    "download_url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download",
    "mime_type": "image/webp",
    "format": "webp",
    "width": 1920,
    "height": 1080,
    "size": 45678
  }
}

Failed Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "failed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:05:50.000000Z",
  "error": "Timeout: Navigation failed after 30 seconds"
}

GET /api/shot/{uuid}/download

Download the screenshot file directly. Returns the actual image file with appropriate headers.

Authentication: Required

Path Parameters

Parameter Type Required Description
uuid string Job UUID of a completed screenshot job

Request Example

curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  --output screenshot.webp

Response

Returns the WebP image file directly with appropriate headers:

  • Content-Type: image/webp

GET /api/shot

List all screenshot jobs with pagination (optional endpoint for debugging).

Authentication: Required

Request Example

curl -X GET "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "jobs": [
    {
      "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
      "type": "shot",
      "url": "https://example.com",
      "status": "completed",
      "created_at": "2025-08-10T10:05:42.000000Z",
      "completed_at": "2025-08-10T10:06:12.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 3,
    "total_items": 50,
    "per_page": 20
  }
}

Webhook System

Crawlshot supports real-time webhook notifications to keep your application informed about job status changes without constant polling.

How Webhooks Work

  1. Configure Webhook: Include webhook_url when creating jobs
  2. Filter Events: Use webhook_events_filter to specify which status changes trigger webhooks
  3. Receive Notifications: Your endpoint receives HTTP POST requests with job status data
  4. Automatic Retries: Failed webhooks are automatically retried with progressive backoff

Event Filtering

Control which job status changes trigger webhook calls:

{
  "webhook_events_filter": ["completed", "failed"]
}

Available Events:

  • queued - Job created and queued for processing
  • processing - Job started processing
  • completed - Job finished successfully
  • failed - Job encountered an error

Special Behaviors:

  • Default: ["queued", "processing", "completed", "failed"] (all events)
  • Disable: [] (empty array disables webhooks entirely)
  • Omitted: Same as default (all events)

Webhook Payload

Webhooks send the exact same payload as the status endpoints (GET /api/crawl/{uuid} or GET /api/shot/{uuid}), ensuring consistency.

Crawl Webhook Example:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z", 
  "completed_at": "2025-08-10T10:01:12.000000Z",
  "result": {
    "html": {
      "url": "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb.html",
      "raw": "<!doctype html>\n<html>..."
    }
  }
}

Screenshot Webhook Example:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:06:12.000000Z",
  "result": {
    "image": {
      "url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851.webp",
      "raw": "iVBORw0KGgoAAAANSUhEUgAAAHg..."
    },
    "mime_type": "image/webp",
    "format": "webp", 
    "width": 1920,
    "height": 1080,
    "size": 45678
  }
}

Progressive Retry System

Failed webhook deliveries are automatically retried with exponential backoff:

  • 1st retry: 1 minute after failure
  • 2nd retry: 2 minutes after failure
  • 3rd retry: 4 minutes after failure
  • 4th retry: 8 minutes after failure
  • 5th retry: 16 minutes after failure
  • 6th retry: 32 minutes after failure
  • After 6 failures: Stops retrying, webhook marked as failed

Total retry window: ~63 minutes (1+2+4+8+16+32)

Webhook Requirements

Your webhook endpoint should:

  • Accept HTTP POST requests
  • Return HTTP 2xx status codes for successful processing
  • Respond within 5 seconds (webhook timeout)
  • Handle duplicate deliveries gracefully (use job UUID for idempotency)

Example webhook handler (PHP):

Route::post('/webhooks/crawlshot', function (Request $request) {
    $jobData = $request->all();
    
    // Process the job status update
    if ($jobData['status'] === 'completed') {
        // Handle successful completion
        $result = $jobData['result'];
    } elseif ($jobData['status'] === 'failed') {
        // Handle failure
        $error = $jobData['error'];
    }
    
    return response('OK', 200);
});

Webhook Error Management

When webhooks fail, you can manage them through dedicated endpoints.

GET /api/webhook-errors

List all jobs with failed webhook deliveries.

Authentication: Required

Request Example

curl -X GET "https://crawlshot.test/api/webhook-errors" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "jobs": [
    {
      "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
      "type": "crawl",
      "url": "https://example.com", 
      "status": "completed",
      "webhook_url": "https://myapp.com/webhook",
      "webhook_attempts": 6,
      "webhook_last_error": "Connection timeout",
      "webhook_next_retry_at": null,
      "created_at": "2025-08-10T10:00:42.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 1,
    "total_items": 1,
    "per_page": 20
  }
}

POST /api/webhook-errors/{uuid}/retry

Manually retry a failed webhook immediately.

Authentication: Required

Request Example

curl -X POST "https://crawlshot.test/api/webhook-errors/b5dc483b-f62d-4e40-8b9e-4715324a8cbb/retry" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "message": "Webhook retry attempted"
}

DELETE /api/webhook-errors/{uuid}/clear

Clear webhook error status without retrying.

Authentication: Required

Request Example

curl -X DELETE "https://crawlshot.test/api/webhook-errors/b5dc483b-f62d-4e40-8b9e-4715324a8cbb/clear" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb", 
  "message": "Webhook error cleared"
}

Job Status Flow

Both crawl and screenshot jobs follow the same status progression:

  1. queued - Job created and waiting for processing
  2. processing - Job is currently being executed by a worker
  3. completed - Job finished successfully, results available
  4. failed - Job encountered an error and could not complete

Error Responses

401 Unauthorized

{
  "message": "Unauthenticated."
}

404 Not Found

{
  "error": "Job not found"
}

422 Validation Error

{
  "message": "The given data was invalid.",
  "errors": {
    "url": [
      "The url field is required."
    ],
    "timeout": [
      "The timeout must be between 5 and 300."
    ]
  }
}

Features

Ad & Tracker Blocking

  • EasyList Integration: Automatically downloads and applies EasyList filters
  • Cookie Banner Blocking: Removes cookie consent prompts
  • Tracker Blocking: Blocks Google Analytics, Facebook Pixel, and other tracking scripts
  • Custom Domain Blocking: Blocks common advertising and tracking domains

Image Processing

  • WebP Format: High-quality WebP screenshots with optimizable compression
  • Quality Control: Adjustable compression quality (1-100)
  • Efficient Processing: Optimized WebP encoding for fast delivery
  • Responsive Sizing: Custom viewport dimensions up to 4K resolution

Storage & Cleanup

  • 24-Hour TTL: All files automatically deleted after 24 hours
  • Scheduled Cleanup: Daily automated cleanup of expired files
  • Manual Cleanup: php artisan crawlshot:prune-storage command available

Performance

  • Background Processing: All jobs processed asynchronously via Laravel Horizon
  • Queue Management: Built-in retry logic and failure handling
  • Caching: EasyList filters cached for optimal performance
  • Monitoring: Horizon dashboard for real-time job monitoring at /horizon

Rate Limiting

API endpoints include rate limiting to prevent abuse. Contact your system administrator for current rate limit settings.

Support

For technical support or questions about the Crawlshot API, please refer to the system documentation or contact your administrator.