Files
crawlshot/API_DOCUMENTATION.md
2025-08-10 21:10:33 +08:00

12 KiB

Crawlshot API Documentation

Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities.

Base URL

https://crawlshot.test

Authentication

All API endpoints (except health check) require authentication using Laravel Sanctum API tokens.

Authentication Header

Authorization: Bearer {your-api-token}

Example API Token

1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c

Health Check

GET /api/health

Check if the Crawlshot service is running and healthy.

Authentication: Not required

Request Example

curl -X GET "https://crawlshot.test/api/health" \
  -H "Accept: application/json"

Response Example

{
  "status": "healthy",
  "timestamp": "2025-08-10T09:54:52.195383Z",
  "service": "crawlshot"
}

Web Crawling APIs

POST /api/crawl

Initiate a web crawling job to extract HTML content from a URL.

Authentication: Required

Request Parameters

Parameter Type Required Default Description
url string - Target URL to crawl (max 2048 chars)
timeout integer 30 Request timeout in seconds (5-300)
delay integer 0 Wait time before capture in milliseconds (0-30000)
block_ads boolean true Block ads using EasyList filters
block_cookie_banners boolean true Block cookie consent banners
block_trackers boolean true Block tracking scripts
wait_until_network_idle boolean false Wait for network activity to cease

Request Example

curl -X POST "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "timeout": 30,
    "delay": 2000,
    "block_ads": true,
    "block_cookie_banners": true,
    "block_trackers": true,
    "wait_until_network_idle": true
  }'

Response Example

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "message": "Crawl job initiated successfully"
}

GET /api/crawl/{uuid}

Check the status and retrieve results of a crawl job.

Authentication: Required

Path Parameters

Parameter Type Required Description
uuid string Job UUID returned from crawl initiation

Request Example

curl -X GET "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Examples

Queued Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z"
}

Processing Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "processing",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z"
}

Completed Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z",
  "completed_at": "2025-08-10T10:01:12.000000Z",
  "result": "<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n</head>\n<body>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples...</p>\n</body>\n</html>"
}

Failed Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "failed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z",
  "completed_at": "2025-08-10T10:00:50.000000Z",
  "error": "Timeout: Navigation failed after 30 seconds"
}

GET /api/crawl

List all crawl jobs with pagination (optional endpoint for debugging).

Authentication: Required

Request Example

curl -X GET "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "jobs": [
    {
      "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
      "type": "crawl",
      "url": "https://example.com",
      "status": "completed",
      "created_at": "2025-08-10T10:00:42.000000Z",
      "completed_at": "2025-08-10T10:01:12.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 5,
    "total_items": 100,
    "per_page": 20
  }
}

Screenshot APIs

POST /api/shot

Initiate a screenshot job to capture an image of a webpage.

Authentication: Required

Request Parameters

Parameter Type Required Default Description
url string - Target URL to screenshot (max 2048 chars)
viewport_width integer 1920 Viewport width in pixels (320-3840)
viewport_height integer 1080 Viewport height in pixels (240-2160)
format string "jpg" Image format: "jpg", "png", "webp"
quality integer 90 Image quality 1-100 (for JPEG/WebP)
timeout integer 30 Request timeout in seconds (5-300)
delay integer 0 Wait time before capture in milliseconds (0-30000)
block_ads boolean true Block ads using EasyList filters
block_cookie_banners boolean true Block cookie consent banners
block_trackers boolean true Block tracking scripts

Request Example

curl -X POST "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "viewport_width": 1920,
    "viewport_height": 1080,
    "format": "webp",
    "quality": 90,
    "timeout": 30,
    "delay": 2000,
    "block_ads": true,
    "block_cookie_banners": true,
    "block_trackers": true
  }'

Response Example

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "queued",
  "message": "Screenshot job initiated successfully"
}

GET /api/shot/{uuid}

Check the status and retrieve results of a screenshot job. When completed, returns base64 image data and download URL.

Authentication: Required

Path Parameters

Parameter Type Required Description
uuid string Job UUID returned from screenshot initiation

Request Example

curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Examples

Queued Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "queued",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z"
}

Processing Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "processing",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z"
}

Completed Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:06:12.000000Z",
  "result": {
    "image_data": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAAyCAYAAACXpx/Y...",
    "download_url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download",
    "mime_type": "image/webp",
    "format": "webp",
    "width": 1920,
    "height": 1080,
    "size": 45678
  }
}

Failed Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "failed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:05:50.000000Z",
  "error": "Timeout: Navigation failed after 30 seconds"
}

GET /api/shot/{uuid}/download

Download the screenshot file directly. Returns the actual image file with appropriate headers.

Authentication: Required

Path Parameters

Parameter Type Required Description
uuid string Job UUID of a completed screenshot job

Request Example

curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  --output screenshot.webp

Response

Returns the image file directly with appropriate Content-Type headers:

  • Content-Type: image/jpeg for JPEG files
  • Content-Type: image/png for PNG files
  • Content-Type: image/webp for WebP files

GET /api/shot

List all screenshot jobs with pagination (optional endpoint for debugging).

Authentication: Required

Request Example

curl -X GET "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "jobs": [
    {
      "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
      "type": "shot",
      "url": "https://example.com",
      "status": "completed",
      "created_at": "2025-08-10T10:05:42.000000Z",
      "completed_at": "2025-08-10T10:06:12.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 3,
    "total_items": 50,
    "per_page": 20
  }
}

Job Status Flow

Both crawl and screenshot jobs follow the same status progression:

  1. queued - Job created and waiting for processing
  2. processing - Job is currently being executed by a worker
  3. completed - Job finished successfully, results available
  4. failed - Job encountered an error and could not complete

Error Responses

401 Unauthorized

{
  "message": "Unauthenticated."
}

404 Not Found

{
  "error": "Job not found"
}

422 Validation Error

{
  "message": "The given data was invalid.",
  "errors": {
    "url": [
      "The url field is required."
    ],
    "timeout": [
      "The timeout must be between 5 and 300."
    ]
  }
}

Features

Ad & Tracker Blocking

  • EasyList Integration: Automatically downloads and applies EasyList filters
  • Cookie Banner Blocking: Removes cookie consent prompts
  • Tracker Blocking: Blocks Google Analytics, Facebook Pixel, and other tracking scripts
  • Custom Domain Blocking: Blocks common advertising and tracking domains

Image Processing

  • Multiple Formats: Support for JPEG, PNG, and WebP
  • Quality Control: Adjustable compression quality (1-100)
  • Imagick Integration: High-quality image processing and format conversion
  • Responsive Sizing: Custom viewport dimensions up to 4K resolution

Storage & Cleanup

  • 24-Hour TTL: All files automatically deleted after 24 hours
  • Scheduled Cleanup: Daily automated cleanup of expired files
  • Manual Cleanup: php artisan crawlshot:prune-storage command available

Performance

  • Background Processing: All jobs processed asynchronously via Laravel Horizon
  • Queue Management: Built-in retry logic and failure handling
  • Caching: EasyList filters cached for optimal performance
  • Monitoring: Horizon dashboard for real-time job monitoring at /horizon

Rate Limiting

API endpoints include rate limiting to prevent abuse. Contact your system administrator for current rate limit settings.

Support

For technical support or questions about the Crawlshot API, please refer to the system documentation or contact your administrator.