Files

ct f3c91b9a64 Update

2025-08-11 02:35:35 +08:00

20 KiB

Raw Permalink Blame History

Crawlshot API Documentation

Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This comprehensive API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities, webhook notifications, and intelligent retry mechanisms.

Overview

Core Capabilities:

HTML Crawling: Extract clean HTML content from web pages with ad/tracker blocking
Screenshot Capture: Generate high-quality WebP screenshots with optimizable quality settings
Webhook Notifications: Real-time status updates with event filtering and progressive retry
Background Processing: Asynchronous job processing via Laravel Horizon
Smart Filtering: EasyList integration for ad/tracker/cookie banner blocking
Auto-cleanup: 24-hour file retention with automated cleanup

Perfect for:

Content extraction and monitoring
Website screenshot automation
Quality assurance and testing
Social media preview generation
Compliance and archival systems

Base URL

https://crawlshot.test

Replace crawlshot.test with your actual Crawlshot service URL.

Quick Start

1. Authentication

All API endpoints (except health check) require authentication using Laravel Sanctum API tokens.

Authentication Header:

Authorization: Bearer {your-api-token}

Example API Token:

1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c

2. Your First API Call

Simple HTML Crawl:

curl -X POST "https://crawlshot.test/api/crawl" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "message": "Crawl job initiated successfully"
}

3. Check Job Status

curl -H "Authorization: Bearer YOUR_TOKEN" \
  "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb"

Health Check

GET `/api/health`

Check if the Crawlshot service is running and healthy.

Authentication: Not required

Request Example

curl -X GET "https://crawlshot.test/api/health" \
  -H "Accept: application/json"

Response Example

{
  "status": "healthy",
  "timestamp": "2025-08-10T09:54:52.195383Z",
  "service": "crawlshot"
}

Web Crawling APIs

POST `/api/crawl`

Initiate a web crawling job to extract HTML content from a URL.

Authentication: Required

Request Parameters

Parameter	Type	Required	Default	Description
`url`	string	✅	-	Target URL to crawl (max 2048 chars)
`timeout`	integer	❌	30	Request timeout in seconds (5-300)
`delay`	integer	❌	0	Wait time before capture in milliseconds (0-30000)
`block_ads`	boolean	❌	true	Block ads using EasyList filters
`block_cookie_banners`	boolean	❌	true	Block cookie consent banners
`block_trackers`	boolean	❌	true	Block tracking scripts
`webhook_url`	string	❌	null	URL to receive job status webhooks (max 2048 chars)
`webhook_events_filter`	array	❌	`["queued","processing","completed","failed"]`	Which job statuses trigger webhooks. Empty array `[]` disables webhooks

Request Examples

Basic Crawl:

curl -X POST "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "timeout": 30,
    "delay": 2000,
    "block_ads": true,
    "block_cookie_banners": true,
    "block_trackers": true
  }'

With Webhook Notifications:

curl -X POST "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "webhook_url": "https://myapp.com/webhooks/crawlshot",
    "webhook_events_filter": ["completed", "failed"],
    "block_ads": true,
    "timeout": 60
  }'

Response Example

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "message": "Crawl job initiated successfully"
}

GET `/api/crawl/{uuid}`

Check the status and retrieve results of a crawl job.

Authentication: Required

Path Parameters

Parameter	Type	Required	Description
`uuid`	string	✅	Job UUID returned from crawl initiation

Request Example

curl -X GET "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Examples

Queued Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z"
}

Processing Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "processing",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z"
}

Completed Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z",
  "completed_at": "2025-08-10T10:01:12.000000Z",
  "result": "<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n</head>\n<body>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples...</p>\n</body>\n</html>"
}

Failed Status:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "failed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z",
  "completed_at": "2025-08-10T10:00:50.000000Z",
  "error": "Timeout: Navigation failed after 30 seconds"
}

GET `/api/crawl`

List all crawl jobs with pagination (optional endpoint for debugging).

Authentication: Required

Request Example

curl -X GET "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "jobs": [
    {
      "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
      "type": "crawl",
      "url": "https://example.com",
      "status": "completed",
      "created_at": "2025-08-10T10:00:42.000000Z",
      "completed_at": "2025-08-10T10:01:12.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 5,
    "total_items": 100,
    "per_page": 20
  }
}

Screenshot APIs

POST `/api/shot`

Initiate a screenshot job to capture an image of a webpage.

Authentication: Required

Request Parameters

Parameter	Type	Required	Default	Description
`url`	string	✅	-	Target URL to screenshot (max 2048 chars)
`viewport_width`	integer	❌	1920	Viewport width in pixels (320-3840)
`viewport_height`	integer	❌	1080	Viewport height in pixels (240-2160)
`quality`	integer	❌	90	Image quality 1-100 (always WebP format)
`timeout`	integer	❌	30	Request timeout in seconds (5-300)
`delay`	integer	❌	0	Wait time before capture in milliseconds (0-30000)
`block_ads`	boolean	❌	true	Block ads using EasyList filters
`block_cookie_banners`	boolean	❌	true	Block cookie consent banners
`block_trackers`	boolean	❌	true	Block tracking scripts
`webhook_url`	string	❌	null	URL to receive job status webhooks (max 2048 chars)
`webhook_events_filter`	array	❌	`["queued","processing","completed","failed"]`	Which job statuses trigger webhooks. Empty array `[]` disables webhooks

Request Examples

Basic Screenshot:

curl -X POST "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "viewport_width": 1920,
    "viewport_height": 1080,
    "quality": 90,
    "timeout": 30,
    "delay": 2000,
    "block_ads": true,
    "block_cookie_banners": true,
    "block_trackers": true
  }'

With Webhook Notifications:

curl -X POST "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "webhook_url": "https://myapp.com/webhooks/crawlshot",
    "webhook_events_filter": ["completed"],
    "viewport_width": 1200,
    "viewport_height": 800,
    "quality": 85
  }'

Response Example

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "queued",
  "message": "Screenshot job initiated successfully"
}

GET `/api/shot/{uuid}`

Check the status and retrieve results of a screenshot job. When completed, returns base64 image data and download URL.

Authentication: Required

Path Parameters

Parameter	Type	Required	Description
`uuid`	string	✅	Job UUID returned from screenshot initiation

Request Example

curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Examples

Queued Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "queued",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z"
}

Processing Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "processing",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z"
}

Completed Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:06:12.000000Z",
  "result": {
    "image_data": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAAyCAYAAACXpx/Y...",
    "download_url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download",
    "mime_type": "image/webp",
    "format": "webp",
    "width": 1920,
    "height": 1080,
    "size": 45678
  }
}

Failed Status:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "failed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:05:50.000000Z",
  "error": "Timeout: Navigation failed after 30 seconds"
}

GET `/api/shot/{uuid}/download`

Download the screenshot file directly. Returns the actual image file with appropriate headers.

Authentication: Required

Path Parameters

Parameter	Type	Required	Description
`uuid`	string	✅	Job UUID of a completed screenshot job

Request Example

curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  --output screenshot.webp

Response

Returns the WebP image file directly with appropriate headers:

Content-Type: image/webp

GET `/api/shot`

List all screenshot jobs with pagination (optional endpoint for debugging).

Authentication: Required

Request Example

curl -X GET "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "jobs": [
    {
      "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
      "type": "shot",
      "url": "https://example.com",
      "status": "completed",
      "created_at": "2025-08-10T10:05:42.000000Z",
      "completed_at": "2025-08-10T10:06:12.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 3,
    "total_items": 50,
    "per_page": 20
  }
}

Webhook System

Crawlshot supports real-time webhook notifications to keep your application informed about job status changes without constant polling.

How Webhooks Work

Configure Webhook: Include webhook_url when creating jobs
Filter Events: Use webhook_events_filter to specify which status changes trigger webhooks
Receive Notifications: Your endpoint receives HTTP POST requests with job status data
Automatic Retries: Failed webhooks are automatically retried with progressive backoff

Event Filtering

Control which job status changes trigger webhook calls:

{
  "webhook_events_filter": ["completed", "failed"]
}

Available Events:

queued - Job created and queued for processing
processing - Job started processing
completed - Job finished successfully
failed - Job encountered an error

Special Behaviors:

Default: ["queued", "processing", "completed", "failed"] (all events)
Disable: [] (empty array disables webhooks entirely)
Omitted: Same as default (all events)

Webhook Payload

Webhooks send the exact same payload as the status endpoints (GET /api/crawl/{uuid} or GET /api/shot/{uuid}), ensuring consistency.

Crawl Webhook Example:

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z", 
  "completed_at": "2025-08-10T10:01:12.000000Z",
  "result": {
    "html": {
      "url": "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb.html",
      "raw": "<!doctype html>\n<html>..."
    }
  }
}

Screenshot Webhook Example:

{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:06:12.000000Z",
  "result": {
    "image": {
      "url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851.webp",
      "raw": "iVBORw0KGgoAAAANSUhEUgAAAHg..."
    },
    "mime_type": "image/webp",
    "format": "webp", 
    "width": 1920,
    "height": 1080,
    "size": 45678
  }
}

Progressive Retry System

Failed webhook deliveries are automatically retried with exponential backoff:

1st retry: 1 minute after failure
2nd retry: 2 minutes after failure
3rd retry: 4 minutes after failure
4th retry: 8 minutes after failure
5th retry: 16 minutes after failure
6th retry: 32 minutes after failure
After 6 failures: Stops retrying, webhook marked as failed

Total retry window: ~63 minutes (1+2+4+8+16+32)

Webhook Requirements

Your webhook endpoint should:

Accept HTTP POST requests
Return HTTP 2xx status codes for successful processing
Respond within 5 seconds (webhook timeout)
Handle duplicate deliveries gracefully (use job UUID for idempotency)

Example webhook handler (PHP):

Route::post('/webhooks/crawlshot', function (Request $request) {
    $jobData = $request->all();
    
    // Process the job status update
    if ($jobData['status'] === 'completed') {
        // Handle successful completion
        $result = $jobData['result'];
    } elseif ($jobData['status'] === 'failed') {
        // Handle failure
        $error = $jobData['error'];
    }
    
    return response('OK', 200);
});

Webhook Error Management

When webhooks fail, you can manage them through dedicated endpoints.

GET `/api/webhook-errors`

List all jobs with failed webhook deliveries.

Authentication: Required

Request Example

curl -X GET "https://crawlshot.test/api/webhook-errors" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "jobs": [
    {
      "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
      "type": "crawl",
      "url": "https://example.com", 
      "status": "completed",
      "webhook_url": "https://myapp.com/webhook",
      "webhook_attempts": 6,
      "webhook_last_error": "Connection timeout",
      "webhook_next_retry_at": null,
      "created_at": "2025-08-10T10:00:42.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 1,
    "total_items": 1,
    "per_page": 20
  }
}

POST `/api/webhook-errors/{uuid}/retry`

Manually retry a failed webhook immediately.

Authentication: Required

Request Example

curl -X POST "https://crawlshot.test/api/webhook-errors/b5dc483b-f62d-4e40-8b9e-4715324a8cbb/retry" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "message": "Webhook retry attempted"
}

DELETE `/api/webhook-errors/{uuid}/clear`

Clear webhook error status without retrying.

Authentication: Required

Request Example

curl -X DELETE "https://crawlshot.test/api/webhook-errors/b5dc483b-f62d-4e40-8b9e-4715324a8cbb/clear" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"

Response Example

{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb", 
  "message": "Webhook error cleared"
}

Job Status Flow

Both crawl and screenshot jobs follow the same status progression:

queued - Job created and waiting for processing
processing - Job is currently being executed by a worker
completed - Job finished successfully, results available
failed - Job encountered an error and could not complete

Error Responses

401 Unauthorized

{
  "message": "Unauthenticated."
}

404 Not Found

{
  "error": "Job not found"
}

422 Validation Error

{
  "message": "The given data was invalid.",
  "errors": {
    "url": [
      "The url field is required."
    ],
    "timeout": [
      "The timeout must be between 5 and 300."
    ]
  }
}

Features

Ad & Tracker Blocking

EasyList Integration: Automatically downloads and applies EasyList filters
Cookie Banner Blocking: Removes cookie consent prompts
Tracker Blocking: Blocks Google Analytics, Facebook Pixel, and other tracking scripts
Custom Domain Blocking: Blocks common advertising and tracking domains

Image Processing

WebP Format: High-quality WebP screenshots with optimizable compression
Quality Control: Adjustable compression quality (1-100)
Efficient Processing: Optimized WebP encoding for fast delivery
Responsive Sizing: Custom viewport dimensions up to 4K resolution

Storage & Cleanup

24-Hour TTL: All files automatically deleted after 24 hours
Scheduled Cleanup: Daily automated cleanup of expired files
Manual Cleanup: php artisan crawlshot:prune-storage command available

Performance

Background Processing: All jobs processed asynchronously via Laravel Horizon
Queue Management: Built-in retry logic and failure handling
Caching: EasyList filters cached for optimal performance
Monitoring: Horizon dashboard for real-time job monitoring at /horizon

Rate Limiting

API endpoints include rate limiting to prevent abuse. Contact your system administrator for current rate limit settings.

Support

For technical support or questions about the Crawlshot API, please refer to the system documentation or contact your administrator.

20 KiB Raw Permalink Blame History

Crawlshot API Documentation

Overview

Base URL

Quick Start

1. Authentication

2. Your First API Call

3. Check Job Status

Health Check

GET /api/health

Request Example

Response Example

Web Crawling APIs

POST /api/crawl

Request Parameters

Request Examples

Response Example

GET /api/crawl/{uuid}

Path Parameters

Request Example

Response Examples

GET /api/crawl

Request Example

Response Example

Screenshot APIs

POST /api/shot

Request Parameters

Request Examples

Response Example

GET /api/shot/{uuid}

Path Parameters

Request Example

Response Examples

GET /api/shot/{uuid}/download

Path Parameters

Request Example

Response

GET /api/shot

Request Example

Response Example

Webhook System

How Webhooks Work

Event Filtering

Webhook Payload

Progressive Retry System

Webhook Requirements

Webhook Error Management

GET /api/webhook-errors

Request Example

Response Example

POST /api/webhook-errors/{uuid}/retry

Request Example

Response Example

DELETE /api/webhook-errors/{uuid}/clear

Request Example

Response Example

Job Status Flow

Error Responses

401 Unauthorized

404 Not Found

422 Validation Error

Features

Ad & Tracker Blocking

Image Processing

Storage & Cleanup

Performance

Rate Limiting

Support

20 KiB

Raw Permalink Blame History

GET `/api/health`

POST `/api/crawl`

GET `/api/crawl/{uuid}`

GET `/api/crawl`

POST `/api/shot`

GET `/api/shot/{uuid}`

GET `/api/shot/{uuid}/download`

GET `/api/shot`

GET `/api/webhook-errors`

POST `/api/webhook-errors/{uuid}/retry`

DELETE `/api/webhook-errors/{uuid}/clear`