# Crawlshot API Documentation Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities. ## Base URL ``` https://crawlshot.test ``` ## Authentication All API endpoints (except health check) require authentication using Laravel Sanctum API tokens. ### Authentication Header ```http Authorization: Bearer {your-api-token} ``` ### Example API Token ``` 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c ``` --- ## Health Check ### GET `/api/health` Check if the Crawlshot service is running and healthy. **Authentication:** Not required #### Request Example ```bash curl -X GET "https://crawlshot.test/api/health" \ -H "Accept: application/json" ``` #### Response Example ```json { "status": "healthy", "timestamp": "2025-08-10T09:54:52.195383Z", "service": "crawlshot" } ``` --- ## Web Crawling APIs ### POST `/api/crawl` Initiate a web crawling job to extract HTML content from a URL. **Authentication:** Required #### Request Parameters | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `url` | string | ✅ | - | Target URL to crawl (max 2048 chars) | | `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) | | `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) | | `block_ads` | boolean | ❌ | true | Block ads using EasyList filters | | `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners | | `block_trackers` | boolean | ❌ | true | Block tracking scripts | | `wait_until_network_idle` | boolean | ❌ | false | Wait for network activity to cease | #### Request Example ```bash curl -X POST "https://crawlshot.test/api/crawl" \ -H "Accept: application/json" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \ -d '{ "url": "https://example.com", "timeout": 30, "delay": 2000, "block_ads": true, "block_cookie_banners": true, "block_trackers": true, "wait_until_network_idle": true }' ``` #### Response Example ```json { "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb", "status": "queued", "message": "Crawl job initiated successfully" } ``` --- ### GET `/api/crawl/{uuid}` Check the status and retrieve results of a crawl job. **Authentication:** Required #### Path Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `uuid` | string | ✅ | Job UUID returned from crawl initiation | #### Request Example ```bash curl -X GET "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb" \ -H "Accept: application/json" \ -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" ``` #### Response Examples **Queued Status:** ```json { "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb", "status": "queued", "url": "https://example.com", "created_at": "2025-08-10T10:00:42.000000Z" } ``` **Processing Status:** ```json { "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb", "status": "processing", "url": "https://example.com", "created_at": "2025-08-10T10:00:42.000000Z", "started_at": "2025-08-10T10:00:45.000000Z" } ``` **Completed Status:** ```json { "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb", "status": "completed", "url": "https://example.com", "created_at": "2025-08-10T10:00:42.000000Z", "started_at": "2025-08-10T10:00:45.000000Z", "completed_at": "2025-08-10T10:01:12.000000Z", "result": "\n\n\n Example Domain\n\n\n

Example Domain

\n

This domain is for use in illustrative examples...

\n\n" } ``` **Failed Status:** ```json { "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb", "status": "failed", "url": "https://example.com", "created_at": "2025-08-10T10:00:42.000000Z", "started_at": "2025-08-10T10:00:45.000000Z", "completed_at": "2025-08-10T10:00:50.000000Z", "error": "Timeout: Navigation failed after 30 seconds" } ``` --- ### GET `/api/crawl` List all crawl jobs with pagination (optional endpoint for debugging). **Authentication:** Required #### Request Example ```bash curl -X GET "https://crawlshot.test/api/crawl" \ -H "Accept: application/json" \ -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" ``` #### Response Example ```json { "jobs": [ { "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb", "type": "crawl", "url": "https://example.com", "status": "completed", "created_at": "2025-08-10T10:00:42.000000Z", "completed_at": "2025-08-10T10:01:12.000000Z" } ], "pagination": { "current_page": 1, "total_pages": 5, "total_items": 100, "per_page": 20 } } ``` --- ## Screenshot APIs ### POST `/api/shot` Initiate a screenshot job to capture an image of a webpage. **Authentication:** Required #### Request Parameters | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `url` | string | ✅ | - | Target URL to screenshot (max 2048 chars) | | `viewport_width` | integer | ❌ | 1920 | Viewport width in pixels (320-3840) | | `viewport_height` | integer | ❌ | 1080 | Viewport height in pixels (240-2160) | | `format` | string | ❌ | "jpg" | Image format: "jpg", "png", "webp" | | `quality` | integer | ❌ | 90 | Image quality 1-100 (for JPEG/WebP) | | `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) | | `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) | | `block_ads` | boolean | ❌ | true | Block ads using EasyList filters | | `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners | | `block_trackers` | boolean | ❌ | true | Block tracking scripts | #### Request Example ```bash curl -X POST "https://crawlshot.test/api/shot" \ -H "Accept: application/json" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \ -d '{ "url": "https://example.com", "viewport_width": 1920, "viewport_height": 1080, "format": "webp", "quality": 90, "timeout": 30, "delay": 2000, "block_ads": true, "block_cookie_banners": true, "block_trackers": true }' ``` #### Response Example ```json { "uuid": "fe37d511-99cb-4295-853b-6d484900a851", "status": "queued", "message": "Screenshot job initiated successfully" } ``` --- ### GET `/api/shot/{uuid}` Check the status and retrieve results of a screenshot job. When completed, returns base64 image data and download URL. **Authentication:** Required #### Path Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `uuid` | string | ✅ | Job UUID returned from screenshot initiation | #### Request Example ```bash curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851" \ -H "Accept: application/json" \ -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" ``` #### Response Examples **Queued Status:** ```json { "uuid": "fe37d511-99cb-4295-853b-6d484900a851", "status": "queued", "url": "https://example.com", "created_at": "2025-08-10T10:05:42.000000Z" } ``` **Processing Status:** ```json { "uuid": "fe37d511-99cb-4295-853b-6d484900a851", "status": "processing", "url": "https://example.com", "created_at": "2025-08-10T10:05:42.000000Z", "started_at": "2025-08-10T10:05:45.000000Z" } ``` **Completed Status:** ```json { "uuid": "fe37d511-99cb-4295-853b-6d484900a851", "status": "completed", "url": "https://example.com", "created_at": "2025-08-10T10:05:42.000000Z", "started_at": "2025-08-10T10:05:45.000000Z", "completed_at": "2025-08-10T10:06:12.000000Z", "result": { "image_data": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAAyCAYAAACXpx/Y...", "download_url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download", "mime_type": "image/webp", "format": "webp", "width": 1920, "height": 1080, "size": 45678 } } ``` **Failed Status:** ```json { "uuid": "fe37d511-99cb-4295-853b-6d484900a851", "status": "failed", "url": "https://example.com", "created_at": "2025-08-10T10:05:42.000000Z", "started_at": "2025-08-10T10:05:45.000000Z", "completed_at": "2025-08-10T10:05:50.000000Z", "error": "Timeout: Navigation failed after 30 seconds" } ``` --- ### GET `/api/shot/{uuid}/download` Download the screenshot file directly. Returns the actual image file with appropriate headers. **Authentication:** Required #### Path Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `uuid` | string | ✅ | Job UUID of a completed screenshot job | #### Request Example ```bash curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download" \ -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \ --output screenshot.webp ``` #### Response Returns the image file directly with appropriate `Content-Type` headers: - `Content-Type: image/jpeg` for JPEG files - `Content-Type: image/png` for PNG files - `Content-Type: image/webp` for WebP files --- ### GET `/api/shot` List all screenshot jobs with pagination (optional endpoint for debugging). **Authentication:** Required #### Request Example ```bash curl -X GET "https://crawlshot.test/api/shot" \ -H "Accept: application/json" \ -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" ``` #### Response Example ```json { "jobs": [ { "uuid": "fe37d511-99cb-4295-853b-6d484900a851", "type": "shot", "url": "https://example.com", "status": "completed", "created_at": "2025-08-10T10:05:42.000000Z", "completed_at": "2025-08-10T10:06:12.000000Z" } ], "pagination": { "current_page": 1, "total_pages": 3, "total_items": 50, "per_page": 20 } } ``` --- ## Job Status Flow Both crawl and screenshot jobs follow the same status progression: 1. **`queued`** - Job created and waiting for processing 2. **`processing`** - Job is currently being executed by a worker 3. **`completed`** - Job finished successfully, results available 4. **`failed`** - Job encountered an error and could not complete ## Error Responses ### 401 Unauthorized ```json { "message": "Unauthenticated." } ``` ### 404 Not Found ```json { "error": "Job not found" } ``` ### 422 Validation Error ```json { "message": "The given data was invalid.", "errors": { "url": [ "The url field is required." ], "timeout": [ "The timeout must be between 5 and 300." ] } } ``` ## Features ### Ad & Tracker Blocking - **EasyList Integration**: Automatically downloads and applies EasyList filters - **Cookie Banner Blocking**: Removes cookie consent prompts - **Tracker Blocking**: Blocks Google Analytics, Facebook Pixel, and other tracking scripts - **Custom Domain Blocking**: Blocks common advertising and tracking domains ### Image Processing - **Multiple Formats**: Support for JPEG, PNG, and WebP - **Quality Control**: Adjustable compression quality (1-100) - **Imagick Integration**: High-quality image processing and format conversion - **Responsive Sizing**: Custom viewport dimensions up to 4K resolution ### Storage & Cleanup - **24-Hour TTL**: All files automatically deleted after 24 hours - **Scheduled Cleanup**: Daily automated cleanup of expired files - **Manual Cleanup**: `php artisan crawlshot:prune-storage` command available ### Performance - **Background Processing**: All jobs processed asynchronously via Laravel Horizon - **Queue Management**: Built-in retry logic and failure handling - **Caching**: EasyList filters cached for optimal performance - **Monitoring**: Horizon dashboard for real-time job monitoring at `/horizon` ## Rate Limiting API endpoints include rate limiting to prevent abuse. Contact your system administrator for current rate limit settings. ## Support For technical support or questions about the Crawlshot API, please refer to the system documentation or contact your administrator.