20 KiB
Crawlshot API Documentation
Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This comprehensive API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities, webhook notifications, and intelligent retry mechanisms.
Overview
Core Capabilities:
- HTML Crawling: Extract clean HTML content from web pages with ad/tracker blocking
- Screenshot Capture: Generate high-quality WebP screenshots with optimizable quality settings
- Webhook Notifications: Real-time status updates with event filtering and progressive retry
- Background Processing: Asynchronous job processing via Laravel Horizon
- Smart Filtering: EasyList integration for ad/tracker/cookie banner blocking
- Auto-cleanup: 24-hour file retention with automated cleanup
Perfect for:
- Content extraction and monitoring
- Website screenshot automation
- Quality assurance and testing
- Social media preview generation
- Compliance and archival systems
Base URL
https://crawlshot.test
Replace crawlshot.test with your actual Crawlshot service URL.
Quick Start
1. Authentication
All API endpoints (except health check) require authentication using Laravel Sanctum API tokens.
Authentication Header:
Authorization: Bearer {your-api-token}
Example API Token:
1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c
2. Your First API Call
Simple HTML Crawl:
curl -X POST "https://crawlshot.test/api/crawl" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Response:
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "queued",
"message": "Crawl job initiated successfully"
}
3. Check Job Status
curl -H "Authorization: Bearer YOUR_TOKEN" \
"https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb"
Health Check
GET /api/health
Check if the Crawlshot service is running and healthy.
Authentication: Not required
Request Example
curl -X GET "https://crawlshot.test/api/health" \
-H "Accept: application/json"
Response Example
{
"status": "healthy",
"timestamp": "2025-08-10T09:54:52.195383Z",
"service": "crawlshot"
}
Web Crawling APIs
POST /api/crawl
Initiate a web crawling job to extract HTML content from a URL.
Authentication: Required
Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | ✅ | - | Target URL to crawl (max 2048 chars) |
timeout |
integer | ❌ | 30 | Request timeout in seconds (5-300) |
delay |
integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
block_ads |
boolean | ❌ | true | Block ads using EasyList filters |
block_cookie_banners |
boolean | ❌ | true | Block cookie consent banners |
block_trackers |
boolean | ❌ | true | Block tracking scripts |
webhook_url |
string | ❌ | null | URL to receive job status webhooks (max 2048 chars) |
webhook_events_filter |
array | ❌ | ["queued","processing","completed","failed"] |
Which job statuses trigger webhooks. Empty array [] disables webhooks |
Request Examples
Basic Crawl:
curl -X POST "https://crawlshot.test/api/crawl" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
-d '{
"url": "https://example.com",
"timeout": 30,
"delay": 2000,
"block_ads": true,
"block_cookie_banners": true,
"block_trackers": true
}'
With Webhook Notifications:
curl -X POST "https://crawlshot.test/api/crawl" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
-d '{
"url": "https://example.com",
"webhook_url": "https://myapp.com/webhooks/crawlshot",
"webhook_events_filter": ["completed", "failed"],
"block_ads": true,
"timeout": 60
}'
Response Example
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "queued",
"message": "Crawl job initiated successfully"
}
GET /api/crawl/{uuid}
Check the status and retrieve results of a crawl job.
Authentication: Required
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
uuid |
string | ✅ | Job UUID returned from crawl initiation |
Request Example
curl -X GET "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
Response Examples
Queued Status:
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "queued",
"url": "https://example.com",
"created_at": "2025-08-10T10:00:42.000000Z"
}
Processing Status:
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "processing",
"url": "https://example.com",
"created_at": "2025-08-10T10:00:42.000000Z",
"started_at": "2025-08-10T10:00:45.000000Z"
}
Completed Status:
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "completed",
"url": "https://example.com",
"created_at": "2025-08-10T10:00:42.000000Z",
"started_at": "2025-08-10T10:00:45.000000Z",
"completed_at": "2025-08-10T10:01:12.000000Z",
"result": "<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n</head>\n<body>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples...</p>\n</body>\n</html>"
}
Failed Status:
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "failed",
"url": "https://example.com",
"created_at": "2025-08-10T10:00:42.000000Z",
"started_at": "2025-08-10T10:00:45.000000Z",
"completed_at": "2025-08-10T10:00:50.000000Z",
"error": "Timeout: Navigation failed after 30 seconds"
}
GET /api/crawl
List all crawl jobs with pagination (optional endpoint for debugging).
Authentication: Required
Request Example
curl -X GET "https://crawlshot.test/api/crawl" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
Response Example
{
"jobs": [
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"type": "crawl",
"url": "https://example.com",
"status": "completed",
"created_at": "2025-08-10T10:00:42.000000Z",
"completed_at": "2025-08-10T10:01:12.000000Z"
}
],
"pagination": {
"current_page": 1,
"total_pages": 5,
"total_items": 100,
"per_page": 20
}
}
Screenshot APIs
POST /api/shot
Initiate a screenshot job to capture an image of a webpage.
Authentication: Required
Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url |
string | ✅ | - | Target URL to screenshot (max 2048 chars) |
viewport_width |
integer | ❌ | 1920 | Viewport width in pixels (320-3840) |
viewport_height |
integer | ❌ | 1080 | Viewport height in pixels (240-2160) |
quality |
integer | ❌ | 90 | Image quality 1-100 (always WebP format) |
timeout |
integer | ❌ | 30 | Request timeout in seconds (5-300) |
delay |
integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
block_ads |
boolean | ❌ | true | Block ads using EasyList filters |
block_cookie_banners |
boolean | ❌ | true | Block cookie consent banners |
block_trackers |
boolean | ❌ | true | Block tracking scripts |
webhook_url |
string | ❌ | null | URL to receive job status webhooks (max 2048 chars) |
webhook_events_filter |
array | ❌ | ["queued","processing","completed","failed"] |
Which job statuses trigger webhooks. Empty array [] disables webhooks |
Request Examples
Basic Screenshot:
curl -X POST "https://crawlshot.test/api/shot" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
-d '{
"url": "https://example.com",
"viewport_width": 1920,
"viewport_height": 1080,
"quality": 90,
"timeout": 30,
"delay": 2000,
"block_ads": true,
"block_cookie_banners": true,
"block_trackers": true
}'
With Webhook Notifications:
curl -X POST "https://crawlshot.test/api/shot" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
-d '{
"url": "https://example.com",
"webhook_url": "https://myapp.com/webhooks/crawlshot",
"webhook_events_filter": ["completed"],
"viewport_width": 1200,
"viewport_height": 800,
"quality": 85
}'
Response Example
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "queued",
"message": "Screenshot job initiated successfully"
}
GET /api/shot/{uuid}
Check the status and retrieve results of a screenshot job. When completed, returns base64 image data and download URL.
Authentication: Required
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
uuid |
string | ✅ | Job UUID returned from screenshot initiation |
Request Example
curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
Response Examples
Queued Status:
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "queued",
"url": "https://example.com",
"created_at": "2025-08-10T10:05:42.000000Z"
}
Processing Status:
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "processing",
"url": "https://example.com",
"created_at": "2025-08-10T10:05:42.000000Z",
"started_at": "2025-08-10T10:05:45.000000Z"
}
Completed Status:
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "completed",
"url": "https://example.com",
"created_at": "2025-08-10T10:05:42.000000Z",
"started_at": "2025-08-10T10:05:45.000000Z",
"completed_at": "2025-08-10T10:06:12.000000Z",
"result": {
"image_data": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAAyCAYAAACXpx/Y...",
"download_url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download",
"mime_type": "image/webp",
"format": "webp",
"width": 1920,
"height": 1080,
"size": 45678
}
}
Failed Status:
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "failed",
"url": "https://example.com",
"created_at": "2025-08-10T10:05:42.000000Z",
"started_at": "2025-08-10T10:05:45.000000Z",
"completed_at": "2025-08-10T10:05:50.000000Z",
"error": "Timeout: Navigation failed after 30 seconds"
}
GET /api/shot/{uuid}/download
Download the screenshot file directly. Returns the actual image file with appropriate headers.
Authentication: Required
Path Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
uuid |
string | ✅ | Job UUID of a completed screenshot job |
Request Example
curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
--output screenshot.webp
Response
Returns the WebP image file directly with appropriate headers:
Content-Type: image/webp
GET /api/shot
List all screenshot jobs with pagination (optional endpoint for debugging).
Authentication: Required
Request Example
curl -X GET "https://crawlshot.test/api/shot" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
Response Example
{
"jobs": [
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"type": "shot",
"url": "https://example.com",
"status": "completed",
"created_at": "2025-08-10T10:05:42.000000Z",
"completed_at": "2025-08-10T10:06:12.000000Z"
}
],
"pagination": {
"current_page": 1,
"total_pages": 3,
"total_items": 50,
"per_page": 20
}
}
Webhook System
Crawlshot supports real-time webhook notifications to keep your application informed about job status changes without constant polling.
How Webhooks Work
- Configure Webhook: Include
webhook_urlwhen creating jobs - Filter Events: Use
webhook_events_filterto specify which status changes trigger webhooks - Receive Notifications: Your endpoint receives HTTP POST requests with job status data
- Automatic Retries: Failed webhooks are automatically retried with progressive backoff
Event Filtering
Control which job status changes trigger webhook calls:
{
"webhook_events_filter": ["completed", "failed"]
}
Available Events:
queued- Job created and queued for processingprocessing- Job started processingcompleted- Job finished successfullyfailed- Job encountered an error
Special Behaviors:
- Default:
["queued", "processing", "completed", "failed"](all events) - Disable:
[](empty array disables webhooks entirely) - Omitted: Same as default (all events)
Webhook Payload
Webhooks send the exact same payload as the status endpoints (GET /api/crawl/{uuid} or GET /api/shot/{uuid}), ensuring consistency.
Crawl Webhook Example:
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "completed",
"url": "https://example.com",
"created_at": "2025-08-10T10:00:42.000000Z",
"started_at": "2025-08-10T10:00:45.000000Z",
"completed_at": "2025-08-10T10:01:12.000000Z",
"result": {
"html": {
"url": "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb.html",
"raw": "<!doctype html>\n<html>..."
}
}
}
Screenshot Webhook Example:
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "completed",
"url": "https://example.com",
"created_at": "2025-08-10T10:05:42.000000Z",
"started_at": "2025-08-10T10:05:45.000000Z",
"completed_at": "2025-08-10T10:06:12.000000Z",
"result": {
"image": {
"url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851.webp",
"raw": "iVBORw0KGgoAAAANSUhEUgAAAHg..."
},
"mime_type": "image/webp",
"format": "webp",
"width": 1920,
"height": 1080,
"size": 45678
}
}
Progressive Retry System
Failed webhook deliveries are automatically retried with exponential backoff:
- 1st retry: 1 minute after failure
- 2nd retry: 2 minutes after failure
- 3rd retry: 4 minutes after failure
- 4th retry: 8 minutes after failure
- 5th retry: 16 minutes after failure
- 6th retry: 32 minutes after failure
- After 6 failures: Stops retrying, webhook marked as failed
Total retry window: ~63 minutes (1+2+4+8+16+32)
Webhook Requirements
Your webhook endpoint should:
- Accept HTTP POST requests
- Return HTTP 2xx status codes for successful processing
- Respond within 5 seconds (webhook timeout)
- Handle duplicate deliveries gracefully (use job UUID for idempotency)
Example webhook handler (PHP):
Route::post('/webhooks/crawlshot', function (Request $request) {
$jobData = $request->all();
// Process the job status update
if ($jobData['status'] === 'completed') {
// Handle successful completion
$result = $jobData['result'];
} elseif ($jobData['status'] === 'failed') {
// Handle failure
$error = $jobData['error'];
}
return response('OK', 200);
});
Webhook Error Management
When webhooks fail, you can manage them through dedicated endpoints.
GET /api/webhook-errors
List all jobs with failed webhook deliveries.
Authentication: Required
Request Example
curl -X GET "https://crawlshot.test/api/webhook-errors" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
Response Example
{
"jobs": [
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"type": "crawl",
"url": "https://example.com",
"status": "completed",
"webhook_url": "https://myapp.com/webhook",
"webhook_attempts": 6,
"webhook_last_error": "Connection timeout",
"webhook_next_retry_at": null,
"created_at": "2025-08-10T10:00:42.000000Z"
}
],
"pagination": {
"current_page": 1,
"total_pages": 1,
"total_items": 1,
"per_page": 20
}
}
POST /api/webhook-errors/{uuid}/retry
Manually retry a failed webhook immediately.
Authentication: Required
Request Example
curl -X POST "https://crawlshot.test/api/webhook-errors/b5dc483b-f62d-4e40-8b9e-4715324a8cbb/retry" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
Response Example
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"message": "Webhook retry attempted"
}
DELETE /api/webhook-errors/{uuid}/clear
Clear webhook error status without retrying.
Authentication: Required
Request Example
curl -X DELETE "https://crawlshot.test/api/webhook-errors/b5dc483b-f62d-4e40-8b9e-4715324a8cbb/clear" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
Response Example
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"message": "Webhook error cleared"
}
Job Status Flow
Both crawl and screenshot jobs follow the same status progression:
queued- Job created and waiting for processingprocessing- Job is currently being executed by a workercompleted- Job finished successfully, results availablefailed- Job encountered an error and could not complete
Error Responses
401 Unauthorized
{
"message": "Unauthenticated."
}
404 Not Found
{
"error": "Job not found"
}
422 Validation Error
{
"message": "The given data was invalid.",
"errors": {
"url": [
"The url field is required."
],
"timeout": [
"The timeout must be between 5 and 300."
]
}
}
Features
Ad & Tracker Blocking
- EasyList Integration: Automatically downloads and applies EasyList filters
- Cookie Banner Blocking: Removes cookie consent prompts
- Tracker Blocking: Blocks Google Analytics, Facebook Pixel, and other tracking scripts
- Custom Domain Blocking: Blocks common advertising and tracking domains
Image Processing
- WebP Format: High-quality WebP screenshots with optimizable compression
- Quality Control: Adjustable compression quality (1-100)
- Efficient Processing: Optimized WebP encoding for fast delivery
- Responsive Sizing: Custom viewport dimensions up to 4K resolution
Storage & Cleanup
- 24-Hour TTL: All files automatically deleted after 24 hours
- Scheduled Cleanup: Daily automated cleanup of expired files
- Manual Cleanup:
php artisan crawlshot:prune-storagecommand available
Performance
- Background Processing: All jobs processed asynchronously via Laravel Horizon
- Queue Management: Built-in retry logic and failure handling
- Caching: EasyList filters cached for optimal performance
- Monitoring: Horizon dashboard for real-time job monitoring at
/horizon
Rate Limiting
API endpoints include rate limiting to prevent abuse. Contact your system administrator for current rate limit settings.
Support
For technical support or questions about the Crawlshot API, please refer to the system documentation or contact your administrator.