This commit is contained in:
ct
2025-08-11 02:35:35 +08:00
parent 4a80723243
commit f3c91b9a64
24 changed files with 2035 additions and 214 deletions

View File

@@ -1,6 +1,23 @@
# Crawlshot API Documentation
Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities.
Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This comprehensive API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities, webhook notifications, and intelligent retry mechanisms.
## Overview
**Core Capabilities:**
- **HTML Crawling**: Extract clean HTML content from web pages with ad/tracker blocking
- **Screenshot Capture**: Generate high-quality WebP screenshots with optimizable quality settings
- **Webhook Notifications**: Real-time status updates with event filtering and progressive retry
- **Background Processing**: Asynchronous job processing via Laravel Horizon
- **Smart Filtering**: EasyList integration for ad/tracker/cookie banner blocking
- **Auto-cleanup**: 24-hour file retention with automated cleanup
**Perfect for:**
- Content extraction and monitoring
- Website screenshot automation
- Quality assurance and testing
- Social media preview generation
- Compliance and archival systems
## Base URL
@@ -8,21 +25,50 @@ ## Base URL
https://crawlshot.test
```
## Authentication
Replace `crawlshot.test` with your actual Crawlshot service URL.
## Quick Start
### 1. Authentication
All API endpoints (except health check) require authentication using Laravel Sanctum API tokens.
### Authentication Header
**Authentication Header:**
```http
Authorization: Bearer {your-api-token}
```
### Example API Token
**Example API Token:**
```
1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c
```
### 2. Your First API Call
**Simple HTML Crawl:**
```bash
curl -X POST "https://crawlshot.test/api/crawl" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
```
**Response:**
```json
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "queued",
"message": "Crawl job initiated successfully"
}
```
### 3. Check Job Status
```bash
curl -H "Authorization: Bearer YOUR_TOKEN" \
"https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb"
```
---
## Health Check
@@ -70,10 +116,12 @@ #### Request Parameters
| `block_ads` | boolean | ❌ | true | Block ads using EasyList filters |
| `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners |
| `block_trackers` | boolean | ❌ | true | Block tracking scripts |
| `wait_until_network_idle` | boolean | ❌ | false | Wait for network activity to cease |
| `webhook_url` | string | ❌ | null | URL to receive job status webhooks (max 2048 chars) |
| `webhook_events_filter` | array | ❌ | `["queued","processing","completed","failed"]` | Which job statuses trigger webhooks. Empty array `[]` disables webhooks |
#### Request Example
#### Request Examples
**Basic Crawl:**
```bash
curl -X POST "https://crawlshot.test/api/crawl" \
-H "Accept: application/json" \
@@ -85,8 +133,22 @@ #### Request Example
"delay": 2000,
"block_ads": true,
"block_cookie_banners": true,
"block_trackers": true,
"wait_until_network_idle": true
"block_trackers": true
}'
```
**With Webhook Notifications:**
```bash
curl -X POST "https://crawlshot.test/api/crawl" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
-d '{
"url": "https://example.com",
"webhook_url": "https://myapp.com/webhooks/crawlshot",
"webhook_events_filter": ["completed", "failed"],
"block_ads": true,
"timeout": 60
}'
```
@@ -227,16 +289,18 @@ #### Request Parameters
| `url` | string | ✅ | - | Target URL to screenshot (max 2048 chars) |
| `viewport_width` | integer | ❌ | 1920 | Viewport width in pixels (320-3840) |
| `viewport_height` | integer | ❌ | 1080 | Viewport height in pixels (240-2160) |
| `format` | string | ❌ | "jpg" | Image format: "jpg", "png", "webp" |
| `quality` | integer | ❌ | 90 | Image quality 1-100 (for JPEG/WebP) |
| `quality` | integer | ❌ | 90 | Image quality 1-100 (always WebP format) |
| `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) |
| `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
| `block_ads` | boolean | ❌ | true | Block ads using EasyList filters |
| `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners |
| `block_trackers` | boolean | ❌ | true | Block tracking scripts |
| `webhook_url` | string | ❌ | null | URL to receive job status webhooks (max 2048 chars) |
| `webhook_events_filter` | array | ❌ | `["queued","processing","completed","failed"]` | Which job statuses trigger webhooks. Empty array `[]` disables webhooks |
#### Request Example
#### Request Examples
**Basic Screenshot:**
```bash
curl -X POST "https://crawlshot.test/api/shot" \
-H "Accept: application/json" \
@@ -246,7 +310,6 @@ #### Request Example
"url": "https://example.com",
"viewport_width": 1920,
"viewport_height": 1080,
"format": "webp",
"quality": 90,
"timeout": 30,
"delay": 2000,
@@ -256,6 +319,22 @@ #### Request Example
}'
```
**With Webhook Notifications:**
```bash
curl -X POST "https://crawlshot.test/api/shot" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
-d '{
"url": "https://example.com",
"webhook_url": "https://myapp.com/webhooks/crawlshot",
"webhook_events_filter": ["completed"],
"viewport_width": 1200,
"viewport_height": 800,
"quality": 85
}'
```
#### Response Example
```json
@@ -369,10 +448,8 @@ #### Request Example
#### Response
Returns the image file directly with appropriate `Content-Type` headers:
- `Content-Type: image/jpeg` for JPEG files
- `Content-Type: image/png` for PNG files
- `Content-Type: image/webp` for WebP files
Returns the WebP image file directly with appropriate headers:
- `Content-Type: image/webp`
---
@@ -415,6 +492,217 @@ #### Response Example
---
## Webhook System
Crawlshot supports real-time webhook notifications to keep your application informed about job status changes without constant polling.
### How Webhooks Work
1. **Configure Webhook**: Include `webhook_url` when creating jobs
2. **Filter Events**: Use `webhook_events_filter` to specify which status changes trigger webhooks
3. **Receive Notifications**: Your endpoint receives HTTP POST requests with job status data
4. **Automatic Retries**: Failed webhooks are automatically retried with progressive backoff
### Event Filtering
Control which job status changes trigger webhook calls:
```json
{
"webhook_events_filter": ["completed", "failed"]
}
```
**Available Events:**
- `queued` - Job created and queued for processing
- `processing` - Job started processing
- `completed` - Job finished successfully
- `failed` - Job encountered an error
**Special Behaviors:**
- **Default**: `["queued", "processing", "completed", "failed"]` (all events)
- **Disable**: `[]` (empty array disables webhooks entirely)
- **Omitted**: Same as default (all events)
### Webhook Payload
Webhooks send the **exact same payload** as the status endpoints (`GET /api/crawl/{uuid}` or `GET /api/shot/{uuid}`), ensuring consistency.
**Crawl Webhook Example:**
```json
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "completed",
"url": "https://example.com",
"created_at": "2025-08-10T10:00:42.000000Z",
"started_at": "2025-08-10T10:00:45.000000Z",
"completed_at": "2025-08-10T10:01:12.000000Z",
"result": {
"html": {
"url": "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb.html",
"raw": "<!doctype html>\n<html>..."
}
}
}
```
**Screenshot Webhook Example:**
```json
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "completed",
"url": "https://example.com",
"created_at": "2025-08-10T10:05:42.000000Z",
"started_at": "2025-08-10T10:05:45.000000Z",
"completed_at": "2025-08-10T10:06:12.000000Z",
"result": {
"image": {
"url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851.webp",
"raw": "iVBORw0KGgoAAAANSUhEUgAAAHg..."
},
"mime_type": "image/webp",
"format": "webp",
"width": 1920,
"height": 1080,
"size": 45678
}
}
```
### Progressive Retry System
Failed webhook deliveries are automatically retried with exponential backoff:
- **1st retry**: 1 minute after failure
- **2nd retry**: 2 minutes after failure
- **3rd retry**: 4 minutes after failure
- **4th retry**: 8 minutes after failure
- **5th retry**: 16 minutes after failure
- **6th retry**: 32 minutes after failure
- **After 6 failures**: Stops retrying, webhook marked as failed
**Total retry window**: ~63 minutes (1+2+4+8+16+32)
### Webhook Requirements
**Your webhook endpoint should:**
- Accept HTTP POST requests
- Return HTTP 2xx status codes for successful processing
- Respond within 5 seconds (webhook timeout)
- Handle duplicate deliveries gracefully (use job UUID for idempotency)
**Example webhook handler (PHP):**
```php
Route::post('/webhooks/crawlshot', function (Request $request) {
$jobData = $request->all();
// Process the job status update
if ($jobData['status'] === 'completed') {
// Handle successful completion
$result = $jobData['result'];
} elseif ($jobData['status'] === 'failed') {
// Handle failure
$error = $jobData['error'];
}
return response('OK', 200);
});
```
---
## Webhook Error Management
When webhooks fail, you can manage them through dedicated endpoints.
### GET `/api/webhook-errors`
List all jobs with failed webhook deliveries.
**Authentication:** Required
#### Request Example
```bash
curl -X GET "https://crawlshot.test/api/webhook-errors" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```
#### Response Example
```json
{
"jobs": [
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"type": "crawl",
"url": "https://example.com",
"status": "completed",
"webhook_url": "https://myapp.com/webhook",
"webhook_attempts": 6,
"webhook_last_error": "Connection timeout",
"webhook_next_retry_at": null,
"created_at": "2025-08-10T10:00:42.000000Z"
}
],
"pagination": {
"current_page": 1,
"total_pages": 1,
"total_items": 1,
"per_page": 20
}
}
```
### POST `/api/webhook-errors/{uuid}/retry`
Manually retry a failed webhook immediately.
**Authentication:** Required
#### Request Example
```bash
curl -X POST "https://crawlshot.test/api/webhook-errors/b5dc483b-f62d-4e40-8b9e-4715324a8cbb/retry" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```
#### Response Example
```json
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"message": "Webhook retry attempted"
}
```
### DELETE `/api/webhook-errors/{uuid}/clear`
Clear webhook error status without retrying.
**Authentication:** Required
#### Request Example
```bash
curl -X DELETE "https://crawlshot.test/api/webhook-errors/b5dc483b-f62d-4e40-8b9e-4715324a8cbb/clear" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```
#### Response Example
```json
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"message": "Webhook error cleared"
}
```
---
## Job Status Flow
Both crawl and screenshot jobs follow the same status progression:
@@ -464,9 +752,9 @@ ### Ad & Tracker Blocking
- **Custom Domain Blocking**: Blocks common advertising and tracking domains
### Image Processing
- **Multiple Formats**: Support for JPEG, PNG, and WebP
- **WebP Format**: High-quality WebP screenshots with optimizable compression
- **Quality Control**: Adjustable compression quality (1-100)
- **Imagick Integration**: High-quality image processing and format conversion
- **Efficient Processing**: Optimized WebP encoding for fast delivery
- **Responsive Sizing**: Custom viewport dimensions up to 4K resolution
### Storage & Cleanup