Update

2025-08-10 21:10:33 +08:00
parent 480bd9055d
commit 583a804073
43 changed files with 7623 additions and 270 deletions
--- a/API_DOCUMENTATION.md
+++ b/API_DOCUMENTATION.md
@@ -0,0 +1,489 @@
+# Crawlshot API Documentation
+
+Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities.
+
+## Base URL
+
+```
+https://crawlshot.test
+```
+
+## Authentication
+
+All API endpoints (except health check) require authentication using Laravel Sanctum API tokens.
+
+### Authentication Header
+
+```http
+Authorization: Bearer {your-api-token}
+```
+
+### Example API Token
+```
+1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c
+```
+
+---
+
+## Health Check
+
+### GET `/api/health`
+
+Check if the Crawlshot service is running and healthy.
+
+**Authentication:** Not required
+
+#### Request Example
+
+```bash
+curl -X GET "https://crawlshot.test/api/health" \
+  -H "Accept: application/json"
+```
+
+#### Response Example
+
+```json
+{
+  "status": "healthy",
+  "timestamp": "2025-08-10T09:54:52.195383Z",
+  "service": "crawlshot"
+}
+```
+
+---
+
+## Web Crawling APIs
+
+### POST `/api/crawl`
+
+Initiate a web crawling job to extract HTML content from a URL.
+
+**Authentication:** Required
+
+#### Request Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `url` | string | ✅ | - | Target URL to crawl (max 2048 chars) |
+| `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) |
+| `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
+| `block_ads` | boolean | ❌ | true | Block ads using EasyList filters |
+| `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners |
+| `block_trackers` | boolean | ❌ | true | Block tracking scripts |
+| `wait_until_network_idle` | boolean | ❌ | false | Wait for network activity to cease |
+
+#### Request Example
+
+```bash
+curl -X POST "https://crawlshot.test/api/crawl" \
+  -H "Accept: application/json" \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
+  -d '{
+    "url": "https://example.com",
+    "timeout": 30,
+    "delay": 2000,
+    "block_ads": true,
+    "block_cookie_banners": true,
+    "block_trackers": true,
+    "wait_until_network_idle": true
+  }'
+```
+
+#### Response Example
+
+```json
+{
+  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
+  "status": "queued",
+  "message": "Crawl job initiated successfully"
+}
+```
+
+---
+
+### GET `/api/crawl/{uuid}`
+
+Check the status and retrieve results of a crawl job.
+
+**Authentication:** Required
+
+#### Path Parameters
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `uuid` | string | ✅ | Job UUID returned from crawl initiation |
+
+#### Request Example
+
+```bash
+curl -X GET "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb" \
+  -H "Accept: application/json" \
+  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
+```
+
+#### Response Examples
+
+**Queued Status:**
+```json
+{
+  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
+  "status": "queued",
+  "url": "https://example.com",
+  "created_at": "2025-08-10T10:00:42.000000Z"
+}
+```
+
+**Processing Status:**
+```json
+{
+  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
+  "status": "processing",
+  "url": "https://example.com",
+  "created_at": "2025-08-10T10:00:42.000000Z",
+  "started_at": "2025-08-10T10:00:45.000000Z"
+}
+```
+
+**Completed Status:**
+```json
+{
+  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
+  "status": "completed",
+  "url": "https://example.com",
+  "created_at": "2025-08-10T10:00:42.000000Z",
+  "started_at": "2025-08-10T10:00:45.000000Z",
+  "completed_at": "2025-08-10T10:01:12.000000Z",
+  "result": "<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n</head>\n<body>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples...</p>\n</body>\n</html>"
+}
+```
+
+**Failed Status:**
+```json
+{
+  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
+  "status": "failed",
+  "url": "https://example.com",
+  "created_at": "2025-08-10T10:00:42.000000Z",
+  "started_at": "2025-08-10T10:00:45.000000Z",
+  "completed_at": "2025-08-10T10:00:50.000000Z",
+  "error": "Timeout: Navigation failed after 30 seconds"
+}
+```
+
+---
+
+### GET `/api/crawl`
+
+List all crawl jobs with pagination (optional endpoint for debugging).
+
+**Authentication:** Required
+
+#### Request Example
+
+```bash
+curl -X GET "https://crawlshot.test/api/crawl" \
+  -H "Accept: application/json" \
+  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
+```
+
+#### Response Example
+
+```json
+{
+  "jobs": [
+    {
+      "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
+      "type": "crawl",
+      "url": "https://example.com",
+      "status": "completed",
+      "created_at": "2025-08-10T10:00:42.000000Z",
+      "completed_at": "2025-08-10T10:01:12.000000Z"
+    }
+  ],
+  "pagination": {
+    "current_page": 1,
+    "total_pages": 5,
+    "total_items": 100,
+    "per_page": 20
+  }
+}
+```
+
+---
+
+## Screenshot APIs
+
+### POST `/api/shot`
+
+Initiate a screenshot job to capture an image of a webpage.
+
+**Authentication:** Required
+
+#### Request Parameters
+
+| Parameter | Type | Required | Default | Description |
+|-----------|------|----------|---------|-------------|
+| `url` | string | ✅ | - | Target URL to screenshot (max 2048 chars) |
+| `viewport_width` | integer | ❌ | 1920 | Viewport width in pixels (320-3840) |
+| `viewport_height` | integer | ❌ | 1080 | Viewport height in pixels (240-2160) |
+| `format` | string | ❌ | "jpg" | Image format: "jpg", "png", "webp" |
+| `quality` | integer | ❌ | 90 | Image quality 1-100 (for JPEG/WebP) |
+| `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) |
+| `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
+| `block_ads` | boolean | ❌ | true | Block ads using EasyList filters |
+| `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners |
+| `block_trackers` | boolean | ❌ | true | Block tracking scripts |
+
+#### Request Example
+
+```bash
+curl -X POST "https://crawlshot.test/api/shot" \
+  -H "Accept: application/json" \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
+  -d '{
+    "url": "https://example.com",
+    "viewport_width": 1920,
+    "viewport_height": 1080,
+    "format": "webp",
+    "quality": 90,
+    "timeout": 30,
+    "delay": 2000,
+    "block_ads": true,
+    "block_cookie_banners": true,
+    "block_trackers": true
+  }'
+```
+
+#### Response Example
+
+```json
+{
+  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
+  "status": "queued",
+  "message": "Screenshot job initiated successfully"
+}
+```
+
+---
+
+### GET `/api/shot/{uuid}`
+
+Check the status and retrieve results of a screenshot job. When completed, returns base64 image data and download URL.
+
+**Authentication:** Required
+
+#### Path Parameters
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `uuid` | string | ✅ | Job UUID returned from screenshot initiation |
+
+#### Request Example
+
+```bash
+curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851" \
+  -H "Accept: application/json" \
+  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
+```
+
+#### Response Examples
+
+**Queued Status:**
+```json
+{
+  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
+  "status": "queued",
+  "url": "https://example.com",
+  "created_at": "2025-08-10T10:05:42.000000Z"
+}
+```
+
+**Processing Status:**
+```json
+{
+  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
+  "status": "processing",
+  "url": "https://example.com",
+  "created_at": "2025-08-10T10:05:42.000000Z",
+  "started_at": "2025-08-10T10:05:45.000000Z"
+}
+```
+
+**Completed Status:**
+```json
+{
+  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
+  "status": "completed",
+  "url": "https://example.com",
+  "created_at": "2025-08-10T10:05:42.000000Z",
+  "started_at": "2025-08-10T10:05:45.000000Z",
+  "completed_at": "2025-08-10T10:06:12.000000Z",
+  "result": {
+    "image_data": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAAyCAYAAACXpx/Y...",
+    "download_url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download",
+    "mime_type": "image/webp",
+    "format": "webp",
+    "width": 1920,
+    "height": 1080,
+    "size": 45678
+  }
+}
+```
+
+**Failed Status:**
+```json
+{
+  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
+  "status": "failed",
+  "url": "https://example.com",
+  "created_at": "2025-08-10T10:05:42.000000Z",
+  "started_at": "2025-08-10T10:05:45.000000Z",
+  "completed_at": "2025-08-10T10:05:50.000000Z",
+  "error": "Timeout: Navigation failed after 30 seconds"
+}
+```
+
+---
+
+### GET `/api/shot/{uuid}/download`
+
+Download the screenshot file directly. Returns the actual image file with appropriate headers.
+
+**Authentication:** Required
+
+#### Path Parameters
+
+| Parameter | Type | Required | Description |
+|-----------|------|----------|-------------|
+| `uuid` | string | ✅ | Job UUID of a completed screenshot job |
+
+#### Request Example
+
+```bash
+curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download" \
+  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
+  --output screenshot.webp
+```
+
+#### Response
+
+Returns the image file directly with appropriate `Content-Type` headers:
+- `Content-Type: image/jpeg` for JPEG files
+- `Content-Type: image/png` for PNG files  
+- `Content-Type: image/webp` for WebP files
+
+---
+
+### GET `/api/shot`
+
+List all screenshot jobs with pagination (optional endpoint for debugging).
+
+**Authentication:** Required
+
+#### Request Example
+
+```bash
+curl -X GET "https://crawlshot.test/api/shot" \
+  -H "Accept: application/json" \
+  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
+```
+
+#### Response Example
+
+```json
+{
+  "jobs": [
+    {
+      "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
+      "type": "shot",
+      "url": "https://example.com",
+      "status": "completed",
+      "created_at": "2025-08-10T10:05:42.000000Z",
+      "completed_at": "2025-08-10T10:06:12.000000Z"
+    }
+  ],
+  "pagination": {
+    "current_page": 1,
+    "total_pages": 3,
+    "total_items": 50,
+    "per_page": 20
+  }
+}
+```
+
+---
+
+## Job Status Flow
+
+Both crawl and screenshot jobs follow the same status progression:
+
+1. **`queued`** - Job created and waiting for processing
+2. **`processing`** - Job is currently being executed by a worker
+3. **`completed`** - Job finished successfully, results available
+4. **`failed`** - Job encountered an error and could not complete
+
+## Error Responses
+
+### 401 Unauthorized
+```json
+{
+  "message": "Unauthenticated."
+}
+```
+
+### 404 Not Found
+```json
+{
+  "error": "Job not found"
+}
+```
+
+### 422 Validation Error
+```json
+{
+  "message": "The given data was invalid.",
+  "errors": {
+    "url": [
+      "The url field is required."
+    ],
+    "timeout": [
+      "The timeout must be between 5 and 300."
+    ]
+  }
+}
+```
+
+## Features
+
+### Ad & Tracker Blocking
+- **EasyList Integration**: Automatically downloads and applies EasyList filters
+- **Cookie Banner Blocking**: Removes cookie consent prompts
+- **Tracker Blocking**: Blocks Google Analytics, Facebook Pixel, and other tracking scripts
+- **Custom Domain Blocking**: Blocks common advertising and tracking domains
+
+### Image Processing
+- **Multiple Formats**: Support for JPEG, PNG, and WebP
+- **Quality Control**: Adjustable compression quality (1-100)
+- **Imagick Integration**: High-quality image processing and format conversion
+- **Responsive Sizing**: Custom viewport dimensions up to 4K resolution
+
+### Storage & Cleanup
+- **24-Hour TTL**: All files automatically deleted after 24 hours
+- **Scheduled Cleanup**: Daily automated cleanup of expired files
+- **Manual Cleanup**: `php artisan crawlshot:prune-storage` command available
+
+### Performance
+- **Background Processing**: All jobs processed asynchronously via Laravel Horizon
+- **Queue Management**: Built-in retry logic and failure handling  
+- **Caching**: EasyList filters cached for optimal performance
+- **Monitoring**: Horizon dashboard for real-time job monitoring at `/horizon`
+
+## Rate Limiting
+
+API endpoints include rate limiting to prevent abuse. Contact your system administrator for current rate limit settings.
+
+## Support
+
+For technical support or questions about the Crawlshot API, please refer to the system documentation or contact your administrator.