This commit is contained in:
ct
2025-08-10 21:10:33 +08:00
parent 480bd9055d
commit 583a804073
43 changed files with 7623 additions and 270 deletions

489
API_DOCUMENTATION.md Normal file
View File

@@ -0,0 +1,489 @@
# Crawlshot API Documentation
Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities.
## Base URL
```
https://crawlshot.test
```
## Authentication
All API endpoints (except health check) require authentication using Laravel Sanctum API tokens.
### Authentication Header
```http
Authorization: Bearer {your-api-token}
```
### Example API Token
```
1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c
```
---
## Health Check
### GET `/api/health`
Check if the Crawlshot service is running and healthy.
**Authentication:** Not required
#### Request Example
```bash
curl -X GET "https://crawlshot.test/api/health" \
-H "Accept: application/json"
```
#### Response Example
```json
{
"status": "healthy",
"timestamp": "2025-08-10T09:54:52.195383Z",
"service": "crawlshot"
}
```
---
## Web Crawling APIs
### POST `/api/crawl`
Initiate a web crawling job to extract HTML content from a URL.
**Authentication:** Required
#### Request Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `url` | string | ✅ | - | Target URL to crawl (max 2048 chars) |
| `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) |
| `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
| `block_ads` | boolean | ❌ | true | Block ads using EasyList filters |
| `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners |
| `block_trackers` | boolean | ❌ | true | Block tracking scripts |
| `wait_until_network_idle` | boolean | ❌ | false | Wait for network activity to cease |
#### Request Example
```bash
curl -X POST "https://crawlshot.test/api/crawl" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
-d '{
"url": "https://example.com",
"timeout": 30,
"delay": 2000,
"block_ads": true,
"block_cookie_banners": true,
"block_trackers": true,
"wait_until_network_idle": true
}'
```
#### Response Example
```json
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "queued",
"message": "Crawl job initiated successfully"
}
```
---
### GET `/api/crawl/{uuid}`
Check the status and retrieve results of a crawl job.
**Authentication:** Required
#### Path Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `uuid` | string | ✅ | Job UUID returned from crawl initiation |
#### Request Example
```bash
curl -X GET "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```
#### Response Examples
**Queued Status:**
```json
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "queued",
"url": "https://example.com",
"created_at": "2025-08-10T10:00:42.000000Z"
}
```
**Processing Status:**
```json
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "processing",
"url": "https://example.com",
"created_at": "2025-08-10T10:00:42.000000Z",
"started_at": "2025-08-10T10:00:45.000000Z"
}
```
**Completed Status:**
```json
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "completed",
"url": "https://example.com",
"created_at": "2025-08-10T10:00:42.000000Z",
"started_at": "2025-08-10T10:00:45.000000Z",
"completed_at": "2025-08-10T10:01:12.000000Z",
"result": "<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n</head>\n<body>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples...</p>\n</body>\n</html>"
}
```
**Failed Status:**
```json
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"status": "failed",
"url": "https://example.com",
"created_at": "2025-08-10T10:00:42.000000Z",
"started_at": "2025-08-10T10:00:45.000000Z",
"completed_at": "2025-08-10T10:00:50.000000Z",
"error": "Timeout: Navigation failed after 30 seconds"
}
```
---
### GET `/api/crawl`
List all crawl jobs with pagination (optional endpoint for debugging).
**Authentication:** Required
#### Request Example
```bash
curl -X GET "https://crawlshot.test/api/crawl" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```
#### Response Example
```json
{
"jobs": [
{
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
"type": "crawl",
"url": "https://example.com",
"status": "completed",
"created_at": "2025-08-10T10:00:42.000000Z",
"completed_at": "2025-08-10T10:01:12.000000Z"
}
],
"pagination": {
"current_page": 1,
"total_pages": 5,
"total_items": 100,
"per_page": 20
}
}
```
---
## Screenshot APIs
### POST `/api/shot`
Initiate a screenshot job to capture an image of a webpage.
**Authentication:** Required
#### Request Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `url` | string | ✅ | - | Target URL to screenshot (max 2048 chars) |
| `viewport_width` | integer | ❌ | 1920 | Viewport width in pixels (320-3840) |
| `viewport_height` | integer | ❌ | 1080 | Viewport height in pixels (240-2160) |
| `format` | string | ❌ | "jpg" | Image format: "jpg", "png", "webp" |
| `quality` | integer | ❌ | 90 | Image quality 1-100 (for JPEG/WebP) |
| `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) |
| `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
| `block_ads` | boolean | ❌ | true | Block ads using EasyList filters |
| `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners |
| `block_trackers` | boolean | ❌ | true | Block tracking scripts |
#### Request Example
```bash
curl -X POST "https://crawlshot.test/api/shot" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
-d '{
"url": "https://example.com",
"viewport_width": 1920,
"viewport_height": 1080,
"format": "webp",
"quality": 90,
"timeout": 30,
"delay": 2000,
"block_ads": true,
"block_cookie_banners": true,
"block_trackers": true
}'
```
#### Response Example
```json
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "queued",
"message": "Screenshot job initiated successfully"
}
```
---
### GET `/api/shot/{uuid}`
Check the status and retrieve results of a screenshot job. When completed, returns base64 image data and download URL.
**Authentication:** Required
#### Path Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `uuid` | string | ✅ | Job UUID returned from screenshot initiation |
#### Request Example
```bash
curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```
#### Response Examples
**Queued Status:**
```json
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "queued",
"url": "https://example.com",
"created_at": "2025-08-10T10:05:42.000000Z"
}
```
**Processing Status:**
```json
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "processing",
"url": "https://example.com",
"created_at": "2025-08-10T10:05:42.000000Z",
"started_at": "2025-08-10T10:05:45.000000Z"
}
```
**Completed Status:**
```json
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "completed",
"url": "https://example.com",
"created_at": "2025-08-10T10:05:42.000000Z",
"started_at": "2025-08-10T10:05:45.000000Z",
"completed_at": "2025-08-10T10:06:12.000000Z",
"result": {
"image_data": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAAyCAYAAACXpx/Y...",
"download_url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download",
"mime_type": "image/webp",
"format": "webp",
"width": 1920,
"height": 1080,
"size": 45678
}
}
```
**Failed Status:**
```json
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"status": "failed",
"url": "https://example.com",
"created_at": "2025-08-10T10:05:42.000000Z",
"started_at": "2025-08-10T10:05:45.000000Z",
"completed_at": "2025-08-10T10:05:50.000000Z",
"error": "Timeout: Navigation failed after 30 seconds"
}
```
---
### GET `/api/shot/{uuid}/download`
Download the screenshot file directly. Returns the actual image file with appropriate headers.
**Authentication:** Required
#### Path Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `uuid` | string | ✅ | Job UUID of a completed screenshot job |
#### Request Example
```bash
curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
--output screenshot.webp
```
#### Response
Returns the image file directly with appropriate `Content-Type` headers:
- `Content-Type: image/jpeg` for JPEG files
- `Content-Type: image/png` for PNG files
- `Content-Type: image/webp` for WebP files
---
### GET `/api/shot`
List all screenshot jobs with pagination (optional endpoint for debugging).
**Authentication:** Required
#### Request Example
```bash
curl -X GET "https://crawlshot.test/api/shot" \
-H "Accept: application/json" \
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```
#### Response Example
```json
{
"jobs": [
{
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
"type": "shot",
"url": "https://example.com",
"status": "completed",
"created_at": "2025-08-10T10:05:42.000000Z",
"completed_at": "2025-08-10T10:06:12.000000Z"
}
],
"pagination": {
"current_page": 1,
"total_pages": 3,
"total_items": 50,
"per_page": 20
}
}
```
---
## Job Status Flow
Both crawl and screenshot jobs follow the same status progression:
1. **`queued`** - Job created and waiting for processing
2. **`processing`** - Job is currently being executed by a worker
3. **`completed`** - Job finished successfully, results available
4. **`failed`** - Job encountered an error and could not complete
## Error Responses
### 401 Unauthorized
```json
{
"message": "Unauthenticated."
}
```
### 404 Not Found
```json
{
"error": "Job not found"
}
```
### 422 Validation Error
```json
{
"message": "The given data was invalid.",
"errors": {
"url": [
"The url field is required."
],
"timeout": [
"The timeout must be between 5 and 300."
]
}
}
```
## Features
### Ad & Tracker Blocking
- **EasyList Integration**: Automatically downloads and applies EasyList filters
- **Cookie Banner Blocking**: Removes cookie consent prompts
- **Tracker Blocking**: Blocks Google Analytics, Facebook Pixel, and other tracking scripts
- **Custom Domain Blocking**: Blocks common advertising and tracking domains
### Image Processing
- **Multiple Formats**: Support for JPEG, PNG, and WebP
- **Quality Control**: Adjustable compression quality (1-100)
- **Imagick Integration**: High-quality image processing and format conversion
- **Responsive Sizing**: Custom viewport dimensions up to 4K resolution
### Storage & Cleanup
- **24-Hour TTL**: All files automatically deleted after 24 hours
- **Scheduled Cleanup**: Daily automated cleanup of expired files
- **Manual Cleanup**: `php artisan crawlshot:prune-storage` command available
### Performance
- **Background Processing**: All jobs processed asynchronously via Laravel Horizon
- **Queue Management**: Built-in retry logic and failure handling
- **Caching**: EasyList filters cached for optimal performance
- **Monitoring**: Horizon dashboard for real-time job monitoring at `/horizon`
## Rate Limiting
API endpoints include rate limiting to prevent abuse. Contact your system administrator for current rate limit settings.
## Support
For technical support or questions about the Crawlshot API, please refer to the system documentation or contact your administrator.