Update
This commit is contained in:
489
API_DOCUMENTATION.md
Normal file
489
API_DOCUMENTATION.md
Normal file
@@ -0,0 +1,489 @@
|
||||
# Crawlshot API Documentation
|
||||
|
||||
Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities.
|
||||
|
||||
## Base URL
|
||||
|
||||
```
|
||||
https://crawlshot.test
|
||||
```
|
||||
|
||||
## Authentication
|
||||
|
||||
All API endpoints (except health check) require authentication using Laravel Sanctum API tokens.
|
||||
|
||||
### Authentication Header
|
||||
|
||||
```http
|
||||
Authorization: Bearer {your-api-token}
|
||||
```
|
||||
|
||||
### Example API Token
|
||||
```
|
||||
1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Health Check
|
||||
|
||||
### GET `/api/health`
|
||||
|
||||
Check if the Crawlshot service is running and healthy.
|
||||
|
||||
**Authentication:** Not required
|
||||
|
||||
#### Request Example
|
||||
|
||||
```bash
|
||||
curl -X GET "https://crawlshot.test/api/health" \
|
||||
-H "Accept: application/json"
|
||||
```
|
||||
|
||||
#### Response Example
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"timestamp": "2025-08-10T09:54:52.195383Z",
|
||||
"service": "crawlshot"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Web Crawling APIs
|
||||
|
||||
### POST `/api/crawl`
|
||||
|
||||
Initiate a web crawling job to extract HTML content from a URL.
|
||||
|
||||
**Authentication:** Required
|
||||
|
||||
#### Request Parameters
|
||||
|
||||
| Parameter | Type | Required | Default | Description |
|
||||
|-----------|------|----------|---------|-------------|
|
||||
| `url` | string | ✅ | - | Target URL to crawl (max 2048 chars) |
|
||||
| `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) |
|
||||
| `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
|
||||
| `block_ads` | boolean | ❌ | true | Block ads using EasyList filters |
|
||||
| `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners |
|
||||
| `block_trackers` | boolean | ❌ | true | Block tracking scripts |
|
||||
| `wait_until_network_idle` | boolean | ❌ | false | Wait for network activity to cease |
|
||||
|
||||
#### Request Example
|
||||
|
||||
```bash
|
||||
curl -X POST "https://crawlshot.test/api/crawl" \
|
||||
-H "Accept: application/json" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
|
||||
-d '{
|
||||
"url": "https://example.com",
|
||||
"timeout": 30,
|
||||
"delay": 2000,
|
||||
"block_ads": true,
|
||||
"block_cookie_banners": true,
|
||||
"block_trackers": true,
|
||||
"wait_until_network_idle": true
|
||||
}'
|
||||
```
|
||||
|
||||
#### Response Example
|
||||
|
||||
```json
|
||||
{
|
||||
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
|
||||
"status": "queued",
|
||||
"message": "Crawl job initiated successfully"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### GET `/api/crawl/{uuid}`
|
||||
|
||||
Check the status and retrieve results of a crawl job.
|
||||
|
||||
**Authentication:** Required
|
||||
|
||||
#### Path Parameters
|
||||
|
||||
| Parameter | Type | Required | Description |
|
||||
|-----------|------|----------|-------------|
|
||||
| `uuid` | string | ✅ | Job UUID returned from crawl initiation |
|
||||
|
||||
#### Request Example
|
||||
|
||||
```bash
|
||||
curl -X GET "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb" \
|
||||
-H "Accept: application/json" \
|
||||
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
|
||||
```
|
||||
|
||||
#### Response Examples
|
||||
|
||||
**Queued Status:**
|
||||
```json
|
||||
{
|
||||
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
|
||||
"status": "queued",
|
||||
"url": "https://example.com",
|
||||
"created_at": "2025-08-10T10:00:42.000000Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Processing Status:**
|
||||
```json
|
||||
{
|
||||
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
|
||||
"status": "processing",
|
||||
"url": "https://example.com",
|
||||
"created_at": "2025-08-10T10:00:42.000000Z",
|
||||
"started_at": "2025-08-10T10:00:45.000000Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Completed Status:**
|
||||
```json
|
||||
{
|
||||
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
|
||||
"status": "completed",
|
||||
"url": "https://example.com",
|
||||
"created_at": "2025-08-10T10:00:42.000000Z",
|
||||
"started_at": "2025-08-10T10:00:45.000000Z",
|
||||
"completed_at": "2025-08-10T10:01:12.000000Z",
|
||||
"result": "<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n</head>\n<body>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples...</p>\n</body>\n</html>"
|
||||
}
|
||||
```
|
||||
|
||||
**Failed Status:**
|
||||
```json
|
||||
{
|
||||
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
|
||||
"status": "failed",
|
||||
"url": "https://example.com",
|
||||
"created_at": "2025-08-10T10:00:42.000000Z",
|
||||
"started_at": "2025-08-10T10:00:45.000000Z",
|
||||
"completed_at": "2025-08-10T10:00:50.000000Z",
|
||||
"error": "Timeout: Navigation failed after 30 seconds"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### GET `/api/crawl`
|
||||
|
||||
List all crawl jobs with pagination (optional endpoint for debugging).
|
||||
|
||||
**Authentication:** Required
|
||||
|
||||
#### Request Example
|
||||
|
||||
```bash
|
||||
curl -X GET "https://crawlshot.test/api/crawl" \
|
||||
-H "Accept: application/json" \
|
||||
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
|
||||
```
|
||||
|
||||
#### Response Example
|
||||
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
|
||||
"type": "crawl",
|
||||
"url": "https://example.com",
|
||||
"status": "completed",
|
||||
"created_at": "2025-08-10T10:00:42.000000Z",
|
||||
"completed_at": "2025-08-10T10:01:12.000000Z"
|
||||
}
|
||||
],
|
||||
"pagination": {
|
||||
"current_page": 1,
|
||||
"total_pages": 5,
|
||||
"total_items": 100,
|
||||
"per_page": 20
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Screenshot APIs
|
||||
|
||||
### POST `/api/shot`
|
||||
|
||||
Initiate a screenshot job to capture an image of a webpage.
|
||||
|
||||
**Authentication:** Required
|
||||
|
||||
#### Request Parameters
|
||||
|
||||
| Parameter | Type | Required | Default | Description |
|
||||
|-----------|------|----------|---------|-------------|
|
||||
| `url` | string | ✅ | - | Target URL to screenshot (max 2048 chars) |
|
||||
| `viewport_width` | integer | ❌ | 1920 | Viewport width in pixels (320-3840) |
|
||||
| `viewport_height` | integer | ❌ | 1080 | Viewport height in pixels (240-2160) |
|
||||
| `format` | string | ❌ | "jpg" | Image format: "jpg", "png", "webp" |
|
||||
| `quality` | integer | ❌ | 90 | Image quality 1-100 (for JPEG/WebP) |
|
||||
| `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) |
|
||||
| `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
|
||||
| `block_ads` | boolean | ❌ | true | Block ads using EasyList filters |
|
||||
| `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners |
|
||||
| `block_trackers` | boolean | ❌ | true | Block tracking scripts |
|
||||
|
||||
#### Request Example
|
||||
|
||||
```bash
|
||||
curl -X POST "https://crawlshot.test/api/shot" \
|
||||
-H "Accept: application/json" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
|
||||
-d '{
|
||||
"url": "https://example.com",
|
||||
"viewport_width": 1920,
|
||||
"viewport_height": 1080,
|
||||
"format": "webp",
|
||||
"quality": 90,
|
||||
"timeout": 30,
|
||||
"delay": 2000,
|
||||
"block_ads": true,
|
||||
"block_cookie_banners": true,
|
||||
"block_trackers": true
|
||||
}'
|
||||
```
|
||||
|
||||
#### Response Example
|
||||
|
||||
```json
|
||||
{
|
||||
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
|
||||
"status": "queued",
|
||||
"message": "Screenshot job initiated successfully"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### GET `/api/shot/{uuid}`
|
||||
|
||||
Check the status and retrieve results of a screenshot job. When completed, returns base64 image data and download URL.
|
||||
|
||||
**Authentication:** Required
|
||||
|
||||
#### Path Parameters
|
||||
|
||||
| Parameter | Type | Required | Description |
|
||||
|-----------|------|----------|-------------|
|
||||
| `uuid` | string | ✅ | Job UUID returned from screenshot initiation |
|
||||
|
||||
#### Request Example
|
||||
|
||||
```bash
|
||||
curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851" \
|
||||
-H "Accept: application/json" \
|
||||
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
|
||||
```
|
||||
|
||||
#### Response Examples
|
||||
|
||||
**Queued Status:**
|
||||
```json
|
||||
{
|
||||
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
|
||||
"status": "queued",
|
||||
"url": "https://example.com",
|
||||
"created_at": "2025-08-10T10:05:42.000000Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Processing Status:**
|
||||
```json
|
||||
{
|
||||
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
|
||||
"status": "processing",
|
||||
"url": "https://example.com",
|
||||
"created_at": "2025-08-10T10:05:42.000000Z",
|
||||
"started_at": "2025-08-10T10:05:45.000000Z"
|
||||
}
|
||||
```
|
||||
|
||||
**Completed Status:**
|
||||
```json
|
||||
{
|
||||
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
|
||||
"status": "completed",
|
||||
"url": "https://example.com",
|
||||
"created_at": "2025-08-10T10:05:42.000000Z",
|
||||
"started_at": "2025-08-10T10:05:45.000000Z",
|
||||
"completed_at": "2025-08-10T10:06:12.000000Z",
|
||||
"result": {
|
||||
"image_data": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAAyCAYAAACXpx/Y...",
|
||||
"download_url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download",
|
||||
"mime_type": "image/webp",
|
||||
"format": "webp",
|
||||
"width": 1920,
|
||||
"height": 1080,
|
||||
"size": 45678
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Failed Status:**
|
||||
```json
|
||||
{
|
||||
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
|
||||
"status": "failed",
|
||||
"url": "https://example.com",
|
||||
"created_at": "2025-08-10T10:05:42.000000Z",
|
||||
"started_at": "2025-08-10T10:05:45.000000Z",
|
||||
"completed_at": "2025-08-10T10:05:50.000000Z",
|
||||
"error": "Timeout: Navigation failed after 30 seconds"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### GET `/api/shot/{uuid}/download`
|
||||
|
||||
Download the screenshot file directly. Returns the actual image file with appropriate headers.
|
||||
|
||||
**Authentication:** Required
|
||||
|
||||
#### Path Parameters
|
||||
|
||||
| Parameter | Type | Required | Description |
|
||||
|-----------|------|----------|-------------|
|
||||
| `uuid` | string | ✅ | Job UUID of a completed screenshot job |
|
||||
|
||||
#### Request Example
|
||||
|
||||
```bash
|
||||
curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download" \
|
||||
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
|
||||
--output screenshot.webp
|
||||
```
|
||||
|
||||
#### Response
|
||||
|
||||
Returns the image file directly with appropriate `Content-Type` headers:
|
||||
- `Content-Type: image/jpeg` for JPEG files
|
||||
- `Content-Type: image/png` for PNG files
|
||||
- `Content-Type: image/webp` for WebP files
|
||||
|
||||
---
|
||||
|
||||
### GET `/api/shot`
|
||||
|
||||
List all screenshot jobs with pagination (optional endpoint for debugging).
|
||||
|
||||
**Authentication:** Required
|
||||
|
||||
#### Request Example
|
||||
|
||||
```bash
|
||||
curl -X GET "https://crawlshot.test/api/shot" \
|
||||
-H "Accept: application/json" \
|
||||
-H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
|
||||
```
|
||||
|
||||
#### Response Example
|
||||
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"uuid": "fe37d511-99cb-4295-853b-6d484900a851",
|
||||
"type": "shot",
|
||||
"url": "https://example.com",
|
||||
"status": "completed",
|
||||
"created_at": "2025-08-10T10:05:42.000000Z",
|
||||
"completed_at": "2025-08-10T10:06:12.000000Z"
|
||||
}
|
||||
],
|
||||
"pagination": {
|
||||
"current_page": 1,
|
||||
"total_pages": 3,
|
||||
"total_items": 50,
|
||||
"per_page": 20
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Job Status Flow
|
||||
|
||||
Both crawl and screenshot jobs follow the same status progression:
|
||||
|
||||
1. **`queued`** - Job created and waiting for processing
|
||||
2. **`processing`** - Job is currently being executed by a worker
|
||||
3. **`completed`** - Job finished successfully, results available
|
||||
4. **`failed`** - Job encountered an error and could not complete
|
||||
|
||||
## Error Responses
|
||||
|
||||
### 401 Unauthorized
|
||||
```json
|
||||
{
|
||||
"message": "Unauthenticated."
|
||||
}
|
||||
```
|
||||
|
||||
### 404 Not Found
|
||||
```json
|
||||
{
|
||||
"error": "Job not found"
|
||||
}
|
||||
```
|
||||
|
||||
### 422 Validation Error
|
||||
```json
|
||||
{
|
||||
"message": "The given data was invalid.",
|
||||
"errors": {
|
||||
"url": [
|
||||
"The url field is required."
|
||||
],
|
||||
"timeout": [
|
||||
"The timeout must be between 5 and 300."
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
### Ad & Tracker Blocking
|
||||
- **EasyList Integration**: Automatically downloads and applies EasyList filters
|
||||
- **Cookie Banner Blocking**: Removes cookie consent prompts
|
||||
- **Tracker Blocking**: Blocks Google Analytics, Facebook Pixel, and other tracking scripts
|
||||
- **Custom Domain Blocking**: Blocks common advertising and tracking domains
|
||||
|
||||
### Image Processing
|
||||
- **Multiple Formats**: Support for JPEG, PNG, and WebP
|
||||
- **Quality Control**: Adjustable compression quality (1-100)
|
||||
- **Imagick Integration**: High-quality image processing and format conversion
|
||||
- **Responsive Sizing**: Custom viewport dimensions up to 4K resolution
|
||||
|
||||
### Storage & Cleanup
|
||||
- **24-Hour TTL**: All files automatically deleted after 24 hours
|
||||
- **Scheduled Cleanup**: Daily automated cleanup of expired files
|
||||
- **Manual Cleanup**: `php artisan crawlshot:prune-storage` command available
|
||||
|
||||
### Performance
|
||||
- **Background Processing**: All jobs processed asynchronously via Laravel Horizon
|
||||
- **Queue Management**: Built-in retry logic and failure handling
|
||||
- **Caching**: EasyList filters cached for optimal performance
|
||||
- **Monitoring**: Horizon dashboard for real-time job monitoring at `/horizon`
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
API endpoints include rate limiting to prevent abuse. Contact your system administrator for current rate limit settings.
|
||||
|
||||
## Support
|
||||
|
||||
For technical support or questions about the Crawlshot API, please refer to the system documentation or contact your administrator.
|
||||
Reference in New Issue
Block a user