crawlshot/API_DOCUMENTATION.md

# Crawlshot API Documentation

Crawlshot is a self-hosted web crawling and screenshot service built with Laravel and Spatie Browsershot. This comprehensive API provides endpoints for capturing web content and generating screenshots with advanced filtering capabilities, webhook notifications, and intelligent retry mechanisms.

## Overview

**Core Capabilities:**
- **HTML Crawling**: Extract clean HTML content from web pages with ad/tracker blocking
- **Screenshot Capture**: Generate high-quality WebP screenshots with optimizable quality settings
- **Webhook Notifications**: Real-time status updates with event filtering and progressive retry
- **Background Processing**: Asynchronous job processing via Laravel Horizon
- **Smart Filtering**: EasyList integration for ad/tracker/cookie banner blocking
- **Auto-cleanup**: 24-hour file retention with automated cleanup

**Perfect for:**
- Content extraction and monitoring
- Website screenshot automation
- Quality assurance and testing
- Social media preview generation
- Compliance and archival systems

## Base URL

```
https://crawlshot.test
```

Replace `crawlshot.test` with your actual Crawlshot service URL.

## Quick Start

### 1. Authentication

All API endpoints (except health check) require authentication using Laravel Sanctum API tokens.

**Authentication Header:**
```http
Authorization: Bearer {your-api-token}
```

**Example API Token:**
```
1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c
```

### 2. Your First API Call

**Simple HTML Crawl:**
```bash
curl -X POST "https://crawlshot.test/api/crawl" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'
```

**Response:**
```json
{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "message": "Crawl job initiated successfully"
}
```

### 3. Check Job Status

```bash
curl -H "Authorization: Bearer YOUR_TOKEN" \
  "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb"
```

---

## Health Check

### GET `/api/health`

Check if the Crawlshot service is running and healthy.

**Authentication:** Not required

#### Request Example

```bash
curl -X GET "https://crawlshot.test/api/health" \
  -H "Accept: application/json"
```

#### Response Example

```json
{
  "status": "healthy",
  "timestamp": "2025-08-10T09:54:52.195383Z",
  "service": "crawlshot"
}
```

---

## Web Crawling APIs

### POST `/api/crawl`

Initiate a web crawling job to extract HTML content from a URL.

**Authentication:** Required

#### Request Parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `url` | string | ✅ | - | Target URL to crawl (max 2048 chars) |
| `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) |
| `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
| `block_ads` | boolean | ❌ | true | Block ads using EasyList filters |
| `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners |
| `block_trackers` | boolean | ❌ | true | Block tracking scripts |
| `webhook_url` | string | ❌ | null | URL to receive job status webhooks (max 2048 chars) |
| `webhook_events_filter` | array | ❌ | `["queued","processing","completed","failed"]` | Which job statuses trigger webhooks. Empty array `[]` disables webhooks |

#### Request Examples

**Basic Crawl:**
```bash
curl -X POST "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "timeout": 30,
    "delay": 2000,
    "block_ads": true,
    "block_cookie_banners": true,
    "block_trackers": true
  }'
```

**With Webhook Notifications:**
```bash
curl -X POST "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "webhook_url": "https://myapp.com/webhooks/crawlshot",
    "webhook_events_filter": ["completed", "failed"],
    "block_ads": true,
    "timeout": 60
  }'
```

#### Response Example

```json
{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "message": "Crawl job initiated successfully"
}
```

---

### GET `/api/crawl/{uuid}`

Check the status and retrieve results of a crawl job.

**Authentication:** Required

#### Path Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `uuid` | string | ✅ | Job UUID returned from crawl initiation |

#### Request Example

```bash
curl -X GET "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```

#### Response Examples

**Queued Status:**
```json
{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "queued",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z"
}
```

**Processing Status:**
```json
{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "processing",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z"
}
```

**Completed Status:**
```json
{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z",
  "completed_at": "2025-08-10T10:01:12.000000Z",
  "result": "<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n</head>\n<body>\n    <h1>Example Domain</h1>\n    <p>This domain is for use in illustrative examples...</p>\n</body>\n</html>"
}
```

**Failed Status:**
```json
{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "failed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z",
  "completed_at": "2025-08-10T10:00:50.000000Z",
  "error": "Timeout: Navigation failed after 30 seconds"
}
```

---

### GET `/api/crawl`

List all crawl jobs with pagination (optional endpoint for debugging).

**Authentication:** Required

#### Request Example

```bash
curl -X GET "https://crawlshot.test/api/crawl" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```

#### Response Example

```json
{
  "jobs": [
    {
      "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
      "type": "crawl",
      "url": "https://example.com",
      "status": "completed",
      "created_at": "2025-08-10T10:00:42.000000Z",
      "completed_at": "2025-08-10T10:01:12.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 5,
    "total_items": 100,
    "per_page": 20
  }
}
```

---

## Screenshot APIs

### POST `/api/shot`

Initiate a screenshot job to capture an image of a webpage.

**Authentication:** Required

#### Request Parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `url` | string | ✅ | - | Target URL to screenshot (max 2048 chars) |
| `viewport_width` | integer | ❌ | 1920 | Viewport width in pixels (320-3840) |
| `viewport_height` | integer | ❌ | 1080 | Viewport height in pixels (240-2160) |
| `quality` | integer | ❌ | 90 | Image quality 1-100 (always WebP format) |
| `timeout` | integer | ❌ | 30 | Request timeout in seconds (5-300) |
| `delay` | integer | ❌ | 0 | Wait time before capture in milliseconds (0-30000) |
| `block_ads` | boolean | ❌ | true | Block ads using EasyList filters |
| `block_cookie_banners` | boolean | ❌ | true | Block cookie consent banners |
| `block_trackers` | boolean | ❌ | true | Block tracking scripts |
| `webhook_url` | string | ❌ | null | URL to receive job status webhooks (max 2048 chars) |
| `webhook_events_filter` | array | ❌ | `["queued","processing","completed","failed"]` | Which job statuses trigger webhooks. Empty array `[]` disables webhooks |

#### Request Examples

**Basic Screenshot:**
```bash
curl -X POST "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "viewport_width": 1920,
    "viewport_height": 1080,
    "quality": 90,
    "timeout": 30,
    "delay": 2000,
    "block_ads": true,
    "block_cookie_banners": true,
    "block_trackers": true
  }'
```

**With Webhook Notifications:**
```bash
curl -X POST "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  -d '{
    "url": "https://example.com",
    "webhook_url": "https://myapp.com/webhooks/crawlshot",
    "webhook_events_filter": ["completed"],
    "viewport_width": 1200,
    "viewport_height": 800,
    "quality": 85
  }'
```

#### Response Example

```json
{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "queued",
  "message": "Screenshot job initiated successfully"
}
```

---

### GET `/api/shot/{uuid}`

Check the status and retrieve results of a screenshot job. When completed, returns base64 image data and download URL.

**Authentication:** Required

#### Path Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `uuid` | string | ✅ | Job UUID returned from screenshot initiation |

#### Request Example

```bash
curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```

#### Response Examples

**Queued Status:**
```json
{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "queued",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z"
}
```

**Processing Status:**
```json
{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "processing",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z"
}
```

**Completed Status:**
```json
{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:06:12.000000Z",
  "result": {
    "image_data": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAAyCAYAAACXpx/Y...",
    "download_url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download",
    "mime_type": "image/webp",
    "format": "webp",
    "width": 1920,
    "height": 1080,
    "size": 45678
  }
}
```

**Failed Status:**
```json
{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "failed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:05:50.000000Z",
  "error": "Timeout: Navigation failed after 30 seconds"
}
```

---

### GET `/api/shot/{uuid}/download`

Download the screenshot file directly. Returns the actual image file with appropriate headers.

**Authentication:** Required

#### Path Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `uuid` | string | ✅ | Job UUID of a completed screenshot job |

#### Request Example

```bash
curl -X GET "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851/download" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c" \
  --output screenshot.webp
```

#### Response

Returns the WebP image file directly with appropriate headers:
- `Content-Type: image/webp`

---

### GET `/api/shot`

List all screenshot jobs with pagination (optional endpoint for debugging).

**Authentication:** Required

#### Request Example

```bash
curl -X GET "https://crawlshot.test/api/shot" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```

#### Response Example

```json
{
  "jobs": [
    {
      "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
      "type": "shot",
      "url": "https://example.com",
      "status": "completed",
      "created_at": "2025-08-10T10:05:42.000000Z",
      "completed_at": "2025-08-10T10:06:12.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 3,
    "total_items": 50,
    "per_page": 20
  }
}
```

---

## Webhook System

Crawlshot supports real-time webhook notifications to keep your application informed about job status changes without constant polling.

### How Webhooks Work

1. **Configure Webhook**: Include `webhook_url` when creating jobs
2. **Filter Events**: Use `webhook_events_filter` to specify which status changes trigger webhooks
3. **Receive Notifications**: Your endpoint receives HTTP POST requests with job status data
4. **Automatic Retries**: Failed webhooks are automatically retried with progressive backoff

### Event Filtering

Control which job status changes trigger webhook calls:

```json
{
  "webhook_events_filter": ["completed", "failed"]
}
```

**Available Events:**
- `queued` - Job created and queued for processing
- `processing` - Job started processing
- `completed` - Job finished successfully
- `failed` - Job encountered an error

**Special Behaviors:**
- **Default**: `["queued", "processing", "completed", "failed"]` (all events)
- **Disable**: `[]` (empty array disables webhooks entirely)
- **Omitted**: Same as default (all events)

### Webhook Payload

Webhooks send the **exact same payload** as the status endpoints (`GET /api/crawl/{uuid}` or `GET /api/shot/{uuid}`), ensuring consistency.

**Crawl Webhook Example:**
```json
{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:00:42.000000Z",
  "started_at": "2025-08-10T10:00:45.000000Z",
  "completed_at": "2025-08-10T10:01:12.000000Z",
  "result": {
    "html": {
      "url": "https://crawlshot.test/api/crawl/b5dc483b-f62d-4e40-8b9e-4715324a8cbb.html",
      "raw": "<!doctype html>\n<html>..."
    }
  }
}
```

**Screenshot Webhook Example:**
```json
{
  "uuid": "fe37d511-99cb-4295-853b-6d484900a851",
  "status": "completed",
  "url": "https://example.com",
  "created_at": "2025-08-10T10:05:42.000000Z",
  "started_at": "2025-08-10T10:05:45.000000Z",
  "completed_at": "2025-08-10T10:06:12.000000Z",
  "result": {
    "image": {
      "url": "https://crawlshot.test/api/shot/fe37d511-99cb-4295-853b-6d484900a851.webp",
      "raw": "iVBORw0KGgoAAAANSUhEUgAAAHg..."
    },
    "mime_type": "image/webp",
    "format": "webp",
    "width": 1920,
    "height": 1080,
    "size": 45678
  }
}
```

### Progressive Retry System

Failed webhook deliveries are automatically retried with exponential backoff:

- **1st retry**: 1 minute after failure
- **2nd retry**: 2 minutes after failure
- **3rd retry**: 4 minutes after failure
- **4th retry**: 8 minutes after failure
- **5th retry**: 16 minutes after failure
- **6th retry**: 32 minutes after failure
- **After 6 failures**: Stops retrying, webhook marked as failed

**Total retry window**: ~63 minutes (1+2+4+8+16+32)

### Webhook Requirements

**Your webhook endpoint should:**
- Accept HTTP POST requests
- Return HTTP 2xx status codes for successful processing
- Respond within 5 seconds (webhook timeout)
- Handle duplicate deliveries gracefully (use job UUID for idempotency)

**Example webhook handler (PHP):**
```php
Route::post('/webhooks/crawlshot', function (Request $request) {
    $jobData = $request->all();

    // Process the job status update
    if ($jobData['status'] === 'completed') {
        // Handle successful completion
        $result = $jobData['result'];
    } elseif ($jobData['status'] === 'failed') {
        // Handle failure
        $error = $jobData['error'];
    }

    return response('OK', 200);
});
```

---

## Webhook Error Management

When webhooks fail, you can manage them through dedicated endpoints.

### GET `/api/webhook-errors`

List all jobs with failed webhook deliveries.

**Authentication:** Required

#### Request Example

```bash
curl -X GET "https://crawlshot.test/api/webhook-errors" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```

#### Response Example

```json
{
  "jobs": [
    {
      "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
      "type": "crawl",
      "url": "https://example.com",
      "status": "completed",
      "webhook_url": "https://myapp.com/webhook",
      "webhook_attempts": 6,
      "webhook_last_error": "Connection timeout",
      "webhook_next_retry_at": null,
      "created_at": "2025-08-10T10:00:42.000000Z"
    }
  ],
  "pagination": {
    "current_page": 1,
    "total_pages": 1,
    "total_items": 1,
    "per_page": 20
  }
}
```

### POST `/api/webhook-errors/{uuid}/retry`

Manually retry a failed webhook immediately.

**Authentication:** Required

#### Request Example

```bash
curl -X POST "https://crawlshot.test/api/webhook-errors/b5dc483b-f62d-4e40-8b9e-4715324a8cbb/retry" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```

#### Response Example

```json
{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "message": "Webhook retry attempted"
}
```

### DELETE `/api/webhook-errors/{uuid}/clear`

Clear webhook error status without retrying.

**Authentication:** Required

#### Request Example

```bash
curl -X DELETE "https://crawlshot.test/api/webhook-errors/b5dc483b-f62d-4e40-8b9e-4715324a8cbb/clear" \
  -H "Accept: application/json" \
  -H "Authorization: Bearer 1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c"
```

#### Response Example

```json
{
  "uuid": "b5dc483b-f62d-4e40-8b9e-4715324a8cbb",
  "message": "Webhook error cleared"
}
```

---

## Job Status Flow

Both crawl and screenshot jobs follow the same status progression:

1. **`queued`** - Job created and waiting for processing
2. **`processing`** - Job is currently being executed by a worker
3. **`completed`** - Job finished successfully, results available
4. **`failed`** - Job encountered an error and could not complete

## Error Responses

### 401 Unauthorized
```json
{
  "message": "Unauthenticated."
}
```

### 404 Not Found
```json
{
  "error": "Job not found"
}
```

### 422 Validation Error
```json
{
  "message": "The given data was invalid.",
  "errors": {
    "url": [
      "The url field is required."
    ],
    "timeout": [
      "The timeout must be between 5 and 300."
    ]
  }
}
```

## Features

### Ad & Tracker Blocking
- **EasyList Integration**: Automatically downloads and applies EasyList filters
- **Cookie Banner Blocking**: Removes cookie consent prompts
- **Tracker Blocking**: Blocks Google Analytics, Facebook Pixel, and other tracking scripts
- **Custom Domain Blocking**: Blocks common advertising and tracking domains

### Image Processing
- **WebP Format**: High-quality WebP screenshots with optimizable compression
- **Quality Control**: Adjustable compression quality (1-100)
- **Efficient Processing**: Optimized WebP encoding for fast delivery
- **Responsive Sizing**: Custom viewport dimensions up to 4K resolution

### Storage & Cleanup
- **24-Hour TTL**: All files automatically deleted after 24 hours
- **Scheduled Cleanup**: Daily automated cleanup of expired files
- **Manual Cleanup**: `php artisan crawlshot:prune-storage` command available

### Performance
- **Background Processing**: All jobs processed asynchronously via Laravel Horizon
- **Queue Management**: Built-in retry logic and failure handling
- **Caching**: EasyList filters cached for optimal performance
- **Monitoring**: Horizon dashboard for real-time job monitoring at `/horizon`

## Rate Limiting

API endpoints include rate limiting to prevent abuse. Contact your system administrator for current rate limit settings.

## Support

For technical support or questions about the Crawlshot API, please refer to the system documentation or contact your administrator.