Files
crawlshot/CLAUDE.md
2025-08-11 02:56:17 +08:00

8.8 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Crawlshot is a self-hosted API service built on Laravel 12 that provides web crawling and screenshot capabilities using Spatie Browsershot. It's designed as a self-hosted solution, offering browser automation through a REST API with authentication and job processing.

Core Features

  • Web Crawling: HTML extraction using headless Chrome via Spatie Browsershot
  • Screenshots: Image capture using Imagick with customizable dimensions
  • Ad/Tracker Blocking: Built-in blocking of ads, cookie banners, and trackers
  • Authentication: Laravel Sanctum API token authentication
  • Job Processing: Laravel Horizon for background job management
  • Temporary Storage: 24-hour auto-deletion of crawl results
  • Status Tracking: UUID-based job status monitoring

Technology Stack

  • Backend: PHP 8.3+ with Laravel 12 framework
  • Browser Automation: Spatie Browsershot (Puppeteer/Chrome headless)
  • Queue System: Laravel Horizon for job processing
  • Authentication: Laravel Sanctum for API tokens
  • Testing: Pest PHP testing framework
  • Database: SQLite (development) for job tracking and API tokens

API Endpoints

Core API Routes

POST /api/crawl
- Initiates crawling/screenshot job
- Parameters: url, type (html|image), width, height, timeout
- Returns: {"uuid": "job-uuid", "status": "queued"}

GET /api/crawl/{uuid}
- Checks job status and retrieves results
- Returns: {"status": "processing|completed|failed", "result": "html content or image url"}

Supported Parameters (mapped to Browsershot capabilities)

HTML Crawling:

  • url: Target URL to crawl
  • timeout: Request timeout in seconds (via timeout() method)
  • block_ads: true/false - Uses EasyList filter (https://easylist.to/easylist/easylist.txt)
  • block_cookie_banners: true/false - Uses cookie banner blocking patterns
  • block_trackers: true/false - Uses tracker blocking patterns
  • delay: Wait time before capture in milliseconds (via setDelay())
  • Network idle waiting is always enabled for optimal rendering (no parameter needed)

Screenshot Capture:

  • url: Target URL to screenshot
  • viewport_width: Viewport width (via windowSize() method)
  • viewport_height: Viewport height (via windowSize() method)
  • quality: WebP image quality 1-100 (via setScreenshotType('webp', quality))
  • block_ads: true/false - Uses EasyList filter for ad blocking
  • block_cookie_banners: true/false - Uses cookie banner blocking patterns
  • block_trackers: true/false - Uses tracker blocking patterns
  • timeout: Request timeout in seconds (via timeout() method)
  • delay: Wait time before capture in milliseconds (via setDelay())

Development Commands

Starting the Development Environment

User will start the development, do not start yourself, prompt the user to start instead

Queue Management with Horizon

User will star the horizon, do not start yourself, prompt the user to start instead

Horizon dashboard available at: /horizon

Monitor job queues, failed jobs, and metrics


### Individual Services
Do not start them yourself, prompt the user to start instead

### Testing
```bash
# Run all tests using Pest
composer run test

# Run API endpoint tests
php artisan test --filter=Api

# Test browsershot functionality
php artisan test tests/Feature/BrowsershotTest.php

Database Operations

Never run database migrations yourself, prompt the user to run instead

API Token Management

# Generate API tokens via Tinker
php artisan tinker
# User::find(1)->createToken('client-name')->plainTextToken

# Prune expired tokens
php artisan sanctum:prune-expired --hours=24

Storage Management

# Prune expired crawl results (HTML and images older than 24 hours)
php artisan crawlshot:prune-storage

# Run storage cleanup via scheduled job
php artisan schedule:run

Browsershot Setup Requirements

# Install Node.js and Puppeteer dependencies
npm install puppeteer

# For production servers, ensure Chrome/Chromium is installed
# Ubuntu/Debian: apt-get install chromium-browser
# Alpine: apk add chromium
# Or use Puppeteer's bundled Chromium

Architecture Overview

Job Processing Flow

  1. Crawl API Request/api/crawl with URL and parameters
  2. ** Screenshot API Request** → /api/shot with URL and parameters
  3. Job Creation → Queue job with UUID, store in database
  4. Processing → Horizon worker uses Browsershot to capture content
  5. Storage → Save HTML/image to storage with 24h expiry
  6. Status Check/api/crawl/{uuid} returns result when ready

Directory Structure

app/
├── Http/Controllers/Api/
│   └── CrawlController.php          # Main API endpoints (/crawl, /crawl/{uuid})
│   └── ShotController.php           # Main API endpoints (/shot, /shot/{uuid})

├── Jobs/
│   ├── ProcessCrawlShotJob.php          # Browsershot integration
│   └── CleanupOldResults.php        # Auto-delete expired files
├── Models/
│   ├── CrawlShotJob.php                 # Job tracking model
│   └── User.php                     # API token authentication
└── Services/
    ├── BrowsershotService.php       # Browsershot wrapper with filtering
    └── EasyListService.php          # ProtonMail php-adblock-parser wrapper

storage/app/crawlshot/               # Temporary result storage (24h TTL)
├── html/                           # HTML crawl results
└── images/                         # Screenshot files (.webp)

routes/
└── api.php                         # /crawl endpoints with Sanctum auth

Browsershot Configuration

// Basic screenshot configuration with EasyList ad blocking  
$browsershot = Browsershot::url($url)
    ->windowSize($width, $height)
    ->setScreenshotType('webp', $quality)  // Always WebP format
    ->setDelay($delayInMs)
    // Network idle waiting is always enabled
    ->timeout($timeoutInSeconds);

// Apply EasyList filters if block_ads is true
if ($blockAds) {
    $blockedDomains = EasyListService::getBlockedDomains($url);
    $blockedUrls = EasyListService::getBlockedUrls($url);
    $browsershot->blockDomains($blockedDomains)->blockUrls($blockedUrls);
}

$tempPath = storage_path('temp_screenshot.webp');
$browsershot->save($tempPath);

// HTML crawling configuration with EasyList filtering
$browsershot = Browsershot::url($url)
    ->setDelay($delayInMs)
    // Network idle waiting is always enabled
    ->timeout($timeoutInSeconds);

// Apply EasyList filters if block_ads is true
if ($blockAds) {
    $blockedDomains = EasyListService::getBlockedDomains($url);
    $blockedUrls = EasyListService::getBlockedUrls($url);
    $browsershot->blockDomains($blockedDomains)->blockUrls($blockedUrls);
}

$html = $browsershot->bodyHtml();

Job States

  • queued: Job created, waiting for processing
  • processing: Horizon worker running Browsershot
  • completed: Result stored, available via status endpoint
  • failed: Browsershot error, timeout, or invalid URL

Storage Strategy

  • HTML results: storage/app/crawlshot/html/{uuid}.html
  • Image results: storage/app/crawlshot/images/{uuid}.webp (WebP format only)
  • Auto-cleanup scheduled job removes files after 24 hours
  • Database tracks job metadata and file paths

Authentication & Security

  • All API endpoints protected by Sanctum middleware
  • Bearer token required in Authorization header
  • Rate limiting on crawl endpoints to prevent abuse
  • Input validation for URLs and parameters

System Requirements

  • PHP 8.3+ with extensions: gd (WebP support built into Puppeteer)
  • Node.js and npm for Puppeteer
  • Chrome/Chromium browser (headless)
  • Sufficient disk space for temporary file storage
  • Memory for concurrent Browsershot processes

EasyList Integration

  • Uses ProtonMail's php-adblock-parser (https://github.com/ProtonMail/php-adblock-parser)
  • Service downloads and caches EasyList filters from https://easylist.to/easylist/easylist.txt
  • php-adblock-parser handles filter parsing and URL matching
  • Filters converted to domains/URLs for blockDomains() and blockUrls() methods
  • Cache updated periodically to maintain current ad blocking effectiveness
  • Cookie banner and tracker blocking use additional filter lists (EasyList Cookie, Fanboy's Annoyance)

Development Notes

  • Horizon required for proper queue processing
  • Chrome/Chromium must be accessible to PHP process
  • Consider Docker for consistent browser environment
  • Monitor disk usage due to temporary file storage
  • EasyList filters cached locally for performance using php-adblock-parser
  • Test with various websites for ad/tracker blocking effectiveness