# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview Crawlshot is a self-hosted API service built on Laravel 12 that provides web crawling and screenshot capabilities using Spatie Browsershot. It's designed as a self-hosted solution, offering browser automation through a REST API with authentication and job processing. ### Core Features - **Web Crawling**: HTML extraction using headless Chrome via Spatie Browsershot - **Screenshots**: Image capture using Imagick with customizable dimensions - **Ad/Tracker Blocking**: Built-in blocking of ads, cookie banners, and trackers - **Authentication**: Laravel Sanctum API token authentication - **Job Processing**: Laravel Horizon for background job management - **Temporary Storage**: 24-hour auto-deletion of crawl results - **Status Tracking**: UUID-based job status monitoring ### Technology Stack - **Backend**: PHP 8.3+ with Laravel 12 framework - **Browser Automation**: Spatie Browsershot (Puppeteer/Chrome headless) - **Queue System**: Laravel Horizon for job processing - **Authentication**: Laravel Sanctum for API tokens - **Testing**: Pest PHP testing framework - **Database**: SQLite (development) for job tracking and API tokens ## API Endpoints ### Core API Routes ``` POST /api/crawl - Initiates crawling/screenshot job - Parameters: url, type (html|image), width, height, timeout - Returns: {"uuid": "job-uuid", "status": "queued"} GET /api/crawl/{uuid} - Checks job status and retrieves results - Returns: {"status": "processing|completed|failed", "result": "html content or image url"} ``` ### Supported Parameters (mapped to Browsershot capabilities) **HTML Crawling**: - `url`: Target URL to crawl - `timeout`: Request timeout in seconds (via `timeout()` method) - `block_ads`: true/false - Uses EasyList filter (https://easylist.to/easylist/easylist.txt) - `block_cookie_banners`: true/false - Uses cookie banner blocking patterns - `block_trackers`: true/false - Uses tracker blocking patterns - `delay`: Wait time before capture in milliseconds (via `setDelay()`) - Network idle waiting is always enabled for optimal rendering (no parameter needed) **Screenshot Capture**: - `url`: Target URL to screenshot - `viewport_width`: Viewport width (via `windowSize()` method) - `viewport_height`: Viewport height (via `windowSize()` method) - `quality`: WebP image quality 1-100 (via `setScreenshotType('webp', quality)`) - `block_ads`: true/false - Uses EasyList filter for ad blocking - `block_cookie_banners`: true/false - Uses cookie banner blocking patterns - `block_trackers`: true/false - Uses tracker blocking patterns - `timeout`: Request timeout in seconds (via `timeout()` method) - `delay`: Wait time before capture in milliseconds (via `setDelay()`) ## Development Commands ### Starting the Development Environment User will start the development, do not start yourself, prompt the user to start instead ### Queue Management with Horizon User will star the horizon, do not start yourself, prompt the user to start instead # Horizon dashboard available at: /horizon # Monitor job queues, failed jobs, and metrics ```` ### Individual Services Do not start them yourself, prompt the user to start instead ### Testing ```bash # Run all tests using Pest composer run test # Run API endpoint tests php artisan test --filter=Api # Test browsershot functionality php artisan test tests/Feature/BrowsershotTest.php ```` ### Database Operations Never run database migrations yourself, prompt the user to run instead ### API Token Management ```bash # Generate API tokens via Tinker php artisan tinker # User::find(1)->createToken('client-name')->plainTextToken # Prune expired tokens php artisan sanctum:prune-expired --hours=24 ``` ### Storage Management ```bash # Prune expired crawl results (HTML and images older than 24 hours) php artisan crawlshot:prune-storage # Run storage cleanup via scheduled job php artisan schedule:run ``` ### Browsershot Setup Requirements ```bash # Install Node.js and Puppeteer dependencies npm install puppeteer # For production servers, ensure Chrome/Chromium is installed # Ubuntu/Debian: apt-get install chromium-browser # Alpine: apk add chromium # Or use Puppeteer's bundled Chromium ``` ## Architecture Overview ### Job Processing Flow 1. **Crawl API Request** → `/api/crawl` with URL and parameters 2. ** Screenshot API Request** → `/api/shot` with URL and parameters 3. **Job Creation** → Queue job with UUID, store in database 4. **Processing** → Horizon worker uses Browsershot to capture content 5. **Storage** → Save HTML/image to storage with 24h expiry 6. **Status Check** → `/api/crawl/{uuid}` returns result when ready ### Directory Structure ``` app/ ├── Http/Controllers/Api/ │ └── CrawlController.php # Main API endpoints (/crawl, /crawl/{uuid}) │ └── ShotController.php # Main API endpoints (/shot, /shot/{uuid}) ├── Jobs/ │ ├── ProcessCrawlShotJob.php # Browsershot integration │ └── CleanupOldResults.php # Auto-delete expired files ├── Models/ │ ├── CrawlShotJob.php # Job tracking model │ └── User.php # API token authentication └── Services/ ├── BrowsershotService.php # Browsershot wrapper with filtering └── EasyListService.php # ProtonMail php-adblock-parser wrapper storage/app/crawlshot/ # Temporary result storage (24h TTL) ├── html/ # HTML crawl results └── images/ # Screenshot files (.webp) routes/ └── api.php # /crawl endpoints with Sanctum auth ``` ### Browsershot Configuration ```php // Basic screenshot configuration with EasyList ad blocking $browsershot = Browsershot::url($url) ->windowSize($width, $height) ->setScreenshotType('webp', $quality) // Always WebP format ->setDelay($delayInMs) // Network idle waiting is always enabled ->timeout($timeoutInSeconds); // Apply EasyList filters if block_ads is true if ($blockAds) { $blockedDomains = EasyListService::getBlockedDomains($url); $blockedUrls = EasyListService::getBlockedUrls($url); $browsershot->blockDomains($blockedDomains)->blockUrls($blockedUrls); } $tempPath = storage_path('temp_screenshot.webp'); $browsershot->save($tempPath); // HTML crawling configuration with EasyList filtering $browsershot = Browsershot::url($url) ->setDelay($delayInMs) // Network idle waiting is always enabled ->timeout($timeoutInSeconds); // Apply EasyList filters if block_ads is true if ($blockAds) { $blockedDomains = EasyListService::getBlockedDomains($url); $blockedUrls = EasyListService::getBlockedUrls($url); $browsershot->blockDomains($blockedDomains)->blockUrls($blockedUrls); } $html = $browsershot->bodyHtml(); ``` ### Job States - **queued**: Job created, waiting for processing - **processing**: Horizon worker running Browsershot - **completed**: Result stored, available via status endpoint - **failed**: Browsershot error, timeout, or invalid URL ### Storage Strategy - HTML results: `storage/app/crawlshot/html/{uuid}.html` - Image results: `storage/app/crawlshot/images/{uuid}.webp` (WebP format only) - Auto-cleanup scheduled job removes files after 24 hours - Database tracks job metadata and file paths ### Authentication & Security - All API endpoints protected by Sanctum middleware - Bearer token required in Authorization header - Rate limiting on crawl endpoints to prevent abuse - Input validation for URLs and parameters ### System Requirements - PHP 8.3+ with extensions: gd (WebP support built into Puppeteer) - Node.js and npm for Puppeteer - Chrome/Chromium browser (headless) - Sufficient disk space for temporary file storage - Memory for concurrent Browsershot processes ### EasyList Integration - Uses ProtonMail's php-adblock-parser (https://github.com/ProtonMail/php-adblock-parser) - Service downloads and caches EasyList filters from https://easylist.to/easylist/easylist.txt - php-adblock-parser handles filter parsing and URL matching - Filters converted to domains/URLs for `blockDomains()` and `blockUrls()` methods - Cache updated periodically to maintain current ad blocking effectiveness - Cookie banner and tracker blocking use additional filter lists (EasyList Cookie, Fanboy's Annoyance) ### Development Notes - Horizon required for proper queue processing - Chrome/Chromium must be accessible to PHP process - Consider Docker for consistent browser environment - Monitor disk usage due to temporary file storage - EasyList filters cached locally for performance using php-adblock-parser - Test with various websites for ad/tracker blocking effectiveness