# Crawlshot A Laravel web crawling and screenshot service with dual deployment options: 1. **Standalone API Service** - Full Laravel application with REST API endpoints 2. **Laravel Package** - HTTP client package for use in other Laravel applications ## Architecture Overview ### Standalone API Service The main Laravel application provides a complete web crawling and screenshot service: - **Spatie Browsershot Integration** - Uses Puppeteer for browser automation - **EasyList Ad Blocking** - Automatic ad/tracker blocking using EasyList filters - **Queue Processing** - Laravel Horizon for async job processing - **24-hour Cleanup** - Automatic file and database cleanup - **Sanctum Authentication** - API token-based authentication - **SQLite Database** - Stores job metadata and processing status ### Laravel Package Simple HTTP client package that provides a clean interface to the API: - **8 Methods for 8 APIs** - Direct 1:1 mapping to REST endpoints - **Facade Support** - Clean Laravel integration - **Auto-discovery** - Automatic service provider registration ## Deployment Options ### Option 1: Standalone API Service Deploy as a complete Laravel application: ```bash git clone [repository] cd crawlshot composer install npm install puppeteer php artisan migrate php artisan serve ``` **API Endpoints:** - `POST /api/crawl` - Create HTML crawl job - `GET /api/crawl/{uuid}` - Get crawl status/result - `GET /api/crawl` - List all crawl jobs - `POST /api/shot` - Create screenshot job - `GET /api/shot/{uuid}` - Get screenshot status/result - `GET /api/shot/{uuid}/download` - Download screenshot file - `GET /api/shot` - List all screenshot jobs - `GET /api/health` - Health check **Example API Usage:** ```bash # Create crawl job curl -X POST "https://crawlshot.test/api/crawl" \ -H "Authorization: Bearer {token}" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "block_ads": true}' # Check status curl -H "Authorization: Bearer {token}" \ "https://crawlshot.test/api/crawl/{uuid}" ``` ### Option 2: Laravel Package Install as a package in your Laravel application: ```bash composer require crawlshot/laravel php artisan vendor:publish --tag=crawlshot-config ``` **Configuration:** ```env CRAWLSHOT_BASE_URL=https://your-crawlshot-api.com CRAWLSHOT_TOKEN=your-sanctum-token ``` **Package Usage:** ```php use Crawlshot\Laravel\Facades\Crawlshot; // Create crawl job $response = Crawlshot::createCrawl('https://example.com', [ 'block_ads' => true, 'timeout' => 30 ]); // Check status $status = Crawlshot::getCrawlStatus($response['uuid']); // Create screenshot $response = Crawlshot::createShot('https://example.com', [ 'format' => 'jpg', 'width' => 1920, 'height' => 1080 ]); // Download screenshot $imageData = Crawlshot::downloadShot($response['uuid']); file_put_contents('screenshot.jpg', $imageData); ``` ## API Reference ### Available Methods (Package) | Method | API Endpoint | Description | |--------|--------------|-------------| | `createCrawl(string $url, array $options = [])` | `POST /api/crawl` | Create crawl job | | `getCrawlStatus(string $uuid)` | `GET /api/crawl/{uuid}` | Get crawl status | | `listCrawls()` | `GET /api/crawl` | List all crawl jobs | | `createShot(string $url, array $options = [])` | `POST /api/shot` | Create screenshot job | | `getShotStatus(string $uuid)` | `GET /api/shot/{uuid}` | Get screenshot status | | `downloadShot(string $uuid)` | `GET /api/shot/{uuid}/download` | Download screenshot file | | `listShots()` | `GET /api/shot` | List all screenshot jobs | | `health()` | `GET /api/health` | Health check | ### Crawl Options ```php [ 'block_ads' => true, // Block ads using EasyList 'block_trackers' => true, // Block tracking scripts 'timeout' => 30, // Request timeout in seconds 'user_agent' => 'Custom UA', // Custom user agent 'wait_until' => 'networkidle0' // Wait condition ] ``` ### Screenshot Options ```php [ 'format' => 'jpg', // jpg, png, webp 'quality' => 90, // 1-100 for jpg/webp 'width' => 1920, // Viewport width 'height' => 1080, // Viewport height 'full_page' => true, // Capture full page 'block_ads' => true, // Block ads 'timeout' => 30 // Request timeout ] ``` ## Features ### Core Functionality - **HTML Crawling** - Extract clean HTML content from web pages - **Screenshot Capture** - Generate high-quality screenshots (JPG, PNG, WebP) - **Ad Blocking** - Built-in EasyList integration for ad/tracker blocking - **Queue Processing** - Async job processing with Laravel Horizon - **File Management** - Automatic cleanup after 24 hours ### Technical Features - **Laravel 12** support with PHP 8.3+ - **Puppeteer Integration** via Spatie Browsershot - **Sanctum Authentication** for API security - **SQLite Database** with migrations - **Auto-discovery** for package installation - **Environment Configuration** via .env variables ## Development ### Requirements - PHP 8.3+ - Laravel 12.0+ - Node.js with Puppeteer - SQLite (or other database) - ImageMagick extension ### Key Dependencies - `spatie/browsershot` - Browser automation - `protonlabs/php-adblock-parser` - EasyList parsing - `laravel/horizon` - Queue monitoring (standalone) - `laravel/sanctum` - API authentication (standalone) ### File Structure ``` ├── app/ # Laravel application (standalone) │ ├── Http/Controllers/Api/ # API controllers │ ├── Jobs/ # Queue jobs │ ├── Models/ # Eloquent models │ └── Services/ # Core services ├── src/ # Package source (both modes) │ ├── CrawlshotClient.php # HTTP client (package mode) │ ├── CrawlshotServiceProvider.php │ ├── Facades/Crawlshot.php │ └── config/crawlshot.php ├── routes/api.php # API routes (standalone) ├── database/migrations/ # Database schema └── composer.json # Package definition ``` ## License MIT