Files
crawlshot/CLAUDE.md
2025-08-10 21:15:26 +08:00

264 lines
9.1 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Crawlshot is a self-hosted API service built on Laravel 12 that provides web crawling and screenshot capabilities using Spatie Browsershot. It's designed as a self-hosted solution, offering browser automation through a REST API with authentication and job processing.
### Core Features
- **Web Crawling**: HTML extraction using headless Chrome via Spatie Browsershot
- **Screenshots**: Image capture using Imagick with customizable dimensions
- **Ad/Tracker Blocking**: Built-in blocking of ads, cookie banners, and trackers
- **Authentication**: Laravel Sanctum API token authentication
- **Job Processing**: Laravel Horizon for background job management
- **Temporary Storage**: 24-hour auto-deletion of crawl results
- **Status Tracking**: UUID-based job status monitoring
### Technology Stack
- **Backend**: PHP 8.3+ with Laravel 12 framework
- **Browser Automation**: Spatie Browsershot (Puppeteer/Chrome headless)
- **Queue System**: Laravel Horizon for job processing
- **Authentication**: Laravel Sanctum for API tokens
- **Testing**: Pest PHP testing framework
- **Database**: SQLite (development) for job tracking and API tokens
## API Endpoints
### Core API Routes
```
POST /api/crawl
- Initiates crawling/screenshot job
- Parameters: url, type (html|image), width, height, timeout
- Returns: {"uuid": "job-uuid", "status": "queued"}
GET /api/crawl/{uuid}
- Checks job status and retrieves results
- Returns: {"status": "processing|completed|failed", "result": "html content or image url"}
```
### Supported Parameters (mapped to Browsershot capabilities)
**HTML Crawling**:
- `url`: Target URL to crawl
- `timeout`: Request timeout in seconds (via `timeout()` method)
- `block_ads`: true/false - Uses EasyList filter (https://easylist.to/easylist/easylist.txt)
- `block_cookie_banners`: true/false - Uses cookie banner blocking patterns
- `block_trackers`: true/false - Uses tracker blocking patterns
- `delay`: Wait time before capture in milliseconds (via `setDelay()`)
- `wait_until_network_idle`: Wait for network activity to cease (via `waitUntilNetworkIdle()`)
**Screenshot Capture**:
- `url`: Target URL to screenshot
- `viewport_width`: Viewport width (via `windowSize()` method)
- `viewport_height`: Viewport height (via `windowSize()` method)
- `format`: jpg, png, webp (via Imagick post-processing)
- `quality`: Image quality 1-100 for JPEG (via `setScreenshotType('jpeg', quality)`)
- `block_ads`: true/false - Uses EasyList filter for ad blocking
- `block_cookie_banners`: true/false - Uses cookie banner blocking patterns
- `block_trackers`: true/false - Uses tracker blocking patterns
- `timeout`: Request timeout in seconds (via `timeout()` method)
- `delay`: Wait time before capture in milliseconds (via `setDelay()`)
## Development Commands
### Starting the Development Environment
User will start the development, do not start yourself, prompt the user to start instead
### Queue Management with Horizon
User will star the horizon, do not start yourself, prompt the user to start instead
# Horizon dashboard available at: /horizon
# Monitor job queues, failed jobs, and metrics
````
### Individual Services
Do not start them yourself, prompt the user to start instead
### Testing
```bash
# Run all tests using Pest
composer run test
# Run API endpoint tests
php artisan test --filter=Api
# Test browsershot functionality
php artisan test tests/Feature/BrowsershotTest.php
````
### Database Operations
Never run database migrations yourself, prompt the user to run instead
### API Token Management
```bash
# Generate API tokens via Tinker
php artisan tinker
# User::find(1)->createToken('client-name')->plainTextToken
# Prune expired tokens
php artisan sanctum:prune-expired --hours=24
```
### Storage Management
```bash
# Prune expired crawl results (HTML and images older than 24 hours)
php artisan crawlshot:prune-storage
# Run storage cleanup via scheduled job
php artisan schedule:run
```
### Browsershot Setup Requirements
```bash
# Install Node.js and Puppeteer dependencies
npm install puppeteer
# For production servers, ensure Chrome/Chromium is installed
# Ubuntu/Debian: apt-get install chromium-browser
# Alpine: apk add chromium
# Or use Puppeteer's bundled Chromium
```
## Architecture Overview
### Job Processing Flow
1. **Crawl API Request** → `/api/crawl` with URL and parameters
2. ** Screenshot API Request** → `/api/shot` with URL and parameters
3. **Job Creation** → Queue job with UUID, store in database
4. **Processing** → Horizon worker uses Browsershot to capture content
5. **Storage** → Save HTML/image to storage with 24h expiry
6. **Status Check** → `/api/crawl/{uuid}` returns result when ready
### Directory Structure
```
app/
├── Http/Controllers/Api/
│ └── CrawlController.php # Main API endpoints (/crawl, /crawl/{uuid})
│ └── ShotController.php # Main API endpoints (/shot, /shot/{uuid})
├── Jobs/
│ ├── ProcessCrawlShotJob.php # Browsershot integration
│ └── CleanupOldResults.php # Auto-delete expired files
├── Models/
│ ├── CrawlShotJob.php # Job tracking model
│ └── User.php # API token authentication
└── Services/
├── BrowsershotService.php # Browsershot wrapper with filtering
└── EasyListService.php # ProtonMail php-adblock-parser wrapper
storage/app/crawlshot/ # Temporary result storage (24h TTL)
├── html/ # HTML crawl results
└── images/ # Screenshot files (JPEG/PNG/WebP)
routes/
└── api.php # /crawl endpoints with Sanctum auth
```
### Browsershot Configuration
```php
// Basic screenshot configuration with EasyList ad blocking
$browsershot = Browsershot::url($url)
->windowSize($width, $height)
->setScreenshotType('png') // Save as PNG first for Imagick processing
->setDelay($delayInMs)
->waitUntilNetworkIdle()
->timeout($timeoutInSeconds);
// Apply EasyList filters if block_ads is true
if ($blockAds) {
$blockedDomains = EasyListService::getBlockedDomains($url);
$blockedUrls = EasyListService::getBlockedUrls($url);
$browsershot->blockDomains($blockedDomains)->blockUrls($blockedUrls);
}
$tempPath = storage_path('temp_screenshot.png');
$browsershot->save($tempPath);
// Convert to desired format using Imagick if needed
if ($format === 'webp') {
$imagick = new Imagick($tempPath);
$imagick->setImageFormat('webp');
$imagick->writeImage($finalPath);
unlink($tempPath);
}
// HTML crawling configuration with EasyList filtering
$browsershot = Browsershot::url($url)
->setDelay($delayInMs)
->waitUntilNetworkIdle()
->timeout($timeoutInSeconds);
// Apply EasyList filters if block_ads is true
if ($blockAds) {
$blockedDomains = EasyListService::getBlockedDomains($url);
$blockedUrls = EasyListService::getBlockedUrls($url);
$browsershot->blockDomains($blockedDomains)->blockUrls($blockedUrls);
}
$html = $browsershot->bodyHtml();
```
### Job States
- **queued**: Job created, waiting for processing
- **processing**: Horizon worker running Browsershot
- **completed**: Result stored, available via status endpoint
- **failed**: Browsershot error, timeout, or invalid URL
### Storage Strategy
- HTML results: `storage/app/crawlshot/html/{uuid}.html`
- Image results: `storage/app/crawlshot/images/{uuid}.jpg`, `.png`, or `.webp`
- Auto-cleanup scheduled job removes files after 24 hours
- Database tracks job metadata and file paths
### Authentication & Security
- All API endpoints protected by Sanctum middleware
- Bearer token required in Authorization header
- Rate limiting on crawl endpoints to prevent abuse
- Input validation for URLs and parameters
### System Requirements
- PHP 8.3+ with extensions: gd, imagick (required for WebP format)
- Node.js and npm for Puppeteer
- Chrome/Chromium browser (headless)
- Sufficient disk space for temporary file storage
- Memory for concurrent Browsershot processes
### EasyList Integration
- Uses ProtonMail's php-adblock-parser (https://github.com/ProtonMail/php-adblock-parser)
- Service downloads and caches EasyList filters from https://easylist.to/easylist/easylist.txt
- php-adblock-parser handles filter parsing and URL matching
- Filters converted to domains/URLs for `blockDomains()` and `blockUrls()` methods
- Cache updated periodically to maintain current ad blocking effectiveness
- Cookie banner and tracker blocking use additional filter lists (EasyList Cookie, Fanboy's Annoyance)
### Development Notes
- Horizon required for proper queue processing
- Chrome/Chromium must be accessible to PHP process
- Consider Docker for consistent browser environment
- Monitor disk usage due to temporary file storage
- EasyList filters cached locally for performance using php-adblock-parser
- Test with various websites for ad/tracker blocking effectiveness