264 lines
9.1 KiB
Markdown
264 lines
9.1 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
Crawlshot is a self-hosted API service built on Laravel 12 that provides web crawling and screenshot capabilities using Spatie Browsershot. It's designed as a self-hosted solution, offering browser automation through a REST API with authentication and job processing.
|
|
|
|
### Core Features
|
|
|
|
- **Web Crawling**: HTML extraction using headless Chrome via Spatie Browsershot
|
|
- **Screenshots**: Image capture using Imagick with customizable dimensions
|
|
- **Ad/Tracker Blocking**: Built-in blocking of ads, cookie banners, and trackers
|
|
- **Authentication**: Laravel Sanctum API token authentication
|
|
- **Job Processing**: Laravel Horizon for background job management
|
|
- **Temporary Storage**: 24-hour auto-deletion of crawl results
|
|
- **Status Tracking**: UUID-based job status monitoring
|
|
|
|
### Technology Stack
|
|
|
|
- **Backend**: PHP 8.3+ with Laravel 12 framework
|
|
- **Browser Automation**: Spatie Browsershot (Puppeteer/Chrome headless)
|
|
- **Queue System**: Laravel Horizon for job processing
|
|
- **Authentication**: Laravel Sanctum for API tokens
|
|
- **Testing**: Pest PHP testing framework
|
|
- **Database**: SQLite (development) for job tracking and API tokens
|
|
|
|
## API Endpoints
|
|
|
|
### Core API Routes
|
|
|
|
```
|
|
POST /api/crawl
|
|
- Initiates crawling/screenshot job
|
|
- Parameters: url, type (html|image), width, height, timeout
|
|
- Returns: {"uuid": "job-uuid", "status": "queued"}
|
|
|
|
GET /api/crawl/{uuid}
|
|
- Checks job status and retrieves results
|
|
- Returns: {"status": "processing|completed|failed", "result": "html content or image url"}
|
|
```
|
|
|
|
### Supported Parameters (mapped to Browsershot capabilities)
|
|
|
|
**HTML Crawling**:
|
|
|
|
- `url`: Target URL to crawl
|
|
- `timeout`: Request timeout in seconds (via `timeout()` method)
|
|
- `block_ads`: true/false - Uses EasyList filter (https://easylist.to/easylist/easylist.txt)
|
|
- `block_cookie_banners`: true/false - Uses cookie banner blocking patterns
|
|
- `block_trackers`: true/false - Uses tracker blocking patterns
|
|
- `delay`: Wait time before capture in milliseconds (via `setDelay()`)
|
|
- `wait_until_network_idle`: Wait for network activity to cease (via `waitUntilNetworkIdle()`)
|
|
|
|
**Screenshot Capture**:
|
|
|
|
- `url`: Target URL to screenshot
|
|
- `viewport_width`: Viewport width (via `windowSize()` method)
|
|
- `viewport_height`: Viewport height (via `windowSize()` method)
|
|
- `format`: jpg, png, webp (via Imagick post-processing)
|
|
- `quality`: Image quality 1-100 for JPEG (via `setScreenshotType('jpeg', quality)`)
|
|
- `block_ads`: true/false - Uses EasyList filter for ad blocking
|
|
- `block_cookie_banners`: true/false - Uses cookie banner blocking patterns
|
|
- `block_trackers`: true/false - Uses tracker blocking patterns
|
|
- `timeout`: Request timeout in seconds (via `timeout()` method)
|
|
- `delay`: Wait time before capture in milliseconds (via `setDelay()`)
|
|
|
|
## Development Commands
|
|
|
|
### Starting the Development Environment
|
|
|
|
User will start the development, do not start yourself, prompt the user to start instead
|
|
|
|
### Queue Management with Horizon
|
|
|
|
User will star the horizon, do not start yourself, prompt the user to start instead
|
|
|
|
# Horizon dashboard available at: /horizon
|
|
|
|
# Monitor job queues, failed jobs, and metrics
|
|
|
|
````
|
|
|
|
### Individual Services
|
|
Do not start them yourself, prompt the user to start instead
|
|
|
|
### Testing
|
|
```bash
|
|
# Run all tests using Pest
|
|
composer run test
|
|
|
|
# Run API endpoint tests
|
|
php artisan test --filter=Api
|
|
|
|
# Test browsershot functionality
|
|
php artisan test tests/Feature/BrowsershotTest.php
|
|
````
|
|
|
|
### Database Operations
|
|
|
|
Never run database migrations yourself, prompt the user to run instead
|
|
|
|
### API Token Management
|
|
|
|
```bash
|
|
# Generate API tokens via Tinker
|
|
php artisan tinker
|
|
# User::find(1)->createToken('client-name')->plainTextToken
|
|
|
|
# Prune expired tokens
|
|
php artisan sanctum:prune-expired --hours=24
|
|
```
|
|
|
|
### Storage Management
|
|
|
|
```bash
|
|
# Prune expired crawl results (HTML and images older than 24 hours)
|
|
php artisan crawlshot:prune-storage
|
|
|
|
# Run storage cleanup via scheduled job
|
|
php artisan schedule:run
|
|
```
|
|
|
|
### Browsershot Setup Requirements
|
|
|
|
```bash
|
|
# Install Node.js and Puppeteer dependencies
|
|
npm install puppeteer
|
|
|
|
# For production servers, ensure Chrome/Chromium is installed
|
|
# Ubuntu/Debian: apt-get install chromium-browser
|
|
# Alpine: apk add chromium
|
|
# Or use Puppeteer's bundled Chromium
|
|
```
|
|
|
|
## Architecture Overview
|
|
|
|
### Job Processing Flow
|
|
|
|
1. **Crawl API Request** → `/api/crawl` with URL and parameters
|
|
2. ** Screenshot API Request** → `/api/shot` with URL and parameters
|
|
3. **Job Creation** → Queue job with UUID, store in database
|
|
4. **Processing** → Horizon worker uses Browsershot to capture content
|
|
5. **Storage** → Save HTML/image to storage with 24h expiry
|
|
6. **Status Check** → `/api/crawl/{uuid}` returns result when ready
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
app/
|
|
├── Http/Controllers/Api/
|
|
│ └── CrawlController.php # Main API endpoints (/crawl, /crawl/{uuid})
|
|
│ └── ShotController.php # Main API endpoints (/shot, /shot/{uuid})
|
|
|
|
├── Jobs/
|
|
│ ├── ProcessCrawlShotJob.php # Browsershot integration
|
|
│ └── CleanupOldResults.php # Auto-delete expired files
|
|
├── Models/
|
|
│ ├── CrawlShotJob.php # Job tracking model
|
|
│ └── User.php # API token authentication
|
|
└── Services/
|
|
├── BrowsershotService.php # Browsershot wrapper with filtering
|
|
└── EasyListService.php # ProtonMail php-adblock-parser wrapper
|
|
|
|
storage/app/crawlshot/ # Temporary result storage (24h TTL)
|
|
├── html/ # HTML crawl results
|
|
└── images/ # Screenshot files (JPEG/PNG/WebP)
|
|
|
|
routes/
|
|
└── api.php # /crawl endpoints with Sanctum auth
|
|
```
|
|
|
|
### Browsershot Configuration
|
|
|
|
```php
|
|
// Basic screenshot configuration with EasyList ad blocking
|
|
$browsershot = Browsershot::url($url)
|
|
->windowSize($width, $height)
|
|
->setScreenshotType('png') // Save as PNG first for Imagick processing
|
|
->setDelay($delayInMs)
|
|
->waitUntilNetworkIdle()
|
|
->timeout($timeoutInSeconds);
|
|
|
|
// Apply EasyList filters if block_ads is true
|
|
if ($blockAds) {
|
|
$blockedDomains = EasyListService::getBlockedDomains($url);
|
|
$blockedUrls = EasyListService::getBlockedUrls($url);
|
|
$browsershot->blockDomains($blockedDomains)->blockUrls($blockedUrls);
|
|
}
|
|
|
|
$tempPath = storage_path('temp_screenshot.png');
|
|
$browsershot->save($tempPath);
|
|
|
|
// Convert to desired format using Imagick if needed
|
|
if ($format === 'webp') {
|
|
$imagick = new Imagick($tempPath);
|
|
$imagick->setImageFormat('webp');
|
|
$imagick->writeImage($finalPath);
|
|
unlink($tempPath);
|
|
}
|
|
|
|
// HTML crawling configuration with EasyList filtering
|
|
$browsershot = Browsershot::url($url)
|
|
->setDelay($delayInMs)
|
|
->waitUntilNetworkIdle()
|
|
->timeout($timeoutInSeconds);
|
|
|
|
// Apply EasyList filters if block_ads is true
|
|
if ($blockAds) {
|
|
$blockedDomains = EasyListService::getBlockedDomains($url);
|
|
$blockedUrls = EasyListService::getBlockedUrls($url);
|
|
$browsershot->blockDomains($blockedDomains)->blockUrls($blockedUrls);
|
|
}
|
|
|
|
$html = $browsershot->bodyHtml();
|
|
```
|
|
|
|
### Job States
|
|
|
|
- **queued**: Job created, waiting for processing
|
|
- **processing**: Horizon worker running Browsershot
|
|
- **completed**: Result stored, available via status endpoint
|
|
- **failed**: Browsershot error, timeout, or invalid URL
|
|
|
|
### Storage Strategy
|
|
|
|
- HTML results: `storage/app/crawlshot/html/{uuid}.html`
|
|
- Image results: `storage/app/crawlshot/images/{uuid}.jpg`, `.png`, or `.webp`
|
|
- Auto-cleanup scheduled job removes files after 24 hours
|
|
- Database tracks job metadata and file paths
|
|
|
|
### Authentication & Security
|
|
|
|
- All API endpoints protected by Sanctum middleware
|
|
- Bearer token required in Authorization header
|
|
- Rate limiting on crawl endpoints to prevent abuse
|
|
- Input validation for URLs and parameters
|
|
|
|
### System Requirements
|
|
|
|
- PHP 8.3+ with extensions: gd, imagick (required for WebP format)
|
|
- Node.js and npm for Puppeteer
|
|
- Chrome/Chromium browser (headless)
|
|
- Sufficient disk space for temporary file storage
|
|
- Memory for concurrent Browsershot processes
|
|
|
|
### EasyList Integration
|
|
|
|
- Uses ProtonMail's php-adblock-parser (https://github.com/ProtonMail/php-adblock-parser)
|
|
- Service downloads and caches EasyList filters from https://easylist.to/easylist/easylist.txt
|
|
- php-adblock-parser handles filter parsing and URL matching
|
|
- Filters converted to domains/URLs for `blockDomains()` and `blockUrls()` methods
|
|
- Cache updated periodically to maintain current ad blocking effectiveness
|
|
- Cookie banner and tracker blocking use additional filter lists (EasyList Cookie, Fanboy's Annoyance)
|
|
|
|
### Development Notes
|
|
|
|
- Horizon required for proper queue processing
|
|
- Chrome/Chromium must be accessible to PHP process
|
|
- Consider Docker for consistent browser environment
|
|
- Monitor disk usage due to temporary file storage
|
|
- EasyList filters cached locally for performance using php-adblock-parser
|
|
- Test with various websites for ad/tracker blocking effectiveness
|