198 lines
6.1 KiB
Markdown
198 lines
6.1 KiB
Markdown
# Crawlshot
|
|
|
|
A Laravel web crawling and screenshot service with dual deployment options:
|
|
|
|
1. **Standalone API Service** - Full Laravel application with REST API endpoints
|
|
2. **Laravel Package** - HTTP client package for use in other Laravel applications
|
|
|
|
## Architecture Overview
|
|
|
|
### Standalone API Service
|
|
The main Laravel application provides a complete web crawling and screenshot service:
|
|
|
|
- **Spatie Browsershot Integration** - Uses Puppeteer for browser automation
|
|
- **EasyList Ad Blocking** - Automatic ad/tracker blocking using EasyList filters
|
|
- **Queue Processing** - Laravel Horizon for async job processing
|
|
- **24-hour Cleanup** - Automatic file and database cleanup
|
|
- **Sanctum Authentication** - API token-based authentication
|
|
- **SQLite Database** - Stores job metadata and processing status
|
|
|
|
### Laravel Package
|
|
Simple HTTP client package that provides a clean interface to the API:
|
|
|
|
- **8 Methods for 8 APIs** - Direct 1:1 mapping to REST endpoints
|
|
- **Facade Support** - Clean Laravel integration
|
|
- **Auto-discovery** - Automatic service provider registration
|
|
|
|
## Deployment Options
|
|
|
|
### Option 1: Standalone API Service
|
|
|
|
Deploy as a complete Laravel application:
|
|
|
|
```bash
|
|
git clone [repository]
|
|
cd crawlshot
|
|
composer install
|
|
npm install puppeteer
|
|
php artisan migrate
|
|
php artisan serve
|
|
```
|
|
|
|
**API Endpoints:**
|
|
- `POST /api/crawl` - Create HTML crawl job
|
|
- `GET /api/crawl/{uuid}` - Get crawl status/result
|
|
- `GET /api/crawl` - List all crawl jobs
|
|
- `POST /api/shot` - Create screenshot job
|
|
- `GET /api/shot/{uuid}` - Get screenshot status/result
|
|
- `GET /api/shot/{uuid}/download` - Download screenshot file
|
|
- `GET /api/shot` - List all screenshot jobs
|
|
- `GET /api/health` - Health check
|
|
|
|
**Example API Usage:**
|
|
```bash
|
|
# Create crawl job
|
|
curl -X POST "https://crawlshot.test/api/crawl" \
|
|
-H "Authorization: Bearer {token}" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"url": "https://example.com", "block_ads": true}'
|
|
|
|
# Check status
|
|
curl -H "Authorization: Bearer {token}" \
|
|
"https://crawlshot.test/api/crawl/{uuid}"
|
|
```
|
|
|
|
### Option 2: Laravel Package
|
|
|
|
Install as a package in your Laravel application:
|
|
|
|
```bash
|
|
composer require crawlshot/laravel
|
|
php artisan vendor:publish --tag=crawlshot-config
|
|
```
|
|
|
|
**Configuration:**
|
|
```env
|
|
CRAWLSHOT_BASE_URL=https://your-crawlshot-api.com
|
|
CRAWLSHOT_TOKEN=your-sanctum-token
|
|
```
|
|
|
|
**Package Usage:**
|
|
```php
|
|
use Crawlshot\Laravel\Facades\Crawlshot;
|
|
|
|
// Create crawl job
|
|
$response = Crawlshot::createCrawl('https://example.com', [
|
|
'block_ads' => true,
|
|
'timeout' => 30
|
|
]);
|
|
|
|
// Check status
|
|
$status = Crawlshot::getCrawlStatus($response['uuid']);
|
|
|
|
// Create screenshot
|
|
$response = Crawlshot::createShot('https://example.com', [
|
|
'format' => 'jpg',
|
|
'width' => 1920,
|
|
'height' => 1080
|
|
]);
|
|
|
|
// Download screenshot
|
|
$imageData = Crawlshot::downloadShot($response['uuid']);
|
|
file_put_contents('screenshot.jpg', $imageData);
|
|
```
|
|
|
|
## API Reference
|
|
|
|
### Available Methods (Package)
|
|
|
|
| Method | API Endpoint | Description |
|
|
|--------|--------------|-------------|
|
|
| `createCrawl(string $url, array $options = [])` | `POST /api/crawl` | Create crawl job |
|
|
| `getCrawlStatus(string $uuid)` | `GET /api/crawl/{uuid}` | Get crawl status |
|
|
| `listCrawls()` | `GET /api/crawl` | List all crawl jobs |
|
|
| `createShot(string $url, array $options = [])` | `POST /api/shot` | Create screenshot job |
|
|
| `getShotStatus(string $uuid)` | `GET /api/shot/{uuid}` | Get screenshot status |
|
|
| `downloadShot(string $uuid)` | `GET /api/shot/{uuid}/download` | Download screenshot file |
|
|
| `listShots()` | `GET /api/shot` | List all screenshot jobs |
|
|
| `health()` | `GET /api/health` | Health check |
|
|
|
|
### Crawl Options
|
|
|
|
```php
|
|
[
|
|
'block_ads' => true, // Block ads using EasyList
|
|
'block_trackers' => true, // Block tracking scripts
|
|
'timeout' => 30, // Request timeout in seconds
|
|
'user_agent' => 'Custom UA', // Custom user agent
|
|
'wait_until' => 'networkidle0' // Wait condition
|
|
]
|
|
```
|
|
|
|
### Screenshot Options
|
|
|
|
```php
|
|
[
|
|
'format' => 'jpg', // jpg, png, webp
|
|
'quality' => 90, // 1-100 for jpg/webp
|
|
'width' => 1920, // Viewport width
|
|
'height' => 1080, // Viewport height
|
|
'full_page' => true, // Capture full page
|
|
'block_ads' => true, // Block ads
|
|
'timeout' => 30 // Request timeout
|
|
]
|
|
```
|
|
|
|
## Features
|
|
|
|
### Core Functionality
|
|
- **HTML Crawling** - Extract clean HTML content from web pages
|
|
- **Screenshot Capture** - Generate high-quality screenshots (JPG, PNG, WebP)
|
|
- **Ad Blocking** - Built-in EasyList integration for ad/tracker blocking
|
|
- **Queue Processing** - Async job processing with Laravel Horizon
|
|
- **File Management** - Automatic cleanup after 24 hours
|
|
|
|
### Technical Features
|
|
- **Laravel 12** support with PHP 8.3+
|
|
- **Puppeteer Integration** via Spatie Browsershot
|
|
- **Sanctum Authentication** for API security
|
|
- **SQLite Database** with migrations
|
|
- **Auto-discovery** for package installation
|
|
- **Environment Configuration** via .env variables
|
|
|
|
## Development
|
|
|
|
### Requirements
|
|
- PHP 8.3+
|
|
- Laravel 12.0+
|
|
- Node.js with Puppeteer
|
|
- SQLite (or other database)
|
|
- ImageMagick extension
|
|
|
|
### Key Dependencies
|
|
- `spatie/browsershot` - Browser automation
|
|
- `protonlabs/php-adblock-parser` - EasyList parsing
|
|
- `laravel/horizon` - Queue monitoring (standalone)
|
|
- `laravel/sanctum` - API authentication (standalone)
|
|
|
|
### File Structure
|
|
|
|
```
|
|
├── app/ # Laravel application (standalone)
|
|
│ ├── Http/Controllers/Api/ # API controllers
|
|
│ ├── Jobs/ # Queue jobs
|
|
│ ├── Models/ # Eloquent models
|
|
│ └── Services/ # Core services
|
|
├── src/ # Package source (both modes)
|
|
│ ├── CrawlshotClient.php # HTTP client (package mode)
|
|
│ ├── CrawlshotServiceProvider.php
|
|
│ ├── Facades/Crawlshot.php
|
|
│ └── config/crawlshot.php
|
|
├── routes/api.php # API routes (standalone)
|
|
├── database/migrations/ # Database schema
|
|
└── composer.json # Package definition
|
|
```
|
|
|
|
## License
|
|
|
|
MIT |