Update
This commit is contained in:
327
README.md
327
README.md
@@ -1,198 +1,213 @@
|
||||
# Crawlshot
|
||||
|
||||
A Laravel web crawling and screenshot service with dual deployment options:
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://laravel.com)
|
||||
[](https://php.net)
|
||||
|
||||
1. **Standalone API Service** - Full Laravel application with REST API endpoints
|
||||
2. **Laravel Package** - HTTP client package for use in other Laravel applications
|
||||
**High-performance web crawling and screenshot service** built with Laravel, featuring intelligent ad blocking, webhook notifications, and a powerful fluent PHP client.
|
||||
|
||||
## Architecture Overview
|
||||
🎯 **Perfect for:** Content monitoring • Screenshot automation • QA testing • Social media previews • Compliance archival
|
||||
|
||||
### Standalone API Service
|
||||
The main Laravel application provides a complete web crawling and screenshot service:
|
||||
## ✨ Key Features
|
||||
|
||||
- **Spatie Browsershot Integration** - Uses Puppeteer for browser automation
|
||||
- **EasyList Ad Blocking** - Automatic ad/tracker blocking using EasyList filters
|
||||
- **Queue Processing** - Laravel Horizon for async job processing
|
||||
- **24-hour Cleanup** - Automatic file and database cleanup
|
||||
- **Sanctum Authentication** - API token-based authentication
|
||||
- **SQLite Database** - Stores job metadata and processing status
|
||||
- 🚀 **Dual Deployment**: Standalone API service or Laravel package
|
||||
- 🔗 **Webhook Notifications**: Real-time updates with progressive retry
|
||||
- 🎨 **Fluent Interface**: `$client->crawl($url)->webhookUrl($webhook)->create()`
|
||||
- 📦 **Typed Responses**: `$result->isCompleted()`, `$shot->getDimensions()`
|
||||
- 🛡️ **Smart Blocking**: EasyList ad/tracker/cookie banner filtering
|
||||
- ⚡ **Background Processing**: Laravel Horizon queue management
|
||||
- 🔄 **Auto-cleanup**: 24-hour file retention with scheduled cleanup
|
||||
- 🔐 **Secure**: Laravel Sanctum API authentication
|
||||
|
||||
### Laravel Package
|
||||
Simple HTTP client package that provides a clean interface to the API:
|
||||
## 📚 Documentation
|
||||
|
||||
- **8 Methods for 8 APIs** - Direct 1:1 mapping to REST endpoints
|
||||
- **Facade Support** - Clean Laravel integration
|
||||
- **Auto-discovery** - Automatic service provider registration
|
||||
- 📖 **[API Documentation](API_DOCUMENTATION.md)** - Complete REST API reference with webhook system
|
||||
- 🔧 **[Client Documentation](CLIENT_DOCUMENTATION.md)** - PHP client library guide with fluent interface
|
||||
- ⚙️ **[Setup Guide](SETUP.md)** - Detailed installation and configuration
|
||||
|
||||
## Deployment Options
|
||||
## 🚀 Quick Start
|
||||
|
||||
### Option 1: Standalone API Service
|
||||
|
||||
Deploy as a complete Laravel application:
|
||||
Deploy your own Crawlshot API server:
|
||||
|
||||
```bash
|
||||
git clone [repository]
|
||||
cd crawlshot
|
||||
composer install
|
||||
npm install puppeteer
|
||||
php artisan migrate
|
||||
php artisan serve
|
||||
```
|
||||
|
||||
**API Endpoints:**
|
||||
- `POST /api/crawl` - Create HTML crawl job
|
||||
- `GET /api/crawl/{uuid}` - Get crawl status/result
|
||||
- `GET /api/crawl` - List all crawl jobs
|
||||
- `POST /api/shot` - Create screenshot job
|
||||
- `GET /api/shot/{uuid}` - Get screenshot status/result
|
||||
- `GET /api/shot/{uuid}/download` - Download screenshot file
|
||||
- `GET /api/shot` - List all screenshot jobs
|
||||
- `GET /api/health` - Health check
|
||||
|
||||
**Example API Usage:**
|
||||
```bash
|
||||
# Create crawl job
|
||||
curl -X POST "https://crawlshot.test/api/crawl" \
|
||||
-H "Authorization: Bearer {token}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"url": "https://example.com", "block_ads": true}'
|
||||
|
||||
# Check status
|
||||
curl -H "Authorization: Bearer {token}" \
|
||||
"https://crawlshot.test/api/crawl/{uuid}"
|
||||
composer install && npm install puppeteer
|
||||
php artisan migrate && php artisan serve
|
||||
```
|
||||
|
||||
### Option 2: Laravel Package
|
||||
|
||||
Install as a package in your Laravel application:
|
||||
Use as a client library in your Laravel app:
|
||||
|
||||
```bash
|
||||
composer require crawlshot/laravel
|
||||
php artisan vendor:publish --tag=crawlshot-config
|
||||
```
|
||||
|
||||
**Configuration:**
|
||||
```env
|
||||
CRAWLSHOT_BASE_URL=https://your-crawlshot-api.com
|
||||
CRAWLSHOT_TOKEN=your-sanctum-token
|
||||
```
|
||||
|
||||
**Package Usage:**
|
||||
```php
|
||||
use Crawlshot\Laravel\Facades\Crawlshot;
|
||||
|
||||
// Create crawl job
|
||||
$response = Crawlshot::createCrawl('https://example.com', [
|
||||
'block_ads' => true,
|
||||
'timeout' => 30
|
||||
]);
|
||||
|
||||
// Check status
|
||||
$status = Crawlshot::getCrawlStatus($response['uuid']);
|
||||
|
||||
// Create screenshot
|
||||
$response = Crawlshot::createShot('https://example.com', [
|
||||
'format' => 'jpg',
|
||||
'width' => 1920,
|
||||
'height' => 1080
|
||||
]);
|
||||
|
||||
// Download screenshot
|
||||
$imageData = Crawlshot::downloadShot($response['uuid']);
|
||||
file_put_contents('screenshot.jpg', $imageData);
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
### Available Methods (Package)
|
||||
|
||||
| Method | API Endpoint | Description |
|
||||
|--------|--------------|-------------|
|
||||
| `createCrawl(string $url, array $options = [])` | `POST /api/crawl` | Create crawl job |
|
||||
| `getCrawlStatus(string $uuid)` | `GET /api/crawl/{uuid}` | Get crawl status |
|
||||
| `listCrawls()` | `GET /api/crawl` | List all crawl jobs |
|
||||
| `createShot(string $url, array $options = [])` | `POST /api/shot` | Create screenshot job |
|
||||
| `getShotStatus(string $uuid)` | `GET /api/shot/{uuid}` | Get screenshot status |
|
||||
| `downloadShot(string $uuid)` | `GET /api/shot/{uuid}/download` | Download screenshot file |
|
||||
| `listShots()` | `GET /api/shot` | List all screenshot jobs |
|
||||
| `health()` | `GET /api/health` | Health check |
|
||||
|
||||
### Crawl Options
|
||||
|
||||
```php
|
||||
[
|
||||
'block_ads' => true, // Block ads using EasyList
|
||||
'block_trackers' => true, // Block tracking scripts
|
||||
'timeout' => 30, // Request timeout in seconds
|
||||
'user_agent' => 'Custom UA', // Custom user agent
|
||||
'wait_until' => 'networkidle0' // Wait condition
|
||||
]
|
||||
$client = new CrawlshotClient('https://crawlshot.test', 'your-token');
|
||||
```
|
||||
|
||||
### Screenshot Options
|
||||
## ⚡ Modern Usage Examples
|
||||
|
||||
### Fluent Interface with Webhooks
|
||||
|
||||
```php
|
||||
[
|
||||
'format' => 'jpg', // jpg, png, webp
|
||||
'quality' => 90, // 1-100 for jpg/webp
|
||||
'width' => 1920, // Viewport width
|
||||
'height' => 1080, // Viewport height
|
||||
'full_page' => true, // Capture full page
|
||||
'block_ads' => true, // Block ads
|
||||
'timeout' => 30 // Request timeout
|
||||
]
|
||||
use Crawlshot\Laravel\CrawlshotClient;
|
||||
|
||||
$client = new CrawlshotClient('https://crawlshot.test', 'your-token');
|
||||
|
||||
// HTML Crawling with webhook notifications
|
||||
$crawl = $client->crawl('https://example.com')
|
||||
->webhookUrl('https://myapp.com/webhook')
|
||||
->webhookEventsFilter(['completed', 'failed'])
|
||||
->blockAds(true)
|
||||
->timeout(60)
|
||||
->create();
|
||||
|
||||
echo "Job: {$crawl->getUuid()} - Status: {$crawl->getStatus()}";
|
||||
|
||||
// Screenshot with custom dimensions
|
||||
$shot = $client->shot('https://dashboard.example.com')
|
||||
->viewportSize(1920, 1080)
|
||||
->quality(90)
|
||||
->webhookUrl('https://myapp.com/webhook')
|
||||
->create();
|
||||
|
||||
if ($shot->isCompleted()) {
|
||||
$dimensions = $shot->getDimensions(); // [1920, 1080]
|
||||
$imageData = $shot->downloadImage(); // Binary data
|
||||
}
|
||||
```
|
||||
|
||||
## Features
|
||||
### Webhook Handler Example
|
||||
|
||||
### Core Functionality
|
||||
- **HTML Crawling** - Extract clean HTML content from web pages
|
||||
- **Screenshot Capture** - Generate high-quality screenshots (JPG, PNG, WebP)
|
||||
- **Ad Blocking** - Built-in EasyList integration for ad/tracker blocking
|
||||
- **Queue Processing** - Async job processing with Laravel Horizon
|
||||
- **File Management** - Automatic cleanup after 24 hours
|
||||
```php
|
||||
Route::post('/webhook', function (Request $request) {
|
||||
$job = $request->all();
|
||||
|
||||
if ($job['status'] === 'completed') {
|
||||
if (isset($job['result']['html'])) {
|
||||
// Process HTML crawl result
|
||||
$html = $job['result']['html']['raw'];
|
||||
} elseif (isset($job['result']['image'])) {
|
||||
// Process screenshot result
|
||||
$imageUrl = $job['result']['image']['url'];
|
||||
}
|
||||
}
|
||||
|
||||
return response('OK', 200);
|
||||
});
|
||||
```
|
||||
|
||||
### Technical Features
|
||||
- **Laravel 12** support with PHP 8.3+
|
||||
- **Puppeteer Integration** via Spatie Browsershot
|
||||
- **Sanctum Authentication** for API security
|
||||
- **SQLite Database** with migrations
|
||||
- **Auto-discovery** for package installation
|
||||
- **Environment Configuration** via .env variables
|
||||
### Direct API Usage
|
||||
|
||||
## Development
|
||||
```bash
|
||||
# HTML crawl with webhook
|
||||
curl -X POST "https://crawlshot.test/api/crawl" \
|
||||
-H "Authorization: Bearer YOUR_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://example.com",
|
||||
"webhook_url": "https://myapp.com/webhook",
|
||||
"webhook_events_filter": ["completed"],
|
||||
"block_ads": true
|
||||
}'
|
||||
|
||||
### Requirements
|
||||
- PHP 8.3+
|
||||
- Laravel 12.0+
|
||||
- Node.js with Puppeteer
|
||||
- SQLite (or other database)
|
||||
- ImageMagick extension
|
||||
# Screenshot with custom viewport
|
||||
curl -X POST "https://crawlshot.test/api/shot" \
|
||||
-H "Authorization: Bearer YOUR_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://example.com",
|
||||
"viewport_width": 1200,
|
||||
"viewport_height": 800,
|
||||
"webhook_url": "https://myapp.com/webhook"
|
||||
}'
|
||||
```
|
||||
|
||||
## 🎯 Core APIs
|
||||
|
||||
### HTML Crawling
|
||||
- `POST /api/crawl` - Create HTML crawl job with ad blocking
|
||||
- `GET /api/crawl/{uuid}` - Get crawl status and results
|
||||
- `GET /api/crawl/{uuid}.html` - Download HTML file directly
|
||||
|
||||
### Screenshot Capture
|
||||
- `POST /api/shot` - Create screenshot job (always WebP format)
|
||||
- `GET /api/shot/{uuid}` - Get screenshot status and results
|
||||
- `GET /api/shot/{uuid}.webp` - Download image file directly
|
||||
|
||||
### Webhook Management
|
||||
- `GET /api/webhook-errors` - List failed webhook deliveries
|
||||
- `POST /api/webhook-errors/{uuid}/retry` - Retry failed webhook
|
||||
- `DELETE /api/webhook-errors/{uuid}/clear` - Clear webhook error
|
||||
|
||||
### Client Library Methods
|
||||
|
||||
| Method | Returns | Description |
|
||||
|--------|---------|-------------|
|
||||
| `$client->crawl($url)->create()` | `CrawlResponse` | Fluent crawl job creation |
|
||||
| `$client->getCrawlStatus($uuid)` | `CrawlResponse` | Typed crawl status |
|
||||
| `$client->shot($url)->create()` | `ShotResponse` | Fluent screenshot creation |
|
||||
| `$client->getShotStatus($uuid)` | `ShotResponse` | Typed screenshot status |
|
||||
| `$client->listWebhookErrors()` | `array` | Failed webhook list |
|
||||
|
||||
## 🔧 Architecture & Features
|
||||
|
||||
### Webhook System
|
||||
- **Event Filtering** - Choose which status changes trigger webhooks (`queued`, `processing`, `completed`, `failed`)
|
||||
- **Progressive Retry** - Automatic retry with exponential backoff (1, 2, 4, 8, 16, 32 minutes)
|
||||
- **Error Management** - List, retry, and clear failed webhook deliveries
|
||||
- **Consistent Payload** - Webhook data matches status API responses exactly
|
||||
|
||||
### Smart Filtering
|
||||
- **EasyList Integration** - Automatic ad/tracker/cookie banner blocking
|
||||
- **Custom Blocking** - Fine-grained control over content filtering
|
||||
- **Performance Optimized** - Cached filter lists with 24-hour updates
|
||||
|
||||
### Developer Experience
|
||||
- **Fluent Interface** - Method chaining for clean, readable code
|
||||
- **Typed Responses** - `CrawlResponse` and `ShotResponse` classes with helpful methods
|
||||
- **Laravel Integration** - Service providers, facades, auto-discovery
|
||||
- **Comprehensive Docs** - Complete API and client documentation
|
||||
|
||||
## 🛠️ Requirements & Setup
|
||||
|
||||
### System Requirements
|
||||
- **PHP 8.3+** with ImageMagick extension
|
||||
- **Laravel 12.0+** framework
|
||||
- **Node.js** with Puppeteer for browser automation
|
||||
- **Database** (SQLite included, MySQL/PostgreSQL supported)
|
||||
|
||||
### Quick Setup
|
||||
```bash
|
||||
# Clone and install
|
||||
git clone [repository] && cd crawlshot
|
||||
composer install && npm install puppeteer
|
||||
|
||||
# Configure and run
|
||||
cp .env.example .env
|
||||
php artisan key:generate
|
||||
php artisan migrate
|
||||
php artisan serve
|
||||
|
||||
# Start queue processing (separate terminal)
|
||||
php artisan horizon
|
||||
```
|
||||
|
||||
### Key Dependencies
|
||||
- `spatie/browsershot` - Browser automation
|
||||
- `protonlabs/php-adblock-parser` - EasyList parsing
|
||||
- `laravel/horizon` - Queue monitoring (standalone)
|
||||
- `laravel/sanctum` - API authentication (standalone)
|
||||
- **[Spatie Browsershot](https://github.com/spatie/browsershot)** - Puppeteer wrapper for browser automation
|
||||
- **[Laravel Horizon](https://laravel.com/docs/horizon)** - Queue monitoring and management
|
||||
- **[Laravel Sanctum](https://laravel.com/docs/sanctum)** - API authentication
|
||||
- **ProtonMail AdBlock Parser** - EasyList filter processing
|
||||
|
||||
### File Structure
|
||||
## 📄 License
|
||||
|
||||
```
|
||||
├── app/ # Laravel application (standalone)
|
||||
│ ├── Http/Controllers/Api/ # API controllers
|
||||
│ ├── Jobs/ # Queue jobs
|
||||
│ ├── Models/ # Eloquent models
|
||||
│ └── Services/ # Core services
|
||||
├── src/ # Package source (both modes)
|
||||
│ ├── CrawlshotClient.php # HTTP client (package mode)
|
||||
│ ├── CrawlshotServiceProvider.php
|
||||
│ ├── Facades/Crawlshot.php
|
||||
│ └── config/crawlshot.php
|
||||
├── routes/api.php # API routes (standalone)
|
||||
├── database/migrations/ # Database schema
|
||||
└── composer.json # Package definition
|
||||
```
|
||||
MIT License - see [LICENSE](LICENSE) file for details.
|
||||
|
||||
## License
|
||||
---
|
||||
|
||||
MIT
|
||||
**[Get Started →](CLIENT_DOCUMENTATION.md)** | **[View API Docs →](API_DOCUMENTATION.md)** | **[Setup Guide →](SETUP.md)**
|
||||
Reference in New Issue
Block a user