Files
crawlshot/README.md
2025-08-11 02:35:35 +08:00

213 lines
7.0 KiB
Markdown

# Crawlshot
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Laravel](https://img.shields.io/badge/Laravel-12-red.svg)](https://laravel.com)
[![PHP](https://img.shields.io/badge/PHP-8.3+-blue.svg)](https://php.net)
**High-performance web crawling and screenshot service** built with Laravel, featuring intelligent ad blocking, webhook notifications, and a powerful fluent PHP client.
🎯 **Perfect for:** Content monitoring • Screenshot automation • QA testing • Social media previews • Compliance archival
## ✨ Key Features
- 🚀 **Dual Deployment**: Standalone API service or Laravel package
- 🔗 **Webhook Notifications**: Real-time updates with progressive retry
- 🎨 **Fluent Interface**: `$client->crawl($url)->webhookUrl($webhook)->create()`
- 📦 **Typed Responses**: `$result->isCompleted()`, `$shot->getDimensions()`
- 🛡️ **Smart Blocking**: EasyList ad/tracker/cookie banner filtering
-**Background Processing**: Laravel Horizon queue management
- 🔄 **Auto-cleanup**: 24-hour file retention with scheduled cleanup
- 🔐 **Secure**: Laravel Sanctum API authentication
## 📚 Documentation
- 📖 **[API Documentation](API_DOCUMENTATION.md)** - Complete REST API reference with webhook system
- 🔧 **[Client Documentation](CLIENT_DOCUMENTATION.md)** - PHP client library guide with fluent interface
- ⚙️ **[Setup Guide](SETUP.md)** - Detailed installation and configuration
## 🚀 Quick Start
### Option 1: Standalone API Service
Deploy your own Crawlshot API server:
```bash
git clone [repository]
cd crawlshot
composer install && npm install puppeteer
php artisan migrate && php artisan serve
```
### Option 2: Laravel Package
Use as a client library in your Laravel app:
```bash
composer require crawlshot/laravel
```
```php
$client = new CrawlshotClient('https://crawlshot.test', 'your-token');
```
## ⚡ Modern Usage Examples
### Fluent Interface with Webhooks
```php
use Crawlshot\Laravel\CrawlshotClient;
$client = new CrawlshotClient('https://crawlshot.test', 'your-token');
// HTML Crawling with webhook notifications
$crawl = $client->crawl('https://example.com')
->webhookUrl('https://myapp.com/webhook')
->webhookEventsFilter(['completed', 'failed'])
->blockAds(true)
->timeout(60)
->create();
echo "Job: {$crawl->getUuid()} - Status: {$crawl->getStatus()}";
// Screenshot with custom dimensions
$shot = $client->shot('https://dashboard.example.com')
->viewportSize(1920, 1080)
->quality(90)
->webhookUrl('https://myapp.com/webhook')
->create();
if ($shot->isCompleted()) {
$dimensions = $shot->getDimensions(); // [1920, 1080]
$imageData = $shot->downloadImage(); // Binary data
}
```
### Webhook Handler Example
```php
Route::post('/webhook', function (Request $request) {
$job = $request->all();
if ($job['status'] === 'completed') {
if (isset($job['result']['html'])) {
// Process HTML crawl result
$html = $job['result']['html']['raw'];
} elseif (isset($job['result']['image'])) {
// Process screenshot result
$imageUrl = $job['result']['image']['url'];
}
}
return response('OK', 200);
});
```
### Direct API Usage
```bash
# HTML crawl with webhook
curl -X POST "https://crawlshot.test/api/crawl" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"webhook_url": "https://myapp.com/webhook",
"webhook_events_filter": ["completed"],
"block_ads": true
}'
# Screenshot with custom viewport
curl -X POST "https://crawlshot.test/api/shot" \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"viewport_width": 1200,
"viewport_height": 800,
"webhook_url": "https://myapp.com/webhook"
}'
```
## 🎯 Core APIs
### HTML Crawling
- `POST /api/crawl` - Create HTML crawl job with ad blocking
- `GET /api/crawl/{uuid}` - Get crawl status and results
- `GET /api/crawl/{uuid}.html` - Download HTML file directly
### Screenshot Capture
- `POST /api/shot` - Create screenshot job (always WebP format)
- `GET /api/shot/{uuid}` - Get screenshot status and results
- `GET /api/shot/{uuid}.webp` - Download image file directly
### Webhook Management
- `GET /api/webhook-errors` - List failed webhook deliveries
- `POST /api/webhook-errors/{uuid}/retry` - Retry failed webhook
- `DELETE /api/webhook-errors/{uuid}/clear` - Clear webhook error
### Client Library Methods
| Method | Returns | Description |
|--------|---------|-------------|
| `$client->crawl($url)->create()` | `CrawlResponse` | Fluent crawl job creation |
| `$client->getCrawlStatus($uuid)` | `CrawlResponse` | Typed crawl status |
| `$client->shot($url)->create()` | `ShotResponse` | Fluent screenshot creation |
| `$client->getShotStatus($uuid)` | `ShotResponse` | Typed screenshot status |
| `$client->listWebhookErrors()` | `array` | Failed webhook list |
## 🔧 Architecture & Features
### Webhook System
- **Event Filtering** - Choose which status changes trigger webhooks (`queued`, `processing`, `completed`, `failed`)
- **Progressive Retry** - Automatic retry with exponential backoff (1, 2, 4, 8, 16, 32 minutes)
- **Error Management** - List, retry, and clear failed webhook deliveries
- **Consistent Payload** - Webhook data matches status API responses exactly
### Smart Filtering
- **EasyList Integration** - Automatic ad/tracker/cookie banner blocking
- **Custom Blocking** - Fine-grained control over content filtering
- **Performance Optimized** - Cached filter lists with 24-hour updates
### Developer Experience
- **Fluent Interface** - Method chaining for clean, readable code
- **Typed Responses** - `CrawlResponse` and `ShotResponse` classes with helpful methods
- **Laravel Integration** - Service providers, facades, auto-discovery
- **Comprehensive Docs** - Complete API and client documentation
## 🛠️ Requirements & Setup
### System Requirements
- **PHP 8.3+** with ImageMagick extension
- **Laravel 12.0+** framework
- **Node.js** with Puppeteer for browser automation
- **Database** (SQLite included, MySQL/PostgreSQL supported)
### Quick Setup
```bash
# Clone and install
git clone [repository] && cd crawlshot
composer install && npm install puppeteer
# Configure and run
cp .env.example .env
php artisan key:generate
php artisan migrate
php artisan serve
# Start queue processing (separate terminal)
php artisan horizon
```
### Key Dependencies
- **[Spatie Browsershot](https://github.com/spatie/browsershot)** - Puppeteer wrapper for browser automation
- **[Laravel Horizon](https://laravel.com/docs/horizon)** - Queue monitoring and management
- **[Laravel Sanctum](https://laravel.com/docs/sanctum)** - API authentication
- **ProtonMail AdBlock Parser** - EasyList filter processing
## 📄 License
MIT License - see [LICENSE](LICENSE) file for details.
---
**[Get Started →](CLIENT_DOCUMENTATION.md)** | **[View API Docs →](API_DOCUMENTATION.md)** | **[Setup Guide →](SETUP.md)**