This commit is contained in:
ct
2025-08-10 21:10:33 +08:00
parent 480bd9055d
commit 583a804073
43 changed files with 7623 additions and 270 deletions

215
README.md
View File

@@ -1,61 +1,198 @@
<p align="center"><a href="https://laravel.com" target="_blank"><img src="https://raw.githubusercontent.com/laravel/art/master/logo-lockup/5%20SVG/2%20CMYK/1%20Full%20Color/laravel-logolockup-cmyk-red.svg" width="400" alt="Laravel Logo"></a></p>
# Crawlshot
<p align="center">
<a href="https://github.com/laravel/framework/actions"><img src="https://github.com/laravel/framework/workflows/tests/badge.svg" alt="Build Status"></a>
<a href="https://packagist.org/packages/laravel/framework"><img src="https://img.shields.io/packagist/dt/laravel/framework" alt="Total Downloads"></a>
<a href="https://packagist.org/packages/laravel/framework"><img src="https://img.shields.io/packagist/v/laravel/framework" alt="Latest Stable Version"></a>
<a href="https://packagist.org/packages/laravel/framework"><img src="https://img.shields.io/packagist/l/laravel/framework" alt="License"></a>
</p>
A Laravel web crawling and screenshot service with dual deployment options:
## About Laravel
1. **Standalone API Service** - Full Laravel application with REST API endpoints
2. **Laravel Package** - HTTP client package for use in other Laravel applications
Laravel is a web application framework with expressive, elegant syntax. We believe development must be an enjoyable and creative experience to be truly fulfilling. Laravel takes the pain out of development by easing common tasks used in many web projects, such as:
## Architecture Overview
- [Simple, fast routing engine](https://laravel.com/docs/routing).
- [Powerful dependency injection container](https://laravel.com/docs/container).
- Multiple back-ends for [session](https://laravel.com/docs/session) and [cache](https://laravel.com/docs/cache) storage.
- Expressive, intuitive [database ORM](https://laravel.com/docs/eloquent).
- Database agnostic [schema migrations](https://laravel.com/docs/migrations).
- [Robust background job processing](https://laravel.com/docs/queues).
- [Real-time event broadcasting](https://laravel.com/docs/broadcasting).
### Standalone API Service
The main Laravel application provides a complete web crawling and screenshot service:
Laravel is accessible, powerful, and provides tools required for large, robust applications.
- **Spatie Browsershot Integration** - Uses Puppeteer for browser automation
- **EasyList Ad Blocking** - Automatic ad/tracker blocking using EasyList filters
- **Queue Processing** - Laravel Horizon for async job processing
- **24-hour Cleanup** - Automatic file and database cleanup
- **Sanctum Authentication** - API token-based authentication
- **SQLite Database** - Stores job metadata and processing status
## Learning Laravel
### Laravel Package
Simple HTTP client package that provides a clean interface to the API:
Laravel has the most extensive and thorough [documentation](https://laravel.com/docs) and video tutorial library of all modern web application frameworks, making it a breeze to get started with the framework.
- **8 Methods for 8 APIs** - Direct 1:1 mapping to REST endpoints
- **Facade Support** - Clean Laravel integration
- **Auto-discovery** - Automatic service provider registration
You may also try the [Laravel Bootcamp](https://bootcamp.laravel.com), where you will be guided through building a modern Laravel application from scratch.
## Deployment Options
If you don't feel like reading, [Laracasts](https://laracasts.com) can help. Laracasts contains thousands of video tutorials on a range of topics including Laravel, modern PHP, unit testing, and JavaScript. Boost your skills by digging into our comprehensive video library.
### Option 1: Standalone API Service
## Laravel Sponsors
Deploy as a complete Laravel application:
We would like to extend our thanks to the following sponsors for funding Laravel development. If you are interested in becoming a sponsor, please visit the [Laravel Partners program](https://partners.laravel.com).
```bash
git clone [repository]
cd crawlshot
composer install
npm install puppeteer
php artisan migrate
php artisan serve
```
### Premium Partners
**API Endpoints:**
- `POST /api/crawl` - Create HTML crawl job
- `GET /api/crawl/{uuid}` - Get crawl status/result
- `GET /api/crawl` - List all crawl jobs
- `POST /api/shot` - Create screenshot job
- `GET /api/shot/{uuid}` - Get screenshot status/result
- `GET /api/shot/{uuid}/download` - Download screenshot file
- `GET /api/shot` - List all screenshot jobs
- `GET /api/health` - Health check
- **[Vehikl](https://vehikl.com)**
- **[Tighten Co.](https://tighten.co)**
- **[Kirschbaum Development Group](https://kirschbaumdevelopment.com)**
- **[64 Robots](https://64robots.com)**
- **[Curotec](https://www.curotec.com/services/technologies/laravel)**
- **[DevSquad](https://devsquad.com/hire-laravel-developers)**
- **[Redberry](https://redberry.international/laravel-development)**
- **[Active Logic](https://activelogic.com)**
**Example API Usage:**
```bash
# Create crawl job
curl -X POST "https://crawlshot.test/api/crawl" \
-H "Authorization: Bearer {token}" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "block_ads": true}'
## Contributing
# Check status
curl -H "Authorization: Bearer {token}" \
"https://crawlshot.test/api/crawl/{uuid}"
```
Thank you for considering contributing to the Laravel framework! The contribution guide can be found in the [Laravel documentation](https://laravel.com/docs/contributions).
### Option 2: Laravel Package
## Code of Conduct
Install as a package in your Laravel application:
In order to ensure that the Laravel community is welcoming to all, please review and abide by the [Code of Conduct](https://laravel.com/docs/contributions#code-of-conduct).
```bash
composer require crawlshot/laravel
php artisan vendor:publish --tag=crawlshot-config
```
## Security Vulnerabilities
**Configuration:**
```env
CRAWLSHOT_BASE_URL=https://your-crawlshot-api.com
CRAWLSHOT_TOKEN=your-sanctum-token
```
If you discover a security vulnerability within Laravel, please send an e-mail to Taylor Otwell via [taylor@laravel.com](mailto:taylor@laravel.com). All security vulnerabilities will be promptly addressed.
**Package Usage:**
```php
use Crawlshot\Laravel\Facades\Crawlshot;
// Create crawl job
$response = Crawlshot::createCrawl('https://example.com', [
'block_ads' => true,
'timeout' => 30
]);
// Check status
$status = Crawlshot::getCrawlStatus($response['uuid']);
// Create screenshot
$response = Crawlshot::createShot('https://example.com', [
'format' => 'jpg',
'width' => 1920,
'height' => 1080
]);
// Download screenshot
$imageData = Crawlshot::downloadShot($response['uuid']);
file_put_contents('screenshot.jpg', $imageData);
```
## API Reference
### Available Methods (Package)
| Method | API Endpoint | Description |
|--------|--------------|-------------|
| `createCrawl(string $url, array $options = [])` | `POST /api/crawl` | Create crawl job |
| `getCrawlStatus(string $uuid)` | `GET /api/crawl/{uuid}` | Get crawl status |
| `listCrawls()` | `GET /api/crawl` | List all crawl jobs |
| `createShot(string $url, array $options = [])` | `POST /api/shot` | Create screenshot job |
| `getShotStatus(string $uuid)` | `GET /api/shot/{uuid}` | Get screenshot status |
| `downloadShot(string $uuid)` | `GET /api/shot/{uuid}/download` | Download screenshot file |
| `listShots()` | `GET /api/shot` | List all screenshot jobs |
| `health()` | `GET /api/health` | Health check |
### Crawl Options
```php
[
'block_ads' => true, // Block ads using EasyList
'block_trackers' => true, // Block tracking scripts
'timeout' => 30, // Request timeout in seconds
'user_agent' => 'Custom UA', // Custom user agent
'wait_until' => 'networkidle0' // Wait condition
]
```
### Screenshot Options
```php
[
'format' => 'jpg', // jpg, png, webp
'quality' => 90, // 1-100 for jpg/webp
'width' => 1920, // Viewport width
'height' => 1080, // Viewport height
'full_page' => true, // Capture full page
'block_ads' => true, // Block ads
'timeout' => 30 // Request timeout
]
```
## Features
### Core Functionality
- **HTML Crawling** - Extract clean HTML content from web pages
- **Screenshot Capture** - Generate high-quality screenshots (JPG, PNG, WebP)
- **Ad Blocking** - Built-in EasyList integration for ad/tracker blocking
- **Queue Processing** - Async job processing with Laravel Horizon
- **File Management** - Automatic cleanup after 24 hours
### Technical Features
- **Laravel 12** support with PHP 8.3+
- **Puppeteer Integration** via Spatie Browsershot
- **Sanctum Authentication** for API security
- **SQLite Database** with migrations
- **Auto-discovery** for package installation
- **Environment Configuration** via .env variables
## Development
### Requirements
- PHP 8.3+
- Laravel 12.0+
- Node.js with Puppeteer
- SQLite (or other database)
- ImageMagick extension
### Key Dependencies
- `spatie/browsershot` - Browser automation
- `protonlabs/php-adblock-parser` - EasyList parsing
- `laravel/horizon` - Queue monitoring (standalone)
- `laravel/sanctum` - API authentication (standalone)
### File Structure
```
├── app/ # Laravel application (standalone)
│ ├── Http/Controllers/Api/ # API controllers
│ ├── Jobs/ # Queue jobs
│ ├── Models/ # Eloquent models
│ └── Services/ # Core services
├── src/ # Package source (both modes)
│ ├── CrawlshotClient.php # HTTP client (package mode)
│ ├── CrawlshotServiceProvider.php
│ ├── Facades/Crawlshot.php
│ └── config/crawlshot.php
├── routes/api.php # API routes (standalone)
├── database/migrations/ # Database schema
└── composer.json # Package definition
```
## License
The Laravel framework is open-sourced software licensed under the [MIT license](https://opensource.org/licenses/MIT).
MIT