2025-08-10 22:23:22 +08:00
2025-08-10 21:10:33 +08:00
2025-08-10 21:10:33 +08:00
2025-08-10 21:10:33 +08:00
2025-08-10 21:10:33 +08:00
2025-08-10 22:21:19 +08:00
2025-08-10 16:28:46 +08:00
2025-08-10 16:28:46 +08:00
2025-08-10 22:02:08 +08:00
2025-08-10 21:14:11 +08:00
2025-08-10 16:28:46 +08:00
2025-08-10 16:28:54 +08:00
2025-08-10 16:28:46 +08:00
2025-08-10 21:10:33 +08:00
2025-08-10 16:28:46 +08:00
2025-08-10 16:28:46 +08:00
2025-08-10 21:10:33 +08:00
2025-08-10 16:28:46 +08:00
2025-08-10 21:15:26 +08:00
2025-08-10 22:23:22 +08:00
2025-08-10 21:54:05 +08:00
2025-08-10 21:10:33 +08:00
2025-08-10 21:10:33 +08:00
2025-08-10 21:10:33 +08:00
2025-08-10 16:28:46 +08:00
2025-08-10 21:10:33 +08:00
2025-08-10 16:28:46 +08:00

Crawlshot

A Laravel web crawling and screenshot service with dual deployment options:

  1. Standalone API Service - Full Laravel application with REST API endpoints
  2. Laravel Package - HTTP client package for use in other Laravel applications

Architecture Overview

Standalone API Service

The main Laravel application provides a complete web crawling and screenshot service:

  • Spatie Browsershot Integration - Uses Puppeteer for browser automation
  • EasyList Ad Blocking - Automatic ad/tracker blocking using EasyList filters
  • Queue Processing - Laravel Horizon for async job processing
  • 24-hour Cleanup - Automatic file and database cleanup
  • Sanctum Authentication - API token-based authentication
  • SQLite Database - Stores job metadata and processing status

Laravel Package

Simple HTTP client package that provides a clean interface to the API:

  • 8 Methods for 8 APIs - Direct 1:1 mapping to REST endpoints
  • Facade Support - Clean Laravel integration
  • Auto-discovery - Automatic service provider registration

Deployment Options

Option 1: Standalone API Service

Deploy as a complete Laravel application:

git clone [repository]
cd crawlshot
composer install
npm install puppeteer
php artisan migrate
php artisan serve

API Endpoints:

  • POST /api/crawl - Create HTML crawl job
  • GET /api/crawl/{uuid} - Get crawl status/result
  • GET /api/crawl - List all crawl jobs
  • POST /api/shot - Create screenshot job
  • GET /api/shot/{uuid} - Get screenshot status/result
  • GET /api/shot/{uuid}/download - Download screenshot file
  • GET /api/shot - List all screenshot jobs
  • GET /api/health - Health check

Example API Usage:

# Create crawl job
curl -X POST "https://crawlshot.test/api/crawl" \
     -H "Authorization: Bearer {token}" \
     -H "Content-Type: application/json" \
     -d '{"url": "https://example.com", "block_ads": true}'

# Check status  
curl -H "Authorization: Bearer {token}" \
     "https://crawlshot.test/api/crawl/{uuid}"

Option 2: Laravel Package

Install as a package in your Laravel application:

composer require crawlshot/laravel
php artisan vendor:publish --tag=crawlshot-config

Configuration:

CRAWLSHOT_BASE_URL=https://your-crawlshot-api.com
CRAWLSHOT_TOKEN=your-sanctum-token

Package Usage:

use Crawlshot\Laravel\Facades\Crawlshot;

// Create crawl job
$response = Crawlshot::createCrawl('https://example.com', [
    'block_ads' => true,
    'timeout' => 30
]);

// Check status
$status = Crawlshot::getCrawlStatus($response['uuid']);

// Create screenshot
$response = Crawlshot::createShot('https://example.com', [
    'format' => 'jpg',
    'width' => 1920,
    'height' => 1080
]);

// Download screenshot
$imageData = Crawlshot::downloadShot($response['uuid']);
file_put_contents('screenshot.jpg', $imageData);

API Reference

Available Methods (Package)

Method API Endpoint Description
createCrawl(string $url, array $options = []) POST /api/crawl Create crawl job
getCrawlStatus(string $uuid) GET /api/crawl/{uuid} Get crawl status
listCrawls() GET /api/crawl List all crawl jobs
createShot(string $url, array $options = []) POST /api/shot Create screenshot job
getShotStatus(string $uuid) GET /api/shot/{uuid} Get screenshot status
downloadShot(string $uuid) GET /api/shot/{uuid}/download Download screenshot file
listShots() GET /api/shot List all screenshot jobs
health() GET /api/health Health check

Crawl Options

[
    'block_ads' => true,           // Block ads using EasyList
    'block_trackers' => true,      // Block tracking scripts
    'timeout' => 30,               // Request timeout in seconds  
    'user_agent' => 'Custom UA',   // Custom user agent
    'wait_until' => 'networkidle0' // Wait condition
]

Screenshot Options

[
    'format' => 'jpg',             // jpg, png, webp
    'quality' => 90,               // 1-100 for jpg/webp
    'width' => 1920,               // Viewport width
    'height' => 1080,              // Viewport height  
    'full_page' => true,           // Capture full page
    'block_ads' => true,           // Block ads
    'timeout' => 30                // Request timeout
]

Features

Core Functionality

  • HTML Crawling - Extract clean HTML content from web pages
  • Screenshot Capture - Generate high-quality screenshots (JPG, PNG, WebP)
  • Ad Blocking - Built-in EasyList integration for ad/tracker blocking
  • Queue Processing - Async job processing with Laravel Horizon
  • File Management - Automatic cleanup after 24 hours

Technical Features

  • Laravel 12 support with PHP 8.3+
  • Puppeteer Integration via Spatie Browsershot
  • Sanctum Authentication for API security
  • SQLite Database with migrations
  • Auto-discovery for package installation
  • Environment Configuration via .env variables

Development

Requirements

  • PHP 8.3+
  • Laravel 12.0+
  • Node.js with Puppeteer
  • SQLite (or other database)
  • ImageMagick extension

Key Dependencies

  • spatie/browsershot - Browser automation
  • protonlabs/php-adblock-parser - EasyList parsing
  • laravel/horizon - Queue monitoring (standalone)
  • laravel/sanctum - API authentication (standalone)

File Structure

├── app/                          # Laravel application (standalone)
│   ├── Http/Controllers/Api/     # API controllers
│   ├── Jobs/                     # Queue jobs  
│   ├── Models/                   # Eloquent models
│   └── Services/                 # Core services
├── src/                          # Package source (both modes)
│   ├── CrawlshotClient.php       # HTTP client (package mode)
│   ├── CrawlshotServiceProvider.php
│   ├── Facades/Crawlshot.php
│   └── config/crawlshot.php
├── routes/api.php                # API routes (standalone)
├── database/migrations/          # Database schema
└── composer.json                 # Package definition

License

MIT

Description
No description provided
Readme 221 KiB
Languages
PHP 59.1%
Blade 40.5%
CSS 0.2%
JavaScript 0.2%