Files

ct a2d2ed0117 Update

2025-08-11 02:56:17 +08:00

21 KiB

Raw Permalink Blame History

Crawlshot PHP Client Library Documentation

The Crawlshot PHP Client Library provides a clean, fluent interface for interacting with Crawlshot API services. Designed specifically for Laravel applications, it offers typed responses, method chaining, and comprehensive webhook support.

Installation & Setup

1. Install via Composer

composer require crawlshot/laravel

2. Configuration

Option A: Direct instantiation

use Crawlshot\Laravel\CrawlshotClient;

$client = new CrawlshotClient('https://crawlshot.test', 'your-api-token');

Option B: Environment variables (recommended)

# .env
CRAWLSHOT_BASE_URL=https://crawlshot.test
CRAWLSHOT_TOKEN=1|rrWUM5ZkmLfGipkm1oIusYX45KbukIekUwMjgB3Nd1121a5c

# In your code
$client = new CrawlshotClient(
    env('CRAWLSHOT_BASE_URL'), 
    env('CRAWLSHOT_TOKEN')
);

3. Service Provider (Optional)

For application-wide configuration, create a service provider:

// app/Providers/CrawlshotServiceProvider.php
class CrawlshotServiceProvider extends ServiceProvider
{
    public function register()
    {
        $this->app->singleton(CrawlshotClient::class, function ($app) {
            return new CrawlshotClient(
                config('services.crawlshot.base_url'),
                config('services.crawlshot.token')
            );
        });
    }
}

// config/services.php
'crawlshot' => [
    'base_url' => env('CRAWLSHOT_BASE_URL'),
    'token' => env('CRAWLSHOT_TOKEN'),
],

Basic Usage

Simple HTML Crawling

use Crawlshot\Laravel\CrawlshotClient;

$client = new CrawlshotClient('https://crawlshot.test', 'your-token');

// Create crawl job
$response = $client->createCrawl('https://example.com');
echo "Job UUID: " . $response['uuid']; // Raw array response

// Check status
$status = $client->getCrawlStatus($response['uuid']);
echo "Status: " . $status->getStatus(); // Typed response object

if ($status->isCompleted()) {
    $html = $status->getResultRaw();
    echo "HTML content: " . substr($html, 0, 200) . "...";
}

Simple Screenshot Capture

// Create screenshot job
$response = $client->createShot('https://example.com');

// Check status
$status = $client->getShotStatus($response['uuid']);

if ($status->isCompleted()) {
    echo "Format: " . $status->getFormat(); // webp
    echo "Size: " . implode('x', $status->getDimensions()); // [1920, 1080]
    
    // Get image data
    $imageData = $status->getImageData(); // base64
    $imageFile = $status->downloadImage(); // binary data
}

Fluent Interface

The client provides a powerful fluent interface for building complex requests with method chaining.

Fluent HTML Crawling

$crawl = $client->crawl('https://example.com')
    ->timeout(60)
    ->delay(2000)
    ->blockAds(true)
    ->blockCookieBanners(true)
    ->blockTrackers(true)
    // Network idle waiting is always enabled for optimal rendering
    ->webhookUrl('https://myapp.com/webhooks/crawlshot')
    ->webhookEventsFilter(['completed', 'failed'])
    ->create(); // Returns CrawlResponse

echo "Job created: " . $crawl->getUuid();
echo "Status: " . $crawl->getStatus();

// Wait for completion
while ($crawl->isProcessing() || $crawl->isQueued()) {
    sleep(2);
    $crawl->refresh(); // Updates from API
}

if ($crawl->isCompleted()) {
    $html = $crawl->getResultRaw();
    file_put_contents('page.html', $html);
}

Fluent Screenshot Capture

$screenshot = $client->shot('https://example.com')
    ->viewportSize(1200, 800)
    ->quality(85)
    ->timeout(30)
    ->delay(1000)
    ->blockAds(true)
    ->webhookUrl('https://myapp.com/webhooks/crawlshot')
    ->webhookEventsFilter(['completed'])
    ->create(); // Returns ShotResponse

echo "Screenshot job: " . $screenshot->getUuid();

// Poll until complete
while (!$screenshot->isCompleted() && !$screenshot->isFailed()) {
    sleep(3);
    $screenshot->refresh();
}

if ($screenshot->isCompleted()) {
    // Save image
    $imageData = $screenshot->downloadImage();
    file_put_contents('screenshot.webp', $imageData);
    
    echo "Saved {$screenshot->getWidth()}x{$screenshot->getHeight()} image";
}

Available Fluent Methods

CrawlJobBuilder Methods

$client->crawl($url)
    ->webhookUrl(string $url)                    // Webhook notification URL
    ->webhookEventsFilter(array $events)        // ['queued', 'processing', 'completed', 'failed']
    ->timeout(int $seconds)                      // Request timeout (5-300)
    ->delay(int $milliseconds)                   // Delay before capture (0-30000)
    ->blockAds(bool $block = true)               // Block ads via EasyList
    ->blockCookieBanners(bool $block = true)     // Block cookie banners
    ->blockTrackers(bool $block = true)          // Block tracking scripts
    // waitUntilNetworkIdle is always enabled server-side for optimal rendering
    ->create();                                  // Execute and return CrawlResponse

ShotJobBuilder Methods

$client->shot($url)
    ->webhookUrl(string $url)                    // Webhook notification URL
    ->webhookEventsFilter(array $events)        // ['queued', 'processing', 'completed', 'failed']
    ->viewportSize(int $width, int $height)      // Viewport dimensions
    ->quality(int $quality)                      // Image quality 1-100
    ->timeout(int $seconds)                      // Request timeout (5-300)
    ->delay(int $milliseconds)                   // Delay before capture (0-30000)
    ->blockAds(bool $block = true)               // Block ads via EasyList
    ->blockCookieBanners(bool $block = true)     // Block cookie banners
    ->blockTrackers(bool $block = true)          // Block tracking scripts
    ->create();                                  // Execute and return ShotResponse

Response Objects

The client library provides typed response objects that make it easy to work with job results.

Common Methods (Both CrawlResponse & ShotResponse)

// Job information
$response->getUuid(): string                     // Job UUID
$response->getStatus(): string                   // queued|processing|completed|failed
$response->getUrl(): string                      // Original URL
$response->getCreatedAt(): \DateTime             // Job creation time
$response->getStartedAt(): ?\DateTime            // Processing start time (null if not started)
$response->getCompletedAt(): ?\DateTime          // Completion time (null if not completed)
$response->getError(): ?string                   // Error message (null if no error)

// Status checks
$response->isQueued(): bool                      // Job waiting to start
$response->isProcessing(): bool                  // Job currently running
$response->isCompleted(): bool                   // Job finished successfully  
$response->isFailed(): bool                      // Job encountered error

// Utility methods
$response->refresh(): static                     // Refresh from API
$response->getRawResponse(): array               // Original API response
$response->getResult(): ?array                   // Result data (null if not completed)

CrawlResponse Specific Methods

// HTML content access
$crawl->getResultRaw(): ?string                  // Raw HTML content
$crawl->getResultUrl(): ?string                  // Download URL (/api/crawl/{uuid}.html)
$crawl->downloadHtml(): ?string                  // Direct download HTML content

// Example usage
if ($crawl->isCompleted()) {
    $html = $crawl->getResultRaw();
    $downloadUrl = $crawl->getResultUrl();
    
    // Or download directly
    $htmlContent = $crawl->downloadHtml();
    file_put_contents('page.html', $htmlContent);
}

ShotResponse Specific Methods

// Image data access
$shot->getImageData(): ?string                   // Base64 encoded image
$shot->getImageUrl(): ?string                    // Download URL (/api/shot/{uuid}.webp)
$shot->downloadImage(): ?string                  // Direct download binary data

// Image metadata
$shot->getMimeType(): ?string                    // image/webp
$shot->getFormat(): ?string                      // webp
$shot->getWidth(): ?int                          // Image width in pixels
$shot->getHeight(): ?int                         // Image height in pixels
$shot->getSize(): ?int                           // File size in bytes
$shot->getDimensions(): ?array                   // [width, height] or null

// Example usage
if ($shot->isCompleted()) {
    $imageData = $shot->getImageData();          // Base64
    $imageBinary = $shot->downloadImage();       // Binary
    $dimensions = $shot->getDimensions();        // [1920, 1080]
    
    echo "Format: {$shot->getFormat()}";         // webp
    echo "Size: {$dimensions[0]}x{$dimensions[1]}"; // 1920x1080
    echo "File size: {$shot->getSize()} bytes";  // 45678 bytes
}

Webhook Integration

Webhooks provide real-time notifications when job statuses change, eliminating the need for constant polling.

Basic Webhook Setup

// Configure webhook when creating jobs
$crawl = $client->crawl('https://example.com')
    ->webhookUrl('https://myapp.com/webhooks/crawlshot')
    ->webhookEventsFilter(['completed', 'failed'])
    ->create();

// Your webhook endpoint receives the same data as status APIs

Webhook Event Filtering

Control which status changes trigger webhooks:

// Only notify on completion
->webhookEventsFilter(['completed'])

// Only notify on completion or failure
->webhookEventsFilter(['completed', 'failed'])

// Notify on all status changes (default)
->webhookEventsFilter(['queued', 'processing', 'completed', 'failed'])

// Disable webhooks entirely
->webhookEventsFilter([])

Webhook Handler Example

// routes/web.php or routes/api.php
Route::post('/webhooks/crawlshot', function (Request $request) {
    $jobData = $request->all();
    
    // The webhook payload is identical to GET /api/crawl/{uuid} response
    $uuid = $jobData['uuid'];
    $status = $jobData['status'];
    $url = $jobData['url'];
    
    switch ($status) {
        case 'completed':
            if (isset($jobData['result']['html'])) {
                // Handle crawl completion
                $html = $jobData['result']['html']['raw'];
                // Process HTML content...
            } elseif (isset($jobData['result']['image'])) {
                // Handle screenshot completion
                $imageUrl = $jobData['result']['image']['url'];
                $dimensions = [$jobData['result']['width'], $jobData['result']['height']];
                // Process screenshot...
            }
            break;
            
        case 'failed':
            $error = $jobData['error'];
            Log::error("Crawlshot job {$uuid} failed: {$error}");
            break;
            
        case 'processing':
            Log::info("Crawlshot job {$uuid} started processing");
            break;
    }
    
    return response('OK', 200);
});

Webhook Error Management

When webhooks fail, you can manage them through the client:

// List all jobs with failed webhooks
$errors = $client->listWebhookErrors();

foreach ($errors['jobs'] as $job) {
    echo "Job {$job['uuid']} webhook failed: {$job['webhook_last_error']}\n";
    echo "Attempts: {$job['webhook_attempts']}\n";
    
    // Retry immediately
    $client->retryWebhook($job['uuid']);
    
    // Or clear the error without retrying
    // $client->clearWebhookError($job['uuid']);
}

Advanced Configuration

Custom Options

// Advanced crawling options
$crawl = $client->crawl('https://spa-website.com')
    ->timeout(120)                               // Long timeout for slow sites
    ->delay(3000)                                // Wait 3 seconds for JS
    // Network idle waiting is always enabled for AJAX/dynamic content
    ->blockAds(false)                            // Allow ads for testing
    ->blockCookieBanners(true)                   // But block cookie banners
    ->webhookUrl('https://myapp.com/webhook')
    ->create();

// High-quality screenshots
$shot = $client->shot('https://dashboard.example.com')
    ->viewportSize(2560, 1440)                  // High resolution
    ->quality(95)                               // High quality
    ->delay(5000)                               // Wait for dashboard to load
    ->blockAds(true)                            // Clean screenshot
    ->create();

Batch Processing

$urls = ['https://site1.com', 'https://site2.com', 'https://site3.com'];
$jobs = [];

// Create multiple jobs
foreach ($urls as $url) {
    $job = $client->crawl($url)
        ->webhookUrl('https://myapp.com/webhook')
        ->create();
    
    $jobs[] = $job;
    echo "Created job: {$job->getUuid()}\n";
}

// Monitor all jobs
while (true) {
    $completed = 0;
    $failed = 0;
    
    foreach ($jobs as $job) {
        $job->refresh();
        
        if ($job->isCompleted()) $completed++;
        if ($job->isFailed()) $failed++;
    }
    
    echo "Progress: {$completed} completed, {$failed} failed\n";
    
    if ($completed + $failed === count($jobs)) {
        break; // All jobs done
    }
    
    sleep(5);
}

// Process results
foreach ($jobs as $job) {
    if ($job->isCompleted()) {
        $html = $job->getResultRaw();
        // Process HTML...
    }
}

Error Handling

Exception Handling

use Crawlshot\Laravel\CrawlshotClient;

try {
    $client = new CrawlshotClient('https://crawlshot.test', 'invalid-token');
    $response = $client->createCrawl('https://example.com');
    
} catch (\Exception $e) {
    if (str_contains($e->getMessage(), 'Unauthenticated')) {
        echo "Invalid API token\n";
    } elseif (str_contains($e->getMessage(), '422')) {
        echo "Validation error: " . $e->getMessage();
    } else {
        echo "API error: " . $e->getMessage();
    }
}

Response Validation

$shot = $client->getShotStatus($uuid);

// Always check status before accessing results
if ($shot->isCompleted()) {
    $imageData = $shot->getImageData();
    
    if ($imageData) {
        file_put_contents('screenshot.webp', base64_decode($imageData));
    } else {
        echo "No image data available\n";
    }
    
} elseif ($shot->isFailed()) {
    echo "Screenshot failed: " . $shot->getError();
    
} else {
    echo "Still processing... Status: " . $shot->getStatus();
}

Common Issues & Solutions

1. Connection Timeout

// Increase timeout for slow networks
$crawl = $client->crawl($url)->timeout(300)->create(); // 5 minutes

2. Invalid URLs

// Validate URLs before sending
if (filter_var($url, FILTER_VALIDATE_URL)) {
    $crawl = $client->crawl($url)->create();
} else {
    echo "Invalid URL: {$url}";
}

3. Large Files

// Handle large responses
$shot = $client->getShotStatus($uuid);
if ($shot->isCompleted()) {
    $size = $shot->getSize();
    if ($size > 10 * 1024 * 1024) { // 10MB
        echo "Large file ({$size} bytes), downloading directly...";
        $imageData = $shot->downloadImage(); // More memory efficient
    } else {
        $imageData = $shot->getImageData(); // Base64
    }
}

Best Practices

1. Use Webhooks for Production

// ❌ Polling (inefficient)
do {
    sleep(5);
    $status = $client->getCrawlStatus($uuid);
} while ($status->isProcessing());

// ✅ Webhooks (efficient)
$crawl = $client->crawl($url)
    ->webhookUrl('https://myapp.com/webhook')
    ->create();

2. Handle Failures Gracefully

$crawl = $client->crawl($url)
    ->timeout(60)
    ->webhookEventsFilter(['completed', 'failed']) // Include 'failed' events
    ->create();

// In webhook handler
if ($jobData['status'] === 'failed') {
    // Log error and potentially retry with different settings
    Log::error("Crawl failed for {$jobData['url']}: {$jobData['error']}");
    
    // Maybe retry with longer timeout
    $retry = $client->crawl($jobData['url'])
        ->timeout(120)
        ->create();
}

3. Use Environment-Specific Configuration

// .env.production
CRAWLSHOT_BASE_URL=https://crawlshot.production.com
CRAWLSHOT_TOKEN=prod_token_here

// .env.development  
CRAWLSHOT_BASE_URL=https://crawlshot.test
CRAWLSHOT_TOKEN=dev_token_here

// .env.testing
CRAWLSHOT_BASE_URL=https://crawlshot.staging.com
CRAWLSHOT_TOKEN=test_token_here

4. Implement Proper Error Logging

try {
    $crawl = $client->crawl($url)->create();
} catch (\Exception $e) {
    Log::channel('crawlshot')->error('Crawl creation failed', [
        'url' => $url,
        'error' => $e->getMessage(),
        'trace' => $e->getTraceAsString()
    ]);
    
    throw $e; // Re-throw if needed
}

5. Monitor Webhook Failures

// Scheduled job to check webhook failures
Schedule::call(function () {
    $client = app(CrawlshotClient::class);
    $errors = $client->listWebhookErrors();
    
    if ($errors['pagination']['total_items'] > 0) {
        Log::warning('Webhook failures detected', [
            'count' => $errors['pagination']['total_items']
        ]);
        
        // Optionally retry recent failures
        foreach ($errors['jobs'] as $job) {
            if ($job['webhook_attempts'] < 3) { // Don't retry too many times
                $client->retryWebhook($job['uuid']);
            }
        }
    }
})->hourly();

Complete Examples

Content Monitoring System

class ContentMonitor
{
    private CrawlshotClient $client;
    
    public function __construct(CrawlshotClient $client)
    {
        $this->client = $client;
    }
    
    public function monitorWebsite(string $url): void
    {
        $crawl = $this->client->crawl($url)
            ->blockAds(true)
            ->blockCookieBanners(true)
            ->timeout(60)
            ->webhookUrl(route('webhook.crawlshot'))
            ->webhookEventsFilter(['completed', 'failed'])
            ->create();
        
        // Store job info for later processing
        MonitorJob::create([
            'uuid' => $crawl->getUuid(),
            'url' => $url,
            'status' => 'queued',
            'created_at' => now()
        ]);
    }
    
    public function handleWebhook(array $data): void
    {
        $monitorJob = MonitorJob::where('uuid', $data['uuid'])->first();
        
        if (!$monitorJob) return;
        
        $monitorJob->update(['status' => $data['status']]);
        
        if ($data['status'] === 'completed') {
            $html = $data['result']['html']['raw'];
            
            // Check for changes
            $previousHash = $monitorJob->content_hash;
            $currentHash = md5($html);
            
            if ($previousHash && $previousHash !== $currentHash) {
                // Content changed, send notification
                Mail::to('admin@example.com')->send(
                    new ContentChangedNotification($monitorJob->url, $html)
                );
            }
            
            $monitorJob->update(['content_hash' => $currentHash]);
        }
    }
}

Screenshot Gallery Generator

class ScreenshotGallery
{
    private CrawlshotClient $client;
    
    public function generateGallery(array $urls): array
    {
        $jobs = [];
        
        // Create all screenshot jobs
        foreach ($urls as $url) {
            $shot = $this->client->shot($url)
                ->viewportSize(1200, 800)
                ->quality(80)
                ->blockAds(true)
                ->delay(2000)
                ->webhookUrl(route('webhook.screenshot'))
                ->create();
            
            $jobs[] = [
                'uuid' => $shot->getUuid(),
                'url' => $url,
                'response' => $shot
            ];
        }
        
        return $jobs;
    }
    
    public function handleScreenshotWebhook(array $data): void
    {
        if ($data['status'] === 'completed') {
            // Save screenshot to permanent storage
            $imageData = base64_decode($data['result']['image']['raw']);
            $filename = $data['uuid'] . '.webp';
            
            Storage::disk('public')->put("screenshots/{$filename}", $imageData);
            
            // Update database
            Screenshot::updateOrCreate(['uuid' => $data['uuid']], [
                'url' => $data['url'],
                'filename' => $filename,
                'width' => $data['result']['width'],
                'height' => $data['result']['height'],
                'size' => $data['result']['size'],
                'completed_at' => now()
            ]);
        }
    }
}

The Crawlshot PHP Client Library provides a comprehensive, developer-friendly interface for all your web crawling and screenshot needs. With its fluent interface, typed responses, and robust webhook support, it's designed to make integration as smooth as possible while maintaining full access to all advanced features.

21 KiB Raw Permalink Blame History