⏱️ Estimated reading time: 15 min

Overview

AnyCrawl is a high-performance web crawler developed by Any4AI that transforms websites into data optimized for large language models (LLMs) and extracts structured search engine results pages (SERPs) from Google, Bing, Baidu, and other major search engines.

With 1.8k GitHub stars and an active community, AnyCrawl is setting a new standard for data collection in AI applications.

Core Value Proposition

  • LLM optimization: Converts web data into a format LLMs can process effectively
  • Multi-engine support: Cheerio, Playwright, Puppeteer, and more
  • SERP specialization: Supports Google, Bing, Baidu, and other major search engines
  • High-performance processing: Multithreading and multiprocess architecture
  • Batch processing: Efficient management of large-scale crawling jobs

What Is AnyCrawl?

An AI-Era Data Collection Platform

AnyCrawl goes beyond a simple web crawler – it is an AI-powered data collection platform:

Web content -> AnyCrawl processing -> LLM-ready data -> AI model training/inference

Modern Architecture

Built on Node.js + TypeScript:

  • Excellent performance through asynchronous processing
  • Stable operation with type safety
  • Leverages a rich ecosystem

Containerized deployment:

  • Docker and Docker Compose support
  • Microservices architecture
  • Scalability and maintainability

Four Core Features

1. SERP Crawling (Search Engine Results Pages)

# Collect Google search results
curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "artificial intelligence trends 2025",
    "limit": 20,
    "engine": "google"
  }'

2. Web Scraping (Single Page Extraction)

# Extract single-page content
curl -X POST http://localhost:8080/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com/article",
    "engine": "playwright"
  }'

3. Site Crawling (Full Site Crawling)

  • Intelligent link traversal
  • Duplicate content removal
  • Structured data extraction

4. Batch Processing (Batch Operations)

  • Bulk URL list processing
  • Parallel job optimization
  • Progress monitoring

System Requirements

Basic Environment

# Check Docker version (20.10+ recommended)
docker --version

# Check Docker Compose version (1.29+ recommended)
docker-compose --version

# Check Git
git --version

# Memory: minimum 4 GB, recommended 8 GB+
# Disk: minimum 10 GB free space

Docker-Based Operation

macOS installation:

# Install Docker via Homebrew
brew install --cask docker

# Complete setup after launching Docker Desktop

Linux installation:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker.io docker-compose

# CentOS/RHEL
sudo yum install docker docker-compose

Installation and Initial Setup

Step 1: Clone the Repository

# Clone the AnyCrawl repository
git clone https://github.com/any4ai/AnyCrawl.git
cd AnyCrawl

# Confirm branch (use the main branch)
git branch -a

Step 2: Configure the Environment

Set Basic Environment Variables

# Create .env file
cp .env.example .env

# Review the key configuration items
cat .env

Key Environment Variables

Variable Default Description
NODE_ENV production Runtime environment setting
ANYCRAWL_API_PORT 8080 API server port
ANYCRAWL_HEADLESS true Headless browser mode
ANYCRAWL_AVAILABLE_ENGINES cheerio,playwright,puppeteer Available engines
ANYCRAWL_REDIS_URL redis://redis:6379 Redis connection URL

Step 3: Start Docker Containers

# Build and start containers
docker-compose up --build -d

# Check service status
docker-compose ps

# View logs
docker-compose logs -f

Step 4: Verify Installation

# API server health check
curl http://localhost:8080/health

# Access API documentation (browser)
open http://localhost:8080/docs

Core Features in Detail

Web Scraping

Cheerio Engine (Static HTML)

# Fastest static HTML parsing
curl -X POST http://localhost:8080/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://news.ycombinator.com",
    "engine": "cheerio"
  }'

Characteristics:

  • Fastest speed
  • Low memory usage
  • No JavaScript support

Playwright Engine (JavaScript Rendering)

# Modern browser engine
curl -X POST http://localhost:8080/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com/spa-app",
    "engine": "playwright"
  }'

Characteristics:

  • Supports all browsers (Chrome, Firefox, Safari)
  • Full JavaScript rendering
  • Supports latest web standards

Puppeteer Engine (Chrome-Only)

# Chrome-based rendering
curl -X POST http://localhost:8080/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com/react-app",
    "engine": "puppeteer"
  }'

Characteristics:

  • Chrome/Chromium only
  • Reliable JavaScript handling
  • Rich debugging capabilities

SERP Crawling (Search Results)

Collect Google Search Results

# Basic search
curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "machine learning tutorials",
    "engine": "google",
    "pages": 2,
    "lang": "en"
  }'

Multilingual Search Support

# Korean search results
curl -X POST http://localhost:8080/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "artificial intelligence tutorial",
    "engine": "google",
    "lang": "ko"
  }'

Advanced Search Parameters

Parameter Type Description Default
query string Search query Required
engine string Search engine (google) google
pages number Number of pages to collect 1
lang string Language code en-US
limit number Result limit 10

Proxies and Advanced Settings

Using an HTTP Proxy

curl -X POST http://localhost:8080/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "engine": "playwright",
    "proxy": "http://proxy.example.com:8080"
  }'

Using a SOCKS Proxy

# SOCKS5 proxy configuration
curl -X POST http://localhost:8080/v1/scrape \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com",
    "proxy": "socks5://proxy.example.com:1080"
  }'

Practical Usage Examples

Example 1: Automating News Data Collection

#!/bin/bash
# news-collector.sh

API_URL="http://localhost:8080"
OUTPUT_DIR="./news-data"

mkdir -p "$OUTPUT_DIR"

# List of major news sites
NEWS_SITES=(
    "https://news.ycombinator.com"
    "https://techcrunch.com"
    "https://www.wired.com"
)

for site in "${NEWS_SITES[@]}"; do
    echo "Starting crawl: $site"
    
    # Collect data per site
    curl -X POST "$API_URL/v1/scrape" \
      -H 'Content-Type: application/json' \
      -d "{
        \"url\": \"$site\",
        \"engine\": \"playwright\"
      }" > "$OUTPUT_DIR/$(basename $site).json"
    
    echo "Done: $site"
    sleep 2  # Rate limiting between requests
done

Example 2: Academic Paper Search and Collection

# academic_research.py
import requests
import json
import time

class AcademicCrawler:
    def __init__(self, base_url="http://localhost:8080"):
        self.base_url = base_url
        
    def search_papers(self, keywords, pages=3):
        """Search for academic papers"""
        results = []
        
        for keyword in keywords:
            response = requests.post(
                f"{self.base_url}/v1/search",
                json={
                    "query": f"{keyword} site:arxiv.org OR site:scholar.google.com",
                    "pages": pages,
                    "limit": 20
                }
            )
            
            if response.status_code == 200:
                data = response.json()
                results.extend(data.get('data', {}).get('results', []))
                
            time.sleep(1)  # Respect API rate limits
            
        return results
    
    def extract_paper_content(self, url):
        """Extract paper page content"""
        response = requests.post(
            f"{self.base_url}/v1/scrape",
            json={
                "url": url,
                "engine": "playwright"
            }
        )
        
        if response.status_code == 200:
            return response.json()
        return None

# Usage example
crawler = AcademicCrawler()

# Search for AI-related papers
keywords = [
    "transformer neural network",
    "large language model",
    "computer vision 2025"
]

papers = crawler.search_papers(keywords)
print(f"Papers collected: {len(papers)}")

# Extract details of the first paper
if papers:
    first_paper = papers[0]
    content = crawler.extract_paper_content(first_paper['url'])
    print(f"Paper title: {first_paper['title']}")

Example 3: E-commerce Price Monitoring

// price-monitor.js
const axios = require('axios');

class PriceMonitor {
    constructor(baseUrl = 'http://localhost:8080') {
        this.baseUrl = baseUrl;
    }
    
    async scrapeProduct(url) {
        try {
            const response = await axios.post(`${this.baseUrl}/v1/scrape`, {
                url: url,
                engine: 'playwright'
            });
            
            return response.data;
        } catch (error) {
            console.error('Scraping error:', error.message);
            return null;
        }
    }
    
    async monitorPrices(products) {
        const results = [];
        
        for (const product of products) {
            console.log(`Monitoring: ${product.name}`);
            
            const data = await this.scrapeProduct(product.url);
            
            if (data) {
                results.push({
                    name: product.name,
                    url: product.url,
                    timestamp: new Date().toISOString(),
                    data: data
                });
            }
            
            // Rate limiting between requests
            await new Promise(resolve => setTimeout(resolve, 2000));
        }
        
        return results;
    }
}

// Usage example
const monitor = new PriceMonitor();

const products = [
    {
        name: 'MacBook Pro',
        url: 'https://www.apple.com/macbook-pro/'
    },
    {
        name: 'iPhone 15',
        url: 'https://www.apple.com/iphone-15/'
    }
];

monitor.monitorPrices(products)
    .then(results => {
        console.log('Price monitoring complete');
        console.log(JSON.stringify(results, null, 2));
    })
    .catch(error => {
        console.error('Monitoring error:', error);
    });

Testing on macOS

Here is a script to set up and test AnyCrawl on macOS.

Automated Test Environment Setup

#!/bin/bash
# test-anycrawl-setup.sh
echo "AnyCrawl test environment setup"

# Check Docker
if command -v docker &> /dev/null; then
    echo "Docker confirmed"
else
    echo "Docker required: brew install --cask docker"
    exit 1
fi

# Create test directory
TEST_DIR="$HOME/anycrawl-test-$(date +%Y%m%d)"
mkdir -p "$TEST_DIR" && cd "$TEST_DIR"

# Clone repository
git clone https://github.com/any4ai/AnyCrawl.git
cd AnyCrawl

# Configure environment
cp .env.example .env

# Start Docker
docker-compose up --build -d
sleep 30

# Health check
if curl -s http://localhost:8080/health | grep -q "ok"; then
    echo "AnyCrawl is ready!"
    echo "API documentation: http://localhost:8080/docs"
else
    echo "Service failed to start"
fi

Conclusion

AnyCrawl is a capable platform that addresses data collection requirements for the AI era. By offering LLM-friendly data transformation, high-performance multithreaded processing, and support for multiple search engines, it enables the construction of high-quality datasets essential for AI application development.

Key Advantages

  1. LLM optimization: Provides structured data that AI models can process readily
  2. Scalability: Docker-based deployment that is straightforward to scale
  3. Versatility: Comprehensive support from web scraping to SERP crawling
  4. Performance: Multithreaded handling of large-scale data

Potential Use Cases

  • RAG systems: Building knowledge bases for retrieval-augmented generation
  • AI training data: Collecting high-quality training data across diverse domains
  • Real-time monitoring: Detecting web changes and analyzing trends
  • Automated pipelines: Automated data collection in CI/CD environments

Start exploring AI-powered data collection with Any4AI’s AnyCrawl.


Related posts: