AnyCrawl: A Comprehensive Guide to the LLM-Ready Web Crawler - A New Standard for AI Data Collection
⏱️ Estimated reading time: 15 min
Overview
AnyCrawl is a high-performance web crawler developed by Any4AI that transforms websites into data optimized for large language models (LLMs) and extracts structured search engine results pages (SERPs) from Google, Bing, Baidu, and other major search engines.
With 1.8k GitHub stars and an active community, AnyCrawl is setting a new standard for data collection in AI applications.
Core Value Proposition
- LLM optimization: Converts web data into a format LLMs can process effectively
- Multi-engine support: Cheerio, Playwright, Puppeteer, and more
- SERP specialization: Supports Google, Bing, Baidu, and other major search engines
- High-performance processing: Multithreading and multiprocess architecture
- Batch processing: Efficient management of large-scale crawling jobs
What Is AnyCrawl?
An AI-Era Data Collection Platform
AnyCrawl goes beyond a simple web crawler – it is an AI-powered data collection platform:
Web content -> AnyCrawl processing -> LLM-ready data -> AI model training/inference
Modern Architecture
Built on Node.js + TypeScript:
- Excellent performance through asynchronous processing
- Stable operation with type safety
- Leverages a rich ecosystem
Containerized deployment:
- Docker and Docker Compose support
- Microservices architecture
- Scalability and maintainability
Four Core Features
1. SERP Crawling (Search Engine Results Pages)
# Collect Google search results
curl -X POST http://localhost:8080/v1/search \
-H 'Content-Type: application/json' \
-d '{
"query": "artificial intelligence trends 2025",
"limit": 20,
"engine": "google"
}'
2. Web Scraping (Single Page Extraction)
# Extract single-page content
curl -X POST http://localhost:8080/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/article",
"engine": "playwright"
}'
3. Site Crawling (Full Site Crawling)
- Intelligent link traversal
- Duplicate content removal
- Structured data extraction
4. Batch Processing (Batch Operations)
- Bulk URL list processing
- Parallel job optimization
- Progress monitoring
System Requirements
Basic Environment
# Check Docker version (20.10+ recommended)
docker --version
# Check Docker Compose version (1.29+ recommended)
docker-compose --version
# Check Git
git --version
# Memory: minimum 4 GB, recommended 8 GB+
# Disk: minimum 10 GB free space
Docker-Based Operation
macOS installation:
# Install Docker via Homebrew
brew install --cask docker
# Complete setup after launching Docker Desktop
Linux installation:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker.io docker-compose
# CentOS/RHEL
sudo yum install docker docker-compose
Installation and Initial Setup
Step 1: Clone the Repository
# Clone the AnyCrawl repository
git clone https://github.com/any4ai/AnyCrawl.git
cd AnyCrawl
# Confirm branch (use the main branch)
git branch -a
Step 2: Configure the Environment
Set Basic Environment Variables
# Create .env file
cp .env.example .env
# Review the key configuration items
cat .env
Key Environment Variables
| Variable | Default | Description |
|---|---|---|
NODE_ENV |
production | Runtime environment setting |
ANYCRAWL_API_PORT |
8080 | API server port |
ANYCRAWL_HEADLESS |
true | Headless browser mode |
ANYCRAWL_AVAILABLE_ENGINES |
cheerio,playwright,puppeteer | Available engines |
ANYCRAWL_REDIS_URL |
redis://redis:6379 | Redis connection URL |
Step 3: Start Docker Containers
# Build and start containers
docker-compose up --build -d
# Check service status
docker-compose ps
# View logs
docker-compose logs -f
Step 4: Verify Installation
# API server health check
curl http://localhost:8080/health
# Access API documentation (browser)
open http://localhost:8080/docs
Core Features in Detail
Web Scraping
Cheerio Engine (Static HTML)
# Fastest static HTML parsing
curl -X POST http://localhost:8080/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://news.ycombinator.com",
"engine": "cheerio"
}'
Characteristics:
- Fastest speed
- Low memory usage
- No JavaScript support
Playwright Engine (JavaScript Rendering)
# Modern browser engine
curl -X POST http://localhost:8080/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/spa-app",
"engine": "playwright"
}'
Characteristics:
- Supports all browsers (Chrome, Firefox, Safari)
- Full JavaScript rendering
- Supports latest web standards
Puppeteer Engine (Chrome-Only)
# Chrome-based rendering
curl -X POST http://localhost:8080/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/react-app",
"engine": "puppeteer"
}'
Characteristics:
- Chrome/Chromium only
- Reliable JavaScript handling
- Rich debugging capabilities
SERP Crawling (Search Results)
Collect Google Search Results
# Basic search
curl -X POST http://localhost:8080/v1/search \
-H 'Content-Type: application/json' \
-d '{
"query": "machine learning tutorials",
"engine": "google",
"pages": 2,
"lang": "en"
}'
Multilingual Search Support
# Korean search results
curl -X POST http://localhost:8080/v1/search \
-H 'Content-Type: application/json' \
-d '{
"query": "artificial intelligence tutorial",
"engine": "google",
"lang": "ko"
}'
Advanced Search Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
query |
string | Search query | Required |
engine |
string | Search engine (google) | |
pages |
number | Number of pages to collect | 1 |
lang |
string | Language code | en-US |
limit |
number | Result limit | 10 |
Proxies and Advanced Settings
Using an HTTP Proxy
curl -X POST http://localhost:8080/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"engine": "playwright",
"proxy": "http://proxy.example.com:8080"
}'
Using a SOCKS Proxy
# SOCKS5 proxy configuration
curl -X POST http://localhost:8080/v1/scrape \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"proxy": "socks5://proxy.example.com:1080"
}'
Practical Usage Examples
Example 1: Automating News Data Collection
#!/bin/bash
# news-collector.sh
API_URL="http://localhost:8080"
OUTPUT_DIR="./news-data"
mkdir -p "$OUTPUT_DIR"
# List of major news sites
NEWS_SITES=(
"https://news.ycombinator.com"
"https://techcrunch.com"
"https://www.wired.com"
)
for site in "${NEWS_SITES[@]}"; do
echo "Starting crawl: $site"
# Collect data per site
curl -X POST "$API_URL/v1/scrape" \
-H 'Content-Type: application/json' \
-d "{
\"url\": \"$site\",
\"engine\": \"playwright\"
}" > "$OUTPUT_DIR/$(basename $site).json"
echo "Done: $site"
sleep 2 # Rate limiting between requests
done
Example 2: Academic Paper Search and Collection
# academic_research.py
import requests
import json
import time
class AcademicCrawler:
def __init__(self, base_url="http://localhost:8080"):
self.base_url = base_url
def search_papers(self, keywords, pages=3):
"""Search for academic papers"""
results = []
for keyword in keywords:
response = requests.post(
f"{self.base_url}/v1/search",
json={
"query": f"{keyword} site:arxiv.org OR site:scholar.google.com",
"pages": pages,
"limit": 20
}
)
if response.status_code == 200:
data = response.json()
results.extend(data.get('data', {}).get('results', []))
time.sleep(1) # Respect API rate limits
return results
def extract_paper_content(self, url):
"""Extract paper page content"""
response = requests.post(
f"{self.base_url}/v1/scrape",
json={
"url": url,
"engine": "playwright"
}
)
if response.status_code == 200:
return response.json()
return None
# Usage example
crawler = AcademicCrawler()
# Search for AI-related papers
keywords = [
"transformer neural network",
"large language model",
"computer vision 2025"
]
papers = crawler.search_papers(keywords)
print(f"Papers collected: {len(papers)}")
# Extract details of the first paper
if papers:
first_paper = papers[0]
content = crawler.extract_paper_content(first_paper['url'])
print(f"Paper title: {first_paper['title']}")
Example 3: E-commerce Price Monitoring
// price-monitor.js
const axios = require('axios');
class PriceMonitor {
constructor(baseUrl = 'http://localhost:8080') {
this.baseUrl = baseUrl;
}
async scrapeProduct(url) {
try {
const response = await axios.post(`${this.baseUrl}/v1/scrape`, {
url: url,
engine: 'playwright'
});
return response.data;
} catch (error) {
console.error('Scraping error:', error.message);
return null;
}
}
async monitorPrices(products) {
const results = [];
for (const product of products) {
console.log(`Monitoring: ${product.name}`);
const data = await this.scrapeProduct(product.url);
if (data) {
results.push({
name: product.name,
url: product.url,
timestamp: new Date().toISOString(),
data: data
});
}
// Rate limiting between requests
await new Promise(resolve => setTimeout(resolve, 2000));
}
return results;
}
}
// Usage example
const monitor = new PriceMonitor();
const products = [
{
name: 'MacBook Pro',
url: 'https://www.apple.com/macbook-pro/'
},
{
name: 'iPhone 15',
url: 'https://www.apple.com/iphone-15/'
}
];
monitor.monitorPrices(products)
.then(results => {
console.log('Price monitoring complete');
console.log(JSON.stringify(results, null, 2));
})
.catch(error => {
console.error('Monitoring error:', error);
});
Testing on macOS
Here is a script to set up and test AnyCrawl on macOS.
Automated Test Environment Setup
#!/bin/bash
# test-anycrawl-setup.sh
echo "AnyCrawl test environment setup"
# Check Docker
if command -v docker &> /dev/null; then
echo "Docker confirmed"
else
echo "Docker required: brew install --cask docker"
exit 1
fi
# Create test directory
TEST_DIR="$HOME/anycrawl-test-$(date +%Y%m%d)"
mkdir -p "$TEST_DIR" && cd "$TEST_DIR"
# Clone repository
git clone https://github.com/any4ai/AnyCrawl.git
cd AnyCrawl
# Configure environment
cp .env.example .env
# Start Docker
docker-compose up --build -d
sleep 30
# Health check
if curl -s http://localhost:8080/health | grep -q "ok"; then
echo "AnyCrawl is ready!"
echo "API documentation: http://localhost:8080/docs"
else
echo "Service failed to start"
fi
Conclusion
AnyCrawl is a capable platform that addresses data collection requirements for the AI era. By offering LLM-friendly data transformation, high-performance multithreaded processing, and support for multiple search engines, it enables the construction of high-quality datasets essential for AI application development.
Key Advantages
- LLM optimization: Provides structured data that AI models can process readily
- Scalability: Docker-based deployment that is straightforward to scale
- Versatility: Comprehensive support from web scraping to SERP crawling
- Performance: Multithreaded handling of large-scale data
Potential Use Cases
- RAG systems: Building knowledge bases for retrieval-augmented generation
- AI training data: Collecting high-quality training data across diverse domains
- Real-time monitoring: Detecting web changes and analyzing trends
- Automated pipelines: Automated data collection in CI/CD environments
Start exploring AI-powered data collection with Any4AI’s AnyCrawl.
Related posts: