⏱️ Estimated reading time: 15 minutes

What is AutoScraper?

AutoScraper is a Python library that revolutionizes web scraping by dramatically simplifying the process. While traditional web scraping requires complex CSS selectors or XPath expressions to extract data, AutoScraper uses machine learning to automatically learn scraping rules.

Key Features

  • Automatic Rule Learning: Simply provide sample data you want to extract, and it automatically learns the scraping rules
  • Fast and Lightweight: Optimized algorithms for fast processing
  • Flexible Usage: Supports various data types including text, URLs, and HTML tag values
  • Model Save/Load: Save and reuse learned models

Installation

AutoScraper can be easily installed via pip:

# Install from PyPI
pip install autoscraper

# Or install the latest version directly from GitHub
pip install git+https://github.com/alirezamika/autoscraper.git

Basic Usage

1. Getting Similar Results

Here’s an example of extracting related question titles from a StackOverflow page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# Provide sample data you want to extract
wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

Output:

[
    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?', 
    'How to call an external command?', 
    'What are metaclasses in Python?', 
    'Does Python have a ternary conditional operator?', 
    'How do you remove duplicates from a list whilst preserving order?', 
    'Convert bytes to a string', 
    'How to get line count of a large file cheaply in Python?', 
    "Does Python have a string 'contains' substring method?", 
    'Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3?'
]

Now you can use the learned scraper object to extract similar content from other StackOverflow pages:

# Extract similar results from other pages
result = scraper.get_result_similar('https://stackoverflow.com/questions/606191/convert-bytes-to-a-string')

2. Getting Exact Results

Here’s an example of scraping live stock prices from Yahoo Finance:

from autoscraper import AutoScraper

url = 'https://finance.yahoo.com/quote/AAPL/'
wanted_list = ["124.81"]  # Update with actual price

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)

Now you can easily get prices for other stock symbols:

# Get Microsoft stock price
result = scraper.get_result_exact('https://finance.yahoo.com/quote/MSFT/')

3. Extracting Multiple Data Simultaneously

Here’s an example of extracting multiple pieces of information from a GitHub repository page:

from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'
wanted_list = [
    'A Smart, Automatic, Fast and Lightweight Web Scraper for Python',  # Description
    '6.2k',  # Star count
    'https://github.com/alirezamika/autoscraper/issues'  # Issues link
]

scraper = AutoScraper()
scraper.build(url, wanted_list)

# Extract data in exact order
result = scraper.get_result_exact('https://github.com/another-repo')

Advanced Usage

Using Proxies and Custom Headers

from autoscraper import AutoScraper

url = 'https://example.com'
wanted_list = ["sample data"]

# Proxy configuration
proxies = {
    "http": 'http://127.0.0.1:8001',
    "https": 'https://127.0.0.1:8001',
}

# Custom headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

scraper = AutoScraper()
result = scraper.build(
    url, 
    wanted_list, 
    request_args=dict(proxies=proxies, headers=headers)
)

Using HTML Content Directly

You can also use HTML content directly instead of URLs:

html_content = """
<html>
<body>
    <h1>Title</h1>
    <p>Content</p>
</body>
</html>
"""

scraper = AutoScraper()
result = scraper.build(html=html_content, wanted_list=["Title"])

Model Saving and Loading

You can save learned models and reuse them later:

# Save model
scraper.save('my-scraper-model')

# Load model
scraper = AutoScraper()
scraper.load('my-scraper-model')

# Use loaded model for scraping
result = scraper.get_result_exact('https://new-url.com')

Real-World Project Examples

News Headline Collector

from autoscraper import AutoScraper
import json

def collect_news_headlines():
    scraper = AutoScraper()
    
    # Training data
    url = 'https://news.ycombinator.com/'
    wanted_list = ["Sample headline text"]
    
    # Train model
    scraper.build(url, wanted_list)
    
    # Collect headlines from multiple news sites
    news_sites = [
        'https://news.ycombinator.com/',
        'https://www.reddit.com/r/programming/',
        'https://dev.to/'
    ]
    
    all_headlines = []
    for site in news_sites:
        headlines = scraper.get_result_similar(site)
        all_headlines.extend(headlines)
    
    return all_headlines

# Execute
headlines = collect_news_headlines()
print(f"Collected {len(headlines)} headlines in total.")

Product Price Monitor

from autoscraper import AutoScraper
import time
import json

class PriceMonitor:
    def __init__(self):
        self.scraper = AutoScraper()
        self.setup_scraper()
    
    def setup_scraper(self):
        # Train with Amazon product page
        url = 'https://www.amazon.com/dp/B08N5WRWNW'  # Example product
        wanted_list = ["$299.99"]  # Update with actual price
        
        self.scraper.build(url, wanted_list)
    
    def monitor_price(self, product_url):
        try:
            result = self.scraper.get_result_exact(product_url)
            if result:
                return {
                    'url': product_url,
                    'price': result[0],
                    'timestamp': time.time()
                }
        except Exception as e:
            print(f"Price monitoring error: {e}")
            return None
    
    def save_price_data(self, price_data, filename='price_history.json'):
        try:
            with open(filename, 'r') as f:
                data = json.load(f)
        except FileNotFoundError:
            data = []
        
        data.append(price_data)
        
        with open(filename, 'w') as f:
            json.dump(data, f, indent=2)

# Usage example
monitor = PriceMonitor()
price_data = monitor.monitor_price('https://www.amazon.com/dp/B08N5WRWNW')
if price_data:
    monitor.save_price_data(price_data)
    print(f"Current price: {price_data['price']}")

Performance Optimization Tips

1. Batch Processing

def batch_scrape(urls, wanted_list):
    scraper = AutoScraper()
    
    # Train with first URL
    scraper.build(urls[0], wanted_list)
    
    # Process remaining URLs in batch
    results = []
    for url in urls[1:]:
        result = scraper.get_result_similar(url)
        results.append({'url': url, 'data': result})
    
    return results

2. Error Handling

def safe_scrape(url, wanted_list, max_retries=3):
    scraper = AutoScraper()
    
    for attempt in range(max_retries):
        try:
            result = scraper.build(url, wanted_list)
            return result
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == max_retries - 1:
                return None
            time.sleep(2 ** attempt)  # Exponential backoff

Best Practices and Considerations

1. Respect Website Policies

  • Check robots.txt: Always check the website’s robots.txt file and follow the rules
  • Rate Limiting: Set appropriate delays to avoid overloading servers with excessive requests
  • User-Agent Setting: Set appropriate User-Agent to appear like a normal browser request

2. Data Validation

def validate_scraped_data(data, expected_fields):
    """Validate the integrity of scraped data."""
    if not data:
        return False
    
    for field in expected_fields:
        if field not in data or not data[field]:
            return False
    
    return True

3. Logging and Monitoring

import logging

# Logging configuration
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('scraper.log'),
        logging.StreamHandler()
    ]
)

def scrape_with_logging(url, wanted_list):
    logging.info(f"Starting scraping: {url}")
    
    try:
        scraper = AutoScraper()
        result = scraper.build(url, wanted_list)
        logging.info(f"Successfully extracted {len(result)} items")
        return result
    except Exception as e:
        logging.error(f"Scraping failed: {e}")
        return None

Advanced Integration Examples

Flask API Integration

from flask import Flask, jsonify, request
from autoscraper import AutoScraper

app = Flask(__name__)

# Global scraper instance
scraper = AutoScraper()

@app.route('/train', methods=['POST'])
def train_scraper():
    data = request.json
    url = data.get('url')
    wanted_list = data.get('wanted_list')
    
    try:
        scraper.build(url, wanted_list)
        return jsonify({'status': 'success', 'message': 'Model trained successfully'})
    except Exception as e:
        return jsonify({'status': 'error', 'message': str(e)}), 400

@app.route('/scrape', methods=['POST'])
def scrape_data():
    data = request.json
    url = data.get('url')
    mode = data.get('mode', 'similar')  # 'similar' or 'exact'
    
    try:
        if mode == 'similar':
            result = scraper.get_result_similar(url)
        else:
            result = scraper.get_result_exact(url)
        
        return jsonify({'status': 'success', 'data': result})
    except Exception as e:
        return jsonify({'status': 'error', 'message': str(e)}), 400

if __name__ == '__main__':
    app.run(debug=True)

Database Integration

import sqlite3
from autoscraper import AutoScraper

class ScrapingDatabase:
    def __init__(self, db_path='scraping_data.db'):
        self.db_path = db_path
        self.init_database()
    
    def init_database(self):
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS scraped_data (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT NOT NULL,
                data TEXT NOT NULL,
                timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
            )
        ''')
        
        conn.commit()
        conn.close()
    
    def save_scraped_data(self, url, data):
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        cursor.execute(
            'INSERT INTO scraped_data (url, data) VALUES (?, ?)',
            (url, json.dumps(data))
        )
        
        conn.commit()
        conn.close()
    
    def get_scraped_data(self, url=None):
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()
        
        if url:
            cursor.execute('SELECT * FROM scraped_data WHERE url = ?', (url,))
        else:
            cursor.execute('SELECT * FROM scraped_data ORDER BY timestamp DESC')
        
        results = cursor.fetchall()
        conn.close()
        
        return results

# Usage
db = ScrapingDatabase()
scraper = AutoScraper()

# Train and scrape
scraper.build('https://example.com', ['sample data'])
result = scraper.get_result_similar('https://another-site.com')

# Save to database
db.save_scraped_data('https://another-site.com', result)

Conclusion

AutoScraper is a powerful tool that revolutionizes web scraping by dramatically simplifying the process. With machine learning-based automatic rule learning, you can easily extract desired data without complex CSS selectors or XPath expressions.

Key Advantages

  • Low Learning Curve: Usable without deep web scraping knowledge
  • Flexibility: Applicable to various websites and data types
  • Reusability: Reuse learned models across multiple sites
  • Scalability: Suitable for large-scale scraping operations

Use Cases

  • Data Collection: News, product information, price data collection
  • Monitoring: Website change tracking
  • Research: Web data analysis and research
  • Business: Competitor analysis, market research

By leveraging AutoScraper, you can significantly reduce the complexity of web scraping and focus more time on data analysis and insight generation.

Consider AutoScraper when starting your web scraping projects. You can build powerful scraping systems with just a few lines of code.


References: