Complete Guide to mdream: Convert Any Website to Clean Markdown for AI & LLMs
⏱️ Estimated Reading Time: 8 minutes
Introduction
In the era of AI and Large Language Models (LLMs), converting web content into clean, structured markdown has become increasingly important. mdream is a powerful Node.js library that transforms any website into clean markdown format, making it perfect for AI applications, LLM context generation, and content processing workflows.
Created by Harlan Wilton, mdream stands out with its streaming capabilities, extensive plugin system, and specialized presets for different use cases. Whether you’re building AI applications, creating documentation, or processing web content at scale, mdream provides the tools you need.
What is mdream?
mdream is a comprehensive HTML-to-markdown conversion library that offers:
- Clean Markdown Output: Produces well-structured markdown perfect for AI consumption
- Streaming Support: Efficiently handles large web pages with streaming conversion
- Plugin System: Extensible architecture for custom processing needs
- Framework Integration: Built-in support for Vite and Nuxt.js
- AI-Focused Features: Generates
llms.txt
files for improved AI discoverability
Installation and Setup
Basic Installation
# Using npm
npm install mdream
# Using yarn
yarn add mdream
# Using pnpm
pnpm add mdream
Framework-Specific Installation
For Vite projects:
pnpm install @mdream/vite
For Nuxt.js projects:
pnpm add @mdream/nuxt
Core API Usage
Converting Existing HTML
The simplest way to use mdream is with the htmlToMarkdown
function:
import { htmlToMarkdown } from 'mdream'
// Basic conversion
const html = '<h1>Hello World</h1><p>This is a paragraph.</p>'
const markdown = htmlToMarkdown(html)
console.log(markdown)
// Output:
// # Hello World
//
// This is a paragraph.
Streaming HTML Conversion
For better performance with large websites, use streamHtmlToMarkdown
:
import { streamHtmlToMarkdown } from 'mdream'
async function convertWebsite(url) {
const response = await fetch(url)
const htmlStream = response.body
const markdownGenerator = streamHtmlToMarkdown(htmlStream, {
origin: url
})
let fullMarkdown = ''
for await (const chunk of markdownGenerator) {
fullMarkdown += chunk
// Process chunks as they arrive
console.log('Received chunk:', chunk.length, 'characters')
}
return fullMarkdown
}
// Usage
const markdown = await convertWebsite('https://example.com')
Pure HTML Parsing
For applications that need DOM parsing without markdown conversion:
import { parseHtml } from 'mdream'
const html = '<div><h1>Title</h1><p>Content</p></div>'
const { events, remainingHtml } = parseHtml(html)
// Process DOM events
events.forEach((event) => {
if (event.type === 'enter' && event.node.type === 'element') {
console.log('Entering element:', event.node.tagName)
}
if (event.type === 'exit' && event.node.type === 'element') {
console.log('Exiting element:', event.node.tagName)
}
})
Plugin System Deep Dive
mdream’s plugin system is one of its most powerful features, allowing you to customize every aspect of the conversion process.
Built-in Plugins
1. Filter Plugin
Exclude or include specific elements:
import { htmlToMarkdown } from 'mdream'
import { filterPlugin } from 'mdream/plugins'
const html = `
<div>
<nav>Navigation menu</nav>
<main>
<h1>Main Content</h1>
<p>Important content here</p>
</main>
<footer>Footer content</footer>
</div>
`
const markdown = htmlToMarkdown(html, {
plugins: [
filterPlugin({
exclude: ['nav', 'footer', '.sidebar', '#advertisement']
})
]
})
// Only main content will be converted
2. Frontmatter Plugin
Generate YAML frontmatter from HTML meta tags:
import { frontmatterPlugin } from 'mdream/plugins'
const html = `
<html>
<head>
<title>My Blog Post</title>
<meta name="description" content="This is a great blog post">
<meta name="author" content="John Doe">
</head>
<body>
<h1>Content</h1>
</body>
</html>
`
const markdown = htmlToMarkdown(html, {
plugins: [frontmatterPlugin()]
})
// Output includes frontmatter:
// ---
// title: My Blog Post
// description: This is a great blog post
// author: John Doe
// ---
// # Content
3. Isolation Plugin
Extract main content automatically:
import { isolateMainPlugin } from 'mdream/plugins'
const markdown = htmlToMarkdown(html, {
plugins: [isolateMainPlugin()]
})
// Automatically finds and converts only the main content area
4. Tailwind Plugin
Convert Tailwind CSS classes to markdown formatting:
import { tailwindPlugin } from 'mdream/plugins'
const html = '<p class="font-bold text-red-500">Important text</p>'
const markdown = htmlToMarkdown(html, {
plugins: [tailwindPlugin()]
})
// Output: **Important text**
Creating Custom Plugins
Build your own plugins for specific needs:
import { createPlugin } from 'mdream/plugins'
import type { ElementNode, TextNode } from 'mdream'
// Example: Add emoji to headings
const emojiHeadingPlugin = createPlugin({
onNodeEnter(node: ElementNode) {
if (node.name === 'h1') {
return '🚀 '
}
if (node.name === 'h2') {
return '📚 '
}
if (node.name === 'h3') {
return '✨ '
}
},
processTextNode(textNode: TextNode) {
// Highlight important text
if (textNode.parent?.attributes?.class?.includes('highlight')) {
return {
content: `**${textNode.value}**`,
skip: false
}
}
}
})
// Usage
const markdown = htmlToMarkdown(html, {
plugins: [emojiHeadingPlugin]
})
Advanced Plugin Example: Content Extraction
import { extractionPlugin } from 'mdream'
const extractedData = {}
const dataExtractionPlugin = extractionPlugin({
'h1, h2, h3': (element, state) => {
if (!extractedData.headings) extractedData.headings = []
extractedData.headings.push({
level: element.tagName,
text: element.textContent,
depth: state.depth
})
},
'img[alt]': (element, state) => {
if (!extractedData.images) extractedData.images = []
extractedData.images.push({
src: element.attributes.src,
alt: element.attributes.alt,
context: state.options.origin
})
},
'a[href]': (element, state) => {
if (!extractedData.links) extractedData.links = []
extractedData.links.push({
href: element.attributes.href,
text: element.textContent,
isExternal: element.attributes.href.startsWith('http')
})
}
})
// Convert and extract data simultaneously
const markdown = htmlToMarkdown(html, {
plugins: [dataExtractionPlugin],
origin: 'https://example.com'
})
console.log('Extracted data:', extractedData)
Presets for Common Use Cases
Minimal Preset
Perfect for AI applications where you want clean, token-efficient output:
import { withMinimalPreset } from 'mdream/preset/minimal'
const options = withMinimalPreset({
origin: 'https://example.com'
})
const markdown = htmlToMarkdown(html, options)
The minimal preset includes:
isolateMainPlugin()
- Extracts main contentfrontmatterPlugin()
- Generates YAML frontmattertailwindPlugin()
- Converts Tailwind classesfilterPlugin()
- Removes non-content elements
CLI Usage with Presets
# Use minimal preset via CLI
curl -s https://example.com | npx mdream --preset minimal --origin https://example.com
# Save to file
curl -s https://example.com | npx mdream --preset minimal --origin https://example.com > output.md
Framework Integration
Vite Integration
Add automatic markdown generation to your Vite project:
// vite.config.ts
import { defineConfig } from 'vite'
import MdreamVite from '@mdream/vite'
export default defineConfig({
plugins: [
MdreamVite({
// Custom options
plugins: [
// Your custom plugins
]
})
]
})
Now you can access any route with .md
extension:
/about.html
→/about.md
/blog/post.html
→/blog/post.md
Nuxt.js Integration
Enable automatic markdown routes and LLMs.txt generation:
// nuxt.config.ts
export default defineNuxtConfig({
modules: [
'@mdream/nuxt'
],
mdream: {
// Custom configuration
preset: 'minimal',
plugins: [
// Additional plugins
]
}
})
Features:
- Automatic
.md
route variants llms.txt
andllms-full.txt
generation during static generation- SEO-friendly markdown versions of your pages
Real-World Examples
Example 1: Documentation Site Converter
import { htmlToMarkdown } from 'mdream'
import { filterPlugin, frontmatterPlugin, isolateMainPlugin } from 'mdream/plugins'
import fs from 'fs/promises'
async function convertDocumentationSite(urls) {
const results = []
for (const url of urls) {
try {
const response = await fetch(url)
const html = await response.text()
const markdown = htmlToMarkdown(html, {
origin: url,
plugins: [
isolateMainPlugin(),
frontmatterPlugin(),
filterPlugin({
exclude: [
'nav', 'footer', '.sidebar',
'.advertisement', '.social-share'
]
})
]
})
// Extract filename from URL
const filename = url.split('/').pop() || 'index'
await fs.writeFile(`docs/${filename}.md`, markdown)
results.push({ url, filename, success: true })
} catch (error) {
results.push({ url, success: false, error: error.message })
}
}
return results
}
// Usage
const urls = [
'https://docs.example.com/getting-started',
'https://docs.example.com/api-reference',
'https://docs.example.com/examples'
]
const results = await convertDocumentationSite(urls)
console.log('Conversion results:', results)
Example 2: Blog Content Aggregator
import { htmlToMarkdown } from 'mdream'
import { extractionPlugin, filterPlugin } from 'mdream/plugins'
class BlogAggregator {
constructor() {
this.articles = []
}
async processArticle(url) {
const response = await fetch(url)
const html = await response.text()
const articleData = {}
const extractionPluginInstance = extractionPlugin({
'h1': (element) => {
articleData.title = element.textContent
},
'[datetime], .published-date, .date': (element) => {
articleData.publishedDate = element.textContent || element.attributes.datetime
},
'.author, [rel="author"]': (element) => {
articleData.author = element.textContent
},
'img[src]': (element) => {
if (!articleData.images) articleData.images = []
articleData.images.push({
src: element.attributes.src,
alt: element.attributes.alt || ''
})
}
})
const markdown = htmlToMarkdown(html, {
origin: url,
plugins: [
extractionPluginInstance,
filterPlugin({
exclude: [
'nav', 'footer', '.comments',
'.social-share', '.advertisement'
]
})
]
})
return {
url,
markdown,
metadata: articleData,
wordCount: markdown.split(' ').length,
processedAt: new Date().toISOString()
}
}
async aggregateBlogs(urls) {
const results = await Promise.allSettled(
urls.map(url => this.processArticle(url))
)
return results
.filter(result => result.status === 'fulfilled')
.map(result => result.value)
}
}
// Usage
const aggregator = new BlogAggregator()
const blogUrls = [
'https://blog.example.com/post-1',
'https://blog.example.com/post-2',
'https://blog.example.com/post-3'
]
const articles = await aggregator.aggregateBlogs(blogUrls)
console.log(`Processed ${articles.length} articles`)
Example 3: AI Training Data Preparation
import { htmlToMarkdown } from 'mdream'
import { withMinimalPreset } from 'mdream/preset/minimal'
class AIDataPreparator {
constructor(options = {}) {
this.maxTokens = options.maxTokens || 4000
this.outputDir = options.outputDir || './ai-training-data'
}
estimateTokens(text) {
// Rough estimation: 1 token ≈ 4 characters
return Math.ceil(text.length / 4)
}
async prepareTrainingData(urls) {
const trainingExamples = []
for (const url of urls) {
try {
const response = await fetch(url)
const html = await response.text()
const options = withMinimalPreset({
origin: url
})
const markdown = htmlToMarkdown(html, options)
const tokenCount = this.estimateTokens(markdown)
if (tokenCount <= this.maxTokens) {
trainingExamples.push({
source: url,
content: markdown,
tokens: tokenCount,
type: 'full_page'
})
} else {
// Split into chunks if too large
const chunks = this.chunkContent(markdown, this.maxTokens)
chunks.forEach((chunk, index) => {
trainingExamples.push({
source: `${url}#chunk-${index}`,
content: chunk,
tokens: this.estimateTokens(chunk),
type: 'chunk'
})
})
}
} catch (error) {
console.error(`Failed to process ${url}:`, error.message)
}
}
return trainingExamples
}
chunkContent(content, maxTokens) {
const chunks = []
const paragraphs = content.split('\n\n')
let currentChunk = ''
for (const paragraph of paragraphs) {
const potentialChunk = currentChunk + '\n\n' + paragraph
if (this.estimateTokens(potentialChunk) > maxTokens && currentChunk) {
chunks.push(currentChunk.trim())
currentChunk = paragraph
} else {
currentChunk = potentialChunk
}
}
if (currentChunk.trim()) {
chunks.push(currentChunk.trim())
}
return chunks
}
async saveTrainingData(trainingExamples, filename = 'training_data.jsonl') {
const jsonlContent = trainingExamples
.map(example => JSON.stringify(example))
.join('\n')
await fs.writeFile(filename, jsonlContent, 'utf8')
console.log(`Saved ${trainingExamples.length} training examples to ${filename}`)
}
}
// Usage
const preparator = new AIDataPreparator({
maxTokens: 3000,
outputDir: './ai-data'
})
const urls = [
'https://docs.framework.com/guide',
'https://docs.framework.com/api',
'https://docs.framework.com/examples'
]
const trainingData = await preparator.prepareTrainingData(urls)
await preparator.saveTrainingData(trainingData)
Performance Optimization
Streaming for Large Sites
import { streamHtmlToMarkdown } from 'mdream'
async function efficientConversion(url) {
const response = await fetch(url)
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`)
}
if (!response.body) {
throw new Error('No response body available for streaming')
}
const markdownGenerator = streamHtmlToMarkdown(response.body, {
origin: url,
plugins: [
// Only essential plugins for better performance
]
})
let result = ''
let chunkCount = 0
for await (const chunk of markdownGenerator) {
result += chunk
chunkCount++
// Optional: Progress reporting
if (chunkCount % 10 === 0) {
console.log(`Processed ${chunkCount} chunks...`)
}
}
return result
}
Batch Processing
import pLimit from 'p-limit'
const limit = pLimit(5) // Process 5 URLs concurrently
async function batchConvertUrls(urls, options = {}) {
const results = await Promise.allSettled(
urls.map(url =>
limit(() => convertUrl(url, options))
)
)
const successful = results
.filter(result => result.status === 'fulfilled')
.map(result => result.value)
const failed = results
.filter(result => result.status === 'rejected')
.map((result, index) => ({
url: urls[index],
error: result.reason.message
}))
return { successful, failed }
}
Best Practices
1. Choose the Right Approach
// For small HTML snippets
const markdown = htmlToMarkdown(html)
// For large web pages
const markdown = await streamHtmlToMarkdown(htmlStream, { origin })
// For DOM analysis without markdown
const { events } = parseHtml(html)
2. Plugin Selection
// For AI/LLM applications
const options = withMinimalPreset({ origin })
// For documentation conversion
const options = {
plugins: [
isolateMainPlugin(),
frontmatterPlugin(),
filterPlugin({ exclude: ['nav', 'footer'] })
]
}
// For content analysis
const options = {
plugins: [
extractionPlugin({
'h1, h2, h3': extractHeadings,
'img': extractImages,
'a[href]': extractLinks
})
]
}
3. Error Handling
async function robustConversion(url) {
try {
const response = await fetch(url, {
timeout: 10000, // 10 second timeout
headers: {
'User-Agent': 'mdream-converter/1.0'
}
})
if (!response.ok) {
throw new Error(`HTTP ${response.status}`)
}
const contentType = response.headers.get('content-type')
if (!contentType?.includes('text/html')) {
throw new Error(`Unexpected content type: ${contentType}`)
}
return await streamHtmlToMarkdown(response.body, {
origin: url
})
} catch (error) {
console.error(`Failed to convert ${url}:`, error.message)
return null
}
}
Troubleshooting
Common Issues
- Memory Issues with Large Pages
// Use streaming instead of loading entire HTML const stream = streamHtmlToMarkdown(htmlStream, options)
- Malformed HTML
// mdream handles malformed HTML gracefully const markdown = htmlToMarkdown(malformedHtml, { // Results may vary based on HTML structure })
- Missing Content
// Check if content is in <main> tags const options = { plugins: [isolateMainPlugin()] }
- Too Much Noise in Output
// Use aggressive filtering const options = { plugins: [ filterPlugin({ exclude: [ 'nav', 'footer', 'aside', '.sidebar', '.advertisement', '.social', '.comments' ] }) ] }
Conclusion
mdream is a powerful and flexible library that makes HTML-to-markdown conversion straightforward and customizable. Its streaming capabilities, extensive plugin system, and AI-focused features make it an excellent choice for modern web content processing workflows.
Whether you’re building AI applications, creating documentation systems, or processing web content at scale, mdream provides the tools and flexibility you need to get clean, structured markdown output.
Key takeaways:
- Use streaming for large websites and better performance
- Leverage the plugin system for custom processing needs
- Choose appropriate presets for your use case
- Consider framework integration for web applications
- Implement proper error handling for production use
Start experimenting with mdream today and see how it can streamline your HTML-to-markdown conversion workflows!