Lectito
A Rust library and CLI for extracting readable content from web pages.
What is Lectito?
Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. It identifies and extracts the main article content from web pages, removing navigation, sidebars, advertisements, and other clutter.
Features
- Content Extraction: Automatically identifies the main article content
- Metadata Extraction: Pulls title, author, date, excerpt, and language
- Output Formats: HTML, Markdown, plain text, and JSON
- URL Fetching: Built-in async HTTP client with timeout support
- CLI: Simple command-line interface for quick extractions
- Site Configuration: Optional XPath-based extraction rules for difficult sites
Use Cases
- Web Scraping: Extract clean article content from web pages
- AI Agents: Feed readable text to language models
- Content Analysis: Analyze article text without HTML noise
- Archival: Save clean copies of web content
- CLI: Quick article extraction from the terminal
Quick Links
- Installation: See the Installation Guide
- CLI Usage: See the CLI Usage Guide
- Library Usage: See the Basic Usage Guide
- API Reference:
See docs.rs/lectito
Quick Start
CLI
# Install
cargo install lectito-cli
# Extract from URL
lectito https://example.com/article
# Extract from local file
lectito article.html
# Pipe from stdin
curl https://example.com | lectito -
Library
use lectito_core::parse;
let html = r#"<html><body><article><h1>Title</h1><p>Content</p></article></body></html>"#;
let article = parse(html)?;
println!("Title: {:?}", article.metadata.title);
println!("Content: {}", article.to_markdown()?);
About the Name
"Lectito" is derived from the Latin legere (to read) and lectio (a reading or selection).
Lectito aims to select and present readable content from the chaos of the modern web.
Installation
Lectito provides both a CLI tool and a Rust library. Install whichever fits your needs.
CLI Installation
From crates.io
The easiest way to install the CLI is via cargo:
cargo install lectito-cli
This installs the lectito binary in your cargo bin directory (typically ~/.cargo/bin).
From Source
# Clone the repository
git clone https://github.com/stormlightlabs/lectito.git
cd lectito
# Build and install
cargo install --path crates/cli
Pre-built Binaries
Pre-built binaries are available on the GitHub Releases page for Linux, macOS, and Windows.
Download the appropriate binary for your platform and place it in your PATH.
Verify Installation
lectito --version
You should see version information printed.
Library Installation
Add to your Cargo.toml:
[dependencies]
lectito-core = "0.1"
Then run cargo build to fetch and compile the dependency.
Feature Flags
The library has several optional features:
[dependencies]
lectito-core = { version = "0.1", features = ["fetch", "markdown"] }
| Feature | Default | Description |
|---|---|---|
fetch | Yes | Enable URL fetching with reqwest |
markdown | Yes | Enable Markdown output format |
siteconfig | Yes | Enable site configuration support |
If you don't need URL fetching (e.g., you have your own HTTP client), disable the default features:
[dependencies]
lectito-core = { version = "0.1", default-features = false, features = ["markdown"] }
Development Build
To build from source for development:
# Clone the repository
git clone https://github.com/stormlightlabs/lectito.git
cd lectito
# Build the workspace
cargo build --release
# The CLI binary will be at target/release/lectito
Next Steps
- Quick Start Guide - Get started with basic usage
- CLI Usage - Learn CLI commands and options
- Library Guide - Use Lectito as a library
Quick Start
Get started with Lectito in minutes.
CLI Quick Start
Basic Usage
Extract content from a URL:
lectito https://example.com/article
Extract from a local file:
lectito article.html
Extract from stdin:
curl https://example.com | lectito -
Save to File
lectito https://example.com/article -o article.md
Change Output Format
# JSON output
lectito https://example.com/article --format json
# Plain text output
lectito https://example.com/article --format text
Set Timeout
For slow-loading sites:
lectito https://example.com/article --timeout 60
Library Quick Start
Add Dependency
Add to Cargo.toml:
[dependencies]
lectito-core = "0.1"
Parse HTML String
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = r#"
<!DOCTYPE html>
<html>
<head><title>My Article</title></head>
<body>
<article>
<h1>Article Title</h1>
<p>This is the article content with plenty of text.</p>
</article>
</body>
</html>
"#;
let article = parse(html)?;
println!("Title: {:?}", article.metadata.title);
println!("Content: {}", article.to_markdown()?);
Ok(())
}
Fetch and Parse URL
use lectito_core::fetch_and_parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let article = fetch_and_parse("https://example.com/article").await?;
println!("Title: {:?}", article.metadata.title);
println!("Word count: {}", article.word_count);
Ok(())
}
Convert to Different Formats
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<h1>Title</h1><p>Content here</p>";
let article = parse(html)?;
// Markdown with frontmatter
let markdown = article.to_markdown()?;
println!("{}", markdown);
// Plain text
let text = article.to_text();
println!("{}", text);
// Structured JSON
let json = article.to_json()?;
println!("{}", json);
Ok(())
}
Common Patterns
Handle Errors
use lectito_core::{parse, LectitoError};
match parse("<html>...</html>") {
Ok(article) => println!("Title: {:?}", article.metadata.title.unwrap()),
Err(LectitoError::NotReaderable { score, threshold }) => {
eprintln!("Content not readable: score {} < threshold {}", score, threshold);
}
Err(e) => eprintln!("Error: {}", e),
}
Configure Extraction
use lectito_core::{Readability, ReadabilityConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = ReadabilityConfig::builder()
.min_score(25.0)
.char_threshold(500)
.preserve_images(true)
.build();
let reader = Readability::with_config(config);
let article = reader.parse("<html>...</html>")?;
Ok(())
}
What's Next?
- CLI Usage - Full CLI command reference
- Library Guide - In-depth library documentation
- Configuration - Advanced configuration options
- Concepts - How the algorithm works
CLI Usage
Complete reference for the lectito command-line tool.
Basic Syntax
lectito [OPTIONS] <INPUT>
The INPUT can be:
- A URL (starts with
http://orhttps://) - A local file path
-to read from stdin
Examples
URL Extraction
lectito https://example.com/article
Local File
lectito article.html
Stdin Pipe
curl https://example.com | lectito -
cat page.html | lectito -
wget -qO- https://example.com | lectito -
Options
-o, --output <FILE>
Write output to a file instead of stdout.
lectito https://example.com/article -o article.md
-f, --format <FORMAT>
Specify output format. Available formats:
| Format | Description |
|---|---|
markdown or md | Markdown (default) |
json | Structured JSON |
text or txt | Plain text |
html | Cleaned HTML |
lectito https://example.com/article -f json
--timeout <SECONDS>
HTTP request timeout in seconds (default: 30).
lectito https://example.com/article --timeout 60
--user-agent <USER_AGENT>
Custom User-Agent header.
lectito https://example.com/article --user-agent "MyBot/1.0"
--config <PATH>
Path to site configuration file (TOML format).
lectito https://example.com/article --config site-config.toml
-v, --verbose
Enable verbose debug logging.
lectito https://example.com/article -v
-h, --help
Display help information.
lectito --help
-V, --version
Display version information.
lectito --version
Common Workflows
Extract and Save Article
lectito https://example.com/article -o articles/article.md
Batch Processing Multiple URLs
while read url; do
lectito "$url" -o "articles/$(date +%s).md"
done < urls.txt
Extract to JSON for Processing
lectito https://example.com/article --format json | jq '.metadata.title'
Extract from Multiple Files
for file in articles/*.html; do
lectito "$file" -o "processed/$(basename "$file" .html).md"
done
Custom Timeout for Slow Sites
lectito https://slow-site.com/article --timeout 120
Output Formats
Markdown (Default)
Output includes TOML frontmatter with metadata (when --frontmatter is used):
+++
title = "Article Title"
author = "John Doe"
date = "2025-01-17"
excerpt = "A brief description..."
+++
# Article Title
Article content here...
JSON
Structured output with all metadata:
{
"metadata": {
"title": "Article Title",
"author": "John Doe",
"date": "2025-01-17",
"excerpt": "A brief description..."
},
"content": "<div>...</div>",
"text_content": "Article content here...",
"word_count": 500
}
Plain Text
Just the article text without formatting:
Article Title
Article content here...
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Error (invalid URL, network failure, etc.) |
Error Handling
The CLI will print error messages to stderr:
lectito https://invalid-domain-xyz.com
# Error: failed to fetch URL: dns error: failed to lookup address information
For content that isn't readable:
lectito https://example.com/page
# Error: content not readable: score 15.2 < threshold 20.0
Tips
- Use timeouts: Set appropriate timeouts to avoid hanging
- Batch operations: Process multiple URLs in parallel
- Save to file: Use
-oto avoid terminal rendering overhead - JSON for parsing: Use JSON output when processing with other tools
Next Steps
- Configuration - Advanced configuration options
- Output Formats - Detailed format documentation
- Concepts - Understanding the algorithm
Basic Usage
Learn the fundamentals of using Lectito as a library.
Simple Parsing
The easiest way to extract content is with the parse function:
use lectito_core::{parse, Article};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = r#"
<!DOCTYPE html>
<html>
<head><title>My Article</title></head>
<body>
<article>
<h1>Article Title</h1>
<p>This is the article content.</p>
</article>
</body>
</html>
"#;
let article: Article = parse(html)?;
println!("Title: {:?}", article.metadata.title);
println!("Content: {}", article.to_markdown()?);
Ok(())
}
Fetching and Parsing
For URLs, use the fetch_and_parse function:
use lectito_core::fetch_and_parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let url = "https://example.com/article";
let article = fetch_and_parse(url).await?;
println!("Title: {:?}", article.metadata.title);
println!("Author: {:?}", article.metadata.author);
println!("Word count: {}", article.word_count);
Ok(())
}
Working with the Article
The Article struct contains all extracted information:
Metadata
use lectito_core::parse;
let html = "<html>...</html>";
let article = parse(html)?;
// Access metadata
if let Some(title) = article.metadata.title {
println!("Title: {}", title);
}
if let Some(author) = article.metadata.author {
println!("Author: {}", author);
}
if let Some(date) = article.metadata.published_date {
println!("Published: {}", date);
}
// Get excerpt
if let Some(excerpt) = article.metadata.excerpt {
println!("Excerpt: {}", excerpt);
}
Content Access
use lectito_core::parse;
let html = "<html>...</html>";
let article = parse(html)?;
// Get cleaned HTML
let html_content = &article.content;
// Get plain text
let text = article.to_text();
// Get Markdown
let markdown = article.to_markdown()?;
// Get JSON
let json = article.to_json()?;
Readability API
For more control, use the Readability API:
use lectito_core::{Readability, ReadabilityConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
// Use default config
let reader = Readability::new();
let article = reader.parse(html)?;
// Or with custom config
let config = ReadabilityConfig::builder()
.min_score(25.0)
.char_threshold(500)
.build();
let reader = Readability::with_config(config);
let article = reader.parse(html)?;
Ok(())
}
Error Handling
Lectito returns Result<T, LectitoError>. Handle errors appropriately:
use lectito_core::{parse, LectitoError};
fn extract_article(html: &str) -> Result<String, String> {
match parse(html) {
Ok(article) => Ok(article.to_markdown().unwrap_or_default()),
Err(LectitoError::NotReaderable { score, threshold }) => {
Err(format!("Content not readable: score {} < threshold {}", score, threshold))
}
Err(LectitoError::InvalidUrl(msg)) => {
Err(format!("Invalid URL: {}", msg))
}
Err(e) => Err(format!("Extraction failed: {}", e)),
}
}
Common Patterns
Parse with URL Context
When you have the URL, provide it for better relative link resolution:
use lectito_core::{parse_with_url, Article};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let url = "https://example.com/article";
let article: Article = parse_with_url(html, url)?;
// Relative links are now resolved correctly
Ok(())
}
Check if Content is Readable
Before parsing, check if content meets readability thresholds:
use lectito_core::is_probably_readable;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
if is_probably_readable(html) {
println!("Content is readable");
} else {
println!("Content may not be readable");
}
Ok(())
}
Working with Documents
For lower-level DOM manipulation:
use lectito_core::{Document, Element};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html><body><p>Hello</p></body></html>";
let doc = Document::parse(html)?;
let elements: Vec<Element> = doc.select("p")?;
for element in elements {
println!("Text: {}", element.text());
}
Ok(())
}
Integrations
With reqwest
use lectito_core::parse;
use reqwest::Client;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
let response = client.get("https://example.com/article")
.send()
.await?;
let html = response.text().await?;
let article = parse(&html)?;
println!("Title: {:?}", article.metadata.title);
Ok(())
}
With Scraper
If you're already using scraper, you can integrate:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
// Work with the article's HTML content
println!("Cleaned HTML: {}", article.content);
Ok(())
}
Next Steps
- Configuration - Advanced configuration options
- Async vs Sync - Understanding async APIs
- Output Formats - Detailed format documentation
Configuration
Customize Lectito's extraction behavior with configuration options.
ReadabilityConfig
The ReadabilityConfig struct controls extraction parameters. Use the builder pattern:
use lectito_core::{Readability, ReadabilityConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = ReadabilityConfig::builder()
.min_score(25.0)
.char_threshold(500)
.preserve_images(true)
.build();
let reader = Readability::with_config(config);
let article = reader.parse("<html>...</html>")?;
Ok(())
}
Configuration Options
min_score
Minimum readability score for content to be considered extractable (default: 20.0).
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.min_score(25.0)
.build();
Higher values are more strict. If content scores below this threshold, parsing returns LectitoError::NotReaderable.
char_threshold
Minimum character count for content to be considered (default: 500).
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.char_threshold(1000)
.build();
Increase this for short pages or blog posts to avoid extracting navigation elements.
preserve_images
Whether to preserve images in the extracted content (default: true).
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.preserve_images(false)
.build();
min_content_length
Minimum length for text content (default: 140).
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.min_content_length(200)
.build();
min_score_threshold
Threshold for minimum score during scoring (default: 20.0).
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.min_score_threshold(25.0)
.build();
FetchConfig
Configure HTTP fetching behavior:
use lectito_core::{fetch_and_parse_with_config, FetchConfig, ReadabilityConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let fetch_config = FetchConfig {
timeout: 60,
user_agent: "MyBot/1.0".to_string(),
..Default::default()
};
let read_config = ReadabilityConfig::builder()
.min_score(25.0)
.build();
let article = fetch_and_parse_with_config(
"https://example.com/article",
&fetch_config,
&read_config
).await?;
Ok(())
}
FetchConfig Options
| Field | Type | Default | Description |
|---|---|---|---|
timeout | u64 | 30 | Request timeout in seconds |
user_agent | String | "Lectito/..." | User-Agent header value |
Default Values
impl Default for ReadabilityConfig {
fn default() -> Self {
Self {
min_score: 20.0,
char_threshold: 500,
preserve_images: true,
min_content_length: 140,
min_score_threshold: 20.0,
}
}
}
Configuration Examples
Strict Extraction
For high-quality content only:
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.min_score(30.0)
.char_threshold(1000)
.min_content_length(300)
.build();
Lenient Extraction
For extracting from short pages:
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.min_score(10.0)
.char_threshold(200)
.min_content_length(50)
.build();
Text-Only Extraction
Remove images and multimedia:
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.preserve_images(false)
.build();
Custom Fetch Settings
Long timeout with custom user agent:
use lectito_core::FetchConfig;
let config = FetchConfig {
timeout: 120,
user_agent: "MyBot/1.0 (+https://example.com/bot)".to_string(),
};
Site Configuration
For sites that require custom extraction rules, use the site configuration feature (requires siteconfig feature):
[dependencies]
lectito-core = { version = "0.1", features = ["siteconfig"] }
Site configuration uses the FTR (Five Filters Text) format. See How It Works for details on site-specific extraction.
Next Steps
- Async vs Sync - Understanding async APIs
- Output Formats - Detailed format documentation
- Scoring Algorithm - How scores are calculated
Async vs Sync
Understanding Lectito's async and synchronous APIs.
Overview
Lectito provides both synchronous and asynchronous APIs:
| Function | Async/Sync | Use Case |
|---|---|---|
parse() | Sync | Parse HTML from string |
parse_with_url() | Sync | Parse with URL context |
fetch_and_parse() | Async | Fetch from URL then parse |
fetch_url() | Async | Fetch HTML from URL |
When to Use Each
Use Sync APIs When
- You already have the HTML as a string
- You're using your own HTTP client
- Performance is not critical
- You're integrating into synchronous code
Use Async APIs When
- You need to fetch from URLs
- You're already using async/await
- You want concurrent fetches
- Performance matters for network operations
Synchronous Parsing
Parse HTML that you already have:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
Ok(())
}
Asynchronous Fetching
Fetch and parse in one operation:
use lectito_core::fetch_and_parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let url = "https://example.com/article";
let article = fetch_and_parse(url).await?;
Ok(())
}
Manual Fetch and Parse
Use your own HTTP client:
use lectito_core::parse;
use reqwest::Client;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
let response = client.get("https://example.com/article")
.send()
.await?;
let html = response.text().await?;
let article = parse(&html)?;
Ok(())
}
Concurrent Fetches
Fetch multiple articles concurrently:
use lectito_core::fetch_and_parse;
use futures::future::join_all;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let urls = vec![
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3",
];
let futures: Vec<_> = urls.into_iter()
.map(|url| fetch_and_parse(url))
.collect();
let articles = join_all(futures).await;
for article in articles {
match article {
Ok(a) => println!("Got: {:?}", a.metadata.title),
Err(e) => eprintln!("Error: {}", e),
}
}
Ok(())
}
Batch Processing
Process URLs with concurrency limits:
use lectito_core::fetch_and_parse;
use futures::stream::{StreamExt, try_stream};
async fn process_urls(urls: Vec<String>) -> Result<(), Box<dyn std::error::Error>> {
let stream = try_stream! {
for url in urls {
let article = fetch_and_parse(&url).await?;
yield article;
}
};
let mut stream = stream.buffer_unordered(5); // 5 concurrent requests
while let Some(article) = stream.next().await {
println!("Processed: {:?}", article?.metadata.title);
}
Ok(())
}
Sync Code in Async Context
If you need to use sync parsing in async code:
use lectito_core::parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Fetch with your async HTTP client
let html = fetch_html().await?;
// Parse is sync, but that's fine in async context
let article = parse(&html)?;
Ok(())
}
async fn fetch_html() -> Result<String, Box<dyn std::error::Error>> {
// Your async fetching logic
Ok(String::from("<html>...</html>"))
}
Performance Considerations
Parsing (Sync)
Parsing is CPU-bound and runs synchronously:
use lectito_core::parse;
use std::time::Instant;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let start = Instant::now();
let article = parse(html)?;
let duration = start.elapsed();
println!("Parsed in {:?}", duration);
Ok(())
}
Fetching (Async)
Fetching is I/O-bound and benefits from async:
use lectito_core::fetch_and_parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let start = std::time::Instant::now();
let article = fetch_and_parse("https://example.com/article").await?;
let duration = start.elapsed();
println!("Fetched and parsed in {:?}", duration);
Ok(())
}
Choosing the Right Approach
| Scenario | Recommended Approach |
|---|---|
| Have HTML string | parse() (sync) |
| Need to fetch URL | fetch_and_parse() (async) |
| Custom HTTP client | Your client + parse() (sync) |
| Batch URL processing | fetch_and_parse() with concurrent futures |
| CLI tool | Depends on your runtime setup |
| Web server | fetch_and_parse() (async) for better throughput |
Feature Flags
To disable async features and reduce dependencies:
[dependencies]
lectito-core = { version = "0.1", default-features = false, features = ["markdown"] }
This removes reqwest and tokio dependencies. You'll need to fetch HTML yourself.
Next Steps
- Output Formats - Working with different output formats
- Configuration - Advanced configuration options
- Basic Usage - Core usage patterns
Output Formats
Work with different output formats: Markdown, JSON, text, and HTML.
Overview
The Article struct provides methods for converting to different formats:
| Method | Format | Requires Feature |
|---|---|---|
to_markdown() | Markdown with frontmatter | markdown |
to_json() | Structured JSON | Always available |
to_text() | Plain text | Always available |
content field | Cleaned HTML | Always available |
Markdown
Convert article to Markdown with YAML frontmatter:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let markdown = article.to_markdown()?;
println!("{}", markdown);
Ok(())
}
Output Format
+++
title = "Article Title"
author = "John Doe"
published_date = "2025-01-17"
excerpt = "A brief description of the article"
word_count = 500
+++
# Article Title
Article content here...
Paragraph with **bold** and _italic_ text.
Customizing Markdown
Use MarkdownFormatter for more control:
use lectito_core::{parse, MarkdownFormatter, MarkdownConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let config = MarkdownConfig {
frontmatter: true,
// Add more options as available
};
let formatter = MarkdownFormatter::new(config);
let markdown = formatter.format(&article)?;
println!("{}", markdown);
Ok(())
}
JSON
Get structured JSON with all metadata:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let json = article.to_json()?;
println!("{}", json);
Ok(())
}
JSON Structure
{
"metadata": {
"title": "Article Title",
"author": "John Doe",
"published_date": "2025-01-17",
"excerpt": "A brief description",
"language": "en"
},
"content": "<div>Cleaned HTML content...</div>",
"text_content": "Plain text content...",
"word_count": 500,
"readability_score": 35.5
}
Parsing JSON
use lectito_core::parse;
use serde_json::Value;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let json = article.to_json()?;
let value: Value = serde_json::from_str(&json)?;
println!("Title: {}", value["metadata"]["title"]);
Ok(())
}
Plain Text
Extract just the text content:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let text = article.to_text();
println!("{}", text);
Ok(())
}
Output Format
Plain text includes:
- Headings as lines with
#prefixes - Paragraphs separated by blank lines
- List items with
*or1.prefixes - No HTML tags or markdown syntax
HTML
Access the cleaned HTML directly:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
// Cleaned HTML is in the `content` field
let cleaned_html = &article.content;
println!("{}", cleaned_html);
Ok(())
}
HTML Characteristics
The cleaned HTML:
- Removes clutter (navigation, sidebars, ads)
- Keeps main content structure
- Preserves images (if
preserve_imagesis true) - Removes most scripts and styles
- Maintains heading hierarchy
Choosing a Format
| Format | Use Case |
|---|---|
| Markdown | Blog posts, documentation, static sites |
| JSON | APIs, databases, further processing |
| Text | Analysis, indexing, simple display |
| HTML | Web display, further HTML processing |
Format Conversion Examples
Markdown to File
use lectito_core::parse;
use std::fs;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let markdown = article.to_markdown()?;
fs::write("article.md", markdown)?;
Ok(())
}
JSON for API Response
use lectito_core::parse;
use warp::Filter;
async fn extract_article(body: String) -> Result<impl warp::Reply, warp::Rejection> {
let article = parse(&body).unwrap();
let json = article.to_json().unwrap();
Ok(warp::reply::json(&json))
}
Text for Analysis
use lectito_core::parse;
fn analyze_text(html: &str) -> Result<(), Box<dyn std::error::Error>> {
let article = parse(html)?;
let text = article.to_text();
// Analyze word frequency
let words: Vec<&str> = text.split_whitespace().collect();
println!("Word count: {}", words.len());
// Count sentences
let sentences = text.split(&['.', '!', '?'][..]).count();
println!("Sentence count: {}", sentences);
Ok(())
}
HTML for Display
use lectito_core::parse;
fn display_article(html: &str) -> Result<(), Box<dyn std::error::Error>> {
let article = parse(html)?;
// Use in a template
let rendered = format!(
r#"
<!DOCTYPE html>
<html>
<head>
<title>{}</title>
</head>
<body>
<article>{}</article>
</body>
</html>
"#,
article.metadata.title.unwrap_or_default(),
article.content
);
Ok(())
}
Next Steps
- Configuration - Advanced configuration options
- Basic Usage - Core usage patterns
- Concepts - Understanding the algorithm
How It Works
Understanding the Lectito content extraction pipeline.
Overview
Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. The algorithm identifies the main article content by analyzing the HTML structure, scoring elements based on various heuristics, and selecting the highest-scoring content.
Extraction Pipeline
The extraction process consists of four main stages:
HTML Input → Preprocessing → Scoring → Selection → Post-processing → Article
1. Preprocessing
Clean the HTML to improve scoring accuracy:
- Remove unlikely content: scripts, styles, iframes, and hidden nodes
- Strip elements with unlikely class/ID patterns
- Preserve structure: maintain HTML hierarchy for accurate scoring
Why: Preprocessing removes elements that could confuse the scoring algorithm or contain non-article content.
2. Scoring
Score each element based on content characteristics:
- Tag score: Different HTML tags have different base scores
- Class/ID weight: Positive patterns (article, content) vs negative (sidebar, footer)
- Content density: Length and punctuation indicate content quality
- Link density: Too many links suggests navigation/metadata, not content
Why: Scoring identifies which elements are most likely to contain the main article content.
3. Selection
Select the highest-scoring element as the article candidate:
- Find element with highest score (bias toward semantic containers when scores are close)
- Check if score meets minimum threshold (default: 20.0)
- Check if content length meets minimum threshold (default: 500 chars)
- Return error if content doesn't meet thresholds
Why: Selection ensures we extract actual article content, not navigation or ads.
4. Post-processing
Clean up the selected content:
- Include sibling elements: adjacent content blocks and shared-parent headers
- Remove remaining clutter: ads, comments, social widgets
- Clean up whitespace: normalize spacing and formatting
- Preserve structure: maintain headings, paragraphs, lists
Why: Post-processing improves the quality of extracted content and includes related elements.
Data Flow
Input HTML
↓
parse_to_document()
↓
preprocess_html() → Cleaned HTML
↓
build_dom_tree() → DOM Tree
↓
calculate_score() → Scored Elements
↓
extract_content() → Selected Element
↓
postprocess_html() → Cleaned Content
↓
extract_metadata() → Metadata
↓
Article
Key Components
Document and Element
The Document and Element types wrap the scraper crate's HTML parsing:
use lectito_core::{Document, Element};
let doc = Document::parse(html)?;
let elements: Vec<Element> = doc.select("article p")?;
These provide a convenient API for DOM manipulation and element traversal.
Scoring Algorithm
The scoring algorithm combines multiple factors:
element_score = base_tag_score
+ class_id_weight
+ content_density_score
× (1 - link_density)
See Scoring Algorithm for details.
Metadata Extraction
Separate process extracts metadata from the HTML:
- Title:
<h1>,<title>, or Open Graph tags - Author: meta tags, bylines, schema.org
- Date: meta tags, time elements, schema.org
- Excerpt: meta description, first paragraph
Why This Approach
Content Over Structure
Unlike XPath-based extraction, Lectito doesn't rely on fixed HTML structures. It analyzes content characteristics, making it work across many sites without custom rules.
Heuristic-Based
The algorithm uses heuristics (rules of thumb) derived from analyzing thousands of articles. This makes it flexible and adaptable to different site designs.
Fallback Mechanism
For sites where the algorithm fails, Lectito supports site-specific configuration files with XPath expressions. See Configuration for details.
Limitations
Sites That May Fail
- Very short pages (tweets, status updates)
- Non-article content (product pages, search results)
- Unusual layouts (some single-column designs)
- Heavily JavaScript-dependent content
Improving Extraction
For difficult sites:
- Adjust thresholds: Lower
min_scoreorchar_threshold - Site configuration: Provide XPath rules
- Manual curation: Use XPath or CSS selectors directly
See Configuration for options.
Comparison to Alternatives
| Approach | Pros | Cons |
|---|---|---|
| Lectito | Works across many sites, no custom rules needed | May fail on unusual layouts |
| XPath | Precise, predictable | Requires custom rules per site |
| CSS Selectors | Simple, familiar | Brittle, breaks on layout changes |
| Machine Learning | Adaptable | Complex, requires training data |
Lectito strikes a balance: works well for most sites without custom rules, with site configuration as a fallback.
Performance Considerations
- Parsing: HTML parsing is fast but not instant
- Scoring: Traverses entire DOM, O(n) complexity
- Fetching: Async for non-blocking I/O
- Memory: Entire document loaded into memory
For large-scale extraction, consider batching and concurrent fetches.
Next Steps
- Scoring Algorithm - Detailed scoring explanation
- Configuration - Customizing extraction
- Basic Usage - Using the API
Scoring Algorithm
Detailed explanation of how Lectito scores HTML elements to identify article content.
Overview
The scoring algorithm assigns a numeric score to each HTML element, indicating how likely it is to contain the main article content. Higher scores indicate better content candidates.
Score Formula
The final score for each element is calculated as:
element_score = (base_tag_score
+ class_id_weight
+ content_density_score
+ container_bonus)
× (1 - link_density)
Let's break down each component.
Base Tag Score
Different HTML tags have different inherent scores, reflecting their likelihood of containing content:
| Tag | Score | Rationale |
|---|---|---|
<article> | +10 | Semantic article container |
<section> | +8 | Logical content section |
<div> | +5 | Generic container, often used for content |
<blockquote> | +3 | Quoted content |
<pre> | 0 | Preformatted text, neutral |
<td> | +3 | Table cell |
<address> | -3 | Contact info, unlikely to be main content |
<ol>/<ul> | -3 | Lists and metadata |
<li> | -3 | List item |
<header> | -5 | Header, not main content |
<footer> | -5 | Footer, not main content |
<nav> | -5 | Navigation |
<th> | -5 | Table header |
<h1>-<h6> | -5 | Headings, not content themselves |
<form> | -3 | Forms, not content |
<main> | 0 | Container scored via bonus |
Class/ID Weight
Class and ID attributes strongly indicate element purpose:
Positive Patterns
These patterns indicate content elements:
(?i)(article|body|content|entry|hentry|h-entry|main|page|post|text|blog|story)
Weight: +25 points
Examples:
class="article-content"id="main-content"class="post-body"
Negative Patterns
These patterns indicate non-content elements:
(?i)(banner|breadcrumbs?|combx|comment|community|disqus|extra|foot|header|menu|related|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup)
Weight: -25 points
Examples:
class="sidebar"id="footer"class="navigation"
Content Density Score
Rewards elements with substantial text content:
Character Density
1 point per 100 characters, maximum 3 points.
char_score = (text_length / 100).min(3)
Punctuation Density
1 point per 5 commas/periods, maximum 3 points.
punct_score = (comma_count / 5).min(3)
Total content density:
content_density = char_score + punct_score
Rationale: Real article content has more text and punctuation than navigation or metadata.
Container Bonus
Elements that are typical article containers receive a small boost:
<article>,<section>,<main>: +2
This bias helps select semantic containers when scores are close.
Link Density Penalty
Penalizes elements with too many links:
link_density = (length of all <a> tag text) / (total text length)
final_score = raw_score × (1 - link_density)
Examples:
- Text "Click here": link density = 100% (10/10)
- Text "See the article for details": link density = 33% (7/21)
- Text "Article content with no links": link density = 0%
Rationale: Navigation menus, lists of links, and metadata have high link density. Real content has low link density.
Complete Example
Consider this HTML:
<div class="article-content">
<h1>Article Title</h1>
<p>
This is a substantial paragraph with plenty of text, including multiple
sentences, and commas, to demonstrate how content density scoring works.
</p>
<p>
Another paragraph with even more text, details, and information to
increase the character count.
</p>
</div>
Step-by-Step Scoring
1 Base Tag Score
<div>: +5
2 Class/ID Weight
class="article-content" contains "article" and "content": +25
3 Content Density
- Text length: ~220 characters
- Character score: min(220/100, 3) = 2
- Commas: 4
- Punctuation score: min(4/5, 3) = 0
- Total: 2 points
4 Link Density
No links: link density = 0
5 Final Score
(5 + 25 + 2) × (1 - 0) = 32
This element would score 32, well above the default threshold of 20.
Thresholds
Two thresholds determine if content is readable:
Score Threshold
Minimum score for extraction (default: 20.0).
If no element scores above this, extraction fails with LectitoError::NotReaderable.
Character Threshold
Minimum character count (default: 500).
Even with high score, content must have enough text to be meaningful.
Scoring Edge Cases
Empty Elements
Elements with no text receive score of 0 and are ignored.
Nested Elements
Both parent and child elements are scored. The highest-scoring element at any level is selected.
Sibling Elements
Adjacent elements with similar scores may be grouped as part of the same article.
Negative Scores
Elements can have negative scores (e.g., navigation). They're excluded from selection.
Configuration Affecting Scoring
Adjust scoring behavior with ReadabilityConfig:
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.min_score(25.0) // Higher threshold
.char_threshold(1000) // Require more content
.min_content_length(200) // Longer minimum text
.build();
See Configuration for details.
Practical Implications
Why Articles Score Well
- Semantic tags (
<article>) - Descriptive classes (
article-content) - Substantial text (high character count)
- Punctuation (commas, periods)
- Few links (low link density)
Why Navigation Scores Poorly
- Generic or negative classes (
sidebar,navigation) - Little text (just link labels)
- Many links (high link density)
- Short content (fails character threshold)
Why Comments May Score Poorly
- Often in negative classed containers (
comments) - Short individual comments
- Many links (usernames, replies)
- Variable quality
Site Configuration
When automatic scoring fails, provide XPath rules:
# example.com.toml
[[fingerprints]]
pattern = "example.com"
[[fingerprints.extract]]
title = "//h1[@class='article-title']"
content = "//div[@class='article-body']"
See Configuration for details.
References
- Original Readability.js: Mozilla Readability
- Algorithm inspiration: Arc90 Readability
Next Steps
- How It Works - Overall extraction pipeline
- Configuration - Customizing behavior
- Basic Usage - Using the API
API Overview
Complete reference for the Lectito Rust library API.
Core Types
Article
The main result type containing extracted content and metadata.
pub struct Article {
/// Extracted metadata
pub metadata: Metadata,
/// Cleaned HTML content
pub content: String,
/// Plain text content
pub text_content: String,
/// Number of words in content
pub word_count: usize,
/// Final readability score
pub readability_score: f64,
}
Methods:
to_markdown() -> Result<String>- Convert to Markdown with frontmatterto_json() -> Result<String>- Convert to JSONto_text() -> String- Get plain text
Metadata
Extracted article metadata.
pub struct Metadata {
/// Article title
pub title: Option<String>,
/// Author name
pub author: Option<String>,
/// Publication date
pub published_date: Option<String>,
/// Article excerpt/description
pub excerpt: Option<String>,
/// Content language
pub language: Option<String>,
}
LectitoError
Error type for all Lectito operations.
pub enum LectitoError {
/// Content not readable: score below threshold
NotReaderable { score: f64, threshold: f64 },
/// Invalid URL provided
InvalidUrl(String),
/// HTTP request timeout
Timeout { timeout: u64 },
/// HTTP error
HttpError(reqwest::Error),
/// HTML parsing error
HtmlParseError(String),
/// IO error
IoError(std::io::Error),
}
Result
Type alias for Result with LectitoError.
pub type Result<T> = std::result::Result<T, LectitoError>;
Configuration Types
ReadabilityConfig
Main configuration for content extraction.
pub struct ReadabilityConfig {
/// Minimum readability score (default: 20.0)
pub min_score: f64,
/// Minimum character count (default: 500)
pub char_threshold: usize,
/// Preserve images in output (default: true)
pub preserve_images: bool,
/// Minimum content length (default: 140)
pub min_content_length: usize,
/// Minimum score threshold (default: 20.0)
pub min_score_threshold: f64,
}
Methods:
builder() -> ReadabilityConfigBuilder- Create a builderdefault() -> Self- Default configuration
ReadabilityConfigBuilder
Builder for ReadabilityConfig.
pub struct ReadabilityConfigBuilder {
// ...
}
Methods:
min_score(f64) -> Self- Set minimum scorechar_threshold(usize) -> Self- Set character thresholdpreserve_images(bool) -> Self- Set image preservationmin_content_length(usize) -> Self- Set minimum content lengthmin_score_threshold(f64) -> Self- Set score thresholdbuild() -> ReadabilityConfig- Build configuration
FetchConfig
Configuration for HTTP fetching.
pub struct FetchConfig {
/// Request timeout in seconds (default: 30)
pub timeout: u64,
/// User-Agent header (default: "Lectito/...")
pub user_agent: String,
}
Trait:
impl Default for FetchConfig
Main API Functions
parse
Parse HTML string and extract article.
pub fn parse(html: &str) -> Result<Article>
Example:
use lectito_core::parse;
let article = parse("<html>...</html>")?;
parse_with_url
Parse HTML with URL context for relative link resolution.
pub fn parse_with_url(html: &str, url: &str) -> Result<Article>
Example:
use lectito_core::parse_with_url;
let article = parse_with_url(html, "https://example.com/article")?;
fetch_and_parse
Fetch URL and extract article.
pub async fn fetch_and_parse(url: &str) -> Result<Article>
Feature: fetch
Example:
use lectito_core::fetch_and_parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let article = fetch_and_parse("https://example.com/article").await?;
Ok(())
}
fetch_and_parse_with_config
Fetch URL and extract with custom configuration.
pub async fn fetch_and_parse_with_config(
url: &str,
fetch_config: &FetchConfig,
readability_config: &ReadabilityConfig
) -> Result<Article>
Feature: fetch
Example:
use lectito_core::{fetch_and_parse_with_config, FetchConfig, ReadabilityConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let fetch_config = FetchConfig {
timeout: 60,
..Default::default()
};
let read_config = ReadabilityConfig::builder()
.min_score(25.0)
.build();
let article = fetch_and_parse_with_config(
"https://example.com/article",
&fetch_config,
&read_config
).await?;
Ok(())
}
is_probably_readable
Check if content likely meets readability thresholds.
pub fn is_probably_readable(html: &str) -> bool
Example:
use lectito_core::is_probably_readable;
if is_probably_readable(html) {
println!("Content is readable");
}
Readability Type
Main API for configured extraction.
pub struct Readability {
config: ReadabilityConfig,
}
Methods:
new() -> Self- Create with default configwith_config(ReadabilityConfig) -> Self- Create with custom configparse(&str) -> Result<Article>- Parse HTML
Example:
use lectito_core::{Readability, ReadabilityConfig};
let config = ReadabilityConfig::builder()
.min_score(25.0)
.build();
let reader = Readability::with_config(config);
let article = reader.parse(html)?;
Fetch Functions
fetch_url
Fetch HTML from URL.
pub async fn fetch_url(url: &str, config: &FetchConfig) -> Result<String>
Feature: fetch
fetch_file
Read HTML from file.
pub fn fetch_file(path: &str) -> Result<String>
fetch_stdin
Read HTML from stdin.
pub fn fetch_stdin() -> Result<String>
DOM Types
Document
HTML document wrapper for parsing and selection.
pub struct Document {
// ...
}
Methods:
parse(&str) -> Result<Self>- Parse HTMLselect(&str) -> Result<Vec<Element>>- CSS selector
Element
DOM element wrapper.
pub struct Element<'a> {
// ...
}
Methods:
text() -> String- Extract text contenthtml() -> String- Get inner HTML
Module Organization
.
├── article # Article and Metadata types
├── error # LectitoError and Result
├── fetch # HTTP and file fetching
├── formatters # Output formatters
├── metadata # Metadata extraction
├── parse # Document and Element types
├── readability # Main API (parse, fetch_and_parse)
└── scoring # Scoring algorithm
Feature Flags
| Feature | Default | Enables |
|---|---|---|
fetch | Yes | URL fetching with reqwest |
markdown | Yes | Markdown output |
siteconfig | Yes | Site configuration support |
Re-exports
The crate re-exports commonly used types at the root:
// Core types
pub use article::{Article, OutputFormat};
pub use error::{LectitoError, Result};
// Configuration
pub use fetch::FetchConfig;
pub use readability::{
Readability, ReadabilityConfig, ReadabilityConfigBuilder,
LectitoConfig, LectitoConfigBuilder
};
// Functions
pub use readability::{
parse, parse_with_url, fetch_and_parse, fetch_and_parse_with_config,
is_probably_readable
};
// Fetching
pub use fetch::{fetch_url, fetch_file, fetch_stdin};
// Formatters
pub use formatters::{
MarkdownFormatter, TextFormatter, JsonFormatter,
convert_to_markdown, convert_to_text, convert_to_json
};
Complete Example
use lectito_core::{
Readability, ReadabilityConfig, FetchConfig,
fetch_and_parse_with_config
};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Configure
let fetch_config = FetchConfig {
timeout: 60,
user_agent: "MyBot/1.0".to_string(),
};
let read_config = ReadabilityConfig::builder()
.min_score(25.0)
.char_threshold(1000)
.build();
// Fetch and parse
let article = fetch_and_parse_with_config(
"https://example.com/article",
&fetch_config,
&read_config
).await?;
// Access results
println!("Title: {:?}", article.metadata.title);
println!("Word count: {}", article.word_count);
// Convert to format
let markdown = article.to_markdown()?;
println!("{}", markdown);
Ok(())
}
Further Documentation
docs.rs/lectito- Full API documentation with rustdoc- GitHub Repository - Source code and examples
- Basic Usage - Usage examples
- Configuration - Configuration options
Next Steps
- Getting Started - Installation and quick start
- Library Guide - In-depth usage documentation