Lectito
A Rust library and CLI for extracting readable content from web pages.
What is Lectito?
Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. It identifies and extracts the main article content from web pages, removing navigation, sidebars, advertisements, and other clutter.
Features
- Content Extraction: Automatically identifies the main article content
- Metadata Extraction: Pulls title, author, date, excerpt, site name, and language
- Output Formats: HTML, Markdown, plain text, and JSON
- URL Fetching: Built-in async HTTP client with timeout support
- CLI: Simple command-line interface for quick extractions
- Site Configuration: Optional XPath-based extraction rules for difficult sites
Use Cases
- Web Scraping: Extract clean article content from web pages
- AI Agents: Feed readable text to language models
- Content Analysis: Analyze article text without HTML noise
- Archival: Save clean copies of web content
- CLI: Quick article extraction from the terminal
Quick Links
- Installation: See the Installation Guide
- CLI Usage: See the CLI Usage Guide
- Library Usage: See the Basic Usage Guide
- API Reference: See the API Overview
Quick Start
CLI
# Install
cargo install lectito-cli
# Extract from URL
lectito https://example.com/article
# Extract from local file
lectito article.html
# Pipe from stdin
curl https://example.com | lectito -
Library
use lectito_core::parse;
let html = r#"<html><body><article><h1>Title</h1><p>Content</p></article></body></html>"#;
let article = parse(html)?;
println!("Title: {:?}", article.metadata.title);
println!("Content: {}", article.to_markdown()?);
About the Name
"Lectito" is derived from the Latin legere (to read) and lectio (a reading or selection).
Lectito aims to select and present readable content from the chaos of the modern web.
Installation
Lectito provides both a CLI tool and a Rust library. Install whichever fits your needs.
CLI Installation
From crates.io
The easiest way to install the CLI is via cargo:
cargo install lectito-cli
This installs the lectito binary in your cargo bin directory (typically ~/.cargo/bin).
From Source
# Clone the repository
git clone https://github.com/stormlightlabs/lectito.git
cd lectito
# Build and install
cargo install --path crates/cli
Pre-built Binaries
Pre-built binaries are available on the GitHub Releases page for Linux, macOS, and Windows.
Download the appropriate binary for your platform and place it in your PATH.
Verify Installation
lectito --version
You should see version information printed.
Library Installation
Add to your Cargo.toml:
[dependencies]
lectito-core = "0.1"
Then run cargo build to fetch and compile the dependency.
Feature Flags
The library has several optional features:
[dependencies]
lectito-core = { version = "0.1", features = ["fetch", "markdown"] }
| Feature | Default | Description |
|---|---|---|
fetch | Yes | Enable URL fetching with reqwest |
markdown | Yes | Enable Markdown output format |
siteconfig | Yes | Enable site configuration support |
If you don't need URL fetching, disable the default features and opt back into only what you need:
[dependencies]
lectito-core = { version = "0.1", default-features = false, features = ["markdown"] }
Development Build
To build from source for development:
# Clone the repository
git clone https://github.com/stormlightlabs/lectito.git
cd lectito
# Build the workspace
cargo build --release
# The CLI binary will be at target/release/lectito
Next Steps
- Quick Start Guide - Get started with basic usage
- CLI Usage - Learn CLI commands and options
- Library Guide - Use Lectito as a library
Quick Start
Get started with Lectito in minutes.
CLI Quick Start
Basic Usage
Extract content from a URL:
lectito https://example.com/article
Extract from a local file:
lectito article.html
Extract from stdin:
curl https://example.com | lectito -
Save to File
lectito https://example.com/article -o article.md
Change Output Format
# JSON output
lectito https://example.com/article --format json
# Plain text output
lectito https://example.com/article --format text
Set Timeout
For slow-loading sites:
lectito https://example.com/article --timeout 60
Library Quick Start
Add Dependency
Add to Cargo.toml:
[dependencies]
lectito-core = "0.1"
Parse HTML String
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = r#"
<!DOCTYPE html>
<html>
<head><title>My Article</title></head>
<body>
<article>
<h1>Article Title</h1>
<p>This is the article content with plenty of text.</p>
</article>
</body>
</html>
"#;
let article = parse(html)?;
println!("Title: {:?}", article.metadata.title);
println!("Content: {}", article.to_markdown()?);
Ok(())
}
Fetch and Parse URL
use lectito_core::fetch_and_parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let article = fetch_and_parse("https://example.com/article").await?;
println!("Title: {:?}", article.metadata.title);
println!("Word count: {}", article.word_count);
Ok(())
}
Convert to Different Formats
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<h1>Title</h1><p>Content here</p>";
let article = parse(html)?;
// Markdown with frontmatter
let markdown = article.to_markdown()?;
println!("{}", markdown);
// Plain text
let text = article.to_text();
println!("{}", text);
// Structured JSON
let json = article.to_json()?;
println!("{}", json);
Ok(())
}
Common Patterns
Handle Errors
use lectito_core::{parse, LectitoError};
match parse("<html>...</html>") {
Ok(article) => println!("Title: {:?}", article.metadata.title.unwrap()),
Err(LectitoError::NotReadable { score, threshold }) => {
eprintln!("Content not readable: score {} < threshold {}", score, threshold);
}
Err(e) => eprintln!("Error: {}", e),
}
Configure Extraction
use lectito_core::{Readability, ReadabilityConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = ReadabilityConfig::builder()
.min_score(25.0)
.char_threshold(500)
.preserve_images(true)
.build();
let reader = Readability::with_config(config);
let article = reader.parse("<html>...</html>")?;
Ok(())
}
What's Next?
- CLI Usage - Full CLI command reference
- Library Guide - In-depth library documentation
- Configuration - Advanced configuration options
- Concepts - How the algorithm works
CLI Usage
Reference for the lectito command-line tool.
Basic Syntax
lectito [OPTIONS] [INPUT]
INPUT can be:
- a URL starting with
http://orhttps:// - a local file path
-to read from stdin
Common Examples
Extract from a URL
lectito https://example.com/article
Extract from a File
lectito article.html
Read from stdin
curl https://example.com | lectito -
Output Options
-o, --output <FILE>
Write output to a file instead of stdout.
lectito https://example.com/article -o article.md
-f, --format <FORMAT>
Output format. Available values:
| Format | Description |
|---|---|
markdown or md | Markdown output |
html | Cleaned HTML |
text or txt | Plain text |
json | Structured JSON |
lectito https://example.com/article --format text
--json
Force structured JSON output regardless of --format.
lectito https://example.com/article --json
--references
Include a reference table in Markdown output or a references array in JSON output.
lectito https://example.com/article --references
--frontmatter
Include TOML frontmatter in Markdown output.
lectito https://example.com/article --frontmatter
-m, --metadata-only
Output metadata only.
lectito https://example.com/article --metadata-only
--metadata-format <FORMAT>
Metadata output format for --metadata-only. Supported values: toml, json.
lectito https://example.com/article --metadata-only --metadata-format json
Extraction Options
--timeout <SECS>
HTTP timeout in seconds. Default: 30.
--user-agent <UA>
Custom User-Agent for HTTP requests.
-c, --config-dir <DIR>
Directory containing site configuration files.
--char-threshold <NUM>
Minimum character threshold for content candidates. Default: 500.
--max-elements <NUM>
Maximum number of top candidates to track. Default: 5.
--no-images
Strip images from output.
-v, --verbose
Enable verbose logging and timing output.
Shell Completions
--completions <SHELL>
Generate a completion script for bash, zsh, fish, or powershell.
lectito --completions zsh
Help and Version
lectito --help
lectito --version
Output Shapes
Markdown
With --frontmatter, Markdown output starts with TOML frontmatter and then the extracted body.
JSON
--format json and --json emit structured output with:
metadatacontent.markdowncontent.textcontent.html- optional
references
Metadata-Only
--metadata-only emits either:
- TOML metadata
- JSON metadata
without the extracted body.
Common Workflows
Save a Markdown export
lectito https://example.com/article --frontmatter --references -o article.md
Get JSON for downstream processing
lectito https://example.com/article --json | jq '.metadata.title'
Extract text without images
lectito https://example.com/article --format text --no-images
Basic Usage
Learn the fundamentals of using Lectito as a library.
Simple Parsing
The easiest way to extract content is with the parse function:
use lectito_core::{parse, Article};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = r#"
<!DOCTYPE html>
<html>
<head><title>My Article</title></head>
<body>
<article>
<h1>Article Title</h1>
<p>This is the article content.</p>
</article>
</body>
</html>
"#;
let article: Article = parse(html)?;
println!("Title: {:?}", article.metadata.title);
println!("Confidence: {:.2}", article.confidence);
println!("Content: {}", article.to_markdown()?);
Ok(())
}
Fetching and Parsing
For URLs, use the fetch_and_parse function:
use lectito_core::fetch_and_parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let url = "https://example.com/article";
let article = fetch_and_parse(url).await?;
println!("Title: {:?}", article.metadata.title);
println!("Author: {:?}", article.metadata.author);
println!("Word count: {}", article.word_count);
Ok(())
}
This helper requires the fetch feature.
Working with the Article
The Article struct contains the extracted content, metadata, and derived metrics.
Metadata
use lectito_core::parse;
let html = "<html>...</html>";
let article = parse(html)?;
if let Some(title) = article.metadata.title {
println!("Title: {}", title);
}
if let Some(author) = article.metadata.author {
println!("Author: {}", author);
}
if let Some(date) = article.metadata.date {
println!("Published: {}", date);
}
if let Some(excerpt) = article.metadata.excerpt {
println!("Excerpt: {}", excerpt);
}
Content Access
use lectito_core::parse;
let html = "<html>...</html>";
let article = parse(html)?;
let html_content = &article.content;
let text = article.to_text();
let markdown = article.to_markdown()?;
let json = article.to_json()?;
to_markdown() requires the markdown feature.
Readability API
For more control, use the Readability API:
use lectito_core::{Readability, ReadabilityConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let reader = Readability::new();
let article = reader.parse(html)?;
let config = ReadabilityConfig::builder()
.min_score(25.0)
.char_threshold(500)
.nb_top_candidates(8)
.build();
let reader = Readability::with_config(config);
let article = reader.parse(html)?;
Ok(())
}
Error Handling
Lectito returns Result<T, LectitoError>. Handle errors appropriately:
use lectito_core::{parse, LectitoError};
fn extract_article(html: &str) -> Result<String, String> {
match parse(html) {
Ok(article) => Ok(article.to_markdown().unwrap_or_default()),
Err(LectitoError::NotReadable { score, threshold }) => {
Err(format!("Content not readable: score {} < threshold {}", score, threshold))
}
Err(LectitoError::InvalidUrl(msg)) => {
Err(format!("Invalid URL: {}", msg))
}
Err(e) => Err(format!("Extraction failed: {}", e)),
}
}
Common Patterns
Parse with URL Context
When you have the URL, provide it for better relative link resolution:
use lectito_core::{parse_with_url, Article};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let url = "https://example.com/article";
let article: Article = parse_with_url(html, url)?;
assert_eq!(article.source_url.as_deref(), Some(url));
Ok(())
}
Check if Content is Probably Readable
For a quick pre-check:
use lectito_core::is_probably_readable;
fn main() {
let html = "<html>...</html>";
if is_probably_readable(html) {
println!("Content looks readable");
} else {
println!("Content may not be readable");
}
}
Working with Documents
For lower-level DOM manipulation:
use lectito_core::{Document, Element};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html><body><p>Hello</p></body></html>";
let doc = Document::parse(html)?;
let elements: Vec<Element> = doc.select("p")?;
for element in elements {
println!("Text: {}", element.text());
}
Ok(())
}
Configuration
Customize Lectito's extraction behavior with configuration options.
ReadabilityConfig
The ReadabilityConfig struct controls extraction parameters. Use the builder pattern:
use lectito_core::{Readability, ReadabilityConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = ReadabilityConfig::builder()
.min_score(25.0)
.char_threshold(500)
.nb_top_candidates(8)
.preserve_images(true)
.preserve_video_embeds(true)
.build();
let reader = Readability::with_config(config);
let article = reader.parse("<html>...</html>")?;
Ok(())
}
Readability Options
| Field | Default | Description |
|---|---|---|
min_score | 20.0 | Minimum score required for extraction |
char_threshold | 500 | Minimum character count for strong candidates |
nb_top_candidates | 5 | Number of top candidates to keep during scoring |
max_elems_to_parse | 0 | Maximum number of elements to score, 0 means unlimited |
remove_unlikely | true | Remove obvious chrome before scoring |
keep_classes | false | Preserve class attributes in output HTML |
preserve_images | true | Keep images in extracted content |
preserve_video_embeds | true | Keep supported video embeds |
Strict Extraction
For high-quality content only:
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.min_score(30.0)
.char_threshold(1000)
.build();
Lenient Extraction
For short pages or difficult layouts:
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.min_score(10.0)
.char_threshold(200)
.remove_unlikely(false)
.build();
Text-Only Extraction
Remove images and embeds:
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::builder()
.preserve_images(false)
.preserve_video_embeds(false)
.build();
FetchConfig
Configure HTTP fetching behavior:
use lectito_core::{fetch_and_parse_with_config, FetchConfig, ReadabilityConfig};
use std::collections::HashMap;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let fetch_config = FetchConfig {
timeout: 60,
user_agent: "MyBot/1.0".to_string(),
headers: HashMap::new(),
};
let read_config = ReadabilityConfig::builder()
.min_score(25.0)
.build();
let article = fetch_and_parse_with_config(
"https://example.com/article",
&read_config,
&fetch_config,
).await?;
Ok(())
}
Fetch Options
| Field | Type | Default | Description |
|---|---|---|---|
timeout | u64 | 30 | Request timeout in seconds |
user_agent | String | Browser-like Lectito UA | User-Agent header value |
headers | HashMap<String, String> | empty | Extra request headers |
Default Values
use lectito_core::ReadabilityConfig;
let config = ReadabilityConfig::default();
assert_eq!(config.min_score, 20.0);
assert_eq!(config.char_threshold, 500);
assert_eq!(config.nb_top_candidates, 5);
assert_eq!(config.max_elems_to_parse, 0);
assert!(config.remove_unlikely);
assert!(!config.keep_classes);
assert!(config.preserve_images);
assert!(config.preserve_video_embeds);
Site Configuration
For sites that require custom extraction rules, use the site configuration feature:
[dependencies]
lectito-core = { version = "0.1", features = ["siteconfig"] }
Site configuration uses the FTR-style ruleset and the ConfigLoader APIs to apply per-site extraction rules.
Next Steps
- Async vs Sync - Understanding async APIs
- Output Formats - Detailed format documentation
- Scoring Algorithm - How scores are calculated
Async vs Sync
Understanding Lectito's async and synchronous APIs.
Overview
Lectito provides both synchronous and asynchronous APIs:
| Function | Async/Sync | Use Case |
|---|---|---|
parse() | Sync | Parse HTML from string |
parse_with_url() | Sync | Parse with URL context |
fetch_and_parse() | Async | Fetch from URL then parse |
fetch_url() | Async | Fetch HTML from URL |
The async fetch helpers require the fetch feature.
When to Use Each
Use Sync APIs When
- You already have the HTML as a string
- You're using your own HTTP client
- Performance is not critical
- You're integrating into synchronous code
Use Async APIs When
- You need to fetch from URLs
- You're already using async/await
- You want concurrent fetches
- Performance matters for network operations
Synchronous Parsing
Parse HTML that you already have:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
Ok(())
}
Asynchronous Fetching
Fetch and parse in one operation:
use lectito_core::fetch_and_parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let url = "https://example.com/article";
let article = fetch_and_parse(url).await?;
Ok(())
}
Manual Fetch and Parse
Use your own HTTP client:
use lectito_core::parse;
use reqwest::Client;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
let response = client.get("https://example.com/article")
.send()
.await?;
let html = response.text().await?;
let article = parse(&html)?;
Ok(())
}
Concurrent Fetches
Fetch multiple articles concurrently:
use lectito_core::fetch_and_parse;
use futures::future::join_all;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let urls = vec![
"https://example.com/article1",
"https://example.com/article2",
"https://example.com/article3",
];
let futures: Vec<_> = urls.into_iter()
.map(|url| fetch_and_parse(url))
.collect();
let articles = join_all(futures).await;
for article in articles {
match article {
Ok(a) => println!("Got: {:?}", a.metadata.title),
Err(e) => eprintln!("Error: {}", e),
}
}
Ok(())
}
Batch Processing
Process URLs with concurrency limits:
use lectito_core::fetch_and_parse;
use futures::stream::{StreamExt, try_stream};
async fn process_urls(urls: Vec<String>) -> Result<(), Box<dyn std::error::Error>> {
let stream = try_stream! {
for url in urls {
let article = fetch_and_parse(&url).await?;
yield article;
}
};
let mut stream = stream.buffer_unordered(5); // 5 concurrent requests
while let Some(article) = stream.next().await {
println!("Processed: {:?}", article?.metadata.title);
}
Ok(())
}
Sync Code in Async Context
If you need to use sync parsing in async code:
use lectito_core::parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Fetch with your async HTTP client
let html = fetch_html().await?;
// Parse is sync, but that's fine in async context
let article = parse(&html)?;
Ok(())
}
async fn fetch_html() -> Result<String, Box<dyn std::error::Error>> {
// Your async fetching logic
Ok(String::from("<html>...</html>"))
}
Performance Considerations
Parsing (Sync)
Parsing is CPU-bound and runs synchronously:
use lectito_core::parse;
use std::time::Instant;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let start = Instant::now();
let article = parse(html)?;
let duration = start.elapsed();
println!("Parsed in {:?}", duration);
Ok(())
}
Fetching (Async)
Fetching is I/O-bound and benefits from async:
use lectito_core::fetch_and_parse;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let start = std::time::Instant::now();
let article = fetch_and_parse("https://example.com/article").await?;
let duration = start.elapsed();
println!("Fetched and parsed in {:?}", duration);
Ok(())
}
Choosing the Right Approach
| Scenario | Recommended Approach |
|---|---|
| Have HTML string | parse() (sync) |
| Need to fetch URL | fetch_and_parse() (async) |
| Custom HTTP client | Your client + parse() (sync) |
| Batch URL processing | fetch_and_parse() with concurrent futures |
| CLI tool | Depends on your runtime setup |
| Web server | fetch_and_parse() (async) for throughput |
Output Formats
Work with different output formats: Markdown, JSON, text, and HTML.
Overview
The Article struct provides several ways to render extracted content:
| Method | Format | Requires Feature |
|---|---|---|
to_markdown() | Markdown | markdown |
to_markdown_with_config() | Markdown with custom options | markdown |
to_json() | Serialized Article JSON | Always available |
to_text() | Plain text | Always available |
content field | Cleaned HTML | Always available |
Markdown
Convert an article to Markdown:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let markdown = article.to_markdown()?;
println!("{}", markdown);
Ok(())
}
Markdown Configuration
Use MarkdownConfig for frontmatter, references, and image handling:
use lectito_core::{parse, MarkdownConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let config = MarkdownConfig {
include_frontmatter: true,
include_references: true,
strip_images: false,
include_title_heading: true,
};
let markdown = article.to_markdown_with_config(&config)?;
println!("{}", markdown);
Ok(())
}
Frontmatter Fields
When include_frontmatter is enabled, Lectito can emit fields such as:
+++
title = "Article Title"
author = "John Doe"
date = "2025-01-17"
site = "Example"
image = "https://example.com/image.jpg"
favicon = "https://example.com/favicon.ico"
excerpt = "A brief description of the article"
word_count = 500
reading_time_minutes = 2.5
+++
JSON
Article::to_json() returns a serialized view of the article itself:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let json = article.to_json()?;
println!("{}", json);
Ok(())
}
JSON Structure
{
"content": "<div>Cleaned HTML content...</div>",
"text_content": "Plain text content...",
"metadata": {
"title": "Article Title",
"author": "John Doe",
"date": "2025-01-17",
"excerpt": "A brief description",
"site_name": "Example",
"language": "en"
},
"length": 1234,
"word_count": 500,
"reading_time": 2.5,
"source_url": "https://example.com/article",
"confidence": 0.92
}
Plain Text
Extract just the text content:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let text = article.to_text();
println!("{}", text);
Ok(())
}
Plain text preserves the readable text content without HTML tags.
HTML
Access the cleaned HTML directly:
use lectito_core::parse;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = "<html>...</html>";
let article = parse(html)?;
let cleaned_html = &article.content;
println!("{}", cleaned_html);
Ok(())
}
The cleaned HTML:
- removes clutter such as navigation and ads
- keeps the main content structure
- preserves images when
preserve_imagesis enabled - preserves supported embeds when
preserve_video_embedsis enabled
Choosing a Format
| Format | Use Case |
|---|---|
| Markdown | Blog posts, docs, static publishing |
| JSON | APIs, storage, downstream processing |
| Text | Analysis, indexing, search |
| HTML | Web display or further HTML processing |
How It Works
Understanding the Lectito content extraction pipeline.
Overview
Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. The algorithm identifies the main article content by analyzing the HTML structure, scoring elements based on various heuristics, and selecting the highest-scoring content.
Extraction Pipeline
The extraction process still follows the same core shape:
HTML Input → Preprocessing → Scoring → Selection → Post-processing → Article
1. Preprocessing
Clean the HTML to improve scoring accuracy:
- Remove unlikely content: scripts, styles, iframes, and hidden nodes
- Strip elements with unlikely class/ID patterns
- Preserve structure: maintain HTML hierarchy for accurate scoring
Why: Preprocessing removes elements that could confuse the scoring algorithm or contain non-article content.
2. Scoring
Score each element based on content characteristics:
- Tag score: Different HTML tags have different base scores
- Class/ID weight: Positive patterns (article, content) vs negative (sidebar, footer)
- Content density: Length and punctuation indicate content quality
- Link density: Too many links suggests navigation or metadata, not content
Why: Scoring identifies which elements are most likely to contain the main article content.
3. Selection
Select the highest-scoring element as the article candidate:
- Find element with highest score
- Bias toward semantic containers when scores are close
- Check if score meets the minimum threshold
- Check if content length meets the minimum threshold
- Return an error if content doesn't meet thresholds
Why: Selection ensures we extract actual article content, not navigation or ads.
4. Post-processing
Clean up the selected content:
- Include sibling elements: adjacent content blocks and shared-parent headers
- Remove remaining clutter: ads, comments, social widgets
- Clean up whitespace: normalize spacing and formatting
- Preserve structure: maintain headings, paragraphs, and lists
Why: Post-processing improves the quality of extracted content and includes related elements.
Branch Additions
The current branch layers a few extra passes around that core flow:
- Retry strategy: if the first pass comes back short, Lectito retries with progressively looser settings before giving up
- Site-specific extraction: built-in extractors and optional site configs can override the generic scorer for difficult sites
- Confidence and diagnostics: successful extractions carry a confidence score and can include pass-by-pass diagnostics
Those additions sit around the original pipeline. They do not replace it.
Data Flow
Input HTML
↓
parse_to_document()
↓
preprocess_html() → Cleaned HTML
↓
build_dom_tree() → DOM Tree
↓
calculate_score() → Scored Elements
↓
extract_content() → Selected Element
↓
postprocess_html() → Cleaned Content
↓
extract_metadata() → Metadata
↓
Article
Key Components
Document and Element
The Document and Element types wrap the scraper crate's HTML parsing:
use lectito_core::{Document, Element};
let doc = Document::parse(html)?;
let elements: Vec<Element> = doc.select("article p")?;
These provide a convenient API for DOM manipulation and element traversal.
Scoring Algorithm
The scoring algorithm combines multiple factors:
element_score = base_tag_score
+ class_id_weight
+ content_density_score
× (1 - link_density)
See Scoring Algorithm for details.
Metadata Extraction
Separate process extracts metadata from the HTML:
- Title:
<h1>,<title>, or Open Graph tags - Author: meta tags, bylines, schema.org
- Date: meta tags, time elements, schema.org
- Excerpt: meta description, first paragraph
Why This Approach
Content Over Structure
Unlike XPath-based extraction, Lectito doesn't rely on fixed HTML structures. It analyzes content characteristics, making it work across many sites without custom rules.
Heuristic-Based
The algorithm uses heuristics derived from analyzing lots of article pages. That keeps it flexible across different site designs.
Fallback Mechanism
For sites where the algorithm fails, Lectito supports site-specific configuration files with XPath expressions. See Configuration for details.
Limitations
Sites That May Fail
- Very short pages (tweets, status updates)
- Non-article content (product pages, search results)
- Unusual layouts
- Heavily JavaScript-dependent content
Improving Extraction
For difficult sites:
- Adjust thresholds such as
min_scoreorchar_threshold - Provide a site configuration
- Add a site-specific extractor when generic scoring is not enough
See Configuration for options.
Comparison to Alternatives
| Approach | Pros | Cons |
|---|---|---|
| Lectito | Works across many sites, no custom rules needed | May fail on unusual layouts |
| Defuddle | Strong HTML and Markdown output, forgiving cleanup, richer metadata extraction | JavaScript and DOM-oriented, not a Rust-native library or CLI stack |
| XPath | Precise, predictable | Requires custom rules per site |
| CSS Selectors | Simple, familiar | Brittle, breaks on layout changes |
| Machine Learning | Adaptable | Complex, requires training data |
Lectito strikes a balance by working well for most sites without custom rules, with site configuration as a fallback.
Scoring Algorithm
Detailed explanation of how Lectito scores HTML elements to identify article content.
Overview
The scoring algorithm assigns a numeric score to each HTML element, indicating how likely it is to contain the main article content. Higher scores indicate better content candidates.
The exact weights evolve as the extractor improves, so treat this page as a guide to the scoring logic, not a frozen ABI.
Score Formula
At a high level, the score still looks like this:
element_score = (base_tag_score
+ class_id_weight
+ content_density_score
+ container_bonus)
× (1 - link_density)
Base Tag Score
Different HTML tags have different inherent scores, reflecting their likelihood of containing content:
| Tag | Typical Bias | Rationale |
|---|---|---|
<article> | Positive | Semantic article container |
<section> | Positive | Logical content section |
<div> | Positive | Generic container, often used for content |
<blockquote> | Slightly positive | Quoted content |
<pre> | Neutral | Preformatted text |
<header> | Negative | Header, not main content |
<footer> | Negative | Footer, not main content |
<nav> | Negative | Navigation |
<form> | Negative | Forms, not content |
Class/ID Weight
Class and ID attributes strongly indicate element purpose.
Positive patterns bias the scorer toward article-like containers. Negative patterns bias it away from sidebars, menus, comments, related-story blocks, and similar chrome.
Examples:
- Positive:
class="article-content",id="main-content" - Negative:
class="sidebar",id="footer",class="navigation"
Content Density Score
The scorer rewards elements with substantial text content:
- more readable text
- more punctuation and sentence structure
- less boilerplate
Real article content tends to have more continuous prose than navigation or metadata.
Link Density Penalty
Nodes packed with links are usually navigation, metadata, or related-story rails, not the article body.
link_density = linked_text / total_text
Higher link density reduces the final score.
Branch-Specific Heuristics
The current branch adds a few important refinements on top of the classic Readability-style score:
Entry-Point Bias
Common article containers such as article, main, and well-known content wrappers get an early structural advantage before raw text density decides the winner.
Sibling Aggregation
When several nearby candidates score well, Lectito can walk upward and treat them as one article body instead of picking only a single subtree.
Table Handling
Layout tables and data tables are treated differently. Data tables should survive extraction. Layout tables should not dominate it.
Retry Strategy
If the first pass extracts too little text, Lectito retries with progressively looser settings before it gives up.
Thresholds
Two thresholds still matter most:
Score Threshold
Minimum score for extraction.
If no element scores high enough, extraction fails with LectitoError::NotReadable.
Character Threshold
Minimum character count for meaningful content.
Even with a strong score, content must still be large enough to count as readable.
Scoring Edge Cases
Empty Elements
Elements with no text receive a negligible score and are ignored.
Nested Elements
Both parent and child elements are scored. The best candidate can appear at any level of the tree.
Sibling Elements
Adjacent elements with similar scores may be grouped as part of the same article.
Negative Scores
Elements that look like navigation or chrome can end up with negative scores and fall out of contention.
Configuration Affecting Scoring
Adjust scoring behavior with ReadabilityConfig:
min_scorechar_thresholdnb_top_candidatesmax_elems_to_parseremove_unlikely
See Configuration for details.
API Overview
Reference for the Lectito Rust library API.
Core Types
Article
The main result type containing extracted content, metadata, and derived metrics.
pub struct Article {
pub content: String,
pub text_content: String,
pub metadata: Metadata,
pub length: usize,
pub word_count: usize,
pub reading_time: f64,
pub source_url: Option<String>,
pub confidence: f64,
pub diagnostics: Option<ExtractionDiagnostics>,
}
Common methods:
to_markdown() -> Result<String>to_markdown_with_config(&MarkdownConfig) -> Result<String>to_json() -> Result<serde_json::Value>to_text() -> Stringto_format(OutputFormat) -> Result<String>
Metadata
Extracted article metadata.
pub struct Metadata {
pub title: Option<String>,
pub author: Option<String>,
pub date: Option<String>,
pub excerpt: Option<String>,
pub site_name: Option<String>,
pub image: Option<String>,
pub favicon: Option<String>,
pub word_count: Option<usize>,
pub reading_time_minutes: Option<f64>,
pub language: Option<String>,
}
LectitoError
Main error type for extraction, parsing, and fetch failures.
Notable variants:
NotReadable { score, threshold }InvalidUrl(String)Timeout { timeout }HtmlParseError(String)NoContentFileNotFound(PathBuf)ConfigError(String)SiteConfigError(String)
HttpError(reqwest::Error) is available when the fetch feature is enabled.
Configuration Types
ReadabilityConfig
Main configuration for content extraction.
pub struct ReadabilityConfig {
pub min_score: f64,
pub char_threshold: usize,
pub nb_top_candidates: usize,
pub max_elems_to_parse: usize,
pub remove_unlikely: bool,
pub keep_classes: bool,
pub preserve_images: bool,
pub preserve_video_embeds: bool,
}
Build with ReadabilityConfig::builder().
FetchConfig
Configuration for HTTP fetching.
pub struct FetchConfig {
pub timeout: u64,
pub user_agent: String,
pub headers: HashMap<String, String>,
}
Main API Functions
parse
Parse an HTML string and extract an Article.
pub fn parse(html: &str) -> Result<Article>
parse_with_url
Parse HTML with URL context for relative link resolution.
pub fn parse_with_url(html: &str, url: &str) -> Result<Article>
is_probably_readable
Cheap pre-check for likely article pages.
pub fn is_probably_readable(html: &str) -> bool
fetch_url
Fetch raw HTML from a URL.
pub async fn fetch_url(url: &str, config: &FetchConfig) -> Result<String>
Requires the fetch feature.
fetch_and_parse
Fetch a URL and extract an article with default configuration.
pub async fn fetch_and_parse(url: &str) -> Result<Article>
Requires the fetch feature.
fetch_and_parse_with_config
Fetch a URL and extract an article with custom readability and fetch settings.
pub async fn fetch_and_parse_with_config(
url: &str,
readability_config: &ReadabilityConfig,
fetch_config: &FetchConfig,
) -> Result<Article>
Requires the fetch feature.
Readability Type
Readability is the main stateful API:
pub struct Readability { /* ... */ }
Common constructors and methods:
Readability::new()Readability::with_config(ReadabilityConfig)Readability::with_config_and_loader(ReadabilityConfig, ConfigLoader)parse(&self, html: &str) -> Result<Article>parse_with_url(&self, html: &str, url: &str) -> Result<Article>is_probably_readable(&self, html: &str) -> boolfetch_and_parse(&self, url: &str) -> Result<Article>fetch_and_parse_with_config(&self, url: &str, fetch_config: &FetchConfig) -> Result<Article>
Lower-Level Types
For callers that need more control, Lectito also exposes:
DocumentandElementfor DOM accessConfigLoaderandConfigLoaderBuilderfor site configuration loadingMarkdownConfig,JsonConfig, and formatter types for output control
Feature Flags
| Feature | Default | Purpose |
|---|---|---|
fetch | Yes | Async URL fetching with reqwest |
markdown | Yes | Markdown conversion support |
siteconfig | Yes | Site configuration support |