Lectito

A Rust library and CLI for extracting readable content from web pages.

What is Lectito?

Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. It identifies and extracts the main article content from web pages, removing navigation, sidebars, advertisements, and other clutter.

Features

  • Content Extraction: Automatically identifies the main article content
  • Metadata Extraction: Pulls title, author, date, excerpt, site name, and language
  • Output Formats: HTML, Markdown, plain text, and JSON
  • URL Fetching: Built-in async HTTP client with timeout support
  • CLI: Simple command-line interface for quick extractions
  • Site Configuration: Optional XPath-based extraction rules for difficult sites

Use Cases

  • Web Scraping: Extract clean article content from web pages
  • AI Agents: Feed readable text to language models
  • Content Analysis: Analyze article text without HTML noise
  • Archival: Save clean copies of web content
  • CLI: Quick article extraction from the terminal

Quick Start

CLI

# Install
cargo install lectito-cli

# Extract from URL
lectito https://example.com/article

# Extract from local file
lectito article.html

# Pipe from stdin
curl https://example.com | lectito -

Library

use lectito_core::parse;

let html = r#"<html><body><article><h1>Title</h1><p>Content</p></article></body></html>"#;
let article = parse(html)?;

println!("Title: {:?}", article.metadata.title);
println!("Content: {}", article.to_markdown()?);

About the Name

"Lectito" is derived from the Latin legere (to read) and lectio (a reading or selection).

Lectito aims to select and present readable content from the chaos of the modern web.

Installation

Lectito provides both a CLI tool and a Rust library. Install whichever fits your needs.

CLI Installation

From crates.io

The easiest way to install the CLI is via cargo:

cargo install lectito-cli

This installs the lectito binary in your cargo bin directory (typically ~/.cargo/bin).

From Source

# Clone the repository
git clone https://github.com/stormlightlabs/lectito.git
cd lectito

# Build and install
cargo install --path crates/cli

Pre-built Binaries

Pre-built binaries are available on the GitHub Releases page for Linux, macOS, and Windows.

Download the appropriate binary for your platform and place it in your PATH.

Verify Installation

lectito --version

You should see version information printed.

Library Installation

Add to your Cargo.toml:

[dependencies]
lectito-core = "0.1"

Then run cargo build to fetch and compile the dependency.

Feature Flags

The library has several optional features:

[dependencies]
lectito-core = { version = "0.1", features = ["fetch", "markdown"] }
FeatureDefaultDescription
fetchYesEnable URL fetching with reqwest
markdownYesEnable Markdown output format
siteconfigYesEnable site configuration support

If you don't need URL fetching, disable the default features and opt back into only what you need:

[dependencies]
lectito-core = { version = "0.1", default-features = false, features = ["markdown"] }

Development Build

To build from source for development:

# Clone the repository
git clone https://github.com/stormlightlabs/lectito.git
cd lectito

# Build the workspace
cargo build --release

# The CLI binary will be at target/release/lectito

Next Steps

Quick Start

Get started with Lectito in minutes.

CLI Quick Start

Basic Usage

Extract content from a URL:

lectito https://example.com/article

Extract from a local file:

lectito article.html

Extract from stdin:

curl https://example.com | lectito -

Save to File

lectito https://example.com/article -o article.md

Change Output Format

# JSON output
lectito https://example.com/article --format json

# Plain text output
lectito https://example.com/article --format text

Set Timeout

For slow-loading sites:

lectito https://example.com/article --timeout 60

Library Quick Start

Add Dependency

Add to Cargo.toml:

[dependencies]
lectito-core = "0.1"

Parse HTML String

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = r#"
        <!DOCTYPE html>
        <html>
            <head><title>My Article</title></head>
            <body>
                <article>
                    <h1>Article Title</h1>
                    <p>This is the article content with plenty of text.</p>
                </article>
            </body>
        </html>
    "#;

    let article = parse(html)?;

    println!("Title: {:?}", article.metadata.title);
    println!("Content: {}", article.to_markdown()?);

    Ok(())
}

Fetch and Parse URL

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let article = fetch_and_parse("https://example.com/article").await?;

    println!("Title: {:?}", article.metadata.title);
    println!("Word count: {}", article.word_count);

    Ok(())
}

Convert to Different Formats

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<h1>Title</h1><p>Content here</p>";
    let article = parse(html)?;

    // Markdown with frontmatter
    let markdown = article.to_markdown()?;
    println!("{}", markdown);

    // Plain text
    let text = article.to_text();
    println!("{}", text);

    // Structured JSON
    let json = article.to_json()?;
    println!("{}", json);

    Ok(())
}

Common Patterns

Handle Errors

use lectito_core::{parse, LectitoError};

match parse("<html>...</html>") {
    Ok(article) => println!("Title: {:?}", article.metadata.title.unwrap()),
    Err(LectitoError::NotReadable { score, threshold }) => {
        eprintln!("Content not readable: score {} < threshold {}", score, threshold);
    }
    Err(e) => eprintln!("Error: {}", e),
}

Configure Extraction

use lectito_core::{Readability, ReadabilityConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(500)
        .preserve_images(true)
        .build();

    let reader = Readability::with_config(config);
    let article = reader.parse("<html>...</html>")?;

    Ok(())
}

What's Next?

CLI Usage

Reference for the lectito command-line tool.

Basic Syntax

lectito [OPTIONS] [INPUT]

INPUT can be:

  • a URL starting with http:// or https://
  • a local file path
  • - to read from stdin

Common Examples

Extract from a URL

lectito https://example.com/article

Extract from a File

lectito article.html

Read from stdin

curl https://example.com | lectito -

Output Options

-o, --output <FILE>

Write output to a file instead of stdout.

lectito https://example.com/article -o article.md

-f, --format <FORMAT>

Output format. Available values:

FormatDescription
markdown or mdMarkdown output
htmlCleaned HTML
text or txtPlain text
jsonStructured JSON
lectito https://example.com/article --format text

--json

Force structured JSON output regardless of --format.

lectito https://example.com/article --json

--references

Include a reference table in Markdown output or a references array in JSON output.

lectito https://example.com/article --references

--frontmatter

Include TOML frontmatter in Markdown output.

lectito https://example.com/article --frontmatter

-m, --metadata-only

Output metadata only.

lectito https://example.com/article --metadata-only

--metadata-format <FORMAT>

Metadata output format for --metadata-only. Supported values: toml, json.

lectito https://example.com/article --metadata-only --metadata-format json

Extraction Options

--timeout <SECS>

HTTP timeout in seconds. Default: 30.

--user-agent <UA>

Custom User-Agent for HTTP requests.

-c, --config-dir <DIR>

Directory containing site configuration files.

--char-threshold <NUM>

Minimum character threshold for content candidates. Default: 500.

--max-elements <NUM>

Maximum number of top candidates to track. Default: 5.

--no-images

Strip images from output.

-v, --verbose

Enable verbose logging and timing output.

Shell Completions

--completions <SHELL>

Generate a completion script for bash, zsh, fish, or powershell.

lectito --completions zsh

Help and Version

lectito --help
lectito --version

Output Shapes

Markdown

With --frontmatter, Markdown output starts with TOML frontmatter and then the extracted body.

JSON

--format json and --json emit structured output with:

  • metadata
  • content.markdown
  • content.text
  • content.html
  • optional references

Metadata-Only

--metadata-only emits either:

  • TOML metadata
  • JSON metadata

without the extracted body.

Common Workflows

Save a Markdown export

lectito https://example.com/article --frontmatter --references -o article.md

Get JSON for downstream processing

lectito https://example.com/article --json | jq '.metadata.title'

Extract text without images

lectito https://example.com/article --format text --no-images

Basic Usage

Learn the fundamentals of using Lectito as a library.

Simple Parsing

The easiest way to extract content is with the parse function:

use lectito_core::{parse, Article};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = r#"
        <!DOCTYPE html>
        <html>
            <head><title>My Article</title></head>
            <body>
                <article>
                    <h1>Article Title</h1>
                    <p>This is the article content.</p>
                </article>
            </body>
        </html>
    "#;

    let article: Article = parse(html)?;

    println!("Title: {:?}", article.metadata.title);
    println!("Confidence: {:.2}", article.confidence);
    println!("Content: {}", article.to_markdown()?);

    Ok(())
}

Fetching and Parsing

For URLs, use the fetch_and_parse function:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://example.com/article";
    let article = fetch_and_parse(url).await?;

    println!("Title: {:?}", article.metadata.title);
    println!("Author: {:?}", article.metadata.author);
    println!("Word count: {}", article.word_count);

    Ok(())
}

This helper requires the fetch feature.

Working with the Article

The Article struct contains the extracted content, metadata, and derived metrics.

Metadata

use lectito_core::parse;

let html = "<html>...</html>";
let article = parse(html)?;

if let Some(title) = article.metadata.title {
    println!("Title: {}", title);
}

if let Some(author) = article.metadata.author {
    println!("Author: {}", author);
}

if let Some(date) = article.metadata.date {
    println!("Published: {}", date);
}

if let Some(excerpt) = article.metadata.excerpt {
    println!("Excerpt: {}", excerpt);
}

Content Access

use lectito_core::parse;

let html = "<html>...</html>";
let article = parse(html)?;

let html_content = &article.content;
let text = article.to_text();
let markdown = article.to_markdown()?;
let json = article.to_json()?;

to_markdown() requires the markdown feature.

Readability API

For more control, use the Readability API:

use lectito_core::{Readability, ReadabilityConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";

    let reader = Readability::new();
    let article = reader.parse(html)?;

    let config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(500)
        .nb_top_candidates(8)
        .build();

    let reader = Readability::with_config(config);
    let article = reader.parse(html)?;

    Ok(())
}

Error Handling

Lectito returns Result<T, LectitoError>. Handle errors appropriately:

use lectito_core::{parse, LectitoError};

fn extract_article(html: &str) -> Result<String, String> {
    match parse(html) {
        Ok(article) => Ok(article.to_markdown().unwrap_or_default()),
        Err(LectitoError::NotReadable { score, threshold }) => {
            Err(format!("Content not readable: score {} < threshold {}", score, threshold))
        }
        Err(LectitoError::InvalidUrl(msg)) => {
            Err(format!("Invalid URL: {}", msg))
        }
        Err(e) => Err(format!("Extraction failed: {}", e)),
    }
}

Common Patterns

Parse with URL Context

When you have the URL, provide it for better relative link resolution:

use lectito_core::{parse_with_url, Article};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let url = "https://example.com/article";

    let article: Article = parse_with_url(html, url)?;

    assert_eq!(article.source_url.as_deref(), Some(url));
    Ok(())
}

Check if Content is Probably Readable

For a quick pre-check:

use lectito_core::is_probably_readable;

fn main() {
    let html = "<html>...</html>";

    if is_probably_readable(html) {
        println!("Content looks readable");
    } else {
        println!("Content may not be readable");
    }
}

Working with Documents

For lower-level DOM manipulation:

use lectito_core::{Document, Element};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html><body><p>Hello</p></body></html>";

    let doc = Document::parse(html)?;
    let elements: Vec<Element> = doc.select("p")?;

    for element in elements {
        println!("Text: {}", element.text());
    }

    Ok(())
}

Configuration

Customize Lectito's extraction behavior with configuration options.

ReadabilityConfig

The ReadabilityConfig struct controls extraction parameters. Use the builder pattern:

use lectito_core::{Readability, ReadabilityConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(500)
        .nb_top_candidates(8)
        .preserve_images(true)
        .preserve_video_embeds(true)
        .build();

    let reader = Readability::with_config(config);
    let article = reader.parse("<html>...</html>")?;

    Ok(())
}

Readability Options

FieldDefaultDescription
min_score20.0Minimum score required for extraction
char_threshold500Minimum character count for strong candidates
nb_top_candidates5Number of top candidates to keep during scoring
max_elems_to_parse0Maximum number of elements to score, 0 means unlimited
remove_unlikelytrueRemove obvious chrome before scoring
keep_classesfalsePreserve class attributes in output HTML
preserve_imagestrueKeep images in extracted content
preserve_video_embedstrueKeep supported video embeds

Strict Extraction

For high-quality content only:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(30.0)
    .char_threshold(1000)
    .build();

Lenient Extraction

For short pages or difficult layouts:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(10.0)
    .char_threshold(200)
    .remove_unlikely(false)
    .build();

Text-Only Extraction

Remove images and embeds:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .preserve_images(false)
    .preserve_video_embeds(false)
    .build();

FetchConfig

Configure HTTP fetching behavior:

use lectito_core::{fetch_and_parse_with_config, FetchConfig, ReadabilityConfig};
use std::collections::HashMap;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fetch_config = FetchConfig {
        timeout: 60,
        user_agent: "MyBot/1.0".to_string(),
        headers: HashMap::new(),
    };

    let read_config = ReadabilityConfig::builder()
        .min_score(25.0)
        .build();

    let article = fetch_and_parse_with_config(
        "https://example.com/article",
        &read_config,
        &fetch_config,
    ).await?;

    Ok(())
}

Fetch Options

FieldTypeDefaultDescription
timeoutu6430Request timeout in seconds
user_agentStringBrowser-like Lectito UAUser-Agent header value
headersHashMap<String, String>emptyExtra request headers

Default Values

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::default();

assert_eq!(config.min_score, 20.0);
assert_eq!(config.char_threshold, 500);
assert_eq!(config.nb_top_candidates, 5);
assert_eq!(config.max_elems_to_parse, 0);
assert!(config.remove_unlikely);
assert!(!config.keep_classes);
assert!(config.preserve_images);
assert!(config.preserve_video_embeds);

Site Configuration

For sites that require custom extraction rules, use the site configuration feature:

[dependencies]
lectito-core = { version = "0.1", features = ["siteconfig"] }

Site configuration uses the FTR-style ruleset and the ConfigLoader APIs to apply per-site extraction rules.

Next Steps

Async vs Sync

Understanding Lectito's async and synchronous APIs.

Overview

Lectito provides both synchronous and asynchronous APIs:

FunctionAsync/SyncUse Case
parse()SyncParse HTML from string
parse_with_url()SyncParse with URL context
fetch_and_parse()AsyncFetch from URL then parse
fetch_url()AsyncFetch HTML from URL

The async fetch helpers require the fetch feature.

When to Use Each

Use Sync APIs When

  • You already have the HTML as a string
  • You're using your own HTTP client
  • Performance is not critical
  • You're integrating into synchronous code

Use Async APIs When

  • You need to fetch from URLs
  • You're already using async/await
  • You want concurrent fetches
  • Performance matters for network operations

Synchronous Parsing

Parse HTML that you already have:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;
    Ok(())
}

Asynchronous Fetching

Fetch and parse in one operation:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://example.com/article";
    let article = fetch_and_parse(url).await?;
    Ok(())
}

Manual Fetch and Parse

Use your own HTTP client:

use lectito_core::parse;
use reqwest::Client;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    let response = client.get("https://example.com/article")
        .send()
        .await?;

    let html = response.text().await?;
    let article = parse(&html)?;

    Ok(())
}

Concurrent Fetches

Fetch multiple articles concurrently:

use lectito_core::fetch_and_parse;
use futures::future::join_all;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let urls = vec![
        "https://example.com/article1",
        "https://example.com/article2",
        "https://example.com/article3",
    ];

    let futures: Vec<_> = urls.into_iter()
        .map(|url| fetch_and_parse(url))
        .collect();

    let articles = join_all(futures).await;

    for article in articles {
        match article {
            Ok(a) => println!("Got: {:?}", a.metadata.title),
            Err(e) => eprintln!("Error: {}", e),
        }
    }

    Ok(())
}

Batch Processing

Process URLs with concurrency limits:

use lectito_core::fetch_and_parse;
use futures::stream::{StreamExt, try_stream};

async fn process_urls(urls: Vec<String>) -> Result<(), Box<dyn std::error::Error>> {
    let stream = try_stream! {
        for url in urls {
            let article = fetch_and_parse(&url).await?;
            yield article;
        }
    };

    let mut stream = stream.buffer_unordered(5); // 5 concurrent requests

    while let Some(article) = stream.next().await {
        println!("Processed: {:?}", article?.metadata.title);
    }

    Ok(())
}

Sync Code in Async Context

If you need to use sync parsing in async code:

use lectito_core::parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Fetch with your async HTTP client
    let html = fetch_html().await?;

    // Parse is sync, but that's fine in async context
    let article = parse(&html)?;

    Ok(())
}

async fn fetch_html() -> Result<String, Box<dyn std::error::Error>> {
    // Your async fetching logic
    Ok(String::from("<html>...</html>"))
}

Performance Considerations

Parsing (Sync)

Parsing is CPU-bound and runs synchronously:

use lectito_core::parse;
use std::time::Instant;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";

    let start = Instant::now();
    let article = parse(html)?;
    let duration = start.elapsed();

    println!("Parsed in {:?}", duration);

    Ok(())
}

Fetching (Async)

Fetching is I/O-bound and benefits from async:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let start = std::time::Instant::now();
    let article = fetch_and_parse("https://example.com/article").await?;
    let duration = start.elapsed();

    println!("Fetched and parsed in {:?}", duration);

    Ok(())
}

Choosing the Right Approach

ScenarioRecommended Approach
Have HTML stringparse() (sync)
Need to fetch URLfetch_and_parse() (async)
Custom HTTP clientYour client + parse() (sync)
Batch URL processingfetch_and_parse() with concurrent futures
CLI toolDepends on your runtime setup
Web serverfetch_and_parse() (async) for throughput

Output Formats

Work with different output formats: Markdown, JSON, text, and HTML.

Overview

The Article struct provides several ways to render extracted content:

MethodFormatRequires Feature
to_markdown()Markdownmarkdown
to_markdown_with_config()Markdown with custom optionsmarkdown
to_json()Serialized Article JSONAlways available
to_text()Plain textAlways available
content fieldCleaned HTMLAlways available

Markdown

Convert an article to Markdown:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let markdown = article.to_markdown()?;
    println!("{}", markdown);

    Ok(())
}

Markdown Configuration

Use MarkdownConfig for frontmatter, references, and image handling:

use lectito_core::{parse, MarkdownConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let config = MarkdownConfig {
        include_frontmatter: true,
        include_references: true,
        strip_images: false,
        include_title_heading: true,
    };

    let markdown = article.to_markdown_with_config(&config)?;
    println!("{}", markdown);

    Ok(())
}

Frontmatter Fields

When include_frontmatter is enabled, Lectito can emit fields such as:

+++
title = "Article Title"
author = "John Doe"
date = "2025-01-17"
site = "Example"
image = "https://example.com/image.jpg"
favicon = "https://example.com/favicon.ico"
excerpt = "A brief description of the article"
word_count = 500
reading_time_minutes = 2.5
+++

JSON

Article::to_json() returns a serialized view of the article itself:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let json = article.to_json()?;
    println!("{}", json);

    Ok(())
}

JSON Structure

{
  "content": "<div>Cleaned HTML content...</div>",
  "text_content": "Plain text content...",
  "metadata": {
    "title": "Article Title",
    "author": "John Doe",
    "date": "2025-01-17",
    "excerpt": "A brief description",
    "site_name": "Example",
    "language": "en"
  },
  "length": 1234,
  "word_count": 500,
  "reading_time": 2.5,
  "source_url": "https://example.com/article",
  "confidence": 0.92
}

Plain Text

Extract just the text content:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let text = article.to_text();
    println!("{}", text);

    Ok(())
}

Plain text preserves the readable text content without HTML tags.

HTML

Access the cleaned HTML directly:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let cleaned_html = &article.content;
    println!("{}", cleaned_html);

    Ok(())
}

The cleaned HTML:

  • removes clutter such as navigation and ads
  • keeps the main content structure
  • preserves images when preserve_images is enabled
  • preserves supported embeds when preserve_video_embeds is enabled

Choosing a Format

FormatUse Case
MarkdownBlog posts, docs, static publishing
JSONAPIs, storage, downstream processing
TextAnalysis, indexing, search
HTMLWeb display or further HTML processing

How It Works

Understanding the Lectito content extraction pipeline.

Overview

Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. The algorithm identifies the main article content by analyzing the HTML structure, scoring elements based on various heuristics, and selecting the highest-scoring content.

Extraction Pipeline

The extraction process still follows the same core shape:

HTML Input → Preprocessing → Scoring → Selection → Post-processing → Article

1. Preprocessing

Clean the HTML to improve scoring accuracy:

  • Remove unlikely content: scripts, styles, iframes, and hidden nodes
  • Strip elements with unlikely class/ID patterns
  • Preserve structure: maintain HTML hierarchy for accurate scoring

Why: Preprocessing removes elements that could confuse the scoring algorithm or contain non-article content.

2. Scoring

Score each element based on content characteristics:

  • Tag score: Different HTML tags have different base scores
  • Class/ID weight: Positive patterns (article, content) vs negative (sidebar, footer)
  • Content density: Length and punctuation indicate content quality
  • Link density: Too many links suggests navigation or metadata, not content

Why: Scoring identifies which elements are most likely to contain the main article content.

3. Selection

Select the highest-scoring element as the article candidate:

  • Find element with highest score
  • Bias toward semantic containers when scores are close
  • Check if score meets the minimum threshold
  • Check if content length meets the minimum threshold
  • Return an error if content doesn't meet thresholds

Why: Selection ensures we extract actual article content, not navigation or ads.

4. Post-processing

Clean up the selected content:

  • Include sibling elements: adjacent content blocks and shared-parent headers
  • Remove remaining clutter: ads, comments, social widgets
  • Clean up whitespace: normalize spacing and formatting
  • Preserve structure: maintain headings, paragraphs, and lists

Why: Post-processing improves the quality of extracted content and includes related elements.

Branch Additions

The current branch layers a few extra passes around that core flow:

  • Retry strategy: if the first pass comes back short, Lectito retries with progressively looser settings before giving up
  • Site-specific extraction: built-in extractors and optional site configs can override the generic scorer for difficult sites
  • Confidence and diagnostics: successful extractions carry a confidence score and can include pass-by-pass diagnostics

Those additions sit around the original pipeline. They do not replace it.

Data Flow

Input HTML
    ↓
parse_to_document()
    ↓
preprocess_html() → Cleaned HTML
    ↓
build_dom_tree() → DOM Tree
    ↓
calculate_score() → Scored Elements
    ↓
extract_content() → Selected Element
    ↓
postprocess_html() → Cleaned Content
    ↓
extract_metadata() → Metadata
    ↓
Article

Key Components

Document and Element

The Document and Element types wrap the scraper crate's HTML parsing:

use lectito_core::{Document, Element};

let doc = Document::parse(html)?;
let elements: Vec<Element> = doc.select("article p")?;

These provide a convenient API for DOM manipulation and element traversal.

Scoring Algorithm

The scoring algorithm combines multiple factors:

element_score = base_tag_score
              + class_id_weight
              + content_density_score
              × (1 - link_density)

See Scoring Algorithm for details.

Metadata Extraction

Separate process extracts metadata from the HTML:

  • Title: <h1>, <title>, or Open Graph tags
  • Author: meta tags, bylines, schema.org
  • Date: meta tags, time elements, schema.org
  • Excerpt: meta description, first paragraph

Why This Approach

Content Over Structure

Unlike XPath-based extraction, Lectito doesn't rely on fixed HTML structures. It analyzes content characteristics, making it work across many sites without custom rules.

Heuristic-Based

The algorithm uses heuristics derived from analyzing lots of article pages. That keeps it flexible across different site designs.

Fallback Mechanism

For sites where the algorithm fails, Lectito supports site-specific configuration files with XPath expressions. See Configuration for details.

Limitations

Sites That May Fail

  • Very short pages (tweets, status updates)
  • Non-article content (product pages, search results)
  • Unusual layouts
  • Heavily JavaScript-dependent content

Improving Extraction

For difficult sites:

  1. Adjust thresholds such as min_score or char_threshold
  2. Provide a site configuration
  3. Add a site-specific extractor when generic scoring is not enough

See Configuration for options.

Comparison to Alternatives

ApproachProsCons
LectitoWorks across many sites, no custom rules neededMay fail on unusual layouts
DefuddleStrong HTML and Markdown output, forgiving cleanup, richer metadata extractionJavaScript and DOM-oriented, not a Rust-native library or CLI stack
XPathPrecise, predictableRequires custom rules per site
CSS SelectorsSimple, familiarBrittle, breaks on layout changes
Machine LearningAdaptableComplex, requires training data

Lectito strikes a balance by working well for most sites without custom rules, with site configuration as a fallback.

Scoring Algorithm

Detailed explanation of how Lectito scores HTML elements to identify article content.

Overview

The scoring algorithm assigns a numeric score to each HTML element, indicating how likely it is to contain the main article content. Higher scores indicate better content candidates.

The exact weights evolve as the extractor improves, so treat this page as a guide to the scoring logic, not a frozen ABI.

Score Formula

At a high level, the score still looks like this:

element_score = (base_tag_score
               + class_id_weight
               + content_density_score
               + container_bonus)
               × (1 - link_density)

Base Tag Score

Different HTML tags have different inherent scores, reflecting their likelihood of containing content:

TagTypical BiasRationale
<article>PositiveSemantic article container
<section>PositiveLogical content section
<div>PositiveGeneric container, often used for content
<blockquote>Slightly positiveQuoted content
<pre>NeutralPreformatted text
<header>NegativeHeader, not main content
<footer>NegativeFooter, not main content
<nav>NegativeNavigation
<form>NegativeForms, not content

Class/ID Weight

Class and ID attributes strongly indicate element purpose.

Positive patterns bias the scorer toward article-like containers. Negative patterns bias it away from sidebars, menus, comments, related-story blocks, and similar chrome.

Examples:

  • Positive: class="article-content", id="main-content"
  • Negative: class="sidebar", id="footer", class="navigation"

Content Density Score

The scorer rewards elements with substantial text content:

  • more readable text
  • more punctuation and sentence structure
  • less boilerplate

Real article content tends to have more continuous prose than navigation or metadata.

Nodes packed with links are usually navigation, metadata, or related-story rails, not the article body.

link_density = linked_text / total_text

Higher link density reduces the final score.

Branch-Specific Heuristics

The current branch adds a few important refinements on top of the classic Readability-style score:

Entry-Point Bias

Common article containers such as article, main, and well-known content wrappers get an early structural advantage before raw text density decides the winner.

Sibling Aggregation

When several nearby candidates score well, Lectito can walk upward and treat them as one article body instead of picking only a single subtree.

Table Handling

Layout tables and data tables are treated differently. Data tables should survive extraction. Layout tables should not dominate it.

Retry Strategy

If the first pass extracts too little text, Lectito retries with progressively looser settings before it gives up.

Thresholds

Two thresholds still matter most:

Score Threshold

Minimum score for extraction.

If no element scores high enough, extraction fails with LectitoError::NotReadable.

Character Threshold

Minimum character count for meaningful content.

Even with a strong score, content must still be large enough to count as readable.

Scoring Edge Cases

Empty Elements

Elements with no text receive a negligible score and are ignored.

Nested Elements

Both parent and child elements are scored. The best candidate can appear at any level of the tree.

Sibling Elements

Adjacent elements with similar scores may be grouped as part of the same article.

Negative Scores

Elements that look like navigation or chrome can end up with negative scores and fall out of contention.

Configuration Affecting Scoring

Adjust scoring behavior with ReadabilityConfig:

  • min_score
  • char_threshold
  • nb_top_candidates
  • max_elems_to_parse
  • remove_unlikely

See Configuration for details.

API Overview

Reference for the Lectito Rust library API.

Core Types

Article

The main result type containing extracted content, metadata, and derived metrics.

pub struct Article {
    pub content: String,
    pub text_content: String,
    pub metadata: Metadata,
    pub length: usize,
    pub word_count: usize,
    pub reading_time: f64,
    pub source_url: Option<String>,
    pub confidence: f64,
    pub diagnostics: Option<ExtractionDiagnostics>,
}

Common methods:

  • to_markdown() -> Result<String>
  • to_markdown_with_config(&MarkdownConfig) -> Result<String>
  • to_json() -> Result<serde_json::Value>
  • to_text() -> String
  • to_format(OutputFormat) -> Result<String>

Metadata

Extracted article metadata.

pub struct Metadata {
    pub title: Option<String>,
    pub author: Option<String>,
    pub date: Option<String>,
    pub excerpt: Option<String>,
    pub site_name: Option<String>,
    pub image: Option<String>,
    pub favicon: Option<String>,
    pub word_count: Option<usize>,
    pub reading_time_minutes: Option<f64>,
    pub language: Option<String>,
}

LectitoError

Main error type for extraction, parsing, and fetch failures.

Notable variants:

  • NotReadable { score, threshold }
  • InvalidUrl(String)
  • Timeout { timeout }
  • HtmlParseError(String)
  • NoContent
  • FileNotFound(PathBuf)
  • ConfigError(String)
  • SiteConfigError(String)

HttpError(reqwest::Error) is available when the fetch feature is enabled.

Configuration Types

ReadabilityConfig

Main configuration for content extraction.

pub struct ReadabilityConfig {
    pub min_score: f64,
    pub char_threshold: usize,
    pub nb_top_candidates: usize,
    pub max_elems_to_parse: usize,
    pub remove_unlikely: bool,
    pub keep_classes: bool,
    pub preserve_images: bool,
    pub preserve_video_embeds: bool,
}

Build with ReadabilityConfig::builder().

FetchConfig

Configuration for HTTP fetching.

pub struct FetchConfig {
    pub timeout: u64,
    pub user_agent: String,
    pub headers: HashMap<String, String>,
}

Main API Functions

parse

Parse an HTML string and extract an Article.

pub fn parse(html: &str) -> Result<Article>

parse_with_url

Parse HTML with URL context for relative link resolution.

pub fn parse_with_url(html: &str, url: &str) -> Result<Article>

is_probably_readable

Cheap pre-check for likely article pages.

pub fn is_probably_readable(html: &str) -> bool

fetch_url

Fetch raw HTML from a URL.

pub async fn fetch_url(url: &str, config: &FetchConfig) -> Result<String>

Requires the fetch feature.

fetch_and_parse

Fetch a URL and extract an article with default configuration.

pub async fn fetch_and_parse(url: &str) -> Result<Article>

Requires the fetch feature.

fetch_and_parse_with_config

Fetch a URL and extract an article with custom readability and fetch settings.

pub async fn fetch_and_parse_with_config(
    url: &str,
    readability_config: &ReadabilityConfig,
    fetch_config: &FetchConfig,
) -> Result<Article>

Requires the fetch feature.

Readability Type

Readability is the main stateful API:

pub struct Readability { /* ... */ }

Common constructors and methods:

  • Readability::new()
  • Readability::with_config(ReadabilityConfig)
  • Readability::with_config_and_loader(ReadabilityConfig, ConfigLoader)
  • parse(&self, html: &str) -> Result<Article>
  • parse_with_url(&self, html: &str, url: &str) -> Result<Article>
  • is_probably_readable(&self, html: &str) -> bool
  • fetch_and_parse(&self, url: &str) -> Result<Article>
  • fetch_and_parse_with_config(&self, url: &str, fetch_config: &FetchConfig) -> Result<Article>

Lower-Level Types

For callers that need more control, Lectito also exposes:

  • Document and Element for DOM access
  • ConfigLoader and ConfigLoaderBuilder for site configuration loading
  • MarkdownConfig, JsonConfig, and formatter types for output control

Feature Flags

FeatureDefaultPurpose
fetchYesAsync URL fetching with reqwest
markdownYesMarkdown conversion support
siteconfigYesSite configuration support