Lectito

A Rust library and CLI for extracting readable content from web pages.

What is Lectito?

Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. It identifies and extracts the main article content from web pages, removing navigation, sidebars, advertisements, and other clutter.

Features

  • Content Extraction: Automatically identifies the main article content
  • Metadata Extraction: Pulls title, author, date, excerpt, and language
  • Output Formats: HTML, Markdown, plain text, and JSON
  • URL Fetching: Built-in async HTTP client with timeout support
  • CLI: Simple command-line interface for quick extractions
  • Site Configuration: Optional XPath-based extraction rules for difficult sites

Use Cases

  • Web Scraping: Extract clean article content from web pages
  • AI Agents: Feed readable text to language models
  • Content Analysis: Analyze article text without HTML noise
  • Archival: Save clean copies of web content
  • CLI: Quick article extraction from the terminal

Quick Start

CLI

# Install
cargo install lectito-cli

# Extract from URL
lectito https://example.com/article

# Extract from local file
lectito article.html

# Pipe from stdin
curl https://example.com | lectito -

Library

use lectito_core::parse;

let html = r#"<html><body><article><h1>Title</h1><p>Content</p></article></body></html>"#;
let article = parse(html)?;

println!("Title: {:?}", article.metadata.title);
println!("Content: {}", article.to_markdown()?);

About the Name

"Lectito" is derived from the Latin legere (to read) and lectio (a reading or selection).

Lectito aims to select and present readable content from the chaos of the modern web.

Installation

Lectito provides both a CLI tool and a Rust library. Install whichever fits your needs.

CLI Installation

From crates.io

The easiest way to install the CLI is via cargo:

cargo install lectito-cli

This installs the lectito binary in your cargo bin directory (typically ~/.cargo/bin).

From Source

# Clone the repository
git clone https://github.com/stormlightlabs/lectito.git
cd lectito

# Build and install
cargo install --path crates/cli

Pre-built Binaries

Pre-built binaries are available on the GitHub Releases page for Linux, macOS, and Windows.

Download the appropriate binary for your platform and place it in your PATH.

Verify Installation

lectito --version

You should see version information printed.

Library Installation

Add to your Cargo.toml:

[dependencies]
lectito-core = "0.1"

Then run cargo build to fetch and compile the dependency.

Feature Flags

The library has several optional features:

[dependencies]
lectito-core = { version = "0.1", features = ["fetch", "markdown"] }
FeatureDefaultDescription
fetchYesEnable URL fetching with reqwest
markdownYesEnable Markdown output format
siteconfigYesEnable site configuration support

If you don't need URL fetching (e.g., you have your own HTTP client), disable the default features:

[dependencies]
lectito-core = { version = "0.1", default-features = false, features = ["markdown"] }

Development Build

To build from source for development:

# Clone the repository
git clone https://github.com/stormlightlabs/lectito.git
cd lectito

# Build the workspace
cargo build --release

# The CLI binary will be at target/release/lectito

Next Steps

Quick Start

Get started with Lectito in minutes.

CLI Quick Start

Basic Usage

Extract content from a URL:

lectito https://example.com/article

Extract from a local file:

lectito article.html

Extract from stdin:

curl https://example.com | lectito -

Save to File

lectito https://example.com/article -o article.md

Change Output Format

# JSON output
lectito https://example.com/article --format json

# Plain text output
lectito https://example.com/article --format text

Set Timeout

For slow-loading sites:

lectito https://example.com/article --timeout 60

Library Quick Start

Add Dependency

Add to Cargo.toml:

[dependencies]
lectito-core = "0.1"

Parse HTML String

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = r#"
        <!DOCTYPE html>
        <html>
            <head><title>My Article</title></head>
            <body>
                <article>
                    <h1>Article Title</h1>
                    <p>This is the article content with plenty of text.</p>
                </article>
            </body>
        </html>
    "#;

    let article = parse(html)?;

    println!("Title: {:?}", article.metadata.title);
    println!("Content: {}", article.to_markdown()?);

    Ok(())
}

Fetch and Parse URL

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let article = fetch_and_parse("https://example.com/article").await?;

    println!("Title: {:?}", article.metadata.title);
    println!("Word count: {}", article.word_count);

    Ok(())
}

Convert to Different Formats

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<h1>Title</h1><p>Content here</p>";
    let article = parse(html)?;

    // Markdown with frontmatter
    let markdown = article.to_markdown()?;
    println!("{}", markdown);

    // Plain text
    let text = article.to_text();
    println!("{}", text);

    // Structured JSON
    let json = article.to_json()?;
    println!("{}", json);

    Ok(())
}

Common Patterns

Handle Errors

use lectito_core::{parse, LectitoError};

match parse("<html>...</html>") {
    Ok(article) => println!("Title: {:?}", article.metadata.title.unwrap()),
    Err(LectitoError::NotReaderable { score, threshold }) => {
        eprintln!("Content not readable: score {} < threshold {}", score, threshold);
    }
    Err(e) => eprintln!("Error: {}", e),
}

Configure Extraction

use lectito_core::{Readability, ReadabilityConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(500)
        .preserve_images(true)
        .build();

    let reader = Readability::with_config(config);
    let article = reader.parse("<html>...</html>")?;

    Ok(())
}

What's Next?

CLI Usage

Complete reference for the lectito command-line tool.

Basic Syntax

lectito [OPTIONS] <INPUT>

The INPUT can be:

  • A URL (starts with http:// or https://)
  • A local file path
  • - to read from stdin

Examples

URL Extraction

lectito https://example.com/article

Local File

lectito article.html

Stdin Pipe

curl https://example.com | lectito -
cat page.html | lectito -
wget -qO- https://example.com | lectito -

Options

-o, --output <FILE>

Write output to a file instead of stdout.

lectito https://example.com/article -o article.md

-f, --format <FORMAT>

Specify output format. Available formats:

FormatDescription
markdown or mdMarkdown (default)
jsonStructured JSON
text or txtPlain text
htmlCleaned HTML
lectito https://example.com/article -f json

--timeout <SECONDS>

HTTP request timeout in seconds (default: 30).

lectito https://example.com/article --timeout 60

--user-agent <USER_AGENT>

Custom User-Agent header.

lectito https://example.com/article --user-agent "MyBot/1.0"

--config <PATH>

Path to site configuration file (TOML format).

lectito https://example.com/article --config site-config.toml

-v, --verbose

Enable verbose debug logging.

lectito https://example.com/article -v

-h, --help

Display help information.

lectito --help

-V, --version

Display version information.

lectito --version

Common Workflows

Extract and Save Article

lectito https://example.com/article -o articles/article.md

Batch Processing Multiple URLs

while read url; do
    lectito "$url" -o "articles/$(date +%s).md"
done < urls.txt

Extract to JSON for Processing

lectito https://example.com/article --format json | jq '.metadata.title'

Extract from Multiple Files

for file in articles/*.html; do
    lectito "$file" -o "processed/$(basename "$file" .html).md"
done

Custom Timeout for Slow Sites

lectito https://slow-site.com/article --timeout 120

Output Formats

Markdown (Default)

Output includes TOML frontmatter with metadata (when --frontmatter is used):

+++
title = "Article Title"
author = "John Doe"
date = "2025-01-17"
excerpt = "A brief description..."
+++

# Article Title

Article content here...

JSON

Structured output with all metadata:

{
    "metadata": {
        "title": "Article Title",
        "author": "John Doe",
        "date": "2025-01-17",
        "excerpt": "A brief description..."
    },
    "content": "<div>...</div>",
    "text_content": "Article content here...",
    "word_count": 500
}

Plain Text

Just the article text without formatting:

Article Title

Article content here...

Exit Codes

CodeMeaning
0Success
1Error (invalid URL, network failure, etc.)

Error Handling

The CLI will print error messages to stderr:

lectito https://invalid-domain-xyz.com
# Error: failed to fetch URL: dns error: failed to lookup address information

For content that isn't readable:

lectito https://example.com/page
# Error: content not readable: score 15.2 < threshold 20.0

Tips

  1. Use timeouts: Set appropriate timeouts to avoid hanging
  2. Batch operations: Process multiple URLs in parallel
  3. Save to file: Use -o to avoid terminal rendering overhead
  4. JSON for parsing: Use JSON output when processing with other tools

Next Steps

Basic Usage

Learn the fundamentals of using Lectito as a library.

Simple Parsing

The easiest way to extract content is with the parse function:

use lectito_core::{parse, Article};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = r#"
        <!DOCTYPE html>
        <html>
            <head><title>My Article</title></head>
            <body>
                <article>
                    <h1>Article Title</h1>
                    <p>This is the article content.</p>
                </article>
            </body>
        </html>
    "#;

    let article: Article = parse(html)?;

    println!("Title: {:?}", article.metadata.title);
    println!("Content: {}", article.to_markdown()?);

    Ok(())
}

Fetching and Parsing

For URLs, use the fetch_and_parse function:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://example.com/article";
    let article = fetch_and_parse(url).await?;

    println!("Title: {:?}", article.metadata.title);
    println!("Author: {:?}", article.metadata.author);
    println!("Word count: {}", article.word_count);

    Ok(())
}

Working with the Article

The Article struct contains all extracted information:

Metadata

use lectito_core::parse;

let html = "<html>...</html>";
let article = parse(html)?;

// Access metadata
if let Some(title) = article.metadata.title {
    println!("Title: {}", title);
}

if let Some(author) = article.metadata.author {
    println!("Author: {}", author);
}

if let Some(date) = article.metadata.published_date {
    println!("Published: {}", date);
}

// Get excerpt
if let Some(excerpt) = article.metadata.excerpt {
    println!("Excerpt: {}", excerpt);
}

Content Access

use lectito_core::parse;

let html = "<html>...</html>";
let article = parse(html)?;

// Get cleaned HTML
let html_content = &article.content;

// Get plain text
let text = article.to_text();

// Get Markdown
let markdown = article.to_markdown()?;

// Get JSON
let json = article.to_json()?;

Readability API

For more control, use the Readability API:

use lectito_core::{Readability, ReadabilityConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";

    // Use default config
    let reader = Readability::new();
    let article = reader.parse(html)?;

    // Or with custom config
    let config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(500)
        .build();

    let reader = Readability::with_config(config);
    let article = reader.parse(html)?;

    Ok(())
}

Error Handling

Lectito returns Result<T, LectitoError>. Handle errors appropriately:

use lectito_core::{parse, LectitoError};

fn extract_article(html: &str) -> Result<String, String> {
    match parse(html) {
        Ok(article) => Ok(article.to_markdown().unwrap_or_default()),
        Err(LectitoError::NotReaderable { score, threshold }) => {
            Err(format!("Content not readable: score {} < threshold {}", score, threshold))
        }
        Err(LectitoError::InvalidUrl(msg)) => {
            Err(format!("Invalid URL: {}", msg))
        }
        Err(e) => Err(format!("Extraction failed: {}", e)),
    }
}

Common Patterns

Parse with URL Context

When you have the URL, provide it for better relative link resolution:

use lectito_core::{parse_with_url, Article};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let url = "https://example.com/article";

    let article: Article = parse_with_url(html, url)?;

    // Relative links are now resolved correctly
    Ok(())
}

Check if Content is Readable

Before parsing, check if content meets readability thresholds:

use lectito_core::is_probably_readable;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";

    if is_probably_readable(html) {
        println!("Content is readable");
    } else {
        println!("Content may not be readable");
    }

    Ok(())
}

Working with Documents

For lower-level DOM manipulation:

use lectito_core::{Document, Element};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html><body><p>Hello</p></body></html>";

    let doc = Document::parse(html)?;
    let elements: Vec<Element> = doc.select("p")?;

    for element in elements {
        println!("Text: {}", element.text());
    }

    Ok(())
}

Integrations

With reqwest

use lectito_core::parse;
use reqwest::Client;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    let response = client.get("https://example.com/article")
        .send()
        .await?;

    let html = response.text().await?;
    let article = parse(&html)?;

    println!("Title: {:?}", article.metadata.title);

    Ok(())
}

With Scraper

If you're already using scraper, you can integrate:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    // Work with the article's HTML content
    println!("Cleaned HTML: {}", article.content);

    Ok(())
}

Next Steps

Configuration

Customize Lectito's extraction behavior with configuration options.

ReadabilityConfig

The ReadabilityConfig struct controls extraction parameters. Use the builder pattern:

use lectito_core::{Readability, ReadabilityConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(500)
        .preserve_images(true)
        .build();

    let reader = Readability::with_config(config);
    let article = reader.parse("<html>...</html>")?;

    Ok(())
}

Configuration Options

min_score

Minimum readability score for content to be considered extractable (default: 20.0).

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(25.0)
    .build();

Higher values are more strict. If content scores below this threshold, parsing returns LectitoError::NotReaderable.

char_threshold

Minimum character count for content to be considered (default: 500).

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .char_threshold(1000)
    .build();

Increase this for short pages or blog posts to avoid extracting navigation elements.

preserve_images

Whether to preserve images in the extracted content (default: true).

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .preserve_images(false)
    .build();

min_content_length

Minimum length for text content (default: 140).

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_content_length(200)
    .build();

min_score_threshold

Threshold for minimum score during scoring (default: 20.0).

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score_threshold(25.0)
    .build();

FetchConfig

Configure HTTP fetching behavior:

use lectito_core::{fetch_and_parse_with_config, FetchConfig, ReadabilityConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fetch_config = FetchConfig {
        timeout: 60,
        user_agent: "MyBot/1.0".to_string(),
        ..Default::default()
    };

    let read_config = ReadabilityConfig::builder()
        .min_score(25.0)
        .build();

    let article = fetch_and_parse_with_config(
        "https://example.com/article",
        &fetch_config,
        &read_config
    ).await?;

    Ok(())
}

FetchConfig Options

FieldTypeDefaultDescription
timeoutu6430Request timeout in seconds
user_agentString"Lectito/..."User-Agent header value

Default Values

impl Default for ReadabilityConfig {
    fn default() -> Self {
        Self {
            min_score: 20.0,
            char_threshold: 500,
            preserve_images: true,
            min_content_length: 140,
            min_score_threshold: 20.0,
        }
    }
}

Configuration Examples

Strict Extraction

For high-quality content only:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(30.0)
    .char_threshold(1000)
    .min_content_length(300)
    .build();

Lenient Extraction

For extracting from short pages:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(10.0)
    .char_threshold(200)
    .min_content_length(50)
    .build();

Text-Only Extraction

Remove images and multimedia:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .preserve_images(false)
    .build();

Custom Fetch Settings

Long timeout with custom user agent:

use lectito_core::FetchConfig;

let config = FetchConfig {
    timeout: 120,
    user_agent: "MyBot/1.0 (+https://example.com/bot)".to_string(),
};

Site Configuration

For sites that require custom extraction rules, use the site configuration feature (requires siteconfig feature):

[dependencies]
lectito-core = { version = "0.1", features = ["siteconfig"] }

Site configuration uses the FTR (Five Filters Text) format. See How It Works for details on site-specific extraction.

Next Steps

Async vs Sync

Understanding Lectito's async and synchronous APIs.

Overview

Lectito provides both synchronous and asynchronous APIs:

FunctionAsync/SyncUse Case
parse()SyncParse HTML from string
parse_with_url()SyncParse with URL context
fetch_and_parse()AsyncFetch from URL then parse
fetch_url()AsyncFetch HTML from URL

When to Use Each

Use Sync APIs When

  • You already have the HTML as a string
  • You're using your own HTTP client
  • Performance is not critical
  • You're integrating into synchronous code

Use Async APIs When

  • You need to fetch from URLs
  • You're already using async/await
  • You want concurrent fetches
  • Performance matters for network operations

Synchronous Parsing

Parse HTML that you already have:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;
    Ok(())
}

Asynchronous Fetching

Fetch and parse in one operation:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://example.com/article";
    let article = fetch_and_parse(url).await?;
    Ok(())
}

Manual Fetch and Parse

Use your own HTTP client:

use lectito_core::parse;
use reqwest::Client;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    let response = client.get("https://example.com/article")
        .send()
        .await?;

    let html = response.text().await?;
    let article = parse(&html)?;

    Ok(())
}

Concurrent Fetches

Fetch multiple articles concurrently:

use lectito_core::fetch_and_parse;
use futures::future::join_all;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let urls = vec![
        "https://example.com/article1",
        "https://example.com/article2",
        "https://example.com/article3",
    ];

    let futures: Vec<_> = urls.into_iter()
        .map(|url| fetch_and_parse(url))
        .collect();

    let articles = join_all(futures).await;

    for article in articles {
        match article {
            Ok(a) => println!("Got: {:?}", a.metadata.title),
            Err(e) => eprintln!("Error: {}", e),
        }
    }

    Ok(())
}

Batch Processing

Process URLs with concurrency limits:

use lectito_core::fetch_and_parse;
use futures::stream::{StreamExt, try_stream};

async fn process_urls(urls: Vec<String>) -> Result<(), Box<dyn std::error::Error>> {
    let stream = try_stream! {
        for url in urls {
            let article = fetch_and_parse(&url).await?;
            yield article;
        }
    };

    let mut stream = stream.buffer_unordered(5); // 5 concurrent requests

    while let Some(article) = stream.next().await {
        println!("Processed: {:?}", article?.metadata.title);
    }

    Ok(())
}

Sync Code in Async Context

If you need to use sync parsing in async code:

use lectito_core::parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Fetch with your async HTTP client
    let html = fetch_html().await?;

    // Parse is sync, but that's fine in async context
    let article = parse(&html)?;

    Ok(())
}

async fn fetch_html() -> Result<String, Box<dyn std::error::Error>> {
    // Your async fetching logic
    Ok(String::from("<html>...</html>"))
}

Performance Considerations

Parsing (Sync)

Parsing is CPU-bound and runs synchronously:

use lectito_core::parse;
use std::time::Instant;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";

    let start = Instant::now();
    let article = parse(html)?;
    let duration = start.elapsed();

    println!("Parsed in {:?}", duration);

    Ok(())
}

Fetching (Async)

Fetching is I/O-bound and benefits from async:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let start = std::time::Instant::now();
    let article = fetch_and_parse("https://example.com/article").await?;
    let duration = start.elapsed();

    println!("Fetched and parsed in {:?}", duration);

    Ok(())
}

Choosing the Right Approach

ScenarioRecommended Approach
Have HTML stringparse() (sync)
Need to fetch URLfetch_and_parse() (async)
Custom HTTP clientYour client + parse() (sync)
Batch URL processingfetch_and_parse() with concurrent futures
CLI toolDepends on your runtime setup
Web serverfetch_and_parse() (async) for better throughput

Feature Flags

To disable async features and reduce dependencies:

[dependencies]
lectito-core = { version = "0.1", default-features = false, features = ["markdown"] }

This removes reqwest and tokio dependencies. You'll need to fetch HTML yourself.

Next Steps

Output Formats

Work with different output formats: Markdown, JSON, text, and HTML.

Overview

The Article struct provides methods for converting to different formats:

MethodFormatRequires Feature
to_markdown()Markdown with frontmattermarkdown
to_json()Structured JSONAlways available
to_text()Plain textAlways available
content fieldCleaned HTMLAlways available

Markdown

Convert article to Markdown with YAML frontmatter:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let markdown = article.to_markdown()?;
    println!("{}", markdown);

    Ok(())
}

Output Format

+++
title = "Article Title"
author = "John Doe"
published_date = "2025-01-17"
excerpt = "A brief description of the article"
word_count = 500
+++

# Article Title

Article content here...

Paragraph with **bold** and _italic_ text.

Customizing Markdown

Use MarkdownFormatter for more control:

use lectito_core::{parse, MarkdownFormatter, MarkdownConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let config = MarkdownConfig {
        frontmatter: true,
        // Add more options as available
    };

    let formatter = MarkdownFormatter::new(config);
    let markdown = formatter.format(&article)?;

    println!("{}", markdown);

    Ok(())
}

JSON

Get structured JSON with all metadata:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let json = article.to_json()?;
    println!("{}", json);

    Ok(())
}

JSON Structure

{
    "metadata": {
        "title": "Article Title",
        "author": "John Doe",
        "published_date": "2025-01-17",
        "excerpt": "A brief description",
        "language": "en"
    },
    "content": "<div>Cleaned HTML content...</div>",
    "text_content": "Plain text content...",
    "word_count": 500,
    "readability_score": 35.5
}

Parsing JSON

use lectito_core::parse;
use serde_json::Value;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let json = article.to_json()?;
    let value: Value = serde_json::from_str(&json)?;

    println!("Title: {}", value["metadata"]["title"]);

    Ok(())
}

Plain Text

Extract just the text content:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let text = article.to_text();
    println!("{}", text);

    Ok(())
}

Output Format

Plain text includes:

  • Headings as lines with # prefixes
  • Paragraphs separated by blank lines
  • List items with * or 1. prefixes
  • No HTML tags or markdown syntax

HTML

Access the cleaned HTML directly:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    // Cleaned HTML is in the `content` field
    let cleaned_html = &article.content;
    println!("{}", cleaned_html);

    Ok(())
}

HTML Characteristics

The cleaned HTML:

  • Removes clutter (navigation, sidebars, ads)
  • Keeps main content structure
  • Preserves images (if preserve_images is true)
  • Removes most scripts and styles
  • Maintains heading hierarchy

Choosing a Format

FormatUse Case
MarkdownBlog posts, documentation, static sites
JSONAPIs, databases, further processing
TextAnalysis, indexing, simple display
HTMLWeb display, further HTML processing

Format Conversion Examples

Markdown to File

use lectito_core::parse;
use std::fs;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let markdown = article.to_markdown()?;
    fs::write("article.md", markdown)?;

    Ok(())
}

JSON for API Response

use lectito_core::parse;
use warp::Filter;

async fn extract_article(body: String) -> Result<impl warp::Reply, warp::Rejection> {
    let article = parse(&body).unwrap();
    let json = article.to_json().unwrap();
    Ok(warp::reply::json(&json))
}

Text for Analysis

use lectito_core::parse;

fn analyze_text(html: &str) -> Result<(), Box<dyn std::error::Error>> {
    let article = parse(html)?;
    let text = article.to_text();

    // Analyze word frequency
    let words: Vec<&str> = text.split_whitespace().collect();
    println!("Word count: {}", words.len());

    // Count sentences
    let sentences = text.split(&['.', '!', '?'][..]).count();
    println!("Sentence count: {}", sentences);

    Ok(())
}

HTML for Display

use lectito_core::parse;

fn display_article(html: &str) -> Result<(), Box<dyn std::error::Error>> {
    let article = parse(html)?;

    // Use in a template
    let rendered = format!(
        r#"
        <!DOCTYPE html>
        <html>
        <head>
            <title>{}</title>
        </head>
        <body>
            <article>{}</article>
        </body>
        </html>
        "#,
        article.metadata.title.unwrap_or_default(),
        article.content
    );

    Ok(())
}

Next Steps

How It Works

Understanding the Lectito content extraction pipeline.

Overview

Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. The algorithm identifies the main article content by analyzing the HTML structure, scoring elements based on various heuristics, and selecting the highest-scoring content.

Extraction Pipeline

The extraction process consists of four main stages:

HTML Input → Preprocessing → Scoring → Selection → Post-processing → Article

1. Preprocessing

Clean the HTML to improve scoring accuracy:

  • Remove unlikely content: scripts, styles, iframes, and hidden nodes
  • Strip elements with unlikely class/ID patterns
  • Preserve structure: maintain HTML hierarchy for accurate scoring

Why: Preprocessing removes elements that could confuse the scoring algorithm or contain non-article content.

2. Scoring

Score each element based on content characteristics:

  • Tag score: Different HTML tags have different base scores
  • Class/ID weight: Positive patterns (article, content) vs negative (sidebar, footer)
  • Content density: Length and punctuation indicate content quality
  • Link density: Too many links suggests navigation/metadata, not content

Why: Scoring identifies which elements are most likely to contain the main article content.

3. Selection

Select the highest-scoring element as the article candidate:

  • Find element with highest score (bias toward semantic containers when scores are close)
  • Check if score meets minimum threshold (default: 20.0)
  • Check if content length meets minimum threshold (default: 500 chars)
  • Return error if content doesn't meet thresholds

Why: Selection ensures we extract actual article content, not navigation or ads.

4. Post-processing

Clean up the selected content:

  • Include sibling elements: adjacent content blocks and shared-parent headers
  • Remove remaining clutter: ads, comments, social widgets
  • Clean up whitespace: normalize spacing and formatting
  • Preserve structure: maintain headings, paragraphs, lists

Why: Post-processing improves the quality of extracted content and includes related elements.

Data Flow

Input HTML
    ↓
parse_to_document()
    ↓
preprocess_html() → Cleaned HTML
    ↓
build_dom_tree() → DOM Tree
    ↓
calculate_score() → Scored Elements
    ↓
extract_content() → Selected Element
    ↓
postprocess_html() → Cleaned Content
    ↓
extract_metadata() → Metadata
    ↓
Article

Key Components

Document and Element

The Document and Element types wrap the scraper crate's HTML parsing:

use lectito_core::{Document, Element};

let doc = Document::parse(html)?;
let elements: Vec<Element> = doc.select("article p")?;

These provide a convenient API for DOM manipulation and element traversal.

Scoring Algorithm

The scoring algorithm combines multiple factors:

element_score = base_tag_score
              + class_id_weight
              + content_density_score
              × (1 - link_density)

See Scoring Algorithm for details.

Metadata Extraction

Separate process extracts metadata from the HTML:

  • Title: <h1>, <title>, or Open Graph tags
  • Author: meta tags, bylines, schema.org
  • Date: meta tags, time elements, schema.org
  • Excerpt: meta description, first paragraph

Why This Approach

Content Over Structure

Unlike XPath-based extraction, Lectito doesn't rely on fixed HTML structures. It analyzes content characteristics, making it work across many sites without custom rules.

Heuristic-Based

The algorithm uses heuristics (rules of thumb) derived from analyzing thousands of articles. This makes it flexible and adaptable to different site designs.

Fallback Mechanism

For sites where the algorithm fails, Lectito supports site-specific configuration files with XPath expressions. See Configuration for details.

Limitations

Sites That May Fail

  • Very short pages (tweets, status updates)
  • Non-article content (product pages, search results)
  • Unusual layouts (some single-column designs)
  • Heavily JavaScript-dependent content

Improving Extraction

For difficult sites:

  1. Adjust thresholds: Lower min_score or char_threshold
  2. Site configuration: Provide XPath rules
  3. Manual curation: Use XPath or CSS selectors directly

See Configuration for options.

Comparison to Alternatives

ApproachProsCons
LectitoWorks across many sites, no custom rules neededMay fail on unusual layouts
XPathPrecise, predictableRequires custom rules per site
CSS SelectorsSimple, familiarBrittle, breaks on layout changes
Machine LearningAdaptableComplex, requires training data

Lectito strikes a balance: works well for most sites without custom rules, with site configuration as a fallback.

Performance Considerations

  • Parsing: HTML parsing is fast but not instant
  • Scoring: Traverses entire DOM, O(n) complexity
  • Fetching: Async for non-blocking I/O
  • Memory: Entire document loaded into memory

For large-scale extraction, consider batching and concurrent fetches.

Next Steps

Scoring Algorithm

Detailed explanation of how Lectito scores HTML elements to identify article content.

Overview

The scoring algorithm assigns a numeric score to each HTML element, indicating how likely it is to contain the main article content. Higher scores indicate better content candidates.

Score Formula

The final score for each element is calculated as:

element_score = (base_tag_score
               + class_id_weight
               + content_density_score
               + container_bonus)
               × (1 - link_density)

Let's break down each component.

Base Tag Score

Different HTML tags have different inherent scores, reflecting their likelihood of containing content:

TagScoreRationale
<article>+10Semantic article container
<section>+8Logical content section
<div>+5Generic container, often used for content
<blockquote>+3Quoted content
<pre>0Preformatted text, neutral
<td>+3Table cell
<address>-3Contact info, unlikely to be main content
<ol>/<ul>-3Lists and metadata
<li>-3List item
<header>-5Header, not main content
<footer>-5Footer, not main content
<nav>-5Navigation
<th>-5Table header
<h1>-<h6>-5Headings, not content themselves
<form>-3Forms, not content
<main>0Container scored via bonus

Class/ID Weight

Class and ID attributes strongly indicate element purpose:

Positive Patterns

These patterns indicate content elements:

(?i)(article|body|content|entry|hentry|h-entry|main|page|post|text|blog|story)

Weight: +25 points

Examples:

  • class="article-content"
  • id="main-content"
  • class="post-body"

Negative Patterns

These patterns indicate non-content elements:

(?i)(banner|breadcrumbs?|combx|comment|community|disqus|extra|foot|header|menu|related|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup)

Weight: -25 points

Examples:

  • class="sidebar"
  • id="footer"
  • class="navigation"

Content Density Score

Rewards elements with substantial text content:

Character Density

1 point per 100 characters, maximum 3 points.

char_score = (text_length / 100).min(3)

Punctuation Density

1 point per 5 commas/periods, maximum 3 points.

punct_score = (comma_count / 5).min(3)

Total content density:

content_density = char_score + punct_score

Rationale: Real article content has more text and punctuation than navigation or metadata.

Container Bonus

Elements that are typical article containers receive a small boost:

  • <article>, <section>, <main>: +2

This bias helps select semantic containers when scores are close.

Penalizes elements with too many links:

link_density = (length of all <a> tag text) / (total text length)
final_score = raw_score × (1 - link_density)

Examples:

  • Text "Click here": link density = 100% (10/10)
  • Text "See the article for details": link density = 33% (7/21)
  • Text "Article content with no links": link density = 0%

Rationale: Navigation menus, lists of links, and metadata have high link density. Real content has low link density.

Complete Example

Consider this HTML:

<div class="article-content">
    <h1>Article Title</h1>
    <p>
        This is a substantial paragraph with plenty of text, including multiple
        sentences, and commas, to demonstrate how content density scoring works.
    </p>
    <p>
        Another paragraph with even more text, details, and information to
        increase the character count.
    </p>
</div>

Step-by-Step Scoring

1 Base Tag Score

<div>: +5

2 Class/ID Weight

class="article-content" contains "article" and "content": +25

3 Content Density

  • Text length: ~220 characters
  • Character score: min(220/100, 3) = 2
  • Commas: 4
  • Punctuation score: min(4/5, 3) = 0
  • Total: 2 points

No links: link density = 0

5 Final Score

(5 + 25 + 2) × (1 - 0) = 32

This element would score 32, well above the default threshold of 20.

Thresholds

Two thresholds determine if content is readable:

Score Threshold

Minimum score for extraction (default: 20.0).

If no element scores above this, extraction fails with LectitoError::NotReaderable.

Character Threshold

Minimum character count (default: 500).

Even with high score, content must have enough text to be meaningful.

Scoring Edge Cases

Empty Elements

Elements with no text receive score of 0 and are ignored.

Nested Elements

Both parent and child elements are scored. The highest-scoring element at any level is selected.

Sibling Elements

Adjacent elements with similar scores may be grouped as part of the same article.

Negative Scores

Elements can have negative scores (e.g., navigation). They're excluded from selection.

Configuration Affecting Scoring

Adjust scoring behavior with ReadabilityConfig:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(25.0)           // Higher threshold
    .char_threshold(1000)      // Require more content
    .min_content_length(200)   // Longer minimum text
    .build();

See Configuration for details.

Practical Implications

Why Articles Score Well

  • Semantic tags (<article>)
  • Descriptive classes (article-content)
  • Substantial text (high character count)
  • Punctuation (commas, periods)
  • Few links (low link density)

Why Navigation Scores Poorly

  • Generic or negative classes (sidebar, navigation)
  • Little text (just link labels)
  • Many links (high link density)
  • Short content (fails character threshold)

Why Comments May Score Poorly

  • Often in negative classed containers (comments)
  • Short individual comments
  • Many links (usernames, replies)
  • Variable quality

Site Configuration

When automatic scoring fails, provide XPath rules:

# example.com.toml
[[fingerprints]]
pattern = "example.com"

[[fingerprints.extract]]
title = "//h1[@class='article-title']"
content = "//div[@class='article-body']"

See Configuration for details.

References

Next Steps

API Overview

Complete reference for the Lectito Rust library API.

Core Types

Article

The main result type containing extracted content and metadata.

pub struct Article {
    /// Extracted metadata
    pub metadata: Metadata,

    /// Cleaned HTML content
    pub content: String,

    /// Plain text content
    pub text_content: String,

    /// Number of words in content
    pub word_count: usize,

    /// Final readability score
    pub readability_score: f64,
}

Methods:

  • to_markdown() -> Result<String> - Convert to Markdown with frontmatter
  • to_json() -> Result<String> - Convert to JSON
  • to_text() -> String - Get plain text

Metadata

Extracted article metadata.

pub struct Metadata {
    /// Article title
    pub title: Option<String>,

    /// Author name
    pub author: Option<String>,

    /// Publication date
    pub published_date: Option<String>,

    /// Article excerpt/description
    pub excerpt: Option<String>,

    /// Content language
    pub language: Option<String>,
}

LectitoError

Error type for all Lectito operations.

pub enum LectitoError {
    /// Content not readable: score below threshold
    NotReaderable { score: f64, threshold: f64 },

    /// Invalid URL provided
    InvalidUrl(String),

    /// HTTP request timeout
    Timeout { timeout: u64 },

    /// HTTP error
    HttpError(reqwest::Error),

    /// HTML parsing error
    HtmlParseError(String),

    /// IO error
    IoError(std::io::Error),
}

Result

Type alias for Result with LectitoError.

pub type Result<T> = std::result::Result<T, LectitoError>;

Configuration Types

ReadabilityConfig

Main configuration for content extraction.

pub struct ReadabilityConfig {
    /// Minimum readability score (default: 20.0)
    pub min_score: f64,

    /// Minimum character count (default: 500)
    pub char_threshold: usize,

    /// Preserve images in output (default: true)
    pub preserve_images: bool,

    /// Minimum content length (default: 140)
    pub min_content_length: usize,

    /// Minimum score threshold (default: 20.0)
    pub min_score_threshold: f64,
}

Methods:

  • builder() -> ReadabilityConfigBuilder - Create a builder
  • default() -> Self - Default configuration

ReadabilityConfigBuilder

Builder for ReadabilityConfig.

pub struct ReadabilityConfigBuilder {
    // ...
}

Methods:

  • min_score(f64) -> Self - Set minimum score
  • char_threshold(usize) -> Self - Set character threshold
  • preserve_images(bool) -> Self - Set image preservation
  • min_content_length(usize) -> Self - Set minimum content length
  • min_score_threshold(f64) -> Self - Set score threshold
  • build() -> ReadabilityConfig - Build configuration

FetchConfig

Configuration for HTTP fetching.

pub struct FetchConfig {
    /// Request timeout in seconds (default: 30)
    pub timeout: u64,

    /// User-Agent header (default: "Lectito/...")
    pub user_agent: String,
}

Trait:

  • impl Default for FetchConfig

Main API Functions

parse

Parse HTML string and extract article.

pub fn parse(html: &str) -> Result<Article>

Example:

use lectito_core::parse;

let article = parse("<html>...</html>")?;

parse_with_url

Parse HTML with URL context for relative link resolution.

pub fn parse_with_url(html: &str, url: &str) -> Result<Article>

Example:

use lectito_core::parse_with_url;

let article = parse_with_url(html, "https://example.com/article")?;

fetch_and_parse

Fetch URL and extract article.

pub async fn fetch_and_parse(url: &str) -> Result<Article>

Feature: fetch

Example:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let article = fetch_and_parse("https://example.com/article").await?;
    Ok(())
}

fetch_and_parse_with_config

Fetch URL and extract with custom configuration.

pub async fn fetch_and_parse_with_config(
    url: &str,
    fetch_config: &FetchConfig,
    readability_config: &ReadabilityConfig
) -> Result<Article>

Feature: fetch

Example:

use lectito_core::{fetch_and_parse_with_config, FetchConfig, ReadabilityConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fetch_config = FetchConfig {
        timeout: 60,
        ..Default::default()
    };

    let read_config = ReadabilityConfig::builder()
        .min_score(25.0)
        .build();

    let article = fetch_and_parse_with_config(
        "https://example.com/article",
        &fetch_config,
        &read_config
    ).await?;

    Ok(())
}

is_probably_readable

Check if content likely meets readability thresholds.

pub fn is_probably_readable(html: &str) -> bool

Example:

use lectito_core::is_probably_readable;

if is_probably_readable(html) {
    println!("Content is readable");
}

Readability Type

Main API for configured extraction.

pub struct Readability {
    config: ReadabilityConfig,
}

Methods:

  • new() -> Self - Create with default config
  • with_config(ReadabilityConfig) -> Self - Create with custom config
  • parse(&str) -> Result<Article> - Parse HTML

Example:

use lectito_core::{Readability, ReadabilityConfig};

let config = ReadabilityConfig::builder()
    .min_score(25.0)
    .build();

let reader = Readability::with_config(config);
let article = reader.parse(html)?;

Fetch Functions

fetch_url

Fetch HTML from URL.

pub async fn fetch_url(url: &str, config: &FetchConfig) -> Result<String>

Feature: fetch

fetch_file

Read HTML from file.

pub fn fetch_file(path: &str) -> Result<String>

fetch_stdin

Read HTML from stdin.

pub fn fetch_stdin() -> Result<String>

DOM Types

Document

HTML document wrapper for parsing and selection.

pub struct Document {
    // ...
}

Methods:

  • parse(&str) -> Result<Self> - Parse HTML
  • select(&str) -> Result<Vec<Element>> - CSS selector

Element

DOM element wrapper.

pub struct Element<'a> {
    // ...
}

Methods:

  • text() -> String - Extract text content
  • html() -> String - Get inner HTML

Module Organization

.
├── article          # Article and Metadata types
├── error            # LectitoError and Result
├── fetch            # HTTP and file fetching
├── formatters       # Output formatters
├── metadata         # Metadata extraction
├── parse            # Document and Element types
├── readability      # Main API (parse, fetch_and_parse)
└── scoring          # Scoring algorithm

Feature Flags

FeatureDefaultEnables
fetchYesURL fetching with reqwest
markdownYesMarkdown output
siteconfigYesSite configuration support

Re-exports

The crate re-exports commonly used types at the root:

// Core types
pub use article::{Article, OutputFormat};
pub use error::{LectitoError, Result};

// Configuration
pub use fetch::FetchConfig;
pub use readability::{
    Readability, ReadabilityConfig, ReadabilityConfigBuilder,
    LectitoConfig, LectitoConfigBuilder
};

// Functions
pub use readability::{
    parse, parse_with_url, fetch_and_parse, fetch_and_parse_with_config,
    is_probably_readable
};

// Fetching
pub use fetch::{fetch_url, fetch_file, fetch_stdin};

// Formatters
pub use formatters::{
    MarkdownFormatter, TextFormatter, JsonFormatter,
    convert_to_markdown, convert_to_text, convert_to_json
};

Complete Example

use lectito_core::{
    Readability, ReadabilityConfig, FetchConfig,
    fetch_and_parse_with_config
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Configure
    let fetch_config = FetchConfig {
        timeout: 60,
        user_agent: "MyBot/1.0".to_string(),
    };

    let read_config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(1000)
        .build();

    // Fetch and parse
    let article = fetch_and_parse_with_config(
        "https://example.com/article",
        &fetch_config,
        &read_config
    ).await?;

    // Access results
    println!("Title: {:?}", article.metadata.title);
    println!("Word count: {}", article.word_count);

    // Convert to format
    let markdown = article.to_markdown()?;
    println!("{}", markdown);

    Ok(())
}

Further Documentation

Next Steps