Lectito

A Rust library and CLI for extracting readable content from web pages.

What is Lectito?

Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. It identifies and extracts the main article content from web pages, removing navigation, sidebars, advertisements, and other clutter.

Features

Content Extraction: Automatically identifies the main article content
Metadata Extraction: Pulls title, author, date, excerpt, and language
Output Formats: HTML, Markdown, plain text, and JSON
URL Fetching: Built-in async HTTP client with timeout support
CLI: Simple command-line interface for quick extractions
Site Configuration: Optional XPath-based extraction rules for difficult sites

Use Cases

Web Scraping: Extract clean article content from web pages
AI Agents: Feed readable text to language models
Content Analysis: Analyze article text without HTML noise
Archival: Save clean copies of web content
CLI: Quick article extraction from the terminal

Quick Links

Installation: See the Installation Guide
CLI Usage: See the CLI Usage Guide
Library Usage: See the Basic Usage Guide
API Reference: ~~See docs.rs/lectito~~

Quick Start

CLI

# Install
cargo install lectito-cli

# Extract from URL
lectito https://example.com/article

# Extract from local file
lectito article.html

# Pipe from stdin
curl https://example.com | lectito -

Library

use lectito_core::parse;

let html = r#"<html><body><article><h1>Title</h1><p>Content</p></article></body></html>"#;
let article = parse(html)?;

println!("Title: {:?}", article.metadata.title);
println!("Content: {}", article.to_markdown()?);

About the Name

"Lectito" is derived from the Latin legere (to read) and lectio (a reading or selection).

Lectito aims to select and present readable content from the chaos of the modern web.

Installation

Lectito provides both a CLI tool and a Rust library. Install whichever fits your needs.

CLI Installation

From crates.io

The easiest way to install the CLI is via cargo:

cargo install lectito-cli

This installs the lectito binary in your cargo bin directory (typically ~/.cargo/bin).

From Source

# Clone the repository
git clone https://github.com/stormlightlabs/lectito.git
cd lectito

# Build and install
cargo install --path crates/cli

Pre-built Binaries

Pre-built binaries are available on the GitHub Releases page for Linux, macOS, and Windows.

Download the appropriate binary for your platform and place it in your PATH.

Verify Installation

lectito --version

You should see version information printed.

Library Installation

Add to your Cargo.toml:

[dependencies]
lectito-core = "0.1"

Then run cargo build to fetch and compile the dependency.

Feature Flags

The library has several optional features:

[dependencies]
lectito-core = { version = "0.1", features = ["fetch", "markdown"] }

Feature	Default	Description
`fetch`	Yes	Enable URL fetching with reqwest
`markdown`	Yes	Enable Markdown output format
`siteconfig`	Yes	Enable site configuration support

If you don't need URL fetching (e.g., you have your own HTTP client), disable the default features:

[dependencies]
lectito-core = { version = "0.1", default-features = false, features = ["markdown"] }

Development Build

To build from source for development:

# Clone the repository
git clone https://github.com/stormlightlabs/lectito.git
cd lectito

# Build the workspace
cargo build --release

# The CLI binary will be at target/release/lectito

Next Steps

Quick Start Guide - Get started with basic usage
CLI Usage - Learn CLI commands and options
Library Guide - Use Lectito as a library

Quick Start

Get started with Lectito in minutes.

CLI Quick Start

Basic Usage

Extract content from a URL:

lectito https://example.com/article

Extract from a local file:

lectito article.html

Extract from stdin:

curl https://example.com | lectito -

Save to File

lectito https://example.com/article -o article.md

Change Output Format

# JSON output
lectito https://example.com/article --format json

# Plain text output
lectito https://example.com/article --format text

Set Timeout

For slow-loading sites:

lectito https://example.com/article --timeout 60

Library Quick Start

Add Dependency

Add to Cargo.toml:

[dependencies]
lectito-core = "0.1"

Parse HTML String

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = r#"
        <!DOCTYPE html>
        <html>
            <head><title>My Article</title></head>
            <body>
                <article>
                    <h1>Article Title</h1>
                    <p>This is the article content with plenty of text.</p>
                </article>
            </body>
        </html>
    "#;

    let article = parse(html)?;

    println!("Title: {:?}", article.metadata.title);
    println!("Content: {}", article.to_markdown()?);

    Ok(())
}

Fetch and Parse URL

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let article = fetch_and_parse("https://example.com/article").await?;

    println!("Title: {:?}", article.metadata.title);
    println!("Word count: {}", article.word_count);

    Ok(())
}

Convert to Different Formats

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<h1>Title</h1><p>Content here</p>";
    let article = parse(html)?;

    // Markdown with frontmatter
    let markdown = article.to_markdown()?;
    println!("{}", markdown);

    // Plain text
    let text = article.to_text();
    println!("{}", text);

    // Structured JSON
    let json = article.to_json()?;
    println!("{}", json);

    Ok(())
}

Common Patterns

Handle Errors

use lectito_core::{parse, LectitoError};

match parse("<html>...</html>") {
    Ok(article) => println!("Title: {:?}", article.metadata.title.unwrap()),
    Err(LectitoError::NotReaderable { score, threshold }) => {
        eprintln!("Content not readable: score {} < threshold {}", score, threshold);
    }
    Err(e) => eprintln!("Error: {}", e),
}

Configure Extraction

use lectito_core::{Readability, ReadabilityConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(500)
        .preserve_images(true)
        .build();

    let reader = Readability::with_config(config);
    let article = reader.parse("<html>...</html>")?;

    Ok(())
}

What's Next?

CLI Usage - Full CLI command reference
Library Guide - In-depth library documentation
Configuration - Advanced configuration options
Concepts - How the algorithm works

CLI Usage

Complete reference for the lectito command-line tool.

Basic Syntax

lectito [OPTIONS] <INPUT>

The INPUT can be:

A URL (starts with http:// or https://)
A local file path
- to read from stdin

Examples

URL Extraction

lectito https://example.com/article

Local File

lectito article.html

Stdin Pipe

curl https://example.com | lectito -
cat page.html | lectito -
wget -qO- https://example.com | lectito -

Options

`-o, --output <FILE>`

Write output to a file instead of stdout.

lectito https://example.com/article -o article.md

`-f, --format <FORMAT>`

Specify output format. Available formats:

Format	Description
`markdown` or `md`	Markdown (default)
`json`	Structured JSON
`text` or `txt`	Plain text
`html`	Cleaned HTML

lectito https://example.com/article -f json

`--timeout <SECONDS>`

HTTP request timeout in seconds (default: 30).

lectito https://example.com/article --timeout 60

`--user-agent <USER_AGENT>`

Custom User-Agent header.

lectito https://example.com/article --user-agent "MyBot/1.0"

`--config <PATH>`

Path to site configuration file (TOML format).

lectito https://example.com/article --config site-config.toml

`-v, --verbose`

Enable verbose debug logging.

lectito https://example.com/article -v

`-h, --help`

Display help information.

lectito --help

`-V, --version`

Display version information.

lectito --version

Common Workflows

Extract and Save Article

lectito https://example.com/article -o articles/article.md

Batch Processing Multiple URLs

while read url; do
    lectito "$url" -o "articles/$(date +%s).md"
done < urls.txt

Extract to JSON for Processing

lectito https://example.com/article --format json | jq '.metadata.title'

Extract from Multiple Files

for file in articles/*.html; do
    lectito "$file" -o "processed/$(basename "$file" .html).md"
done

Custom Timeout for Slow Sites

lectito https://slow-site.com/article --timeout 120

Output Formats

Markdown (Default)

Output includes TOML frontmatter with metadata (when --frontmatter is used):

+++
title = "Article Title"
author = "John Doe"
date = "2025-01-17"
excerpt = "A brief description..."
+++

# Article Title

Article content here...

JSON

Structured output with all metadata:

{
    "metadata": {
        "title": "Article Title",
        "author": "John Doe",
        "date": "2025-01-17",
        "excerpt": "A brief description..."
    },
    "content": "<div>...</div>",
    "text_content": "Article content here...",
    "word_count": 500
}

Plain Text

Just the article text without formatting:

Article Title

Article content here...

Exit Codes

Code	Meaning
0	Success
1	Error (invalid URL, network failure, etc.)

Error Handling

The CLI will print error messages to stderr:

lectito https://invalid-domain-xyz.com
# Error: failed to fetch URL: dns error: failed to lookup address information

For content that isn't readable:

lectito https://example.com/page
# Error: content not readable: score 15.2 < threshold 20.0

Tips

Use timeouts: Set appropriate timeouts to avoid hanging
Batch operations: Process multiple URLs in parallel
Save to file: Use -o to avoid terminal rendering overhead
JSON for parsing: Use JSON output when processing with other tools

Next Steps

Configuration - Advanced configuration options
Output Formats - Detailed format documentation
Concepts - Understanding the algorithm

Basic Usage

Learn the fundamentals of using Lectito as a library.

Simple Parsing

The easiest way to extract content is with the parse function:

use lectito_core::{parse, Article};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = r#"
        <!DOCTYPE html>
        <html>
            <head><title>My Article</title></head>
            <body>
                <article>
                    <h1>Article Title</h1>
                    <p>This is the article content.</p>
                </article>
            </body>
        </html>
    "#;

    let article: Article = parse(html)?;

    println!("Title: {:?}", article.metadata.title);
    println!("Content: {}", article.to_markdown()?);

    Ok(())
}

Fetching and Parsing

For URLs, use the fetch_and_parse function:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://example.com/article";
    let article = fetch_and_parse(url).await?;

    println!("Title: {:?}", article.metadata.title);
    println!("Author: {:?}", article.metadata.author);
    println!("Word count: {}", article.word_count);

    Ok(())
}

Working with the Article

The Article struct contains all extracted information:

Metadata

use lectito_core::parse;

let html = "<html>...</html>";
let article = parse(html)?;

// Access metadata
if let Some(title) = article.metadata.title {
    println!("Title: {}", title);
}

if let Some(author) = article.metadata.author {
    println!("Author: {}", author);
}

if let Some(date) = article.metadata.published_date {
    println!("Published: {}", date);
}

// Get excerpt
if let Some(excerpt) = article.metadata.excerpt {
    println!("Excerpt: {}", excerpt);
}

Content Access

use lectito_core::parse;

let html = "<html>...</html>";
let article = parse(html)?;

// Get cleaned HTML
let html_content = &article.content;

// Get plain text
let text = article.to_text();

// Get Markdown
let markdown = article.to_markdown()?;

// Get JSON
let json = article.to_json()?;

Readability API

For more control, use the Readability API:

use lectito_core::{Readability, ReadabilityConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";

    // Use default config
    let reader = Readability::new();
    let article = reader.parse(html)?;

    // Or with custom config
    let config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(500)
        .build();

    let reader = Readability::with_config(config);
    let article = reader.parse(html)?;

    Ok(())
}

Error Handling

Lectito returns Result<T, LectitoError>. Handle errors appropriately:

use lectito_core::{parse, LectitoError};

fn extract_article(html: &str) -> Result<String, String> {
    match parse(html) {
        Ok(article) => Ok(article.to_markdown().unwrap_or_default()),
        Err(LectitoError::NotReaderable { score, threshold }) => {
            Err(format!("Content not readable: score {} < threshold {}", score, threshold))
        }
        Err(LectitoError::InvalidUrl(msg)) => {
            Err(format!("Invalid URL: {}", msg))
        }
        Err(e) => Err(format!("Extraction failed: {}", e)),
    }
}

Common Patterns

Parse with URL Context

When you have the URL, provide it for better relative link resolution:

use lectito_core::{parse_with_url, Article};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let url = "https://example.com/article";

    let article: Article = parse_with_url(html, url)?;

    // Relative links are now resolved correctly
    Ok(())
}

Check if Content is Readable

Before parsing, check if content meets readability thresholds:

use lectito_core::is_probably_readable;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";

    if is_probably_readable(html) {
        println!("Content is readable");
    } else {
        println!("Content may not be readable");
    }

    Ok(())
}

Working with Documents

For lower-level DOM manipulation:

use lectito_core::{Document, Element};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html><body><p>Hello</p></body></html>";

    let doc = Document::parse(html)?;
    let elements: Vec<Element> = doc.select("p")?;

    for element in elements {
        println!("Text: {}", element.text());
    }

    Ok(())
}

Integrations

With reqwest

use lectito_core::parse;
use reqwest::Client;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    let response = client.get("https://example.com/article")
        .send()
        .await?;

    let html = response.text().await?;
    let article = parse(&html)?;

    println!("Title: {:?}", article.metadata.title);

    Ok(())
}

With Scraper

If you're already using scraper, you can integrate:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    // Work with the article's HTML content
    println!("Cleaned HTML: {}", article.content);

    Ok(())
}

Next Steps

Configuration - Advanced configuration options
Async vs Sync - Understanding async APIs
Output Formats - Detailed format documentation

Configuration

Customize Lectito's extraction behavior with configuration options.

ReadabilityConfig

The ReadabilityConfig struct controls extraction parameters. Use the builder pattern:

use lectito_core::{Readability, ReadabilityConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(500)
        .preserve_images(true)
        .build();

    let reader = Readability::with_config(config);
    let article = reader.parse("<html>...</html>")?;

    Ok(())
}

Configuration Options

min_score

Minimum readability score for content to be considered extractable (default: 20.0).

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(25.0)
    .build();

Higher values are more strict. If content scores below this threshold, parsing returns LectitoError::NotReaderable.

char_threshold

Minimum character count for content to be considered (default: 500).

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .char_threshold(1000)
    .build();

Increase this for short pages or blog posts to avoid extracting navigation elements.

preserve_images

Whether to preserve images in the extracted content (default: true).

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .preserve_images(false)
    .build();

min_content_length

Minimum length for text content (default: 140).

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_content_length(200)
    .build();

min_score_threshold

Threshold for minimum score during scoring (default: 20.0).

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score_threshold(25.0)
    .build();

FetchConfig

Configure HTTP fetching behavior:

use lectito_core::{fetch_and_parse_with_config, FetchConfig, ReadabilityConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fetch_config = FetchConfig {
        timeout: 60,
        user_agent: "MyBot/1.0".to_string(),
        ..Default::default()
    };

    let read_config = ReadabilityConfig::builder()
        .min_score(25.0)
        .build();

    let article = fetch_and_parse_with_config(
        "https://example.com/article",
        &fetch_config,
        &read_config
    ).await?;

    Ok(())
}

FetchConfig Options

Field	Type	Default	Description
`timeout`	`u64`	30	Request timeout in seconds
`user_agent`	`String`	"Lectito/..."	User-Agent header value

Default Values

impl Default for ReadabilityConfig {
    fn default() -> Self {
        Self {
            min_score: 20.0,
            char_threshold: 500,
            preserve_images: true,
            min_content_length: 140,
            min_score_threshold: 20.0,
        }
    }
}

Configuration Examples

Strict Extraction

For high-quality content only:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(30.0)
    .char_threshold(1000)
    .min_content_length(300)
    .build();

Lenient Extraction

For extracting from short pages:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(10.0)
    .char_threshold(200)
    .min_content_length(50)
    .build();

Text-Only Extraction

Remove images and multimedia:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .preserve_images(false)
    .build();

Custom Fetch Settings

Long timeout with custom user agent:

use lectito_core::FetchConfig;

let config = FetchConfig {
    timeout: 120,
    user_agent: "MyBot/1.0 (+https://example.com/bot)".to_string(),
};

Site Configuration

For sites that require custom extraction rules, use the site configuration feature (requires siteconfig feature):

[dependencies]
lectito-core = { version = "0.1", features = ["siteconfig"] }

Site configuration uses the FTR (Five Filters Text) format. See How It Works for details on site-specific extraction.

Next Steps

Async vs Sync - Understanding async APIs
Output Formats - Detailed format documentation
Scoring Algorithm - How scores are calculated

Async vs Sync

Understanding Lectito's async and synchronous APIs.

Overview

Lectito provides both synchronous and asynchronous APIs:

Function	Async/Sync	Use Case
`parse()`	Sync	Parse HTML from string
`parse_with_url()`	Sync	Parse with URL context
`fetch_and_parse()`	Async	Fetch from URL then parse
`fetch_url()`	Async	Fetch HTML from URL

When to Use Each

Use Sync APIs When

You already have the HTML as a string
You're using your own HTTP client
Performance is not critical
You're integrating into synchronous code

Use Async APIs When

You need to fetch from URLs
You're already using async/await
You want concurrent fetches
Performance matters for network operations

Synchronous Parsing

Parse HTML that you already have:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;
    Ok(())
}

Asynchronous Fetching

Fetch and parse in one operation:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://example.com/article";
    let article = fetch_and_parse(url).await?;
    Ok(())
}

Manual Fetch and Parse

Use your own HTTP client:

use lectito_core::parse;
use reqwest::Client;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    let response = client.get("https://example.com/article")
        .send()
        .await?;

    let html = response.text().await?;
    let article = parse(&html)?;

    Ok(())
}

Concurrent Fetches

Fetch multiple articles concurrently:

use lectito_core::fetch_and_parse;
use futures::future::join_all;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let urls = vec![
        "https://example.com/article1",
        "https://example.com/article2",
        "https://example.com/article3",
    ];

    let futures: Vec<_> = urls.into_iter()
        .map(|url| fetch_and_parse(url))
        .collect();

    let articles = join_all(futures).await;

    for article in articles {
        match article {
            Ok(a) => println!("Got: {:?}", a.metadata.title),
            Err(e) => eprintln!("Error: {}", e),
        }
    }

    Ok(())
}

Batch Processing

Process URLs with concurrency limits:

use lectito_core::fetch_and_parse;
use futures::stream::{StreamExt, try_stream};

async fn process_urls(urls: Vec<String>) -> Result<(), Box<dyn std::error::Error>> {
    let stream = try_stream! {
        for url in urls {
            let article = fetch_and_parse(&url).await?;
            yield article;
        }
    };

    let mut stream = stream.buffer_unordered(5); // 5 concurrent requests

    while let Some(article) = stream.next().await {
        println!("Processed: {:?}", article?.metadata.title);
    }

    Ok(())
}

Sync Code in Async Context

If you need to use sync parsing in async code:

use lectito_core::parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Fetch with your async HTTP client
    let html = fetch_html().await?;

    // Parse is sync, but that's fine in async context
    let article = parse(&html)?;

    Ok(())
}

async fn fetch_html() -> Result<String, Box<dyn std::error::Error>> {
    // Your async fetching logic
    Ok(String::from("<html>...</html>"))
}

Performance Considerations

Parsing (Sync)

Parsing is CPU-bound and runs synchronously:

use lectito_core::parse;
use std::time::Instant;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";

    let start = Instant::now();
    let article = parse(html)?;
    let duration = start.elapsed();

    println!("Parsed in {:?}", duration);

    Ok(())
}

Fetching (Async)

Fetching is I/O-bound and benefits from async:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let start = std::time::Instant::now();
    let article = fetch_and_parse("https://example.com/article").await?;
    let duration = start.elapsed();

    println!("Fetched and parsed in {:?}", duration);

    Ok(())
}

Choosing the Right Approach

Scenario	Recommended Approach
Have HTML string	`parse()` (sync)
Need to fetch URL	`fetch_and_parse()` (async)
Custom HTTP client	Your client + `parse()` (sync)
Batch URL processing	`fetch_and_parse()` with concurrent futures
CLI tool	Depends on your runtime setup
Web server	`fetch_and_parse()` (async) for better throughput

Feature Flags

To disable async features and reduce dependencies:

[dependencies]
lectito-core = { version = "0.1", default-features = false, features = ["markdown"] }

This removes reqwest and tokio dependencies. You'll need to fetch HTML yourself.

Next Steps

Output Formats - Working with different output formats
Configuration - Advanced configuration options
Basic Usage - Core usage patterns

Output Formats

Work with different output formats: Markdown, JSON, text, and HTML.

Overview

The Article struct provides methods for converting to different formats:

Method	Format	Requires Feature
`to_markdown()`	Markdown with frontmatter	`markdown`
`to_json()`	Structured JSON	Always available
`to_text()`	Plain text	Always available
`content` field	Cleaned HTML	Always available

Markdown

Convert article to Markdown with YAML frontmatter:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let markdown = article.to_markdown()?;
    println!("{}", markdown);

    Ok(())
}

Output Format

+++
title = "Article Title"
author = "John Doe"
published_date = "2025-01-17"
excerpt = "A brief description of the article"
word_count = 500
+++

# Article Title

Article content here...

Paragraph with **bold** and _italic_ text.

Customizing Markdown

Use MarkdownFormatter for more control:

use lectito_core::{parse, MarkdownFormatter, MarkdownConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let config = MarkdownConfig {
        frontmatter: true,
        // Add more options as available
    };

    let formatter = MarkdownFormatter::new(config);
    let markdown = formatter.format(&article)?;

    println!("{}", markdown);

    Ok(())
}

JSON

Get structured JSON with all metadata:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let json = article.to_json()?;
    println!("{}", json);

    Ok(())
}

JSON Structure

{
    "metadata": {
        "title": "Article Title",
        "author": "John Doe",
        "published_date": "2025-01-17",
        "excerpt": "A brief description",
        "language": "en"
    },
    "content": "<div>Cleaned HTML content...</div>",
    "text_content": "Plain text content...",
    "word_count": 500,
    "readability_score": 35.5
}

Parsing JSON

use lectito_core::parse;
use serde_json::Value;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let json = article.to_json()?;
    let value: Value = serde_json::from_str(&json)?;

    println!("Title: {}", value["metadata"]["title"]);

    Ok(())
}

Plain Text

Extract just the text content:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let text = article.to_text();
    println!("{}", text);

    Ok(())
}

Output Format

Plain text includes:

Headings as lines with # prefixes
Paragraphs separated by blank lines
List items with * or 1. prefixes
No HTML tags or markdown syntax

HTML

Access the cleaned HTML directly:

use lectito_core::parse;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    // Cleaned HTML is in the `content` field
    let cleaned_html = &article.content;
    println!("{}", cleaned_html);

    Ok(())
}

HTML Characteristics

The cleaned HTML:

Removes clutter (navigation, sidebars, ads)
Keeps main content structure
Preserves images (if preserve_images is true)
Removes most scripts and styles
Maintains heading hierarchy

Choosing a Format

Format	Use Case
Markdown	Blog posts, documentation, static sites
JSON	APIs, databases, further processing
Text	Analysis, indexing, simple display
HTML	Web display, further HTML processing

Format Conversion Examples

Markdown to File

use lectito_core::parse;
use std::fs;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let html = "<html>...</html>";
    let article = parse(html)?;

    let markdown = article.to_markdown()?;
    fs::write("article.md", markdown)?;

    Ok(())
}

JSON for API Response

use lectito_core::parse;
use warp::Filter;

async fn extract_article(body: String) -> Result<impl warp::Reply, warp::Rejection> {
    let article = parse(&body).unwrap();
    let json = article.to_json().unwrap();
    Ok(warp::reply::json(&json))
}

Text for Analysis

use lectito_core::parse;

fn analyze_text(html: &str) -> Result<(), Box<dyn std::error::Error>> {
    let article = parse(html)?;
    let text = article.to_text();

    // Analyze word frequency
    let words: Vec<&str> = text.split_whitespace().collect();
    println!("Word count: {}", words.len());

    // Count sentences
    let sentences = text.split(&['.', '!', '?'][..]).count();
    println!("Sentence count: {}", sentences);

    Ok(())
}

HTML for Display

use lectito_core::parse;

fn display_article(html: &str) -> Result<(), Box<dyn std::error::Error>> {
    let article = parse(html)?;

    // Use in a template
    let rendered = format!(
        r#"
        <!DOCTYPE html>
        <html>
        <head>
            <title>{}</title>
        </head>
        <body>
            <article>{}</article>
        </body>
        </html>
        "#,
        article.metadata.title.unwrap_or_default(),
        article.content
    );

    Ok(())
}

Next Steps

Configuration - Advanced configuration options
Basic Usage - Core usage patterns
Concepts - Understanding the algorithm

How It Works

Understanding the Lectito content extraction pipeline.

Overview

Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. The algorithm identifies the main article content by analyzing the HTML structure, scoring elements based on various heuristics, and selecting the highest-scoring content.

Extraction Pipeline

The extraction process consists of four main stages:

HTML Input → Preprocessing → Scoring → Selection → Post-processing → Article

1. Preprocessing

Clean the HTML to improve scoring accuracy:

Remove unlikely content: scripts, styles, iframes, and hidden nodes
Strip elements with unlikely class/ID patterns
Preserve structure: maintain HTML hierarchy for accurate scoring

Why: Preprocessing removes elements that could confuse the scoring algorithm or contain non-article content.

2. Scoring

Score each element based on content characteristics:

Tag score: Different HTML tags have different base scores
Class/ID weight: Positive patterns (article, content) vs negative (sidebar, footer)
Content density: Length and punctuation indicate content quality
Link density: Too many links suggests navigation/metadata, not content

Why: Scoring identifies which elements are most likely to contain the main article content.

3. Selection

Select the highest-scoring element as the article candidate:

Find element with highest score (bias toward semantic containers when scores are close)
Check if score meets minimum threshold (default: 20.0)
Check if content length meets minimum threshold (default: 500 chars)
Return error if content doesn't meet thresholds

Why: Selection ensures we extract actual article content, not navigation or ads.

4. Post-processing

Clean up the selected content:

Include sibling elements: adjacent content blocks and shared-parent headers
Remove remaining clutter: ads, comments, social widgets
Clean up whitespace: normalize spacing and formatting
Preserve structure: maintain headings, paragraphs, lists

Why: Post-processing improves the quality of extracted content and includes related elements.

Data Flow

Input HTML
    ↓
parse_to_document()
    ↓
preprocess_html() → Cleaned HTML
    ↓
build_dom_tree() → DOM Tree
    ↓
calculate_score() → Scored Elements
    ↓
extract_content() → Selected Element
    ↓
postprocess_html() → Cleaned Content
    ↓
extract_metadata() → Metadata
    ↓
Article

Key Components

Document and Element

The Document and Element types wrap the scraper crate's HTML parsing:

use lectito_core::{Document, Element};

let doc = Document::parse(html)?;
let elements: Vec<Element> = doc.select("article p")?;

These provide a convenient API for DOM manipulation and element traversal.

Scoring Algorithm

The scoring algorithm combines multiple factors:

element_score = base_tag_score
              + class_id_weight
              + content_density_score
              × (1 - link_density)

See Scoring Algorithm for details.

Metadata Extraction

Separate process extracts metadata from the HTML:

Title: <h1>, <title>, or Open Graph tags
Author: meta tags, bylines, schema.org
Date: meta tags, time elements, schema.org
Excerpt: meta description, first paragraph

Why This Approach

Content Over Structure

Unlike XPath-based extraction, Lectito doesn't rely on fixed HTML structures. It analyzes content characteristics, making it work across many sites without custom rules.

Heuristic-Based

The algorithm uses heuristics (rules of thumb) derived from analyzing thousands of articles. This makes it flexible and adaptable to different site designs.

Fallback Mechanism

For sites where the algorithm fails, Lectito supports site-specific configuration files with XPath expressions. See Configuration for details.

Limitations

Sites That May Fail

Very short pages (tweets, status updates)
Non-article content (product pages, search results)
Unusual layouts (some single-column designs)
Heavily JavaScript-dependent content

Improving Extraction

For difficult sites:

Adjust thresholds: Lower min_score or char_threshold
Site configuration: Provide XPath rules
Manual curation: Use XPath or CSS selectors directly

See Configuration for options.

Comparison to Alternatives

Approach	Pros	Cons
Lectito	Works across many sites, no custom rules needed	May fail on unusual layouts
XPath	Precise, predictable	Requires custom rules per site
CSS Selectors	Simple, familiar	Brittle, breaks on layout changes
Machine Learning	Adaptable	Complex, requires training data

Lectito strikes a balance: works well for most sites without custom rules, with site configuration as a fallback.

Performance Considerations

Parsing: HTML parsing is fast but not instant
Scoring: Traverses entire DOM, O(n) complexity
Fetching: Async for non-blocking I/O
Memory: Entire document loaded into memory

For large-scale extraction, consider batching and concurrent fetches.

Next Steps

Scoring Algorithm - Detailed scoring explanation
Configuration - Customizing extraction
Basic Usage - Using the API

Scoring Algorithm

Detailed explanation of how Lectito scores HTML elements to identify article content.

Overview

The scoring algorithm assigns a numeric score to each HTML element, indicating how likely it is to contain the main article content. Higher scores indicate better content candidates.

Score Formula

The final score for each element is calculated as:

element_score = (base_tag_score
               + class_id_weight
               + content_density_score
               + container_bonus)
               × (1 - link_density)

Let's break down each component.

Base Tag Score

Different HTML tags have different inherent scores, reflecting their likelihood of containing content:

Tag	Score	Rationale
`<article>`	+10	Semantic article container
`<section>`	+8	Logical content section
`<div>`	+5	Generic container, often used for content
`<blockquote>`	+3	Quoted content
`<pre>`	0	Preformatted text, neutral
`<td>`	+3	Table cell
`<address>`	-3	Contact info, unlikely to be main content
`<ol>`/`<ul>`	-3	Lists and metadata
`<li>`	-3	List item
`<header>`	-5	Header, not main content
`<footer>`	-5	Footer, not main content
`<nav>`	-5	Navigation
`<th>`	-5	Table header
`<h1>`-`<h6>`	-5	Headings, not content themselves
`<form>`	-3	Forms, not content
`<main>`	0	Container scored via bonus

Class/ID Weight

Class and ID attributes strongly indicate element purpose:

Positive Patterns

These patterns indicate content elements:

(?i)(article|body|content|entry|hentry|h-entry|main|page|post|text|blog|story)

Weight: +25 points

Examples:

class="article-content"
id="main-content"
class="post-body"

Negative Patterns

These patterns indicate non-content elements:

(?i)(banner|breadcrumbs?|combx|comment|community|disqus|extra|foot|header|menu|related|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup)

Weight: -25 points

Examples:

class="sidebar"
id="footer"
class="navigation"

Content Density Score

Rewards elements with substantial text content:

Character Density

1 point per 100 characters, maximum 3 points.

char_score = (text_length / 100).min(3)

Punctuation Density

1 point per 5 commas/periods, maximum 3 points.

punct_score = (comma_count / 5).min(3)

Total content density:

content_density = char_score + punct_score

Rationale: Real article content has more text and punctuation than navigation or metadata.

Container Bonus

Elements that are typical article containers receive a small boost:

<article>, <section>, <main>: +2

This bias helps select semantic containers when scores are close.

Link Density Penalty

Penalizes elements with too many links:

link_density = (length of all <a> tag text) / (total text length)
final_score = raw_score × (1 - link_density)

Examples:

Text "Click here": link density = 100% (10/10)
Text "See the article for details": link density = 33% (7/21)
Text "Article content with no links": link density = 0%

Rationale: Navigation menus, lists of links, and metadata have high link density. Real content has low link density.

Complete Example

Consider this HTML:

<div class="article-content">
    <h1>Article Title</h1>
    <p>
        This is a substantial paragraph with plenty of text, including multiple
        sentences, and commas, to demonstrate how content density scoring works.
    </p>
    <p>
        Another paragraph with even more text, details, and information to
        increase the character count.
    </p>
</div>

Text length: ~220 characters
Character score: min(220/100, 3) = 2
Commas: 4
Punctuation score: min(4/5, 3) = 0
Total: 2 points

4 Link Density

No links: link density = 0

5 Final Score

(5 + 25 + 2) × (1 - 0) = 32

This element would score 32, well above the default threshold of 20.

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(25.0)           // Higher threshold
    .char_threshold(1000)      // Require more content
    .min_content_length(200)   // Longer minimum text
    .build();

See Configuration for details.

Practical Implications

Why Articles Score Well

Semantic tags (<article>)
Descriptive classes (article-content)
Substantial text (high character count)
Punctuation (commas, periods)
Few links (low link density)

Generic or negative classes (sidebar, navigation)
Little text (just link labels)
Many links (high link density)
Short content (fails character threshold)

Why Comments May Score Poorly

Often in negative classed containers (comments)
Short individual comments
Many links (usernames, replies)
Variable quality

Site Configuration

When automatic scoring fails, provide XPath rules:

# example.com.toml
[[fingerprints]]
pattern = "example.com"

[[fingerprints.extract]]
title = "//h1[@class='article-title']"
content = "//div[@class='article-body']"

See Configuration for details.

References

Original Readability.js: Mozilla Readability
Algorithm inspiration: Arc90 Readability

Next Steps

How It Works - Overall extraction pipeline
Configuration - Customizing behavior
Basic Usage - Using the API

API Overview

Complete reference for the Lectito Rust library API.

Core Types

Article

The main result type containing extracted content and metadata.

pub struct Article {
    /// Extracted metadata
    pub metadata: Metadata,

    /// Cleaned HTML content
    pub content: String,

    /// Plain text content
    pub text_content: String,

    /// Number of words in content
    pub word_count: usize,

    /// Final readability score
    pub readability_score: f64,
}

Methods:

to_markdown() -> Result<String> - Convert to Markdown with frontmatter
to_json() -> Result<String> - Convert to JSON
to_text() -> String - Get plain text

Metadata

Extracted article metadata.

pub struct Metadata {
    /// Article title
    pub title: Option<String>,

    /// Author name
    pub author: Option<String>,

    /// Publication date
    pub published_date: Option<String>,

    /// Article excerpt/description
    pub excerpt: Option<String>,

    /// Content language
    pub language: Option<String>,
}

LectitoError

Error type for all Lectito operations.

pub enum LectitoError {
    /// Content not readable: score below threshold
    NotReaderable { score: f64, threshold: f64 },

    /// Invalid URL provided
    InvalidUrl(String),

    /// HTTP request timeout
    Timeout { timeout: u64 },

    /// HTTP error
    HttpError(reqwest::Error),

    /// HTML parsing error
    HtmlParseError(String),

    /// IO error
    IoError(std::io::Error),
}

Result

Type alias for Result with LectitoError.

pub type Result<T> = std::result::Result<T, LectitoError>;

Configuration Types

ReadabilityConfig

Main configuration for content extraction.

pub struct ReadabilityConfig {
    /// Minimum readability score (default: 20.0)
    pub min_score: f64,

    /// Minimum character count (default: 500)
    pub char_threshold: usize,

    /// Preserve images in output (default: true)
    pub preserve_images: bool,

    /// Minimum content length (default: 140)
    pub min_content_length: usize,

    /// Minimum score threshold (default: 20.0)
    pub min_score_threshold: f64,
}

Methods:

builder() -> ReadabilityConfigBuilder - Create a builder
default() -> Self - Default configuration

ReadabilityConfigBuilder

Builder for ReadabilityConfig.

pub struct ReadabilityConfigBuilder {
    // ...
}

Methods:

min_score(f64) -> Self - Set minimum score
char_threshold(usize) -> Self - Set character threshold
preserve_images(bool) -> Self - Set image preservation
min_content_length(usize) -> Self - Set minimum content length
min_score_threshold(f64) -> Self - Set score threshold
build() -> ReadabilityConfig - Build configuration

FetchConfig

Configuration for HTTP fetching.

pub struct FetchConfig {
    /// Request timeout in seconds (default: 30)
    pub timeout: u64,

    /// User-Agent header (default: "Lectito/...")
    pub user_agent: String,
}

Trait:

impl Default for FetchConfig

Main API Functions

parse

Parse HTML string and extract article.

pub fn parse(html: &str) -> Result<Article>

Example:

use lectito_core::parse;

let article = parse("<html>...</html>")?;

parse_with_url

Parse HTML with URL context for relative link resolution.

pub fn parse_with_url(html: &str, url: &str) -> Result<Article>

Example:

use lectito_core::parse_with_url;

let article = parse_with_url(html, "https://example.com/article")?;

fetch_and_parse

Fetch URL and extract article.

pub async fn fetch_and_parse(url: &str) -> Result<Article>

Feature: fetch

Example:

use lectito_core::fetch_and_parse;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let article = fetch_and_parse("https://example.com/article").await?;
    Ok(())
}

fetch_and_parse_with_config

Fetch URL and extract with custom configuration.

pub async fn fetch_and_parse_with_config(
    url: &str,
    fetch_config: &FetchConfig,
    readability_config: &ReadabilityConfig
) -> Result<Article>

Feature: fetch

Example:

use lectito_core::{fetch_and_parse_with_config, FetchConfig, ReadabilityConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fetch_config = FetchConfig {
        timeout: 60,
        ..Default::default()
    };

    let read_config = ReadabilityConfig::builder()
        .min_score(25.0)
        .build();

    let article = fetch_and_parse_with_config(
        "https://example.com/article",
        &fetch_config,
        &read_config
    ).await?;

    Ok(())
}

is_probably_readable

Check if content likely meets readability thresholds.

pub fn is_probably_readable(html: &str) -> bool

Example:

use lectito_core::is_probably_readable;

if is_probably_readable(html) {
    println!("Content is readable");
}

Readability Type

Main API for configured extraction.

pub struct Readability {
    config: ReadabilityConfig,
}

Methods:

new() -> Self - Create with default config
with_config(ReadabilityConfig) -> Self - Create with custom config
parse(&str) -> Result<Article> - Parse HTML

Example:

use lectito_core::{Readability, ReadabilityConfig};

let config = ReadabilityConfig::builder()
    .min_score(25.0)
    .build();

let reader = Readability::with_config(config);
let article = reader.parse(html)?;

Fetch Functions

fetch_url

Fetch HTML from URL.

pub async fn fetch_url(url: &str, config: &FetchConfig) -> Result<String>

Feature: fetch

fetch_file

Read HTML from file.

pub fn fetch_file(path: &str) -> Result<String>

fetch_stdin

Read HTML from stdin.

pub fn fetch_stdin() -> Result<String>

DOM Types

Document

HTML document wrapper for parsing and selection.

pub struct Document {
    // ...
}

Methods:

parse(&str) -> Result<Self> - Parse HTML
select(&str) -> Result<Vec<Element>> - CSS selector

Element

DOM element wrapper.

pub struct Element<'a> {
    // ...
}

Methods:

text() -> String - Extract text content
html() -> String - Get inner HTML

Module Organization

.
├── article          # Article and Metadata types
├── error            # LectitoError and Result
├── fetch            # HTTP and file fetching
├── formatters       # Output formatters
├── metadata         # Metadata extraction
├── parse            # Document and Element types
├── readability      # Main API (parse, fetch_and_parse)
└── scoring          # Scoring algorithm

Feature Flags

Feature	Default	Enables
`fetch`	Yes	URL fetching with reqwest
`markdown`	Yes	Markdown output
`siteconfig`	Yes	Site configuration support

Re-exports

The crate re-exports commonly used types at the root:

// Core types
pub use article::{Article, OutputFormat};
pub use error::{LectitoError, Result};

// Configuration
pub use fetch::FetchConfig;
pub use readability::{
    Readability, ReadabilityConfig, ReadabilityConfigBuilder,
    LectitoConfig, LectitoConfigBuilder
};

// Functions
pub use readability::{
    parse, parse_with_url, fetch_and_parse, fetch_and_parse_with_config,
    is_probably_readable
};

// Fetching
pub use fetch::{fetch_url, fetch_file, fetch_stdin};

// Formatters
pub use formatters::{
    MarkdownFormatter, TextFormatter, JsonFormatter,
    convert_to_markdown, convert_to_text, convert_to_json
};

Complete Example

use lectito_core::{
    Readability, ReadabilityConfig, FetchConfig,
    fetch_and_parse_with_config
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Configure
    let fetch_config = FetchConfig {
        timeout: 60,
        user_agent: "MyBot/1.0".to_string(),
    };

    let read_config = ReadabilityConfig::builder()
        .min_score(25.0)
        .char_threshold(1000)
        .build();

    // Fetch and parse
    let article = fetch_and_parse_with_config(
        "https://example.com/article",
        &fetch_config,
        &read_config
    ).await?;

    // Access results
    println!("Title: {:?}", article.metadata.title);
    println!("Word count: {}", article.word_count);

    // Convert to format
    let markdown = article.to_markdown()?;
    println!("{}", markdown);

    Ok(())
}

Further Documentation

~~docs.rs/lectito~~ - Full API documentation with rustdoc
GitHub Repository - Source code and examples
Basic Usage - Usage examples
Configuration - Configuration options

Next Steps

Getting Started - Installation and quick start
Library Guide - In-depth usage documentation

Lectito Documentation