Lectito

Lectito is a Rust library and CLI tool for extracting readable article content from HTML.

Most web pages contain way more than the text a reader came for, like ads, navigation, related links, comment areas, tracking markup, hidden elements, and presentation wrappers. Lectito tries to identify the main content root and return a smaller document that is useful for reading, storage, search, and conversion.

It returns:

  • cleaned article HTML
  • Markdown
  • plain text
  • page metadata
  • extraction diagnostics

Lectito is parser-first. The core API accepts HTML and an optional base URL. URL fetching exists in the CLI for convenience, but the library does not require network access.

This keeps the library usable in environments that already have HTML available: crawlers, browser extensions, desktop apps, mobile apps, tests, and offline archives.

Main APIs

#![allow(unused)]
fn main() {
use lectito::{extract, ReadabilityOptions};

let html = r#"<article><h1>Title</h1><p>Article text.</p></article>"#;
let article = extract(html, Some("https://example.com/post"), &ReadabilityOptions::default())?;

if let Some(article) = article {
    println!("{}", article.markdown);
}
Ok::<(), lectito::Error>(())
}

Use extract_with_diagnostics when tuning extraction or debugging a bad page. Use is_probably_readable before extraction when you only need a quick yes/no answer.

Project Scope

The public API is intentionally small. Callers should depend on the article result, options, diagnostics, and Markdown helpers rather than internal scoring or cleanup modules.

Installation

Lectito is split into a core library and a CLI. Use the library when your application already has HTML. Use the CLI for local inspection, fixtures, shell scripts, and quick conversions.

Library

Add lectito to your Rust project:

[dependencies]
lectito = "0.1"

For local development against this workspace:

[dependencies]
lectito = { path = "crates/core" }

The core crate has no runtime service requirement. It parses the string you pass in and returns an article result.

CLI

Install the CLI from this workspace:

cargo install --path crates/cli

The binary is named lectito.

lectito --help

The CLI can read from a file, stdin, or a URL. URL support is a command-line convenience; it is not part of the core library contract.

License

Lectito is licensed under MPL-2.0.

Quick Start

Extract From HTML

Start with extract for normal use. It takes the source HTML, an optional base URL, and ReadabilityOptions. The base URL lets Lectito resolve relative links, images, and metadata URLs in the extracted output.

use lectito::{extract, ReadabilityOptions};

fn main() -> Result<(), lectito::Error> {
    let html = r#"
        <html>
          <head><title>Example</title></head>
          <body>
            <article>
              <h1>Example</h1>
              <p>This is the article body.</p>
            </article>
          </body>
        </html>
    "#;

    let article = extract(html, Some("https://example.com/article"), &ReadabilityOptions::default())?;

    if let Some(article) = article {
        println!("{:?}", article.title);
        println!("{}", article.markdown);
    }

    Ok(())
}

extract returns Ok(None) when no useful article content is found. That is different from an error. An empty or navigation-only page can be parsed successfully and still have no article.

Check Readability

Use is_probably_readable when you only need to decide whether a page is worth running through full extraction. It is faster and returns a boolean.

#![allow(unused)]
fn main() {
use lectito::{is_probably_readable, ReadableOptions};

let readable = is_probably_readable(html, &ReadableOptions::default())?;
Ok::<(), lectito::Error>(())
}

CLI

The CLI mirrors the library. parse extracts content, and readable performs the quick readability check.

lectito parse article.html --format markdown
lectito parse --url https://example.com/article --format json --pretty
lectito readable article.html

CLI Usage

The CLI is designed for inspecting extraction behavior and converting documents from the terminal. It is also useful for building fixtures because the same binary can print article output and diagnostics.

The CLI has three commands:

  • parse: extract article content
  • readable: check whether a document looks readable
  • fixture: inspect bundled fixtures

Parse

parse accepts one input source. Use a positional file path, --input, --stdin, or --url.

lectito parse article.html
lectito parse --input article.html
lectito parse --stdin < article.html
lectito parse --url https://example.com/article

Output formats:

JSON is the default because it preserves the whole article structure. Use Markdown or text when piping into another tool.

lectito parse article.html --format json --pretty
lectito parse article.html --format html
lectito parse article.html --format markdown
lectito parse article.html --format text

Useful options:

The defaults work for most article pages. Tune these flags when a page is too short, too broad, or has a known content container.

lectito parse article.html --char-threshold 800
lectito parse article.html --nb-top-candidates 8
lectito parse article.html --content-selector article
lectito parse article.html --url https://example.com/post --site-profile example.com.toml
lectito parse article.html --max-elems-to-parse 10000
lectito parse article.html --keep-classes --classes-to-preserve language-rust

--site-profile can be repeated. Each file must be a TOML site profile. User profiles take precedence over bundled profiles for the same host.

Diagnostics are written to stderr after the main output:

This keeps stdout usable for the extracted article while still showing debug information in the terminal.

lectito parse article.html --format markdown --diagnostic-format pretty
lectito parse article.html --diagnostic-format json

Readable

readable checks whether the document appears to contain enough article-like text. It does not return extracted content.

lectito readable article.html
lectito readable --stdin < article.html
lectito readable --url https://example.com/article
lectito readable article.html --json --pretty

Thresholds:

lectito readable article.html --min-content-length 140 --min-score 20

Basic Usage

Use extract when you want article content.

The function does not fetch the page. Pass it the HTML you want parsed. This is usually cleaner in applications because networking, caching, cookies, and browser rendering are application concerns.

#![allow(unused)]
fn main() {
use lectito::{extract, ReadabilityOptions};

let options = ReadabilityOptions::default();
let article = extract(html, Some("https://example.com/post"), &options)?;

match article {
    Some(article) => println!("{}", article.text_content),
    None => eprintln!("no article content found"),
}
Ok::<(), lectito::Error>(())
}

The base URL is optional. Pass it when the document contains relative links, images, or metadata URLs.

When extraction succeeds, Lectito returns Some(Article). When the page parses but does not contain a useful article, it returns None. Reserve error handling for invalid base URLs, configured size limits, and serialization failures.

Article Output

Article contains the extracted content in several forms:

#![allow(unused)]
fn main() {
if let Some(article) = article {
    println!("{}", article.content);
    println!("{}", article.markdown);
    println!("{}", article.text_content);
}
}

Use extract_with_diagnostics when you need to see how extraction chose a root. Diagnostics are meant for development and regression work. Most application code should call extract.

#![allow(unused)]
fn main() {
use lectito::{extract_with_diagnostics, ReadabilityOptions};

let report = extract_with_diagnostics(html, base_url, &ReadabilityOptions::default())?;

if let Some(article) = report.article {
    println!("{}", article.markdown);
}

eprintln!("{:?}", report.diagnostics.outcome);

Ok::<(), lectito::Error>(())
}

Configuration

ReadabilityOptions control extraction.

The defaults are conservative. They favor article pages with enough text to be useful and avoid exposing internal scoring knobs unless they affect common integration cases.

#![allow(unused)]
fn main() {
use lectito::ReadabilityOptions;

let options = ReadabilityOptions {
    char_threshold: 800,
    nb_top_candidates: 8,
    content_selector: Some("article".to_string()),
    site_profiles: Vec::new(),
    ..ReadabilityOptions::default()
};
}

Fields:

FieldDefaultMeaning
max_elems_to_parseNoneReject documents above this element count.
nb_top_candidates5Number of high-scoring candidates to consider.
char_threshold500Minimum extracted text length for an accepted attempt.
content_selectorNoneCSS selector to prefer as the content root.
site_profiles[]TOML site profiles for host-scoped extraction hints.
mobile_viewport_widthSome(480)Width used by recovery rules for mobile snapshots.
classes_to_preserve[]Class names kept during cleanup.
keep_classesfalseKeep all class attributes.
disable_json_ldfalseSkip JSON-LD metadata extraction.
link_density_modifier0.0Adjust link-density cleanup tolerance.

Prefer content_selector when you already know the page shape. It is clearer than trying to tune scores around a stable document layout.

Use site_profiles when you want the same kind of override to apply by URL host, or when you need removal selectors and metadata hints alongside content roots. Profiles are attempted before generic scoring, but weak profile output falls back to the generic extractor.

Use max_elems_to_parse as a guardrail for untrusted input. It rejects very large documents before extraction work continues.

ReadableOptions controls is_probably_readable.

Lower min_content_length for short posts or documentation pages. Raise min_score when you want the quick check to reject borderline pages.

#![allow(unused)]
fn main() {
use lectito::ReadableOptions;

let options = ReadableOptions {
    min_content_length: 140,
    min_score: 20.0,
};
}

Output Formats

Lectito produces all output formats during extraction.

The formats come from the same cleaned article root. That means callers can store HTML for fidelity, use Markdown for display or editing, and use plain text for search without running extraction multiple times.

#![allow(unused)]
fn main() {
let article = extract(html, base_url, &ReadabilityOptions::default())?.unwrap();

let html = article.content;
let markdown = article.markdown;
let text = article.text_content;
}

HTML

content is cleaned article HTML. Scripts, styles, navigation, sidebars, and other page chrome are removed where possible. Relative URLs are resolved when a base URL is provided.

Use HTML when you need the closest representation of the extracted article. It keeps images, links, tables, inline markup, and other structure that can be lost in plain text.

Markdown

markdown is generated from the cleaned article HTML. It preserves common reader content:

  • headings
  • paragraphs
  • links and images
  • lists
  • blockquotes
  • code blocks
  • tables
  • math
  • footnotes

The CLI Markdown output includes TOML frontmatter:

lectito parse article.html --format markdown

Markdown is useful when the next step is a reader view, note-taking system, static archive, or editor. It is also easier to diff in tests than HTML.

Plain Text

text_content is normalized article text. Use it for indexing, previews, and readability checks.

Plain text should not be treated as a rendering format. It discards links, images, and most document structure.

JSON

The CLI can serialize the article:

lectito parse article.html --format json --pretty

JSON is the best CLI format when another program needs metadata and content together.

How It Works

Lectito follows the same broad approach as Mozilla Readability.

The extractor starts with a full HTML document and tries to find the subtree that behaves like an article. It uses signals that tend to survive across sites: text length, paragraph density, semantic tags, class and id names, and the ratio of links to readable text.

  1. Parse the document.
  2. Recover useful content from common snapshots, including selected mobile and shadow-root cases.
  3. Extract metadata.
  4. Try a matching site profile or code extractor when one applies.
  5. Remove scripts, styles, hidden nodes, and unlikely content.
  6. Score candidate content roots by text length, tag type, class/id hints, and link density.
  7. Select the best root and include useful siblings.
  8. Clean the selected content.
  9. Apply schema text fallback when structured data is clearly better.
  10. Return HTML, Markdown, text, and diagnostics.

Extraction runs several attempts. Later attempts relax cleanup rules when the first pass produces too little text. The first attempt that reaches char_threshold is accepted. If no attempt reaches the threshold, Lectito may return the best non-empty attempt.

This retry model matters because pages fail in different ways. Some pages hide the useful content behind classes that look like chrome. Others include enough related links or widgets to pull the score away from the main text. Relaxed attempts give Lectito another chance without making the first pass too loose.

content_selector can short-circuit root selection for known documents:

#![allow(unused)]
fn main() {
let options = ReadabilityOptions {
    content_selector: Some("main article".to_string()),
    ..ReadabilityOptions::default()
};
}

Site profiles provide URL-scoped hints without disabling generic extraction:

#![allow(unused)]
fn main() {
let options = ReadabilityOptions {
    site_profiles: vec![r#"
        name = "example"
        hosts = ["example.com"]
        content_roots = ["article"]
        remove = [".ad", "nav"]
    "#.to_string()],
    ..ReadabilityOptions::default()
};
}

If a profile produces content below char_threshold, Lectito records the profile decision in diagnostics and continues with generic readability attempts.

After the root is selected, cleanup removes empty nodes, normalizes links and media, preserves selected classes, and prepares the HTML for Markdown and text conversion.

Diagnostics

Use diagnostics to inspect extraction decisions.

Diagnostics are for development, fixture work, and bug reports. They explain which candidates were considered, which root was selected, and why an extraction was accepted or downgraded to a best attempt.

#![allow(unused)]
fn main() {
use lectito::{extract_with_diagnostics, ReadabilityOptions};

let report = extract_with_diagnostics(html, base_url, &ReadabilityOptions::default())?;
println!("{:?}", report.diagnostics.outcome);
}

ExtractionReport contains:

  • article: the extracted article, if found
  • diagnostics: details about attempts and candidate selection

Outcomes:

OutcomeMeaning
AcceptedAn attempt met char_threshold.
BestAttemptNo attempt met the threshold, but non-empty content was found.
NoContentNo useful content was found.

Each attempt records:

  • cleanup flags
  • candidate count
  • top candidates
  • entry points
  • selected root
  • cleanup counts
  • recovery counts
  • extracted text length

When a site profile or code extractor matches, diagnostics include site_rule. That record reports the matched profile or extractor, whether it was bundled, which roots were selected, how many removals ran, whether the result met char_threshold, and any fallback reason.

Start with outcome, selected_root, and text_len. If the selected root is wrong, inspect the candidate list. If the root is right but output is noisy, inspect cleanup counts and preserved classes.

CLI diagnostics:

lectito parse article.html --diagnostic-format pretty
lectito parse article.html --diagnostic-format json

API Overview

Lectito has two public API targets:

  • Rust Crate API for native Rust applications, CLIs, and server integrations.
  • WASM API for browser, web worker, bundler, and Node.js integrations.

Both targets use the same core extractor and Markdown conversion logic. The Rust crate is the source of truth; the WASM crate maps that API into JavaScript types and camelCase option names.

Rust Crate API

Public exports from lectito:

The crate exposes the extraction API, output structs, diagnostics, errors, and Markdown helpers. Internal parser, scoring, cleanup, and recovery modules remain private.

#![allow(unused)]
fn main() {
pub use config::{Article, MarkdownOptions, ReadabilityOptions, ReadableOptions};
pub use diagnostics::{
    AttemptDiagnostic, CandidateDiagnostic, CandidateSelection,
    CleanupDiagnostic, ContentSelectorDiagnostic, ExtractionDiagnostics,
    ExtractionOutcome, ExtractionReport, FlagDiagnostic, NodeDiagnostic,
    RecoveryDiagnostic,
};
pub use error::Error;
pub use extract::{clean_article_html, extract, extract_with_diagnostics};
pub use markdown::{html_to_markdown, markdown_to_html, markdown_with_toml_frontmatter};
pub use readable::is_probably_readable;
}

Extraction

Use extract for normal application code.

#![allow(unused)]
fn main() {
pub fn extract(
    html: &str,
    base_url: Option<&str>,
    options: &ReadabilityOptions,
) -> Result<Option<Article>, Error>
}

Returns Ok(Some(article)) when content is found, Ok(None) when the document has no useful article content, and Err for invalid input or processing failures.

Use extract_with_diagnostics when you need extraction details in addition to the article.

#![allow(unused)]
fn main() {
pub fn extract_with_diagnostics(
    html: &str,
    base_url: Option<&str>,
    options: &ReadabilityOptions,
) -> Result<ExtractionReport, Error>
}

Returns the same article result with extraction diagnostics.

Use clean_article_html when you only need the cleaned article HTML.

#![allow(unused)]
fn main() {
pub fn clean_article_html(
    html: &str,
    base_url: Option<&str>,
    options: &ReadabilityOptions,
) -> Result<Option<String>, Error>
}

Readability Check

Use is_probably_readable before full extraction when you are filtering many documents.

#![allow(unused)]
fn main() {
pub fn is_probably_readable(
    html: &str,
    options: &ReadableOptions,
) -> Result<bool, Error>
}

Returns a quick readability estimate without full extraction.

Markdown

The Markdown helpers are available separately for callers that already have a clean HTML fragment, want to render Markdown as HTML, or want CLI-style frontmatter.

#![allow(unused)]
fn main() {
pub fn html_to_markdown(html: &str) -> String
}

Converts HTML fragments to Markdown.

#![allow(unused)]
fn main() {
pub fn markdown_to_html(markdown: &str, options: &MarkdownOptions) -> String
}

Converts Markdown to HTML using CommonMark/GFM options.

#![allow(unused)]
fn main() {
pub fn markdown_with_toml_frontmatter(
    article: &Article,
    source: Option<&str>,
) -> Result<String, Error>
}

Formats an article as Markdown with TOML frontmatter.

WASM API

The lectito-wasm crate exposes the core lectito APIs to JavaScript through wasm-bindgen.

Build targets:

wasm-pack build crates/wasm --target bundler
wasm-pack build crates/wasm --target web
wasm-pack build crates/wasm --target nodejs

Functions

export function extract(
  html: string,
  baseUrl?: string | null,
  options?: ReadabilityOptions,
): Article | null;

export function extractWithDiagnostics(
  html: string,
  baseUrl?: string | null,
  options?: ReadabilityOptions,
): unknown;

export function isProbablyReadable(
  html: string,
  options?: ReadableOptions,
): boolean;

export function cleanHtml(
  html: string,
  baseUrl?: string | null,
  options?: CleanHtmlOptions,
): string | null;

export function htmlToMarkdown(html: string): string;

export function markdownToHtml(
  markdown: string,
  options?: MarkdownOptions,
): string;

Options

The JavaScript API uses camelCase fields and maps them to the Rust options internally.

export interface ReadabilityOptions {
  maxElemsToParse?: number | null;
  nbTopCandidates?: number;
  charThreshold?: number;
  contentSelector?: string | null;
  siteProfiles?: string[];
  mobileViewportWidth?: number | null;
  classesToPreserve?: string[];
  keepClasses?: boolean;
  disableJsonLd?: boolean;
  linkDensityModifier?: number;
}

export interface ReadableOptions {
  minContentLength?: number;
  minScore?: number;
}

export interface MarkdownOptions {
  gfm?: boolean;
  footnotes?: boolean;
  math?: boolean;
  allowRawHtml?: boolean;
}

export type CleanHtmlOptions = ReadabilityOptions;

Sanitization

cleanHtml performs Lectito article cleanup. It is not a complete untrusted-HTML security policy.

Browser integrations that accept arbitrary HTML should run a dedicated sanitizer such as DOMPurify before passing content into Lectito, and should sanitize again before rendering returned HTML when the original input is untrusted.

Errors

The WASM functions throw JavaScript Error objects for invalid base URLs, oversized documents, serialization failures, and option conversion failures.

Article

Article is the extraction result.

The struct is serializable and contains both content and metadata. The content fields are generated from the selected article root; metadata can come from document metadata, JSON-LD, Open Graph tags, or the extracted content itself.

#![allow(unused)]
fn main() {
pub struct Article {
    pub title: Option<String>,
    pub byline: Option<String>,
    pub dir: Option<String>,
    pub lang: Option<String>,
    pub content: String,
    pub markdown: String,
    pub text_content: String,
    pub length: usize,
    pub excerpt: Option<String>,
    pub site_name: Option<String>,
    pub published_time: Option<String>,
    pub image: Option<String>,
    pub domain: Option<String>,
    pub favicon: Option<String>,
}
}

Fields:

FieldMeaning
titleBest title from metadata or document content.
bylineAuthor/byline when detected.
dirText direction, such as ltr or rtl.
langDocument language when detected.
contentCleaned article HTML.
markdownMarkdown generated from content.
text_contentPlain text generated from content.
lengthCharacter length of extracted text.
excerptShort summary or first useful paragraph.
site_namePublisher or site name.
published_timePublication timestamp when detected.
imageLead image URL when detected.
domainSource domain when available.
faviconFavicon URL when detected.

content, markdown, and text_content are different views of the same extracted article. Prefer content when structure matters, markdown when the article will be displayed or edited as text, and text_content when indexing or summarizing.

Options

ReadabilityOptions

ReadabilityOptions changes extraction behavior. Most callers should start with ReadabilityOptions::default() and only set fields that solve a specific problem.

#![allow(unused)]
fn main() {
pub struct ReadabilityOptions {
    pub max_elems_to_parse: Option<usize>,
    pub nb_top_candidates: usize,
    pub char_threshold: usize,
    pub content_selector: Option<String>,
    pub site_profiles: Vec<String>,
    pub mobile_viewport_width: Option<usize>,
    pub classes_to_preserve: Vec<String>,
    pub keep_classes: bool,
    pub disable_json_ld: bool,
    pub link_density_modifier: f32,
}
}

Defaults:

#![allow(unused)]
fn main() {
ReadabilityOptions {
    max_elems_to_parse: None,
    nb_top_candidates: 5,
    char_threshold: 500,
    content_selector: None,
    site_profiles: Vec::new(),
    mobile_viewport_width: Some(480),
    classes_to_preserve: Vec::new(),
    keep_classes: false,
    disable_json_ld: false,
    link_density_modifier: 0.0,
}
}

content_selector is the most direct override. Use it when the caller knows where the article lives in the document. site_profiles accepts TOML profile strings that provide host-scoped content roots, removal selectors, metadata hints, cleanup settings, and fallback behavior. char_threshold controls when an attempt is accepted. nb_top_candidates controls how many candidates remain in play during selection.

ReadableOptions

ReadableOptions only affects is_probably_readable. It does not change full article extraction.

#![allow(unused)]
fn main() {
pub struct ReadableOptions {
    pub min_content_length: usize,
    pub min_score: f32,
}
}

Use lower thresholds for short-form content. Use higher thresholds when false positives are more expensive than missed articles.

Defaults:

#![allow(unused)]
fn main() {
ReadableOptions {
    min_content_length: 140,
    min_score: 20.0,
}
}

Site Profiles

Site profiles are TOML extraction hints scoped by URL host. They are useful when a site has a stable content container or predictable clutter, but still returns ordinary article-shaped HTML.

Profiles run before generic readability scoring. If a profile produces text below char_threshold, Lectito records the profile decision in diagnostics and continues with generic extraction.

Example

name = "example"
hosts = ["example.com"]
subdomains = true
path_prefixes = ["/blog"]
exclude_path_prefixes = ["/blog/comments"]
content_roots = ["article", "#content"]
remove = [".ad", "nav", "footer"]
remove_id_or_class = ["sidebar"]

[metadata]
title = ["h1"]
author = [".byline"]
date = ["time/@datetime"]
image = ["meta[property='og:image']/@content"]
site_name = "Example"
title_suffixes = [" - Example"]

[cleanup]
enabled = true
prune = true

[fallback]
generic_on_empty = true

Fields

FieldMeaning
nameHuman-readable profile name used in diagnostics.
hostsHosts matched by the profile. www. is ignored during matching.
subdomainsWhen true, subdomains of each host also match.
path_prefixesOptional path prefixes. Omit to match every path on the host.
exclude_path_prefixesOptional path prefixes that suppress the profile after host matching.
content_rootsCSS selectors or supported XPath selectors for article roots.
removeCSS selectors or supported XPath selectors to remove before extraction.
remove_id_or_classExact id or class tokens to remove.

Metadata fields are optional selector lists, except site_name, which is a constant. Selectors may target attributes with the supported XPath .../@attr form.

Cleanup defaults to enabled. prune controls conditional cleanup. Disabling cleanup should be reserved for sites where the profile root is already clean and generic cleanup removes useful structure.

Selector Support

Profiles accept CSS selectors directly. They also accept a focused XPath subset for compatibility with rule corpuses and older bundled rules:

  • //tag
  • //*[@id='value']
  • //tag[@class='a b']
  • //tag[contains(@class, 'value')]
  • /text() suffixes
  • /@attribute suffixes for metadata selectors

Unsupported XPath expressions are ignored by selector matching, so bundled profiles should have tests that prove their roots match representative pages.

User Profiles

Rust callers pass profile TOML strings through ReadabilityOptions:

#![allow(unused)]
fn main() {
let options = ReadabilityOptions {
    site_profiles: vec![std::fs::read_to_string("example.com.toml")?],
    ..ReadabilityOptions::default()
};
}

The CLI accepts repeatable profile paths:

lectito parse article.html --url https://example.com/post --site-profile example.com.toml

User profiles take precedence over bundled profiles. More specific host and path matches win within each source group.