Lectito
Lectito is a Rust library and CLI tool for extracting readable article content from HTML.
Most web pages contain way more than the text a reader came for, like ads, navigation, related links, comment areas, tracking markup, hidden elements, and presentation wrappers. Lectito tries to identify the main content root and return a smaller document that is useful for reading, storage, search, and conversion.
It returns:
- cleaned article HTML
- Markdown
- plain text
- page metadata
- extraction diagnostics
Lectito is parser-first. The core API accepts HTML and an optional base URL. URL fetching exists in the CLI for convenience, but the library does not require network access.
This keeps the library usable in environments that already have HTML available: crawlers, browser extensions, desktop apps, mobile apps, tests, and offline archives.
Main APIs
#![allow(unused)] fn main() { use lectito::{extract, ReadabilityOptions}; let html = r#"<article><h1>Title</h1><p>Article text.</p></article>"#; let article = extract(html, Some("https://example.com/post"), &ReadabilityOptions::default())?; if let Some(article) = article { println!("{}", article.markdown); } Ok::<(), lectito::Error>(()) }
Use extract_with_diagnostics when tuning extraction or debugging a bad page.
Use is_probably_readable before extraction when you only need a quick yes/no
answer.
Project Scope
The public API is intentionally small. Callers should depend on the article result, options, diagnostics, and Markdown helpers rather than internal scoring or cleanup modules.
Installation
Lectito is split into a core library and a CLI. Use the library when your application already has HTML. Use the CLI for local inspection, fixtures, shell scripts, and quick conversions.
Library
Add lectito to your Rust project:
[dependencies]
lectito = "0.1"
For local development against this workspace:
[dependencies]
lectito = { path = "crates/core" }
The core crate has no runtime service requirement. It parses the string you pass in and returns an article result.
CLI
Install the CLI from this workspace:
cargo install --path crates/cli
The binary is named lectito.
lectito --help
The CLI can read from a file, stdin, or a URL. URL support is a command-line convenience; it is not part of the core library contract.
License
Lectito is licensed under MPL-2.0.
Quick Start
Extract From HTML
Start with extract for normal use. It takes the source HTML, an optional base
URL, and ReadabilityOptions. The base URL lets Lectito resolve relative links,
images, and metadata URLs in the extracted output.
use lectito::{extract, ReadabilityOptions}; fn main() -> Result<(), lectito::Error> { let html = r#" <html> <head><title>Example</title></head> <body> <article> <h1>Example</h1> <p>This is the article body.</p> </article> </body> </html> "#; let article = extract(html, Some("https://example.com/article"), &ReadabilityOptions::default())?; if let Some(article) = article { println!("{:?}", article.title); println!("{}", article.markdown); } Ok(()) }
extract returns Ok(None) when no useful article content is found.
That is different from an error. An empty or navigation-only page can be parsed
successfully and still have no article.
Check Readability
Use is_probably_readable when you only need to decide whether a page is worth
running through full extraction. It is faster and returns a boolean.
#![allow(unused)] fn main() { use lectito::{is_probably_readable, ReadableOptions}; let readable = is_probably_readable(html, &ReadableOptions::default())?; Ok::<(), lectito::Error>(()) }
CLI
The CLI mirrors the library. parse extracts content, and readable performs
the quick readability check.
lectito parse article.html --format markdown
lectito parse --url https://example.com/article --format json --pretty
lectito readable article.html
CLI Usage
The CLI is designed for inspecting extraction behavior and converting documents from the terminal. It is also useful for building fixtures because the same binary can print article output and diagnostics.
The CLI has three commands:
parse: extract article contentreadable: check whether a document looks readablefixture: inspect bundled fixtures
Parse
parse accepts one input source. Use a positional file path, --input,
--stdin, or --url.
lectito parse article.html
lectito parse --input article.html
lectito parse --stdin < article.html
lectito parse --url https://example.com/article
Output formats:
JSON is the default because it preserves the whole article structure. Use Markdown or text when piping into another tool.
lectito parse article.html --format json --pretty
lectito parse article.html --format html
lectito parse article.html --format markdown
lectito parse article.html --format text
Useful options:
The defaults work for most article pages. Tune these flags when a page is too short, too broad, or has a known content container.
lectito parse article.html --char-threshold 800
lectito parse article.html --nb-top-candidates 8
lectito parse article.html --content-selector article
lectito parse article.html --url https://example.com/post --site-profile example.com.toml
lectito parse article.html --max-elems-to-parse 10000
lectito parse article.html --keep-classes --classes-to-preserve language-rust
--site-profile can be repeated. Each file must be a TOML site profile. User
profiles take precedence over bundled profiles for the same host.
Diagnostics are written to stderr after the main output:
This keeps stdout usable for the extracted article while still showing debug information in the terminal.
lectito parse article.html --format markdown --diagnostic-format pretty
lectito parse article.html --diagnostic-format json
Readable
readable checks whether the document appears to contain enough article-like
text. It does not return extracted content.
lectito readable article.html
lectito readable --stdin < article.html
lectito readable --url https://example.com/article
lectito readable article.html --json --pretty
Thresholds:
lectito readable article.html --min-content-length 140 --min-score 20
Basic Usage
Use extract when you want article content.
The function does not fetch the page. Pass it the HTML you want parsed. This is usually cleaner in applications because networking, caching, cookies, and browser rendering are application concerns.
#![allow(unused)] fn main() { use lectito::{extract, ReadabilityOptions}; let options = ReadabilityOptions::default(); let article = extract(html, Some("https://example.com/post"), &options)?; match article { Some(article) => println!("{}", article.text_content), None => eprintln!("no article content found"), } Ok::<(), lectito::Error>(()) }
The base URL is optional. Pass it when the document contains relative links, images, or metadata URLs.
When extraction succeeds, Lectito returns Some(Article). When the page parses
but does not contain a useful article, it returns None. Reserve error handling
for invalid base URLs, configured size limits, and serialization failures.
Article Output
Article contains the extracted content in several forms:
#![allow(unused)] fn main() { if let Some(article) = article { println!("{}", article.content); println!("{}", article.markdown); println!("{}", article.text_content); } }
Use extract_with_diagnostics when you need to see how extraction chose a root.
Diagnostics are meant for development and regression work. Most application code
should call extract.
#![allow(unused)] fn main() { use lectito::{extract_with_diagnostics, ReadabilityOptions}; let report = extract_with_diagnostics(html, base_url, &ReadabilityOptions::default())?; if let Some(article) = report.article { println!("{}", article.markdown); } eprintln!("{:?}", report.diagnostics.outcome); Ok::<(), lectito::Error>(()) }
Configuration
ReadabilityOptions control extraction.
The defaults are conservative. They favor article pages with enough text to be useful and avoid exposing internal scoring knobs unless they affect common integration cases.
#![allow(unused)] fn main() { use lectito::ReadabilityOptions; let options = ReadabilityOptions { char_threshold: 800, nb_top_candidates: 8, content_selector: Some("article".to_string()), site_profiles: Vec::new(), ..ReadabilityOptions::default() }; }
Fields:
| Field | Default | Meaning |
|---|---|---|
max_elems_to_parse | None | Reject documents above this element count. |
nb_top_candidates | 5 | Number of high-scoring candidates to consider. |
char_threshold | 500 | Minimum extracted text length for an accepted attempt. |
content_selector | None | CSS selector to prefer as the content root. |
site_profiles | [] | TOML site profiles for host-scoped extraction hints. |
mobile_viewport_width | Some(480) | Width used by recovery rules for mobile snapshots. |
classes_to_preserve | [] | Class names kept during cleanup. |
keep_classes | false | Keep all class attributes. |
disable_json_ld | false | Skip JSON-LD metadata extraction. |
link_density_modifier | 0.0 | Adjust link-density cleanup tolerance. |
Prefer content_selector when you already know the page shape. It is clearer
than trying to tune scores around a stable document layout.
Use site_profiles when you want the same kind of override to apply by URL
host, or when you need removal selectors and metadata hints alongside content
roots. Profiles are attempted before generic scoring, but weak profile output
falls back to the generic extractor.
Use max_elems_to_parse as a guardrail for untrusted input. It rejects very
large documents before extraction work continues.
ReadableOptions controls is_probably_readable.
Lower min_content_length for short posts or documentation pages. Raise
min_score when you want the quick check to reject borderline pages.
#![allow(unused)] fn main() { use lectito::ReadableOptions; let options = ReadableOptions { min_content_length: 140, min_score: 20.0, }; }
Output Formats
Lectito produces all output formats during extraction.
The formats come from the same cleaned article root. That means callers can store HTML for fidelity, use Markdown for display or editing, and use plain text for search without running extraction multiple times.
#![allow(unused)] fn main() { let article = extract(html, base_url, &ReadabilityOptions::default())?.unwrap(); let html = article.content; let markdown = article.markdown; let text = article.text_content; }
HTML
content is cleaned article HTML. Scripts, styles, navigation, sidebars, and
other page chrome are removed where possible. Relative URLs are resolved when a
base URL is provided.
Use HTML when you need the closest representation of the extracted article. It keeps images, links, tables, inline markup, and other structure that can be lost in plain text.
Markdown
markdown is generated from the cleaned article HTML. It preserves common
reader content:
- headings
- paragraphs
- links and images
- lists
- blockquotes
- code blocks
- tables
- math
- footnotes
The CLI Markdown output includes TOML frontmatter:
lectito parse article.html --format markdown
Markdown is useful when the next step is a reader view, note-taking system, static archive, or editor. It is also easier to diff in tests than HTML.
Plain Text
text_content is normalized article text. Use it for indexing, previews, and
readability checks.
Plain text should not be treated as a rendering format. It discards links, images, and most document structure.
JSON
The CLI can serialize the article:
lectito parse article.html --format json --pretty
JSON is the best CLI format when another program needs metadata and content together.
How It Works
Lectito follows the same broad approach as Mozilla Readability.
The extractor starts with a full HTML document and tries to find the subtree that behaves like an article. It uses signals that tend to survive across sites: text length, paragraph density, semantic tags, class and id names, and the ratio of links to readable text.
- Parse the document.
- Recover useful content from common snapshots, including selected mobile and shadow-root cases.
- Extract metadata.
- Try a matching site profile or code extractor when one applies.
- Remove scripts, styles, hidden nodes, and unlikely content.
- Score candidate content roots by text length, tag type, class/id hints, and link density.
- Select the best root and include useful siblings.
- Clean the selected content.
- Apply schema text fallback when structured data is clearly better.
- Return HTML, Markdown, text, and diagnostics.
Extraction runs several attempts. Later attempts relax cleanup rules when the
first pass produces too little text. The first attempt that reaches
char_threshold is accepted. If no attempt reaches the threshold, Lectito may
return the best non-empty attempt.
This retry model matters because pages fail in different ways. Some pages hide the useful content behind classes that look like chrome. Others include enough related links or widgets to pull the score away from the main text. Relaxed attempts give Lectito another chance without making the first pass too loose.
content_selector can short-circuit root selection for known documents:
#![allow(unused)] fn main() { let options = ReadabilityOptions { content_selector: Some("main article".to_string()), ..ReadabilityOptions::default() }; }
Site profiles provide URL-scoped hints without disabling generic extraction:
#![allow(unused)] fn main() { let options = ReadabilityOptions { site_profiles: vec![r#" name = "example" hosts = ["example.com"] content_roots = ["article"] remove = [".ad", "nav"] "#.to_string()], ..ReadabilityOptions::default() }; }
If a profile produces content below char_threshold, Lectito records the
profile decision in diagnostics and continues with generic readability attempts.
After the root is selected, cleanup removes empty nodes, normalizes links and media, preserves selected classes, and prepares the HTML for Markdown and text conversion.
Diagnostics
Use diagnostics to inspect extraction decisions.
Diagnostics are for development, fixture work, and bug reports. They explain which candidates were considered, which root was selected, and why an extraction was accepted or downgraded to a best attempt.
#![allow(unused)] fn main() { use lectito::{extract_with_diagnostics, ReadabilityOptions}; let report = extract_with_diagnostics(html, base_url, &ReadabilityOptions::default())?; println!("{:?}", report.diagnostics.outcome); }
ExtractionReport contains:
article: the extracted article, if founddiagnostics: details about attempts and candidate selection
Outcomes:
| Outcome | Meaning |
|---|---|
Accepted | An attempt met char_threshold. |
BestAttempt | No attempt met the threshold, but non-empty content was found. |
NoContent | No useful content was found. |
Each attempt records:
- cleanup flags
- candidate count
- top candidates
- entry points
- selected root
- cleanup counts
- recovery counts
- extracted text length
When a site profile or code extractor matches, diagnostics include site_rule.
That record reports the matched profile or extractor, whether it was bundled,
which roots were selected, how many removals ran, whether the result met
char_threshold, and any fallback reason.
Start with outcome, selected_root, and text_len. If the selected root is
wrong, inspect the candidate list. If the root is right but output is noisy,
inspect cleanup counts and preserved classes.
CLI diagnostics:
lectito parse article.html --diagnostic-format pretty
lectito parse article.html --diagnostic-format json
API Overview
Lectito has two public API targets:
- Rust Crate API for native Rust applications, CLIs, and server integrations.
- WASM API for browser, web worker, bundler, and Node.js integrations.
Both targets use the same core extractor and Markdown conversion logic. The Rust crate is the source of truth; the WASM crate maps that API into JavaScript types and camelCase option names.
Rust Crate API
Public exports from lectito:
The crate exposes the extraction API, output structs, diagnostics, errors, and Markdown helpers. Internal parser, scoring, cleanup, and recovery modules remain private.
#![allow(unused)] fn main() { pub use config::{Article, MarkdownOptions, ReadabilityOptions, ReadableOptions}; pub use diagnostics::{ AttemptDiagnostic, CandidateDiagnostic, CandidateSelection, CleanupDiagnostic, ContentSelectorDiagnostic, ExtractionDiagnostics, ExtractionOutcome, ExtractionReport, FlagDiagnostic, NodeDiagnostic, RecoveryDiagnostic, }; pub use error::Error; pub use extract::{clean_article_html, extract, extract_with_diagnostics}; pub use markdown::{html_to_markdown, markdown_to_html, markdown_with_toml_frontmatter}; pub use readable::is_probably_readable; }
Extraction
Use extract for normal application code.
#![allow(unused)] fn main() { pub fn extract( html: &str, base_url: Option<&str>, options: &ReadabilityOptions, ) -> Result<Option<Article>, Error> }
Returns Ok(Some(article)) when content is found, Ok(None) when the document
has no useful article content, and Err for invalid input or processing
failures.
Use extract_with_diagnostics when you need extraction details in addition to
the article.
#![allow(unused)] fn main() { pub fn extract_with_diagnostics( html: &str, base_url: Option<&str>, options: &ReadabilityOptions, ) -> Result<ExtractionReport, Error> }
Returns the same article result with extraction diagnostics.
Use clean_article_html when you only need the cleaned article HTML.
#![allow(unused)] fn main() { pub fn clean_article_html( html: &str, base_url: Option<&str>, options: &ReadabilityOptions, ) -> Result<Option<String>, Error> }
Readability Check
Use is_probably_readable before full extraction when you are filtering many
documents.
#![allow(unused)] fn main() { pub fn is_probably_readable( html: &str, options: &ReadableOptions, ) -> Result<bool, Error> }
Returns a quick readability estimate without full extraction.
Markdown
The Markdown helpers are available separately for callers that already have a clean HTML fragment, want to render Markdown as HTML, or want CLI-style frontmatter.
#![allow(unused)] fn main() { pub fn html_to_markdown(html: &str) -> String }
Converts HTML fragments to Markdown.
#![allow(unused)] fn main() { pub fn markdown_to_html(markdown: &str, options: &MarkdownOptions) -> String }
Converts Markdown to HTML using CommonMark/GFM options.
#![allow(unused)] fn main() { pub fn markdown_with_toml_frontmatter( article: &Article, source: Option<&str>, ) -> Result<String, Error> }
Formats an article as Markdown with TOML frontmatter.
WASM API
The lectito-wasm crate exposes the core lectito APIs to JavaScript through
wasm-bindgen.
Build targets:
wasm-pack build crates/wasm --target bundler
wasm-pack build crates/wasm --target web
wasm-pack build crates/wasm --target nodejs
Functions
export function extract(
html: string,
baseUrl?: string | null,
options?: ReadabilityOptions,
): Article | null;
export function extractWithDiagnostics(
html: string,
baseUrl?: string | null,
options?: ReadabilityOptions,
): unknown;
export function isProbablyReadable(
html: string,
options?: ReadableOptions,
): boolean;
export function cleanHtml(
html: string,
baseUrl?: string | null,
options?: CleanHtmlOptions,
): string | null;
export function htmlToMarkdown(html: string): string;
export function markdownToHtml(
markdown: string,
options?: MarkdownOptions,
): string;
Options
The JavaScript API uses camelCase fields and maps them to the Rust options internally.
export interface ReadabilityOptions {
maxElemsToParse?: number | null;
nbTopCandidates?: number;
charThreshold?: number;
contentSelector?: string | null;
siteProfiles?: string[];
mobileViewportWidth?: number | null;
classesToPreserve?: string[];
keepClasses?: boolean;
disableJsonLd?: boolean;
linkDensityModifier?: number;
}
export interface ReadableOptions {
minContentLength?: number;
minScore?: number;
}
export interface MarkdownOptions {
gfm?: boolean;
footnotes?: boolean;
math?: boolean;
allowRawHtml?: boolean;
}
export type CleanHtmlOptions = ReadabilityOptions;
Sanitization
cleanHtml performs Lectito article cleanup. It is not a complete
untrusted-HTML security policy.
Browser integrations that accept arbitrary HTML should run a dedicated sanitizer such as DOMPurify before passing content into Lectito, and should sanitize again before rendering returned HTML when the original input is untrusted.
Errors
The WASM functions throw JavaScript Error objects for invalid base URLs,
oversized documents, serialization failures, and option conversion failures.
Article
Article is the extraction result.
The struct is serializable and contains both content and metadata. The content fields are generated from the selected article root; metadata can come from document metadata, JSON-LD, Open Graph tags, or the extracted content itself.
#![allow(unused)] fn main() { pub struct Article { pub title: Option<String>, pub byline: Option<String>, pub dir: Option<String>, pub lang: Option<String>, pub content: String, pub markdown: String, pub text_content: String, pub length: usize, pub excerpt: Option<String>, pub site_name: Option<String>, pub published_time: Option<String>, pub image: Option<String>, pub domain: Option<String>, pub favicon: Option<String>, } }
Fields:
| Field | Meaning |
|---|---|
title | Best title from metadata or document content. |
byline | Author/byline when detected. |
dir | Text direction, such as ltr or rtl. |
lang | Document language when detected. |
content | Cleaned article HTML. |
markdown | Markdown generated from content. |
text_content | Plain text generated from content. |
length | Character length of extracted text. |
excerpt | Short summary or first useful paragraph. |
site_name | Publisher or site name. |
published_time | Publication timestamp when detected. |
image | Lead image URL when detected. |
domain | Source domain when available. |
favicon | Favicon URL when detected. |
content, markdown, and text_content are different views of the same
extracted article. Prefer content when structure matters, markdown when the
article will be displayed or edited as text, and text_content when indexing or
summarizing.
Options
ReadabilityOptions
ReadabilityOptions changes extraction behavior. Most callers should start
with ReadabilityOptions::default() and only set fields that solve a specific
problem.
#![allow(unused)] fn main() { pub struct ReadabilityOptions { pub max_elems_to_parse: Option<usize>, pub nb_top_candidates: usize, pub char_threshold: usize, pub content_selector: Option<String>, pub site_profiles: Vec<String>, pub mobile_viewport_width: Option<usize>, pub classes_to_preserve: Vec<String>, pub keep_classes: bool, pub disable_json_ld: bool, pub link_density_modifier: f32, } }
Defaults:
#![allow(unused)] fn main() { ReadabilityOptions { max_elems_to_parse: None, nb_top_candidates: 5, char_threshold: 500, content_selector: None, site_profiles: Vec::new(), mobile_viewport_width: Some(480), classes_to_preserve: Vec::new(), keep_classes: false, disable_json_ld: false, link_density_modifier: 0.0, } }
content_selector is the most direct override. Use it when the caller knows
where the article lives in the document. site_profiles accepts TOML profile
strings that provide host-scoped content roots, removal selectors, metadata
hints, cleanup settings, and fallback behavior. char_threshold controls when
an attempt is accepted. nb_top_candidates controls how many candidates remain
in play during selection.
ReadableOptions
ReadableOptions only affects is_probably_readable. It does not change full
article extraction.
#![allow(unused)] fn main() { pub struct ReadableOptions { pub min_content_length: usize, pub min_score: f32, } }
Use lower thresholds for short-form content. Use higher thresholds when false positives are more expensive than missed articles.
Defaults:
#![allow(unused)] fn main() { ReadableOptions { min_content_length: 140, min_score: 20.0, } }
Site Profiles
Site profiles are TOML extraction hints scoped by URL host. They are useful when a site has a stable content container or predictable clutter, but still returns ordinary article-shaped HTML.
Profiles run before generic readability scoring. If a profile produces text
below char_threshold, Lectito records the profile decision in diagnostics and
continues with generic extraction.
Example
name = "example"
hosts = ["example.com"]
subdomains = true
path_prefixes = ["/blog"]
exclude_path_prefixes = ["/blog/comments"]
content_roots = ["article", "#content"]
remove = [".ad", "nav", "footer"]
remove_id_or_class = ["sidebar"]
[metadata]
title = ["h1"]
author = [".byline"]
date = ["time/@datetime"]
image = ["meta[property='og:image']/@content"]
site_name = "Example"
title_suffixes = [" - Example"]
[cleanup]
enabled = true
prune = true
[fallback]
generic_on_empty = true
Fields
| Field | Meaning |
|---|---|
name | Human-readable profile name used in diagnostics. |
hosts | Hosts matched by the profile. www. is ignored during matching. |
subdomains | When true, subdomains of each host also match. |
path_prefixes | Optional path prefixes. Omit to match every path on the host. |
exclude_path_prefixes | Optional path prefixes that suppress the profile after host matching. |
content_roots | CSS selectors or supported XPath selectors for article roots. |
remove | CSS selectors or supported XPath selectors to remove before extraction. |
remove_id_or_class | Exact id or class tokens to remove. |
Metadata fields are optional selector lists, except site_name, which is a
constant. Selectors may target attributes with the supported XPath .../@attr
form.
Cleanup defaults to enabled. prune controls conditional cleanup. Disabling
cleanup should be reserved for sites where the profile root is already clean and
generic cleanup removes useful structure.
Selector Support
Profiles accept CSS selectors directly. They also accept a focused XPath subset for compatibility with rule corpuses and older bundled rules:
//tag//*[@id='value']//tag[@class='a b']//tag[contains(@class, 'value')]/text()suffixes/@attributesuffixes for metadata selectors
Unsupported XPath expressions are ignored by selector matching, so bundled profiles should have tests that prove their roots match representative pages.
User Profiles
Rust callers pass profile TOML strings through ReadabilityOptions:
#![allow(unused)] fn main() { let options = ReadabilityOptions { site_profiles: vec![std::fs::read_to_string("example.com.toml")?], ..ReadabilityOptions::default() }; }
The CLI accepts repeatable profile paths:
lectito parse article.html --url https://example.com/post --site-profile example.com.toml
User profiles take precedence over bundled profiles. More specific host and path matches win within each source group.