How It Works
Understanding the Lectito content extraction pipeline.
Overview
Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. The algorithm identifies the main article content by analyzing the HTML structure, scoring elements based on various heuristics, and selecting the highest-scoring content.
Extraction Pipeline
The extraction process still follows the same core shape:
HTML Input → Preprocessing → Scoring → Selection → Post-processing → Article
1. Preprocessing
Clean the HTML to improve scoring accuracy:
- Remove unlikely content: scripts, styles, iframes, and hidden nodes
- Strip elements with unlikely class/ID patterns
- Preserve structure: maintain HTML hierarchy for accurate scoring
Why: Preprocessing removes elements that could confuse the scoring algorithm or contain non-article content.
2. Scoring
Score each element based on content characteristics:
- Tag score: Different HTML tags have different base scores
- Class/ID weight: Positive patterns (article, content) vs negative (sidebar, footer)
- Content density: Length and punctuation indicate content quality
- Link density: Too many links suggests navigation or metadata, not content
Why: Scoring identifies which elements are most likely to contain the main article content.
3. Selection
Select the highest-scoring element as the article candidate:
- Find element with highest score
- Bias toward semantic containers when scores are close
- Check if score meets the minimum threshold
- Check if content length meets the minimum threshold
- Return an error if content doesn't meet thresholds
Why: Selection ensures we extract actual article content, not navigation or ads.
4. Post-processing
Clean up the selected content:
- Include sibling elements: adjacent content blocks and shared-parent headers
- Remove remaining clutter: ads, comments, social widgets
- Clean up whitespace: normalize spacing and formatting
- Preserve structure: maintain headings, paragraphs, and lists
Why: Post-processing improves the quality of extracted content and includes related elements.
Branch Additions
The current branch layers a few extra passes around that core flow:
- Retry strategy: if the first pass comes back short, Lectito retries with progressively looser settings before giving up
- Site-specific extraction: built-in extractors and optional site configs can override the generic scorer for difficult sites
- Confidence and diagnostics: successful extractions carry a confidence score and can include pass-by-pass diagnostics
Those additions sit around the original pipeline. They do not replace it.
Data Flow
Input HTML
↓
parse_to_document()
↓
preprocess_html() → Cleaned HTML
↓
build_dom_tree() → DOM Tree
↓
calculate_score() → Scored Elements
↓
extract_content() → Selected Element
↓
postprocess_html() → Cleaned Content
↓
extract_metadata() → Metadata
↓
Article
Key Components
Document and Element
The Document and Element types wrap the scraper crate's HTML parsing:
use lectito_core::{Document, Element};
let doc = Document::parse(html)?;
let elements: Vec<Element> = doc.select("article p")?;
These provide a convenient API for DOM manipulation and element traversal.
Scoring Algorithm
The scoring algorithm combines multiple factors:
element_score = base_tag_score
+ class_id_weight
+ content_density_score
× (1 - link_density)
See Scoring Algorithm for details.
Metadata Extraction
Separate process extracts metadata from the HTML:
- Title:
<h1>,<title>, or Open Graph tags - Author: meta tags, bylines, schema.org
- Date: meta tags, time elements, schema.org
- Excerpt: meta description, first paragraph
Why This Approach
Content Over Structure
Unlike XPath-based extraction, Lectito doesn't rely on fixed HTML structures. It analyzes content characteristics, making it work across many sites without custom rules.
Heuristic-Based
The algorithm uses heuristics derived from analyzing lots of article pages. That keeps it flexible across different site designs.
Fallback Mechanism
For sites where the algorithm fails, Lectito supports site-specific configuration files with XPath expressions. See Configuration for details.
Limitations
Sites That May Fail
- Very short pages (tweets, status updates)
- Non-article content (product pages, search results)
- Unusual layouts
- Heavily JavaScript-dependent content
Improving Extraction
For difficult sites:
- Adjust thresholds such as
min_scoreorchar_threshold - Provide a site configuration
- Add a site-specific extractor when generic scoring is not enough
See Configuration for options.
Comparison to Alternatives
| Approach | Pros | Cons |
|---|---|---|
| Lectito | Works across many sites, no custom rules needed | May fail on unusual layouts |
| Defuddle | Strong HTML and Markdown output, forgiving cleanup, richer metadata extraction | JavaScript and DOM-oriented, not a Rust-native library or CLI stack |
| XPath | Precise, predictable | Requires custom rules per site |
| CSS Selectors | Simple, familiar | Brittle, breaks on layout changes |
| Machine Learning | Adaptable | Complex, requires training data |
Lectito strikes a balance by working well for most sites without custom rules, with site configuration as a fallback.