How It Works

Understanding the Lectito content extraction pipeline.

Overview

Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. The algorithm identifies the main article content by analyzing the HTML structure, scoring elements based on various heuristics, and selecting the highest-scoring content.

Extraction Pipeline

The extraction process consists of four main stages:

HTML Input → Preprocessing → Scoring → Selection → Post-processing → Article

1. Preprocessing

Clean the HTML to improve scoring accuracy:

Remove unlikely content: scripts, styles, iframes, and hidden nodes
Strip elements with unlikely class/ID patterns
Preserve structure: maintain HTML hierarchy for accurate scoring

Why: Preprocessing removes elements that could confuse the scoring algorithm or contain non-article content.

2. Scoring

Score each element based on content characteristics:

Tag score: Different HTML tags have different base scores
Class/ID weight: Positive patterns (article, content) vs negative (sidebar, footer)
Content density: Length and punctuation indicate content quality
Link density: Too many links suggests navigation/metadata, not content

Why: Scoring identifies which elements are most likely to contain the main article content.

3. Selection

Select the highest-scoring element as the article candidate:

Find element with highest score (bias toward semantic containers when scores are close)
Check if score meets minimum threshold (default: 20.0)
Check if content length meets minimum threshold (default: 500 chars)
Return error if content doesn't meet thresholds

Why: Selection ensures we extract actual article content, not navigation or ads.

4. Post-processing

Clean up the selected content:

Include sibling elements: adjacent content blocks and shared-parent headers
Remove remaining clutter: ads, comments, social widgets
Clean up whitespace: normalize spacing and formatting
Preserve structure: maintain headings, paragraphs, lists

Why: Post-processing improves the quality of extracted content and includes related elements.

Data Flow

Input HTML
    ↓
parse_to_document()
    ↓
preprocess_html() → Cleaned HTML
    ↓
build_dom_tree() → DOM Tree
    ↓
calculate_score() → Scored Elements
    ↓
extract_content() → Selected Element
    ↓
postprocess_html() → Cleaned Content
    ↓
extract_metadata() → Metadata
    ↓
Article

Key Components

Document and Element

The Document and Element types wrap the scraper crate's HTML parsing:

use lectito_core::{Document, Element};

let doc = Document::parse(html)?;
let elements: Vec<Element> = doc.select("article p")?;

These provide a convenient API for DOM manipulation and element traversal.

Scoring Algorithm

The scoring algorithm combines multiple factors:

element_score = base_tag_score
              + class_id_weight
              + content_density_score
              × (1 - link_density)

See Scoring Algorithm for details.

Metadata Extraction

Separate process extracts metadata from the HTML:

Title: <h1>, <title>, or Open Graph tags
Author: meta tags, bylines, schema.org
Date: meta tags, time elements, schema.org
Excerpt: meta description, first paragraph

Why This Approach

Content Over Structure

Unlike XPath-based extraction, Lectito doesn't rely on fixed HTML structures. It analyzes content characteristics, making it work across many sites without custom rules.

Heuristic-Based

The algorithm uses heuristics (rules of thumb) derived from analyzing thousands of articles. This makes it flexible and adaptable to different site designs.

Fallback Mechanism

For sites where the algorithm fails, Lectito supports site-specific configuration files with XPath expressions. See Configuration for details.

Limitations

Sites That May Fail

Very short pages (tweets, status updates)
Non-article content (product pages, search results)
Unusual layouts (some single-column designs)
Heavily JavaScript-dependent content

Improving Extraction

For difficult sites:

Adjust thresholds: Lower min_score or char_threshold
Site configuration: Provide XPath rules
Manual curation: Use XPath or CSS selectors directly

See Configuration for options.

Comparison to Alternatives

Approach	Pros	Cons
Lectito	Works across many sites, no custom rules needed	May fail on unusual layouts
XPath	Precise, predictable	Requires custom rules per site
CSS Selectors	Simple, familiar	Brittle, breaks on layout changes
Machine Learning	Adaptable	Complex, requires training data

Lectito strikes a balance: works well for most sites without custom rules, with site configuration as a fallback.

Performance Considerations

Parsing: HTML parsing is fast but not instant
Scoring: Traverses entire DOM, O(n) complexity
Fetching: Async for non-blocking I/O
Memory: Entire document loaded into memory

For large-scale extraction, consider batching and concurrent fetches.

Next Steps

Scoring Algorithm - Detailed scoring explanation
Configuration - Customizing extraction
Basic Usage - Using the API

Lectito Documentation