How It Works

Understanding the Lectito content extraction pipeline.

Overview

Lectito implements a content extraction algorithm inspired by Mozilla's Readability.js. The algorithm identifies the main article content by analyzing the HTML structure, scoring elements based on various heuristics, and selecting the highest-scoring content.

Extraction Pipeline

The extraction process still follows the same core shape:

HTML Input → Preprocessing → Scoring → Selection → Post-processing → Article

1. Preprocessing

Clean the HTML to improve scoring accuracy:

  • Remove unlikely content: scripts, styles, iframes, and hidden nodes
  • Strip elements with unlikely class/ID patterns
  • Preserve structure: maintain HTML hierarchy for accurate scoring

Why: Preprocessing removes elements that could confuse the scoring algorithm or contain non-article content.

2. Scoring

Score each element based on content characteristics:

  • Tag score: Different HTML tags have different base scores
  • Class/ID weight: Positive patterns (article, content) vs negative (sidebar, footer)
  • Content density: Length and punctuation indicate content quality
  • Link density: Too many links suggests navigation or metadata, not content

Why: Scoring identifies which elements are most likely to contain the main article content.

3. Selection

Select the highest-scoring element as the article candidate:

  • Find element with highest score
  • Bias toward semantic containers when scores are close
  • Check if score meets the minimum threshold
  • Check if content length meets the minimum threshold
  • Return an error if content doesn't meet thresholds

Why: Selection ensures we extract actual article content, not navigation or ads.

4. Post-processing

Clean up the selected content:

  • Include sibling elements: adjacent content blocks and shared-parent headers
  • Remove remaining clutter: ads, comments, social widgets
  • Clean up whitespace: normalize spacing and formatting
  • Preserve structure: maintain headings, paragraphs, and lists

Why: Post-processing improves the quality of extracted content and includes related elements.

Branch Additions

The current branch layers a few extra passes around that core flow:

  • Retry strategy: if the first pass comes back short, Lectito retries with progressively looser settings before giving up
  • Site-specific extraction: built-in extractors and optional site configs can override the generic scorer for difficult sites
  • Confidence and diagnostics: successful extractions carry a confidence score and can include pass-by-pass diagnostics

Those additions sit around the original pipeline. They do not replace it.

Data Flow

Input HTML
    ↓
parse_to_document()
    ↓
preprocess_html() → Cleaned HTML
    ↓
build_dom_tree() → DOM Tree
    ↓
calculate_score() → Scored Elements
    ↓
extract_content() → Selected Element
    ↓
postprocess_html() → Cleaned Content
    ↓
extract_metadata() → Metadata
    ↓
Article

Key Components

Document and Element

The Document and Element types wrap the scraper crate's HTML parsing:

use lectito_core::{Document, Element};

let doc = Document::parse(html)?;
let elements: Vec<Element> = doc.select("article p")?;

These provide a convenient API for DOM manipulation and element traversal.

Scoring Algorithm

The scoring algorithm combines multiple factors:

element_score = base_tag_score
              + class_id_weight
              + content_density_score
              × (1 - link_density)

See Scoring Algorithm for details.

Metadata Extraction

Separate process extracts metadata from the HTML:

  • Title: <h1>, <title>, or Open Graph tags
  • Author: meta tags, bylines, schema.org
  • Date: meta tags, time elements, schema.org
  • Excerpt: meta description, first paragraph

Why This Approach

Content Over Structure

Unlike XPath-based extraction, Lectito doesn't rely on fixed HTML structures. It analyzes content characteristics, making it work across many sites without custom rules.

Heuristic-Based

The algorithm uses heuristics derived from analyzing lots of article pages. That keeps it flexible across different site designs.

Fallback Mechanism

For sites where the algorithm fails, Lectito supports site-specific configuration files with XPath expressions. See Configuration for details.

Limitations

Sites That May Fail

  • Very short pages (tweets, status updates)
  • Non-article content (product pages, search results)
  • Unusual layouts
  • Heavily JavaScript-dependent content

Improving Extraction

For difficult sites:

  1. Adjust thresholds such as min_score or char_threshold
  2. Provide a site configuration
  3. Add a site-specific extractor when generic scoring is not enough

See Configuration for options.

Comparison to Alternatives

ApproachProsCons
LectitoWorks across many sites, no custom rules neededMay fail on unusual layouts
DefuddleStrong HTML and Markdown output, forgiving cleanup, richer metadata extractionJavaScript and DOM-oriented, not a Rust-native library or CLI stack
XPathPrecise, predictableRequires custom rules per site
CSS SelectorsSimple, familiarBrittle, breaks on layout changes
Machine LearningAdaptableComplex, requires training data

Lectito strikes a balance by working well for most sites without custom rules, with site configuration as a fallback.