Scoring Algorithm

Detailed explanation of how Lectito scores HTML elements to identify article content.

Overview

The scoring algorithm assigns a numeric score to each HTML element, indicating how likely it is to contain the main article content. Higher scores indicate better content candidates.

Score Formula

The final score for each element is calculated as:

element_score = (base_tag_score
               + class_id_weight
               + content_density_score
               + container_bonus)
               × (1 - link_density)

Let's break down each component.

Base Tag Score

Different HTML tags have different inherent scores, reflecting their likelihood of containing content:

TagScoreRationale
<article>+10Semantic article container
<section>+8Logical content section
<div>+5Generic container, often used for content
<blockquote>+3Quoted content
<pre>0Preformatted text, neutral
<td>+3Table cell
<address>-3Contact info, unlikely to be main content
<ol>/<ul>-3Lists and metadata
<li>-3List item
<header>-5Header, not main content
<footer>-5Footer, not main content
<nav>-5Navigation
<th>-5Table header
<h1>-<h6>-5Headings, not content themselves
<form>-3Forms, not content
<main>0Container scored via bonus

Class/ID Weight

Class and ID attributes strongly indicate element purpose:

Positive Patterns

These patterns indicate content elements:

(?i)(article|body|content|entry|hentry|h-entry|main|page|post|text|blog|story)

Weight: +25 points

Examples:

  • class="article-content"
  • id="main-content"
  • class="post-body"

Negative Patterns

These patterns indicate non-content elements:

(?i)(banner|breadcrumbs?|combx|comment|community|disqus|extra|foot|header|menu|related|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup)

Weight: -25 points

Examples:

  • class="sidebar"
  • id="footer"
  • class="navigation"

Content Density Score

Rewards elements with substantial text content:

Character Density

1 point per 100 characters, maximum 3 points.

char_score = (text_length / 100).min(3)

Punctuation Density

1 point per 5 commas/periods, maximum 3 points.

punct_score = (comma_count / 5).min(3)

Total content density:

content_density = char_score + punct_score

Rationale: Real article content has more text and punctuation than navigation or metadata.

Container Bonus

Elements that are typical article containers receive a small boost:

  • <article>, <section>, <main>: +2

This bias helps select semantic containers when scores are close.

Penalizes elements with too many links:

link_density = (length of all <a> tag text) / (total text length)
final_score = raw_score × (1 - link_density)

Examples:

  • Text "Click here": link density = 100% (10/10)
  • Text "See the article for details": link density = 33% (7/21)
  • Text "Article content with no links": link density = 0%

Rationale: Navigation menus, lists of links, and metadata have high link density. Real content has low link density.

Complete Example

Consider this HTML:

<div class="article-content">
    <h1>Article Title</h1>
    <p>
        This is a substantial paragraph with plenty of text, including multiple
        sentences, and commas, to demonstrate how content density scoring works.
    </p>
    <p>
        Another paragraph with even more text, details, and information to
        increase the character count.
    </p>
</div>

Step-by-Step Scoring

1 Base Tag Score

<div>: +5

2 Class/ID Weight

class="article-content" contains "article" and "content": +25

3 Content Density

  • Text length: ~220 characters
  • Character score: min(220/100, 3) = 2
  • Commas: 4
  • Punctuation score: min(4/5, 3) = 0
  • Total: 2 points

No links: link density = 0

5 Final Score

(5 + 25 + 2) × (1 - 0) = 32

This element would score 32, well above the default threshold of 20.

Thresholds

Two thresholds determine if content is readable:

Score Threshold

Minimum score for extraction (default: 20.0).

If no element scores above this, extraction fails with LectitoError::NotReaderable.

Character Threshold

Minimum character count (default: 500).

Even with high score, content must have enough text to be meaningful.

Scoring Edge Cases

Empty Elements

Elements with no text receive score of 0 and are ignored.

Nested Elements

Both parent and child elements are scored. The highest-scoring element at any level is selected.

Sibling Elements

Adjacent elements with similar scores may be grouped as part of the same article.

Negative Scores

Elements can have negative scores (e.g., navigation). They're excluded from selection.

Configuration Affecting Scoring

Adjust scoring behavior with ReadabilityConfig:

use lectito_core::ReadabilityConfig;

let config = ReadabilityConfig::builder()
    .min_score(25.0)           // Higher threshold
    .char_threshold(1000)      // Require more content
    .min_content_length(200)   // Longer minimum text
    .build();

See Configuration for details.

Practical Implications

Why Articles Score Well

  • Semantic tags (<article>)
  • Descriptive classes (article-content)
  • Substantial text (high character count)
  • Punctuation (commas, periods)
  • Few links (low link density)

Why Navigation Scores Poorly

  • Generic or negative classes (sidebar, navigation)
  • Little text (just link labels)
  • Many links (high link density)
  • Short content (fails character threshold)

Why Comments May Score Poorly

  • Often in negative classed containers (comments)
  • Short individual comments
  • Many links (usernames, replies)
  • Variable quality

Site Configuration

When automatic scoring fails, provide XPath rules:

# example.com.toml
[[fingerprints]]
pattern = "example.com"

[[fingerprints.extract]]
title = "//h1[@class='article-title']"
content = "//div[@class='article-body']"

See Configuration for details.

References

Next Steps