CLI Usage

The CLI is designed for inspecting extraction behavior and converting documents from the terminal. It is also useful for building fixtures because the same binary can print article output and diagnostics.

The CLI has three commands:

  • parse: extract article content
  • readable: check whether a document looks readable
  • fixture: inspect bundled fixtures

Parse

parse accepts one input source. Use a positional file path, --input, --stdin, or --url.

lectito parse article.html
lectito parse --input article.html
lectito parse --stdin < article.html
lectito parse --url https://example.com/article

Output formats:

JSON is the default because it preserves the whole article structure. Use Markdown or text when piping into another tool.

lectito parse article.html --format json --pretty
lectito parse article.html --format html
lectito parse article.html --format markdown
lectito parse article.html --format text

Useful options:

The defaults work for most article pages. Tune these flags when a page is too short, too broad, or has a known content container.

lectito parse article.html --char-threshold 800
lectito parse article.html --nb-top-candidates 8
lectito parse article.html --content-selector article
lectito parse article.html --url https://example.com/post --site-profile example.com.toml
lectito parse article.html --max-elems-to-parse 10000
lectito parse article.html --keep-classes --classes-to-preserve language-rust

--site-profile can be repeated. Each file must be a TOML site profile. User profiles take precedence over bundled profiles for the same host.

Diagnostics are written to stderr after the main output:

This keeps stdout usable for the extracted article while still showing debug information in the terminal.

lectito parse article.html --format markdown --diagnostic-format pretty
lectito parse article.html --diagnostic-format json

Readable

readable checks whether the document appears to contain enough article-like text. It does not return extracted content.

lectito readable article.html
lectito readable --stdin < article.html
lectito readable --url https://example.com/article
lectito readable article.html --json --pretty

Thresholds:

lectito readable article.html --min-content-length 140 --min-score 20