Automatic visiting

Verbalist automatically visits each URL in the top 10. The system simulates a real browser to access content as a user would see it.

HTML extraction

The complete HTML code of the page is extracted, including all elements: text, headings, lists, tables, images (alt text).

Content cleaning

The system removes elements not relevant for analysis: navigation, site header/footer, sidebar, widgets, ads, popups. Only the main content remains.

Main content identification

Content extraction algorithms identify the "main content" of the page, distinguishing it from accessory elements. This ensures the analysis focuses on actual content.

Markdown conversion

Cleaned HTML is converted to structured Markdown, preserving: heading hierarchy (H1-H6), formatting (bold, italic), lists, links, tables.

Error handling

If a page is inaccessible (404, timeout, protections), Verbalist skips it and continues with the others. The analysis proceeds with available content.