How Scraping Works
The process Verbalist uses to extract content from competitor pages.
Automatic visiting
Verbalist automatically visits each URL in the top 10. The system simulates a real browser to access content as a user would see it.
HTML extraction
The complete HTML code of the page is extracted, including all elements: text, headings, lists, tables, images (alt text).
Content cleaning
The system removes elements not relevant for analysis: navigation, site header/footer, sidebar, widgets, ads, popups. Only the main content remains.
Main content identification
Content extraction algorithms identify the "main content" of the page, distinguishing it from accessory elements. This ensures the analysis focuses on actual content.
Markdown conversion
Cleaned HTML is converted to structured Markdown, preserving: heading hierarchy (H1-H6), formatting (bold, italic), lists, links, tables.
Error handling
If a page is inaccessible (404, timeout, protections), Verbalist skips it and continues with the others. The analysis proceeds with available content.