Skip to content

Scraping Overview

How Thamizhi collects news from multiple sources.

Thamizhi uses two complementary scraping approaches:

ApproachPlatformTime LimitBest For
Batch ScrapingGitHub ActionsUnlimited (public repo)Scheduled bulk collection
On-Demand ScrapingCloudflare Browser Run10 min/dayReal-time URL fetch
  • Batch scraping handles the heavy lifting — collecting hundreds of articles daily from RSS feeds and known news sites
  • On-demand scraping handles real-time requests — user submits a link, Cloudflare Browser Run fetches it immediately for verification
Source TypeMethodSpeedEffort
RSS feedsPython + FeedparserFastMinimal
Static HTML sitesPython + requests + BeautifulSoupFastLow
JS-rendered sitesPython + PlaywrightMediumMedium
YouTube channelsYouTube Data APIFastLow
Social mediaPublic APIs / scrapersVariesMedium
Government dataRSS + PlaywrightVariesMedium

Scraped articles are deduplicated before processing:

  1. URL hash — exact URL match
  2. Title similarity — fuzzy matching (Levenshtein distance)
  3. Content overlap — N-gram comparison for same story across sources

Duplicate articles are linked via cross_refs JSON field.