Batch Scraping (GitHub Actions)
Scheduled scraping using GitHub Actions cron jobs with Python and Playwright.
How It Works
Section titled “How It Works”GitHub Actions runs a scheduled workflow that:
- Checks out the scraper code
- Installs Python dependencies + Playwright browser
- Runs scraping scripts against all configured sources
- Writes results directly to the database
Workflow Configuration
Section titled “Workflow Configuration”name: Scrape Newson: schedule: - cron: '*/30 * * * *' # Every 30 minutes workflow_dispatch: # Manual trigger
jobs: scrape: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: '3.12' - name: Install dependencies run: | pip install -r requirements.txt playwright install chromium - name: Run scrapers run: python run_scrapers.py env: GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}Scraper Script Structure
Section titled “Scraper Script Structure”scrapers/├── run_scrapers.py # Orchestrator├── rss_scraper.py # RSS feed parser├── playwright_scraper.py # JS-rendered site scraper├── youtube_scraper.py # YouTube channel scraper├── dedup.py # Deduplication logic├── requirements.txt└── utils/ └── extractors.py # HTML parsing helpersRSS Scraper (Lightweight)
Section titled “RSS Scraper (Lightweight)”# rss_scraper.py (simplified)import feedparser
SOURCES = [ {"name": "The Hindu", "url": "https://www.thehindu.com/rss/"}, {"name": "Dinamani", "url": "https://www.dinamani.com/rss/"}, {"name": "Times of India", "url": "https://timesofindia.indiatimes.com/rssfeeds/"},]
for source in SOURCES: feed = feedparser.parse(source["url"]) for entry in feed.entries: insert_raw_article( source=source["name"], url=entry.link, title=entry.title, body_html=entry.description, fetched_at=datetime.utcnow(), )Playwright Scraper (JS Sites)
Section titled “Playwright Scraper (JS Sites)”# playwright_scraper.py (simplified)from playwright.sync_api import sync_playwright
with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto("https://www.dinamalar.com/") page.wait_for_load_state("networkidle")
articles = page.eval_on_selector_all( "article", """els => els.map(el => ({ title: el.querySelector('h2')?.innerText, url: el.querySelector('a')?.href, summary: el.querySelector('p')?.innerText }))""" )
for article in articles: insert_raw_article( source="Dinamalar", url=article["url"], title=article["title"], body_html=article["summary"], )Limits
Section titled “Limits”| Limit | Value |
|---|---|
| Max run time per job | 6 hours (generous) |
| Concurrent jobs | 20 (public repos) |
| Storage | 500 MB per job |
| Frequency | Every 30 min recommended (respect sites) |