India Daily Brief — fault tolerance patterns from 60 days of broken RSS feeds

Tags: python, rss, opensource, india

I run a news pipeline that has to work every morning even when specific feeds are throwing exceptions, rate-limiting me, or serving stale cached XML. After 60+ days of daily runs, here's the resilience pattern that made the difference between a script that runs once and a script you can actually schedule.

The honest problem

Most "RSS reader" tutorials assume a friendly world where every feed returns 200 OK and parses cleanly. In reality, Indian publisher RSS feeds in 2026 are:

TOI returns a Cloudflare challenge to ~30% of requests from non-browser UAs
Moneycontrol serves XML but with Content-Type: text/html (yes, really)
The Wire silently goes down for 4 hours at a time
NDTV changes their feed URL twice a year
Scroll.in returns valid RSS but with timestamps in two different formats depending on the day

A naive for url in feeds: parse(url) script will either crash on the first failure or return garbled results from feeds that "succeed" but actually returned an error page.

The four-layer fault tolerance

# Layer 1: per-feed try/except with empty list fallback
def fetch_one(name, url):
    try:
        return parse_feed(url)
    except (urllib.error.URLError, ET.ParseError, TimeoutError, ssl.SSLError) as e:
        log_failure(name, e)
        return []   # not raise — partial success is the goal

# Layer 2: timeout per feed (8s) so one slow feed can't block the whole run
with urllib.request.urlopen(req, timeout=8, context=ctx) as r:
    raw = r.read().decode('utf-8', errors='ignore')

# Layer 3: tolerant XML parsing — strip junk, handle CDATA, fall back to regex
root = ET.fromstring(raw)
items = root.findall('.//item') or root.findall('.//entry')

# Layer 4: confidence scoring on each article, drop anything with empty title/URL
arts = [a for a in items if len(a['title']) > 10 and a['url']]

The key insight: never let one feed's failure cascade. The brief still goes out with 12 feeds instead of 17. The user gets some news, the script logs which feed died, and tomorrow the script retries. No pager, no 3 AM wakeup.

Source quality scoring — the unsung hero

Even after fetching succeeds, you have a quality problem. A "RBI rate decision" story from Google News wrapper is worth less than the same story from FT or The Hindu. I score each article on:

def url_score(article):
    score = 0
    if 'news.google.com' not in article['url']:    score += 10
    if article['src'] in QUALITY_PUBLISHERS:      score += 5
    if len(article['url']) < 120:                 score += 2   # short = direct
    if '?' in article['url']:                     score -= 1   # tracking params
    return score

When dedupe finds 5 copies of the same story (one from NDTV, one from a Google News wrapper pointing to NDTV, one from The Hindu, one from a republicworld repost, one from an obscure aggregator), this score picks the best version. The user sees the direct FT link, not the broken aggregator link with 7 UTM params.

What's actually in the brief

After 17 feeds parse, dedupe collapses ~600 raw items into ~45 unique stories. Then categorization buckets them into 8 sections (Politics, Economy, World, Business, Tech, Defence, Sports, Science) and renders:

Top 7 stories (highest combined recency + quality score)
Categorized sections with 3-8 stories each
Word count: ~600 words, 3-minute read
Lands in inbox at 7:00 AM IST, 7 days a week

What I learned

Timeouts matter more than retries. One 8s timeout is better than 3 retries with no timeout. Slow feeds are usually permanently slow, not transiently slow.
Empty list on failure > crash on failure. A brief with 12 feeds is fine. No brief is not fine.
Score sources, don't trust them. FT > aggregator > Google News wrapper. Always.
Log feed failures to a separate file so you can spot the feeds that fail repeatedly and replace them.
Stdlib is enough. No feedparser, no requests, no httpx. urllib.request + xml.etree.ElementTree handles everything. 293 lines, zero deps, ships anywhere Python 3.10+ runs.

Stack

Python 3.10+ stdlib only (urllib, xml.etree, re, email.utils, concurrent.futures)
Runs as a scheduled agent on Zo Computer (cron at 7 AM IST)
Sends via Gmail SMTP
~293 LOC main file + 200 LOC optional PDF renderer

GitHub: https://github.com/AmSach/india-daily

If you want to see the source-quality list or the failure log format, both are in the README. Drop a comment if you've built something similar for a different region — I'd love to see how it handles regional feed quirks.