Building a Resilient Meta Ads Scraper: What Breaks (and What I Learned Fixing It)

When I set out to build a tool for pulling ad data from Meta's platforms, the brief I gave myself was deceptively simple: let someone search for ads by keyword and country, and get clean, structured data out the other end. The actual problem turned out to be everything in between — Meta's official API doesn't always cover what you need, the alternative (scraping the ad library directly) breaks every time the frontend changes, and "ad data" coming out of either path is messier than it looks. Here's how I approached it, and the decisions that mattered most.

The core problem: pick one access method, and you've already lost

My first instinct was to build against the Meta Graph API and stop there — it's official, structured, and well-documented. But the Graph API has real limits: certain queries need access tiers you don't always have, and once you hit those walls, there's no fallback. So instead of committing to one approach, I built the extraction layer around a Strategy Pattern, with two interchangeable backends: the Graph API for high-volume structured access, and a Playwright-based browser path for everything the API won't give you. The caller doesn't need to know which one is running underneath — it just asks for ads, and the tool picks the right strategy.

Why I scrape JSON, not HTML

The browser-based path was the harder design decision. Most scrapers I'd seen parse rendered HTML, which means every time Meta tweaks a class name or restructures a component, the scraper breaks. Instead, I had the browser engine intercept the raw XHR traffic — the JSON responses the frontend itself depends on to render the page. Meta's design team can change the layout as often as they want; the underlying data contract the frontend consumes is far more stable, because breaking it would break their own product. That one decision made the scraper meaningfully more durable against UI changes than a typical HTML-parsing approach.

Treating "scraped data" as untrusted input

Whether the data comes from the Graph API or the browser path, it arrives messy: inconsistent currency formatting, locale quirks, occasional malformed records. I didn't want any of that reaching the user silently, so I put a validation layer in front of everything using Pydantic v2 models. Every record gets normalized and checked at the boundary — if something doesn't conform, it's filtered out rather than passed through to quietly corrupt someone's downstream analysis. It's a small architectural choice, but it's the difference between a tool people can trust and one they have to double-check by hand.

Memory matters more than you think at scale

A search across multiple countries and keywords can return a lot of records, and an early version of this tool held everything in memory before writing it out — which works fine until it doesn't. I rebuilt the persistence layer as a streaming exporter for both CSV and JSON, writing each record as it arrives instead of batching everything first. Memory usage stays flat regardless of how many ads come back, which matters a lot more once a search returns thousands of results instead of a handful.

Rate limits aren't an edge case, they're the default

Anyone who's worked with Meta's APIs knows HTTP 429s aren't rare — they're expected behavior once you're making any real volume of requests. A scraper that crashes on the first rate limit isn't really a scraper, it's a demo. I integrated tenacity for exponential backoff retries on both the API and browser paths, so transient errors and rate-limiting get absorbed instead of taking the whole run down.

The CLI is part of the product, not an afterthought

It's easy to treat the command line as a throwaway wrapper around the "real" logic. But I didn't want to run a tool that gave me no feedback during a multi-minute scrape, so I built the CLI with click and rich — progress spinners, formatted logs, and a search summary at the end. None of that changes what the tool does under the hood, but it changes whether you actually want to run it.

What I'd do differently

If I rebuilt this today, I'd push validation even earlier — catching malformed records right at the point of interception rather than after they've already been parsed into intermediate objects. I'd also want a caching layer so repeated searches for the same keyword/country pair don't re-hit Meta unnecessarily. Scraping tools age fast; the parts that survive UI redesigns and API changes are usually the parts you over-invested in early — validation, retries, and a clean separation between "how we get the data" and "what we do with it."

That's roughly the shape of it: two interchangeable extraction strategies, a validation boundary that doesn't trust anything coming in, streaming I/O so memory never becomes the bottleneck, and resilience built in from the start rather than bolted on after the first crash. If you're building anything that talks to a platform you don't control, that combination — multiple access paths, strict validation, and assuming failure is normal — will save you more time than any single clever trick.