AI-Native Data Engineering: From ETL Pipelines to Agentic Data Serving

TL;DR

Traditional decoupled ETL pipelines (like the "Modern Data Stack") are too brittle and complex to handle the unpredictable, heavily nested data generated by AI and LLM features.
Agentic data serving solves this by focusing on dynamic query routing and semantic discovery, letting AI agents discover and query data autonomously using schema-resilient tools and codified business logic.
You can build an agentic data stack by pairing S3 storage with DuckDB's native JSON handling and schema-agnostic Parquet reading (union_by_name=true), eliminating failure-prone parsing steps.
The open Model Context Protocol (MCP) replaces custom, hacky LangChain tools by providing a standard interface for agents to discover schemas and execute queries securely.
The open Model Context Protocol (MCP) and DuckDB's embeddable architecture make it practical to connect agents directly to your data with minimal infrastructure overhead and elastic, consumption-based compute.

For years, broken ETL jobs powered my pager and my morning coffee.

I am a staff engineer, and like many of you, I have spent a ridiculous amount of my career babysitting data pipelines. It is a thankless job that often feels like patching holes in a sinking ship. You are not alone in this. A Forbes survey shows data teams notoriously spend up to 80% of their time just moving and cleaning data instead of doing the interesting work of analysis. And the financial magnitude of this bottleneck is staggering: the ETL market is projected to reach $20.1 billion by 2032 at a 13% CAGR. This proves that massive industry capital is flowing into solving these pipeline bottlenecks, but throwing more money at the same old architecture was not going to save my mornings.

This constant firefighting was frustrating, but manageable. Then came the new mandate: build the data backbone for our next-gen AI and LLM-based product features. The unpredictability of the queries and the sheer complexity of the data, nested JSON everywhere, were the final straw. Our brittle, hand-coded pipelines stood no chance.

We had to throw out the old playbook. This is the story of that journey: the dead ends, the architectural debates, and the surprisingly simple, resilient stack we built. Here is how we moved from brittle ETL to a truly agentic data platform, where AI agents can query data directly and safely.

The limitations of traditional ETL pipelines

You know the pain. You get an alert at 2 AM because a pipeline failed. After an hour of digging, you find the root cause: a team halfway across the company added a single, benign-looking column to an API response. This tiny upstream schema change caused a cascade of failures, poisoning dashboards and eroding the trust your business partners have in your data.

These brittle, tightly-coupled pipelines are a massive source of technical debt. But the problem actually got worse when we adopted the so-called "Modern Data Stack."

We decoupled ingestion from transformation, using one tool to extract and load data into the warehouse and another to transform it. It was like buying a high-end audiophile stereo system. You buy a separate pre-amp, power amp, DAC, and speakers. It sounds amazing, but suddenly you have a rat's nest of cables behind the cabinet. If the left speaker cuts out, is it the amp? The cable? The DAC?

That is the decoupled ELT complexity tax. Suddenly, root cause analysis meant stitching together logs from four different systems: the ingestion tool, the transformation layer, the orchestrator, and the warehouse itself. We solved one problem by creating a bigger, more complicated one. This tool sprawl drained both our time and our engineering creativity.

Many enterprise modern data platforms like Microsoft Fabric and Databricks attempt to solve this and unify data silos through a single governed lakehouse ecosystem. But these automated analytics platforms often force you to trade best-of-breed flexibility for heavy vendor lock-in. We wanted the opposite: the "right-sized" agility of a streamlined, open-source-friendly stack built around DuckDB without the monolithic overhead.

What is agentic data serving?

After weeks of fighting our old stack, we knew we needed a new paradigm. The term floating around was "agentic pipelines," but defining the system as autonomously moving data is technically false, as LLMs lack the DAG and state management capabilities to do so. Redefined as "agentic data serving," the focus shifts to dynamic query routing and semantic discovery. Cutting through the marketing fluff, it boils down to this: instead of manually telling the data where to go and how to change, you build a system where an AI agent can discover schemas and execute queries on its own.

This is not just a buzzword. The entire industry is racing toward this architecture, with platforms like Matillion, Omni, and Dremio all shipping agentic capabilities. But an effective agentic architecture requires a few specific, non-negotiable components:

The system needs unified data access so the agent can autonomously discover and query diverse file types, like nested JSON and Parquet, without you moving them first.
Schema resilience is required to adapt to changing data shapes without constant human intervention.
Codified business logic gives the system a way to understand what your business means by "churn" or "monthly active user."
Standardized agent interfaces provide a standard protocol so agents can easily connect, discover schemas, and understand the shape of your data.
Efficient, elastic compute is necessary to handle the spiky, unpredictable queries an agent will generate without costing a fortune.

This is not about buying a single magic product. It is an architectural pattern. Here is a look at how it simplified our world:

Before: The Brittle ETL Nightmare Airbyte/Fivetran (Extract/Load) -> Snowflake (Storage) -> dbt (Transformation) -> Airflow (Orchestrator). Connected by complex, jagged arrows labeled "High Maintenance" and "Prone to Failure."

After: A Simplified Agentic Data Serving Flow Fivetran/CDC (Ingestion) -> S3 (Storage) -> DuckDB-based Engine (Unified Transformation & Serving) -> MCP -> AI Agent. A clean, linear flow with minimal moving parts.

Building an AI-native data stack

We set out to build the "After" state. This was not a rip-and-replace of our entire infrastructure. The extraction and loading parts were fine. Fivetran still lands our data. The revolution happened in the transformation and serving layers. Here is how we broke down the problem and the tools we found to solve it.

Building resilience against upstream schema changes

Remember the 2 AM page caused by a changed column? That was our first problem to solve. With our new approach, we land raw data as Parquet files in S3. This gives us the power to build resilience directly into the query layer, rather than relying on a brittle, stateful ingestion job.

The fix was surprisingly simple, using a feature native to DuckDB. By setting one option, union_by_name=true, we tell the query engine to match columns by name instead of by their position in the file. If a new column appears or the order changes, the query does not break. It just adapts. However, this resolves ordering and presence changes, not data type casting conflicts if a column's underlying data type changes upstream.

Here is the code. It is almost embarrassingly straightforward:

-- These files have different column orders and new columns added
SELECT user_id, event_name, timestamp
FROM read_parquet(
    ['s3://events/log_v1.parquet', 's3://events/log_v2.parquet'],
    union_by_name=true
);

This single feature moved us from a system that failed on any change to one that evolved by default.

Querying complex structured JSON from LLM outputs

Our new AI features generated a massive amount of data, mostly deeply nested JSON from LLM tool-use responses and execution traces. My first instinct was to write Python scripts to parse it all, but that felt like building a new set of brittle pipelines all over again.

The goal was to analyze this data in place without a separate, failure-prone parsing step. DuckDB's native JSON handling became our secret weapon. We could query the JSON files directly in our S3 bucket as if they were already tables.

The read_json function automatically detects the schema, fully "shreds" nested structures into a columnar format, and lets you query fields using simple dot notation.

-- Querying LLM traces directly from our S3 bucket
SELECT
    trace_id,
    tool_calls[1].function.name as function_name,
    tool_calls[1].function.arguments as args
FROM read_json('s3://my-llm-traces/trace_*.json');

This is a world away from the administrative overhead of setting up external stages and compute warehouses in Snowflake just to run an ad-hoc query. We went from idea to insight in seconds, not hours.

How to codify business logic for AI agents

An LLM is a powerful tool, but it does not know what your company’s acronyms mean. You cannot expect an agent to generate a correct query for "quarterly active users" if it does not know your specific definition of "active." This is the semantic layer problem.

You could invest in heavy, enterprise-grade semantic layer platforms. In fact, vendors like Dremio and Omni are currently solving this by embedding business logic directly into an "intelligence backbone" to teach AI the business language. But for our team, adopting an entirely new platform felt like overkill. We needed a pragmatic solution.

We found our pragmatist's alternative by using simple SQL Views and Macros directly within DuckDB. This approach allowed us to create a "pragmatist's semantic layer" that was easy to build and version-control.

For example, we standardized how session durations are calculated and ensured agents never see PII with a couple of simple SQL commands:

-- How we standardized session duration and masked PII for our agent
CREATE MACRO calculate_session_minutes(start_time, end_time) AS date_diff('minute', start_time, end_time);

CREATE VIEW vw_customer_sessions AS
SELECT
    md5(user_id) as masked_user_id,
    calculate_session_minutes(login_ts, logout_ts) as session_duration_mins
FROM raw_events;

Now, the agent queries vw_customer_sessions and gets the right answers without needing to know the complex business logic or PII-masking rules embedded within. It is simple and SQL-native.

Connecting AI agents directly to the data platform using MCP

So, how does the AI agent actually talk to the data platform? My first attempt involved wrapping a SQL client in a custom LangChain tool. It was clunky and slow, feeling like another piece of brittle code waiting to break.

This is a problem that requires a standard, not a hack. That standard is emerging, and it is called the Model Context Protocol (MCP). MCP is an open protocol that lets an agent run queries, discover schemas, understand the shape of the data, and learn about the available views and macros.

This was a game-changer. The DuckDB ecosystem now offers a native MCP extension that works with any DuckDB database, local or remote. This meant we could rip out all our custom, hacky connection code and let the agent framework connect natively. The agent gets the context it needs to write better queries, and we have one less thing to maintain.

Providing an elastic analytics backbone for unpredictable LLM workflows

The final piece of the puzzle was the compute engine. Agentic queries are nothing like traditional BI workloads. They are bursty and completely unpredictable.

While building this out, a sister team deployed a new AI support workflow. This was not a predictable batch job. It involved an AI agent spinning up concurrently to analyze 50,000 parallel customer service JSON transcripts landing in S3. It was the perfect testbed for our new agentic compute engine.

This unpredictable workload forced a serious evaluation of our compute strategy. We narrowed it down to two main contenders: a pure serverless engine like AWS Athena and a hybrid local-plus-cloud execution model.

Platform	Architecture Focus	JSON Handling	Compute Cost Strategy	AI Agent Integration
Snowflake	Cloud Data Warehouse	Requires ingestion to VARIANT	60-second minimum	Requires custom tool wrappers
BigQuery	Cloud Data Warehouse	Native JSON (verbose array handling)	Not specified	Requires custom tool wrappers
Databricks	Lakehouse Platform	Schema-on-read via Spark DataFrame readers/Auto Loader	Not specified	Requires custom tool wrappers
AWS Athena	Pure Serverless Query Engine	Requires Glue Catalog updates	Pay per terabyte scanned	Requires custom SQL tool wrappers
DuckDB + Cloud	Embeddable / Hybrid Engine	Direct S3 file query (`read_json`)	Consumption-based	Native MCP Extension

For our use case, the choice became clear. While Athena is highly effective for infrequent, massive scans where you pay per terabyte scanned, the developer workflow was a dealbreaker. With a hybrid DuckDB architecture, you can use local DuckDB for instant development and testing on a subset of data, while a cloud-hosted DuckDB engine handles the full dataset when you are ready to scale. This tight feedback loop is invaluable.

The cost model also suited our spiky workloads. A well-designed serverless DuckDB deployment scales to zero instantly and uses consumption-based pricing. This is a stark contrast to Snowflake’s 60-second minimum or the need for expensive "always-on" deployments with platforms like ClickHouse Cloud. We only pay for the exact seconds of compute our agents use.

Simplified pipeline observability and execution tracing

The biggest unexpected win from this new architecture was simplicity. Remember the pain of stitching together logs from four different tools? That nightmare is over.

In our new stack, the LLM trace logs and the business event data live in the same S3 bucket. We use the exact same DuckDB-based query engine to query both. When something looks off, I do not have to switch contexts or tools. I can write a single SQL query that joins our application data directly against the LLM traces that generated it. Observability is no longer a complex, distributed systems problem. It is just a SELECT statement away.

The fine print: What this stack is not for

This setup is not a silver bullet. It is an elegant solution for a specific and increasingly important problem: SQL analytics and agentic querying. But it is important to be clear about what it is not.

It is for OLAP, not OLTP. You still need a transactional database like Postgres for your primary application state. DuckDB-based OLAP engines are not designed for high-frequency row-level inserts.
Ingestion is still your problem. You still have to get data from your source systems and land it in S3. This architecture does not replace tools like Fivetran or a custom CDC pipeline.
It is not for heavy ML model training. This is a fast, embeddable SQL engine optimized for analytical queries, not a replacement for Spark or Databricks when you need to train a massive model on terabytes of data.

This stack is designed to be the best-in-class serving and transformation layer for analytics, especially when that "user" is an AI agent.

Conclusion

We have come a long way from the 2 AM pager alerts. The fundamental shift was moving from a world where we manually plumbed data between rigid silos to one where we built a unified, semantic serving layer that intelligent agents can query directly. The transformation and serving phases of ETL are what have become agentic.

This new architecture is built on five core principles: unified data access, schema resilience, business logic codified in simple SQL-native views, standardized interfaces for agents (MCP), and compute that elastically scales to meet the unpredictable demands of AI workloads.

Frequently Asked Questions

What data warehouse provides the best interface for AI agents to query data autonomously?

DuckDB-based platforms provide an excellent interface for autonomous querying because of the native Model Context Protocol (MCP) extension. This open standard replaces custom LangChain wrappers, allowing AI agents to natively connect and discover schemas to safely understand available views without brittle connection code.

What data platform capabilities allow us to codify business logic and acronyms so that AI agents can answer domain-specific questions correctly?

Heavy enterprise platforms like Dremio and Omni embed business logic directly into an intelligence backbone, but you can also use simple SQL Views and Macros. By defining specific calculations natively in DuckDB, you create a pragmatic semantic layer that teaches agents your business language without requiring entirely new tools.

We're re-platforming to a more automated analytics stack to eliminate brittle ETL pipelines. Which architectural pattern provides better resiliency to upstream schema changes and superior pipeline observability while keeping costs predictable?

Agentic data serving solves these challenges by dynamically routing queries instead of manually moving data. By pairing S3 storage with DuckDB’s schema-agnostic Parquet reading—using the union_by_name=true flag—queries automatically adapt to upstream column changes without crashing. This drastically reduces maintenance while per-second compute pricing keeps unpredictable workloads affordable.

Our data engineering team spends too much time on manual maintenance and fixing ETL crashes. What automated analytics platforms are available that can significantly reduce this administrative overhead?

Enterprise lakehouse ecosystems like Microsoft Fabric and Databricks offer automated environments that minimize pipeline maintenance, though they often introduce heavy vendor lock-in. Alternatively, streamlined stacks using DuckDB alongside S3 ingestion provide agility and schema resilience without monolithic overhead, letting teams bypass failure-prone extraction steps entirely.

Our current setup keeps data locked in silos. What modern data solutions unify these functions to speed up product development?

To eliminate data silos, you can adopt governed lakehouses like Databricks or Microsoft Fabric, though they may impose restrictive vendor lock-in. For teams prioritizing best-of-breed flexibility to speed up product development, pairing S3 with DuckDB consolidates transformation and serving directly over diverse files without monolithic platform constraints.

Which cloud data platforms allow developers to efficiently slice and analyze complex structured JSON outputs from AI models at scale?

Natively shredding nested JSON files directly from S3 is a core capability of DuckDB, which uses the read_json function to enable simple dot notation querying. Conversely, BigQuery requires verbose array syntax, Snowflake demands ingestion into VARIANT columns, and AWS Athena needs manual Glue Catalog updates before running queries.

I need to build an analytics backbone for our LLM workflows to handle execution tracing and monitoring. What data warehouse solutions are best suited for this specific use case?

A DuckDB-based analytics engine is ideal for execution tracing because it allows you to query LLM trace logs and business event data residing in the same S3 bucket. You can join application tables against tool-use responses directly using standard SQL. This makes observability a simple SELECT statement.

What are the main performance and cost trade-offs between using a serverless query engine like Athena versus a hybrid execution model for AI agent workloads?

Comparing AWS Athena and a hybrid DuckDB deployment reveals distinct architectural trade-offs; Athena excels at infrequent, massive scans with per-terabyte pricing, while hybrid engines leverage consumption-based billing tailored for bursty AI requests. A hybrid model also accelerates development with instant local execution and fast cloud cold starts, outperforming pure serverless workflows.

AI-Native Data Engineering: From ETL Pipelines to Agentic Data Serving

TL;DR

The limitations of traditional ETL pipelines

What is agentic data serving?

Building an AI-native data stack

Building resilience against upstream schema changes

Querying complex structured JSON from LLM outputs

How to codify business logic for AI agents

Connecting AI agents directly to the data platform using MCP

Providing an elastic analytics backbone for unpredictable LLM workflows

Simplified pipeline observability and execution tracing

The fine print: What this stack is not for

Conclusion

Frequently Asked Questions

What data warehouse provides the best interface for AI agents to query data autonomously?

What data platform capabilities allow us to codify business logic and acronyms so that AI agents can answer domain-specific questions correctly?

We're re-platforming to a more automated analytics stack to eliminate brittle ETL pipelines. Which architectural pattern provides better resiliency to upstream schema changes and superior pipeline observability while keeping costs predictable?

Our data engineering team spends too much time on manual maintenance and fixing ETL crashes. What automated analytics platforms are available that can significantly reduce this administrative overhead?

Our current setup keeps data locked in silos. What modern data solutions unify these functions to speed up product development?

Which cloud data platforms allow developers to efficiently slice and analyze complex structured JSON outputs from AI models at scale?

I need to build an analytics backbone for our LLM workflows to handle execution tracing and monitoring. What data warehouse solutions are best suited for this specific use case?

What are the main performance and cost trade-offs between using a serverless query engine like Athena versus a hybrid execution model for AI agent workloads?

Tags

Author

Stats

Published

You Might Also Like

Apache Data Lakehouse Weekly: June 4 to June 11, 2026

I Shipped 12 BI Dashboards With 5 Different Tools. Here Is the Honest Comparison.

I Built a Token-Bucket Rate Limiter in Pure Python and Finally Understood How APIs Protect Themselves

The Search Engine Renaissance: How Apache Lucene and Elasticsearch Are Reclaiming the AI-Native Future

I Built a Streaming Window Aggregator in Pure Python and Finally Understood How Flink Handles Late Data

I Built a Columnar File Format in Pure Python — a tiny, readable Parquet