The Spreadsheet is Not Good Enough: Building an Autonomous Bioinformatics Quality Analyst

LangGraph state machine with cycles · Claude Sonnet · PubMed integration · Real concordance data · Zero human intervention per step

Most AI demos show you a chatbot answering questions. This post shows you something different: an agent that decides what to do, calls the right tools in the right order, handles failures gracefully, searches scientific literature for context, and produces a structured clinical-grade quality report -- without you telling it what to do at each step.

That distinction is the entire point of agentic AI. And it is harder to build correctly than most tutorials suggest.

What BioAgent actually does

BioAgent is an autonomous bioinformatics quality analyst. You give it a sample ID. It does the rest.

Specifically, it:

Queries a live germline variant calling pipeline API for concordance and reproducibility metrics
Reasons about the findings against GIAB HG001 v4.2.1 benchmarks
Constructs a targeted PubMed search query from the actual metric values
Searches NCBI for relevant literature and retrieves abstracts
Synthesises everything into a structured, evidence-based quality report

Here is a real excerpt from a report BioAgent generated autonomously on live data:

Sample HG001 passes all primary quality thresholds across three replicate runs completed on 2026-05-28, with SNV F1 (0.9928) and Indel F1 (0.9656) both exceeding their respective GIAB HG001 v4.2.1 benchmark minima, and reproducibility metrics indicating excellent run-to-run consistency (ICC = 0.9847, median CV = 4.2%). No active alerts or Westgard violations were detected.

That was written by the agent, not by me. The agent fetched the data, interpreted it, and wrote that paragraph.

Why LangGraph and not a plain agent

This is the question worth spending time on, because the answer reveals the core architectural decision.

A plain LangChain agent with tools works like this: the LLM sees your message, decides which tool to call, calls it, sees the result, decides what to do next. One round trip. Stop when done.

That is fine for simple tasks. It is not fine for an agent that needs to:

Retry a data fetch if the API returns an error
Reformulate a PubMed search query if the first one returns nothing relevant
Degrade gracefully if the upstream API is completely unreachable
Maintain typed state across multiple steps so each node can see what previous nodes found

LangGraph models the agent as a directed graph where nodes are reasoning or action steps and edges are the transitions between them. Crucially, edges can be conditional -- the agent routes to different nodes based on what it finds. And the graph can have cycles -- nodes can loop back to previous nodes when needed.

Here is what BioAgent's graph looks like:

START
  |
  v
[fetch_data]  -- calls 5 pipeline API tools
  |
  +-- critical tools failed, retries exhausted --> [graceful_degradation] --> END
  |
  +-- critical tools failed, retries remain -----> [fetch_data]  (cycle)
  |
  +-- data collected
        |
        v
   [analyse]  -- LLM constructs PubMed query from metric values
        |
        v
[search_literature]  -- PubMed search, retry with broader query if empty
        |
        v
[synthesise_report]  -- LLM writes the full report
        |
        v
       END

The cycles are bounded -- maximum 1 fetch retry, maximum 2 PubMed retries. Without bounds, a cycle can consume your entire API credit. Deterministic bounds are a non-negotiable property for any agent running on paid infrastructure.

The tool design that most tutorials get wrong

Almost every LangGraph tutorial puts "generate report" in the tools list alongside "call API" and "search database". This is architecturally wrong.

Tools are for deterministic actions -- things that call external systems, query databases, or perform calculations. The results are factual and external to the LLM.

Report generation is what the LLM does with the data it has collected. It is not a tool call. It is reasoning.

Mixing these two categories produces an agent that is harder to debug, harder to test, and easier to hallucinate from. When report generation is a tool, the agent can call it before collecting enough data. When it is a node in the graph that only executes after all data nodes have completed, that cannot happen.

BioAgent has six tools and zero "generate" tools:

Tool	What it does
`get_pipeline_runs`	Calls `GET /api/v1/runs`
`get_concordance_summary`	Calls `GET /api/v1/concordance/summary/{id}`
`get_concordance_results`	Calls `GET /api/v1/concordance`
`get_reproducibility`	Calls `GET /api/v1/reproducibility/{id}/latest`
`get_active_alerts`	Calls `GET /api/v1/alerts`
`search_pubmed`	Queries NCBI Entrez for relevant papers

Every tool returns a structured dict with a success flag. On failure it returns a structured error -- not an exception. The agent uses these flags for routing decisions. The LLM never sees a Python traceback.

The PubMed keyword strategy that actually works

Vague PubMed queries return vague results. "bioinformatics quality" returns thousands of papers, none of which are specifically relevant to what your agent found.

BioAgent constructs queries from the actual metric values:

Finding	Query constructed
SNV F1 below 0.98	`germline variant calling sensitivity specificity GIAB`
Indel F1 below 0.95	`indel calling accuracy short read sequencing`
ICC below 0.90	`intraclass correlation coefficient sequencing reproducibility`
VAF CV above 15%	`variant allele frequency technical variation replicate`
All metrics passing	`germline variant calling quality validation clinical`

This is implemented in the analyse node. The LLM receives the actual metric values and constructs the most specific query it can from them. The result is citations that are directly relevant to the findings -- not generic literature padding.

If the first query returns nothing, the agent retries with a broader fallback query. If that also returns nothing, it proceeds without citations rather than citing irrelevant papers. A system that cites whatever it can find is worse than a system that cites nothing.

Graceful degradation -- the feature that is often skipped

Every tutorial agent assumes its tools work. In production, they do not.

BioAgent's most important feature might be what it does when the pipeline API is unreachable. Instead of hallucinating data, it enters a graceful_degradation node that:

Lists exactly which tools failed and why
Tells the user precisely how to fix the problem
Exits without generating a report

I was unable to complete the analysis for sample HG001.

The following tools failed to return data:
- get_pipeline_runs
- get_concordance_summary

To start the API, run this in your terminal:

    cd ~/biomarker-concordance-pipeline
    source .venv/bin/activate
    uvicorn api.main:app --host 0.0.0.0 --port 8000

A system that generates clinical-looking reports based on no data is dangerous. Graceful degradation is not a nice-to-have. It is a safety property.

The routing logic that decides whether to degrade is a simple conditional edge function:

def route_after_fetch(state: AgentState) -> str:
    critical = {"get_concordance_summary", "get_pipeline_runs"}
    critical_failed = critical.intersection(set(state["failed_tools"]))

    if critical_failed and state["fetch_retries"] >= MAX_FETCH_RETRIES:
        return "graceful_degradation"
    if critical_failed and state["fetch_retries"] < MAX_FETCH_RETRIES:
        return "fetch_data"  # retry
    return "analyse"

Thirty lines of code. Critical behaviour.

The async/Streamlit problem and how to solve it

Streamlit reruns the entire script on every user interaction. Running an async LangGraph stream inside Streamlit hits RuntimeError: This event loop is already running immediately.

The solution is two parts. First, apply nest_asyncio to patch the running event loop:

import nest_asyncio
nest_asyncio.apply()

Second, run the agent in a separate thread so the main Streamlit thread can animate a progress indicator while the agent works:

thread = threading.Thread(target=_run_agent, daemon=True)
thread.start()

while thread.is_alive():
    status_placeholder.info(f"Running: {steps[step_idx % len(steps)]}...")
    time.sleep(1.5)
    step_idx += 1

thread.join()

The result is a chat interface where the user sees the agent working in real time -- "Fetching concordance data...", "Searching PubMed...", "Generating report..." -- rather than a 25-second freeze followed by a wall of text. For a portfolio demo where you will be screen-sharing with an interviewer, this matters enormously.

The FastAPI timeout problem and the correct solution

A full BioAgent run takes 15 to 30 seconds. Default HTTP clients time out at 30 seconds. If the report generation takes slightly longer, the client gets a timeout error even though the agent completed successfully.

The correct solution is the background task pattern. The API endpoint returns a job ID immediately (HTTP 202 Accepted) and the agent runs asynchronously. The client polls for results:

@router.post("/analyse", status_code=202)
def analyse(payload: AnalyseRequest, background_tasks: BackgroundTasks):
    job_id = str(uuid.uuid4())
    executor.submit(_run_agent_job, job_id, payload.sample_id, payload.task)
    return {"job_id": job_id, "status": "queued"}

@router.get("/results/{job_id}")
def get_results(job_id: str):
    return _jobs[job_id]  # returns current status and report when complete

This pattern is how every production long-running API works. Returning a job ID immediately and polling for results is the correct architecture for anything that takes more than a few seconds.

What the agent actually produced

Here is what BioAgent generated autonomously on real HG001 data from the companion pipeline project:

SNV Performance (from the report):

Metric	Observed	Threshold	Status
F1 mean	0.9928	>= 0.98	EXCELLENT
F1 min across runs	0.9925	>= 0.98	EXCELLENT
Precision	0.9921	>= 0.98	PASS
Recall	0.9934	>= 0.98	EXCELLENT

The agent also noted -- unprompted -- that SNV Precision at 0.9921 is marginally below the 0.98 threshold floor but sits within the excellent F1 band, and explained that the small precision-recall gap is consistent with a slight false-positive tendency typical of short-read callers in low-complexity flanking regions. That is real clinical bioinformatics reasoning, not a template.

The indel analysis correctly applied the lower threshold (0.95 for indels vs 0.98 for SNVs) and explained the precision-recall asymmetry in terms of homopolymer and tandem-repeat calling difficulty.

None of this was scripted. The agent reasoned from the data.

Cost reality check

One full BioAgent run (fetch data, analyse, search PubMed, generate report) uses approximately 7,000 tokens with Claude Sonnet 4.6. At current API pricing, that is roughly $0.04 to $0.06 per run.

The dashboard tracks runs per session and warns after 20 analyses. At $0.05 per run, 20 runs costs $1.00. The entire development of this project -- building, testing, debugging, and running the agent dozens of times -- cost less than $5.00 in API credit.

Agentic AI is not expensive to build with. The cost concern is about uncontrolled production usage, not development.

What this project is built on

BioAgent is the companion project to the Biomarker Concordance Pipeline -- a production Nextflow DSL2 germline variant calling system with GATK 4.5, hap.py concordance benchmarking against GIAB, ICC/Bland-Altman reproducibility analysis, FastAPI REST API on AWS RDS, and Terraform infrastructure. BioAgent queries that pipeline's API as its primary data source.

The two projects together demonstrate:

Production bioinformatics pipeline engineering (Nextflow, GATK, hap.py, GIAB)
Cloud infrastructure (AWS Batch, RDS, S3, ECR, Terraform)
Data engineering (concordance metrics, ICC, Bland-Altman, Westgard rules)
Agentic AI systems (LangGraph, tool design, graceful degradation, streaming)
Production API design (FastAPI, async SQLAlchemy, background tasks)

The numbers

Metric	Value
Tests passing	27/27
CI pipeline	Green on every push
Tools in agent	6
Graph nodes	5 (fetch, analyse, search, synthesise, degrade)
Max fetch retries	1
Max PubMed retries	2
Cost per agent run	~$0.05
Report generation time	15-25 seconds

Repository

Everything in this post is open source:

github.com/gbadedata/bioagent

The companion pipeline project:

github.com/gbadedata/biomarker-concordance-pipeline

The README covers the full graph architecture, tool design decisions, the PubMed keyword strategy, graceful degradation behaviour, and instructions for running the agent locally against the companion API.

Building agentic AI systems for bioinformatics, genomics, and life sciences data engineering. If you are working on similar problems, connect on GitHub or LinkedIn.