I kept avoiding clipping my own content.
Not because I didn't want short clips. I did. But the process was genuinely painful — scrub through a long video, find a good moment, trim it, crop for vertical, add captions, export. Repeat three times. Two hours gone.
So I built a tool that does the whole thing automatically.
Here's how it works under the hood.
The Problem With a Simple Script
My first instinct was a single Python script — call Whisper, parse the transcript, run FFmpeg. Done.
It worked. Until it didn't.
When the LLM returned a bad clip selection, I had to re-run transcription. When FFmpeg failed on a weird video format, I lost the focus detection results. Debugging meant re-running everything from scratch every single time.
I needed each step to be isolated. That's where LangGraph came in.
Why LangGraph
LangGraph lets you model a pipeline as a graph of discrete nodes, each with its own state. Instead of one big sequential script, the workflow looks like this:
transcription → clip_selection → focus_detection → rendering
Each node:
- Receives only the state it needs
- Writes its output back to shared state
- Can be retried independently if it fails
- Can be tested in isolation without running the full graph
That last point alone saved me hours of debugging. When clip selection was returning poor moments, I could feed it test transcripts directly without touching Whisper or FFmpeg.
Conditional edges also let me add error handling cleanly — if focus detection fails, route to a fallback center-crop instead of crashing the whole pipeline.
The Full Pipeline
Node 1 — Transcription
Pulls audio from the video (or YouTube URL via yt-dlp) and runs it through OpenAI Whisper locally. Output is a full transcript with word-level timestamps.
Word-level timestamps are important — they let you map a selected text moment back to exact video timecodes for cutting.
Node 2 — Clip Selection
Sends the transcript to an LLM with a prompt asking it to identify the 3 most engaging moments. The model returns start/end timestamps and a brief reason for each selection.
The prompt explicitly asks for moments that:
- Have a clear beginning and end
- Make sense without surrounding context
- Would stop someone mid-scroll
Node 3 — Focus Detection
For each selected clip, runs face/subject detection to find where the main subject is in the frame. This determines the crop position for the 9:16 vertical output.
For single-speaker content this works well. Multi-person framing is still something I'm working on.
Node 4 — Rendering
FFmpeg renders each clip with:
- 9:16 crop based on focus detection output
- Auto-generated captions burned into the video
- Output optimised for TikTok / Reels / Shorts
Real-Time Progress in the UI
One nice side effect of the LangGraph architecture: real-time progress updates came almost for free.
As state moves through each node, the backend emits an event. The frontend listens and updates a progress indicator — so instead of staring at a loading spinner for 3 minutes, you watch the pipeline move:
✓ Transcription complete
✓ Clip moments identified
✓ Focus detection done
⏳ Rendering clips...
Users told me this was the most reassuring part of the UX. Knowing something is actually happening makes the wait feel shorter.
Stack
| Layer | Tech |
|---|---|
| Backend | FastAPI |
| Frontend | Next.js 14 |
| Transcription | OpenAI Whisper |
| Video Processing | FFmpeg |
| Pipeline Orchestration | LangGraph |
| Storage & Auth | Supabase |
| YouTube ingestion | yt-dlp |
What Works Well and What Doesn't
Works well:
Talk-heavy content — podcasts, interviews, conference talks, lectures. The transcript is rich and the LLM picks genuinely good moments.
Still needs work:
B-roll-heavy videos where the visual tells the story more than the words. The transcript alone doesn't capture what makes a moment visually compelling. This is the next problem I want to solve — probably with frame-level visual analysis alongside the transcript.
Multi-person framing for focus detection is also rough. Single speaker is solid.
Try It
It's free right now, no signup needed: https://video-generator-six-coral.vercel.app/
If you're curious about the LangGraph architecture or any part of the pipeline, ask in the comments — happy to go deeper on any of it.
And if you try it on your own content, I'd genuinely love to know if the clip selection actually picks good moments.













