AI Video Enhancement with Storyboards, Captions & SFX
Modern AI tools don’t just translate or upscale videos—they can understand scenes, generate storyboards, write meme-style captions, and add perfectly timed sound effects.
This post walks through how Curify AI builds an automated pipeline for scene-based video enhancement using: scene detection, GPT-4o Vision, storyboard JSON generation, captioning, and SFX layering.

Before & After: Enhanced Clips
Below are examples showing the transformation from raw footage to captioned, storyboard-driven, sound-enhanced clips.
Original
Enhanced
1. Scene Detection → Storyboard JSON
Curify uses scene detection (PySceneDetect) to extract only visually important beats. These frames are sent to GPT-4o Vision, which produces an editable storyboard JSON:
- Scene timestamps
- Meme-style captions
- SFX selection
- Text timing & duration
[
{
"start": 0,
"end": 14,
"text": "European leaders sanction Russia and shut off the oil faucet.",
"sfx_key": "dun",
"bg_sfx_key": "water_flow",
"bg_start": 0,
"bg_end": 14,
"text_offset": 0.5,
"text_duration": 5
},
{
"start": 14,
"end": 27,
"text": "The U.S. leader arrives smiling, carrying a bucket, making a deal with the Russian leader and handing over a bag of cash.",
"sfx_key": "cash",
"text_offset": 0.5,
"text_duration": 5
},
{
"start": 27,
"end": 44,
"text": "The U.S. leader resells the oil to European leaders who arrive with cash—while the U.S. leader laughs.",
"sfx_key": "clown",
"bg_sfx_key": "evil_laugh",
"bg_start": 41,
"bg_end": 44,
"text_offset": 0.5,
"text_duration": 5
}
]2. Auto-Generated Meme Captions
Captions are short, punchy hooks written by the LLM. They are synced to scene boundaries and rendered with bold, high-contrast styling.
- White text + black stroke
- Bounce / pop entrance animation
- Emotionally aligned with visual content
3. Sound Effects & Timing
The enhancement pipeline uses a small but expressive SFX library:
- cash – deal making / money bag
- whoosh – transitions / fast movement
- dun – dramatic emphasis
- clown – comedic beats
- news – broadcast intro sting
- water_flow – oil/water ambience
- evil_laugh – humorous villain ending
cash
whoosh
dun
clown
news
water_flow
evil_laugh
4. Putting It All Together
- Scene detection isolates visual beats
- Frames → GPT-4o Vision
- LLM generates storyboard JSON
- User optionally edits captions or timing
- MoviePy assembles text + SFX + transitions