TwelveLabs is Scaling the Video Cognition System, the Foundation for Video Superintelligence
- Karan Bhatia

- 58 minutes ago
- 2 min read

TwelveLabs, a video intelligence platform and API led by Jae Lee, Aiden Lee, Dave Chung, Soyoung Lee, and Sungjun Kim, has raised $100 million, co-led by NEA and NAVER Ventures, with participation from Amazon, alongside Radical Ventures, Korea Investment Partners, Index Ventures, Quadrille Capital, and Red Bull Ventures.
The Bet.
Five years ago, a simple observation shaped the company's direction:
The world does not happen in text. It happens in motion.
Language compresses reality after the fact. It is powerful, but inherently lossy. Before words exist, there is sensory evidence: shape, motion, sound, and sequence.
While most AI systems learn from compressed representations, a different approach focused on the signal itself. The underlying belief was simple: understanding the physical world requires a native representation of video. Not captions or sampled frames, but a system that can perceive, index, retrieve, and reason over reality as it unfolds.
That conviction has defined TwelveLabs from the very beginning.
What Was Built?
The platform was built around three technical principles: perception, memory, and reasoning.
Perception enables raw video to be transformed into meaning without reducing it to text too early. Marengo unifies visual, audio, speech, and on-screen text into a searchable representation, while Pegasus generates grounded descriptions, answers, and summaries.
Memory ensures every video is understood once, converted into a durable representation, and indexed to the exact moment it occurs. The archive becomes machine-readable memory rather than passive storage.
Reasoning makes it possible to connect evidence across thousands of videos, identify patterns over time, compare events, and answer complex questions grounded in source footage.
Together, perception, memory, and reasoning form a Video Cognition System, an architecture that makes video computational.
Why Now?
The last decade of AI made text programmable. Language models transformed words into semantic data, enabling agents to reason over documents, conversations, and code.
Video has yet to undergo the same transformation.
Today, vast amounts of video remain trapped in archives, cameras, broadcasts, factories, hospitals, and satellites, rich with visual and contextual information, yet accessible primarily through filenames, transcripts, or human memory.
The next frontier is making every second of video addressable, searchable, and usable by AI agents.
That is the path from video understanding to Video Superintelligence.


