Loading…
Loading…
Why Aptibit built a native C++20 streaming engine for Visylix instead of wrapping FFmpeg. The architectural decisions behind 5,000+ streams per node at 22% CPU.
When embarking on ambitious projects in the space of video streaming, particularly those demanding high-performance live streaming capabilities, a foundational question inevitably arises: "Why not simply use FFmpeg?" It is a fair inquiry, given FFmpeg's status as a cornerstone of the digital media space. This open-source multimedia framework is the backbone of countless applications, from global video platforms to desktop media players. Many commercial streaming solutions and video management systems are built upon its strong foundation.
However, at Aptibit Technologies, when we began developing Visylix, our enterprise AI video management platform, we opted for a different path. We chose to build our streaming engine from the ground up, rather than solely relying on FFmpeg's extensive but often narrowly focused capabilities. This decision was not made lightly; it represented a significant investment in time and engineering effort. The reasons behind this choice stem from the very nature of what we aimed to achieve: a platform capable of handling massive concurrency, sub-second latency, and integrated AI processing on every live stream.
Visylix is engineered for enterprise-grade AI video management. Our clients typically connect thousands of cameras to a single deployment, requiring live video with ultra-low latency ideally sub-second and the simultaneous execution of multiple AI models on each stream. This demanding environment operates 24/7, necessitating unwavering stability and minimal resource degradation.
During our evaluation of existing video management systems built around FFmpeg, we encountered a critical bottleneck. These platforms began to falter around the 300 to 500 camera mark per server. CPU usage would spike to an unsustainable 94%, live stream latency would balloon to several seconds, and stream drops became commonplace. This was not a limitation of hardware; it was a fundamental architectural constraint imposed by the way FFmpeg was integrated. The sheer volume of video processing required for so many concurrent feeds overwhelmed the traditional approach.
FFmpeg is, without question, a phenomenal tool. Its core strength lies in its ability to transcode media files, convert container formats, and manage individual streams with an unparalleled breadth of video codecs and audio encoders support. It excels at tasks like converting file formats, handling archive formats, and performing single-instance video recording. In essence, FFmpeg is a powerful toolkit, a collection of highly optimized libraries and a versatile command-line tool for media manipulation.
However, using FFmpeg for live video surveillance at the scale we envisioned means asking it to perform tasks far beyond its original design parameters. A typical FFmpeg-based live stream pipeline for such scenarios involves multiple decode-encode cycles: decoding the incoming stream, re-encoding it for storage, decoding it again for live viewing, and finally re-encoding it once more for browser-based delivery. Each of these operations is CPU-intensive. When multiplied by thousands of cameras, the cumulative CPU load becomes the limiting factor. Most video management platforms built on FFmpeg accept this limitation and advise customers to deploy more servers. Our objective was to create a solution that broke through this ceiling, not one that simply scaled out. The global video streaming market is projected to reach USD 137.9 billion in 2024 and grow at a compound annual growth rate of 22.3% until 2033 to reach a value of USD 843.0 billion, underscoring the need for scalable solutions.
The decision to engineer a custom streaming engine, particularly in C++20, represented a significant deviation from the easier path of wrapping FFmpeg. This approach added months to our development timeline and necessitated the reinvention of solutions that FFmpeg had already perfected decades ago, such as handling various video codecs and file formats.
However, this intensive undertaking granted us the unparalleled ability to design every layer of our system with a single, definitive purpose: to efficiently manage massive concurrent video streams, integrate sophisticated AI video processing, and minimize resource consumption. This deep-level control allowed us to architect a platform that could scale linearly. Three critical architectural decisions set our custom engine apart from traditional FFmpeg-based solutions.
A pervasive bottleneck in many traditional video streaming platforms, including those using FFmpeg, is the overhead associated with Input/Output operations. Each system call made for every I/O event adds up significantly. When managing thousands of streams, each generating dozens of I/O operations per second, this cumulative overhead can cripple performance.
To combat this, we developed a proprietary asynchronous I/O engine. This engine is engineered to process operations with near-zero overhead per individual operation. By drastically reducing the systemic cost of I/O, we achieved a throughput that is approximately three times higher than traditional approaches on identical hardware. This forms the bedrock of our ability to handle more large videos and video recording tasks without performance degradation.
In a typical FFmpeg-based pipeline, video data is subjected to multiple memory copies as it navigates through various stages, such as decoding, AI analysis, and storage. For a scenario involving 5,000 streams at 1080p resolution running at 30 frames per second, this can consume nearly 900 GB per second of memory bandwidth, solely dedicated to moving data around. This is inefficient and fundamentally limits scalability.
Our custom architecture fundamentally eliminates these unnecessary data copies. Video data flows smoothly through the entire pipeline without redundant duplication. This optimization significantly reduces memory bandwidth consumption by over 80% and, crucially, liberates substantial CPU cycles. These freed-up cycles can then be dedicated to actual video processing and AI inference, rather than being consumed by mere data wrangling.
Standard memory allocators are designed for general-purpose applications, offering a balance of features for diverse workloads. However, video processing at scale is anything but general-purpose. Frames arrive at variable rates, AI models require dynamic memory allocations, and recording buffers must expand and contract fluidly based on event triggers.
We engineered a specialized memory management system precisely optimized for these dynamic patterns. In rigorous 30-day continuous operation tests, Visylix maintained consistent, high performance throughout the entire period. In stark contrast, platforms based on FFmpeg exhibited a noticeable performance degradation of 15 to 20% over the same duration. This specialized memory management is vital for maintaining predictable performance, whether handling live streams or complex video workflows.
The tangible outcomes of our architectural decisions are profound and speak volumes about the power of custom engineering. A single Visylix node can effortlessly handle over 5,000 concurrent streams while maintaining a CPU usage of just 22%. To put this into perspective, the same hardware running a traditional FFmpeg-based video management system typically manages only 300 to 500 streams at a crippling 94% CPU usage.
live view latency is reduced to below 500ms, delivered smoothly via WebRTC. This stands in sharp contrast to the 2 to 5 seconds latency typically experienced with FFmpeg-based platforms relying on HLS stream delivery. Our architecture also supports over a million concurrent connections across a full deployment, a capability that few competitors in the video management space can claim or even market, as their underlying architectures simply cannot sustain such demand. This is key in today's rapidly growing live streaming market, which is projected to reach $345 billion by 2030.
Undeniably, constructing a custom streaming engine from the ground up demanded significantly more time and resources than simply wrapping FFmpeg. We had to meticulously address challenges in video codecs, container format handling, streaming protocols negotiation, buffer management, and a many of other complex issues that FFmpeg solves out of the box.
However, the reward was the creation of a platform that delivers capabilities unattainable by any FFmpeg wrapper: a system that handles tenfold more streams on the same hardware, consumes five times less CPU, and achieves sub-second latency. For our clients, this translates directly into fewer servers, reduced operational costs, and truly real-time video streaming that enhances their operations and user experiences.
Every startup in the video management and streaming protocols space faces the fundamental question: build or wrap? Many opt to wrap FFmpeg for a faster time to market. Our strategic imperative, however, was not speed to market, but long-term performance and scalability. We optimized for the critical moment when a customer connects their 1,000th camera, ensuring that performance does not just hold, but excels.
There is no substitute for building technology with architecture tailored to perform at scale. FFmpeg remains an incredible tool, indispensable for its intended purposes in video processing and format conversion, supporting a large array of video codecs and audio encoders. However, when a product requires handling thousands of concurrent live streams, integrating complex AI video processing, and achieving sub-second latency, an architecture explicitly designed for these challenges is paramount.
We built Visylix from the ground up in Kolkata. Today, it demonstrably handles more concurrent streams than any other video management platform on the market. That would not have been possible by merely wrapping FFmpeg. Sometimes the hardest engineering decision is the right one.
FFmpeg is brilliant as a transcoding library. It's not a streaming server. Wrapping it in a server process works fine until you hit a few hundred concurrent streams, then memory fragmentation, process-per-stream overhead, and lack of async I/O start destroying performance. For 5,000+ streams per node we needed ownership of the entire pipeline.
On identical hardware we handle roughly 10x more concurrent streams at 22% CPU utilization versus an FFmpeg-wrapped server pushed to its limits. Memory footprint dropped by about 80 percent thanks to jemalloc and shared buffer pools. Latency under load stayed under 500ms where the legacy setup drifted past 3 seconds.
Go's garbage collector introduces latency spikes we couldn't afford. Rust was a serious candidate but the video codec ecosystem still leans heavily on C and C++ libraries, and our team had deep C++ expertise. C++20 with coroutines, io_uring, and jemalloc gave us deterministic memory behavior and zero-cost abstractions where they matter.
About two and a half years of focused work by a small senior team. That's an investment most companies can't justify, which is why so many VMS and streaming products ship on FFmpeg wrappers. For us, it was the difference between competing on features and competing on the underlying architecture.
Small deployments, batch transcoding jobs, recording pipelines, and any workload where latency and concurrency aren't the binding constraints. For those use cases FFmpeg is unbeatable. The moment you need thousands of concurrent live streams with AI analytics in the hot path, you're probably building custom.
No. It's proprietary to Visylix, though we draw on open standards (WebRTC, RTSP, HLS, SRT) and open-source libraries where they're best in class. The engine itself is our differentiator, not a commodity.