Broadcast digital services company Cognacq-Jay Image has deployed scale-out NAS storage from Qumulo. A key attraction over its competitors was finely grained monitoring and control over settings, especially for use with applications that handle large numbers of files and with tight timescales dictated by customers.
“Every day we receive several TB of video that we must process and return, with deadlines dictated by channel schedules,” said Michel Desconnets, head of IT at Cognacq-Jay Image. “We have to maintain throughput, but we are as dependent on performance as the accuracy of the process.”
Cognacq-Jay Image’s work consists of post-production work on TV programmes such as adding credits, advertising or subtitles. But with the bulk of TV now being via digital channels, most work is now IT-related, and each video needs to be transcoded to a variety of formats for multiple set-top boxes and applications.
“For TV news, for example, we receive recently shot footage and send it back correctly formatted after 10 minutes,” said Desconnets. “But for a high-resolution film, there can be several hours of conversion processing. Some customers send us their video at the last minute; others weeks in advance.
“The number of formats varies by client. Some videos need the addition of digital rights management [DRM], for example. We have to take all these things into consideration, and manage priorities for numerous jobs at any given time on our systems. It’s a very complex process.”
Customers range from independent small channels to large media groups. Some clients carry out part of the processing internally, while others don’t.
Some demand that Cognacq-Jay Image retains dedicated infrastructure for their work. It is for that reason that the company has seen platforms multiply in its datacentre, with scale-out NAS from Isilon (Dell EMC) and object storage from Scality.
The challenge of tight timescales
In 2020, an unnamed customer wanted to add to its production jobs, but the Scality array used didn’t offer the required workload characteristics. “It was a 300TB array and supported throughput of 2.5GBps,” said Desconnets. “Capacity wasn’t a problem because 60TB was dedicated to production, with the rest handling archiving as it was sent back to the client.
“Our main concern was throughput. We needed 3GBps for writes plus 1GBps to export the final files.”
Desconnets added: “The servers that execute transcoding support large amounts of bandwidth and write a large quantity of files in parallel. But if their write times are 20% less performant than their processing speed, that retards other processes. The problem is that we don’t know which ones slow the whole thing down.
“In other words, beyond a simple technical bottleneck, we didn’t know how to react to problems quickly. And yet, problems like these – an error in transcoding, a bad file, etc – are very frequent and require extreme vigilance on our part.”
In the middle of 2020, Desconnets and his team started to look for a new storage setup. “In all their offer, Scality was more able to deliver capacity than speed of access,” he said. “In other words, their solutions meant we would have to buy lots of servers to compensate for latency.
“With Isilon, bandwidth was less of a problem. But it is very difficult to monitor activity on an Isilon array, in particular as you try to diagnose problems posed by small files, large files, etc.”
Qumulo storage software on HPE hardware
During the research process, Desconnets came across Qumulo. “They suggested we test some machines for a couple of months,” he said. “We were able to validate that their solution contained very rich APIs [application programming interfaces] that would allow us to write extensive scripts and had ready-to-use test processes.”
The order for Qumulo went in during the final quarter of 2020. Qumulo is a software product and was bought through HPE, which supplied pre-configured hardware which comprised six 2U Apollo servers with 36TB of storage capacity.
Qumulo is part of a new wave of scale-out NAS and distributed storage products that seek to address the growing need to store unstructured data, often in the cloud as well as the customer datacentre.
The order was completed with two 1U switches. Besides connecting the Qumulo nodes, the switches enabled four connections of 10GBps to the transcoding servers, which comprised about 30 Windows machines.
“The transcoding servers are connected to the same client and that posed the question of whether to opt for hyper-converged infrastructure [HCI] with compute and storage in the same node,” said Desconnets. “But HCI isn’t suited to our needs where compute is independent of storage capacity. We want to be able to add to one without necessarily adding to the other.
“Our processes also pass through our export servers, which are not dedicated to specific clients and so require a separated infrastructure.”
Read more on scale-out NAS
- Five on-premise and cloud options for network-attached storage. We look at five options for file access storage, from ‘traditional’ NAS in a standalone appliance to distributed hybrid cloud file storage and NAS cloud gateways.
- Scale-out NAS bounces back to fight for unstructured data against object storage. There’s a scrap breaking out between object storage and scale-out NAS for customers that need to deal with very large amounts of unstructured data.
The components were in place by the end of 2020, said Desconnets. “We needed to get it into production from the start of 2021, but a customer added to their workload just before Christmas. So, we decided to accelerate the migration. In the end, we completed testing for production in two days.”
And then, the solution derailed. At first, everything went as Cognacq-Jay Image imagined it would. But two months later, it hit a snag.
“In February 2021, we suddenly noticed queues building,” said Desconnets. “A file that would have been sent in an hour took two, or even three hours when transcoding to some formats. Qumulo monitoring tools revealed latencies increased by 100x. But that didn’t mean we knew whether the problem was with disks, or software or our tools.
“So we took advantage of the functionality in the API that enables us to get real-time monitoring. As a result of that, I realised that if I turned off some transcoders, everything went faster, and that meant that – paradoxically – parallel working was counter-productive.”
Desconnets soon understood that the problem was to do with the way processing was organised. “We had taken the decision to transcode all files in an initial format, then to put them into a second format, etc,” he said. “But by doing this, we had to load and unload files in cache with each transcoding run.”
He explained that the cache comprised 1TB on each node, with 6TB in total, and so was not enough to hold all files while they were being processed.
“Best practice is to transcode a file in all possible formats, then go to the next file,” said Desconnets. “What we needed to do was to transcode a file and get it out as quickly as possible, rather than do lots at the same time.”
Opportunity for granular monitoring
Desconnets is proud of the monitoring system he has built for the company’s Qumulo deployment. It comprises Zabbix to gather metrics, Kibana to analyse logs and Grafana, which creates graphical visualisations.
“I deployed a console that allowed us to drill down into the provenance of each operation,” said Desconnets. “This monitoring system allows us to resolve all problems in less than a week. At the end of two weeks, we optimised all settings and even discovered bugs that had existed for a long time in our processes and managed to iron them out.”
Since then, the team has added two more Apollo nodes. Raw capacity has increased to 288TB (210TB usable), with the rest given over to redundancy. “On average, we use 100TB a day, but that’s sometimes 180TB one day and 85TB the next,” said Desconnets. “This isn’t storage that grows gradually, but fills and empties all the time.
“Nevertheless, our Qumulo cluster has run like a watch. The metrics keep allowing us to monitor client activity. For example, we have seen where operations have not completed quickly enough and that has allowed us to resolve bottlenecks.”