CraftStory developers expand & extend AI video generation services

Companies, both small and large, are starting to adopt AI videos not only for ads, but for marketing content, comms and education. As production costs drop thanks to AI, it becomes much easier for organisations to communicate with their customers.

While there are issues surrounding ethics, speed of AI-engine processing at the back end and the quality of AI video itself, most of the resulting “product” in this space is short in length and sometimes short on quality.

Aiming to address both or the traditional shortfalls in this space is CraftStory.

This new startup organisation says its technology can create AI videos up to five minutes long. Why does that matter? Because firms can use video material of this kind for training purposes, product demos and other serious non-trivial use cases, although users will naturally (probably) experiment with creatinhg videos of themselves pictured as medieval knights slaying dragons too.

Other firms in this space produce shorter end results, OpenAI is known for its Sora 2 application, which is capable of producing video clips of 25 seconds in length. Other tools in this space produce shorter videos.

According to the Sora 2 homepage, the service can do things that are exceptionally difficult (and in some instances outright impossible) for prior video generation models… but they are limited in length.

CraftStory says its technology can create “continuous and coherent” video content… and at five minutes in length, its output could be said to be as long as a typical YouTube video. 

The system only works with static camera backgrounds for now, but the team promise to move to dynamic backgrounds in the near future. Arguably useful for smaller businesses that want to showcase their services with demonstration videos for marketing and other commercial use cases, CraftStory’s founder and CEO Victor Erukhimov has said that other services in this space often ignore a part of a user’s instructions, especially when it comes to intended length.

So how does it work?

Parallelised diffusion architecture

Using a technique known as parallelised diffusion architecture, Erukhimov explains how his company’s platform takes a different approach to the sequential processing methods typically used by video generation models. These traditional video models run “diffusion algorithms” on large three-dimensional volumes, where time is the the third axis

For learning here, let’s thank IBM for explaining that, “Diffusion algorithms, particularly Diffusion Models, are advanced AI models that create high-quality data (like images, text, audio) by learning to reverse a gradual noising process, inspired by physics.

To create longer videos using diffusion algorithms, any given use case will demand the power of a larger network, a wider pool of training data and more processing power. With its differentiated approach, CraftStory simultaneously runs a whole number of smaller diffusion algorithms throughout the duration of the video being produced, using what are known as “bidirectional constraints” to form the connection points between the algorithms.

How does that work? Well, it’s simple really i.e. bidirectional constraints connect algorithms through a combined approach so that two simultaneous processes work interdependently to satisfy a common set of conditions. This technique is especially useful if the software system in question is tasked with complex problem-solving scenarios that would normally require a wide “search space” and computation time.

With its bidirectional constraints connectivity approach, Erukhimov explains that the end of any given piece of video content can influence the way the starting section is created and rendered.

“A diffusion algorithm is imperfect and adds small artefacts in objects and poses,” said Erukhimov. “So when you generate a video one time interval after another, the next diffusion algorithm will inherit all artefacts from the previous one and add its own.”

Avoiding artefact accumulation

In order to avoid artefact accumulation, the team runs multiple diffusion algorithms for different time intervals at the same time. They connect then so that the diffusion algorithms working on neighbouring time intervals are aware of each other’s results and generate a consistent video.

We do use some video downloaded from the Internet, but because we’re building human-centric content, the hardest problem is depicting people convincingly – natural gestures, facial expression and believable motion. By training on our own high-frequency footage, we’ve achieved video quality with people that’s comparable to much larger foundation models. That dataset also enables precise gesture control: if you want someone to point at an object, snap their fingers, or make small hand motions, the model can do it reliably because it learned from clear, non-blurred finger and hand detail,” said Erukhimov.

The company uses professional studios actors using high-frame-rate camera systems to create its pool of data for the Model 2.0 service, how does this work?

We hired actors and filmed them in both Europe and the US using our own equipment in professional studios. People’s hands and fingers can move really fast… and they are blurred on standard videos from the Internet. High frame rate footage allowed us to obtain clear images of fingers and played an important role in the model training. Also, we have created ‘AI actors’ from this footage, so that you can create a video with a very distinct personality portrayed by a professional actor by just giving us a script. Actors that collaborated with us on this project receive a revenue share when their likeness is used in customer videos,” explained Erukhimov.

The team says it is “definitely thinking” more about business-to-business use cases than consumers.

Application scenarios

We want to help companies amplify their stories with videos. Founders, marketers and content creators can record a video of themselves in a home or an office and then transform the scene into a conference, a city square, or even a tourist attraction – places that are otherwise difficult to film in. Agencies can shoot videos with fine-level emotion and gesture control. And this is just the beginning,” said Erukhimov.

CraftStory’s roadmap leads the company to what it says will be a text-to-video model that would allow users to generate long-form content directly from scripts.

“We are about to release an image-to-video model that will generate movement automatically”, said Erukhimov. “A user will need to supply just an image and a script. This release will also include ‘walk-and-talk’ videos that make the delivery way more entertaining. We are also working on an interview format, we posted an interview video created by Model 2.0 to our YouTube channel. Text-to-video mode is coming to our customers in January.”

Thanks to the Internet and social media, Erukhimov thinks that video is “steadily overtaking text” as the primary form of information delivery. He suggests that a sharp decline in video-generation costs will accelerate this trend. 

Image: Google Gemini