Clockwork VP: Neocloud revolution, what AI/ML engineers need to know

This is a guest post for the Computer Weekly Developer Network written by Anita Pandey, vice president of growth at Clockwork.

Clockwork is known for its software-driven fabric service designed to maximises GPU utilisation and make AI workloads resilient to failure. The company’s technology runs anywhere and supports any Ethernet, return on capital employed (RoCE) or InfiniBand fabric.

Pandey’s full title for this piece is “The neocloud revolution, what AI/ML engineers must know for economically viable production” and she writes in full as follows…

The gap between a successful AI proof-of-concept and economically viable, production-scale deployment is a chasm that continues to swallow projects. While big tech invests billions in infrastructure, the promised revenue often lags, a disconnect largely driven by the operational challenges of massive GPU clusters. For cloud-native AI and ML engineers, this landscape shift is giving rise to a new infrastructure class: the neocloud.

What is a neocloud?

A neocloud is emerging as an AI-specialised cloud offering that strikes a balance between the vast, general-purpose scale of hyperscalers (AWS, Azure, GCP) and the high control of on-premises deployments.

For the AI engineer, the key benefit is one word: economics.

Hyperscalers are often criticised for their unpredictable, token-based inference pricing and inconsistent GPU availability, which causes costs to balloon in production. Neoclouds offer an alternative – in some cases, they are presented as potentially 20 times cheaper for AI workloads, pushing towards the outcome-based pricing that CFOs and engineering leadership demand. They solve the choice dilemma: on-premises vs. hyperscaler vs. neocloud.

From FLOPs to fault tolerance

Modern AI factories, scaling to dense systems like NVL72 and NVL144, are limited by how often jobs crash. Their density dramatically increases the blast radius of failures, where a single GPU fault can stall an entire training run, leading to billions in wasted capacity.

Pandey: A neocloud is emerging as an AI-specialised cloud offering that strikes a balance between the vast, general-purpose scale of hyperscalers & the control of on-premises.

Cloud-native developers must look for solutions that address this within the compute platform. A vendor-neutral software layer that optimises large-scale GPU clusters for real-time observability, fault tolerance and deterministic performance is the missing component: it addresses one of the most costly failure modes in large-scale AI training: costly job restarts caused by infrastructure faults, allowing jobs to continue running instead of restarting.

This capability is non-negotiable for production AI, directly translating to higher GPU utilisation and materially improved per-GPU ROI, moving utilisation beyond the current 30–50% average.

High model uptime SLAs

Unlike mature CPU clouds with six 9’s availability, GPU clouds are notoriously fragile and high-density systems running long-running distributed training jobs can be crashed by one link flap or GPU fault, leaving hundreds of GPUs idle. Alibaba’s SIGCOMM 2024 study found that approximately 60% of large-scale training jobs experience some form of slowness. This makes failure the norm, not the exception.

NOTE: A link flap occurs when a network connection rapidly toggles between up and down, causing instability and packet loss.

Working within a neocloud, especially in bare-metal or hybrid environments, demands a new level of observability. When dealing with turbulent “fleet clusters” rich in errors, developers need deep, real-time visibility.

A cloud-native monitoring module should not only provide high-level dashboards but also offer truly real-time telemetry critical for granular job to GPU cluster health profiling and correlation of network telemetry with performance metrics like local timeouts and retransmits.

Integration flexibility

The future is hybrid, perhaps even more so than our previous notion of what constitutes a hybrid cloud.

Neoclouds facilitate the reality that AI adoption will integrate with trillions of dollars in existing enterprise IT investments. Cloud-native developers should look for deployment models that prioritise security and confidentiality, often deployed on the customer’s own infrastructure, allowing for seamless integration with on-premise structured data, IAM and observability stacks without forklifting the existing environment.