GreenOps - Starburst: Abstracting the complexity of (legacy) distributed data estates

This is a guest post for the Computer Weekly Developer Network written by Jitender Aswani in his role as senior vice president of engineering at Starburst.

Aswani writes in full as follows…

There is a quiet crisis unfolding inside enterprise data architecture. As organisations race to deploy AI — predictive, generative, and increasingly agentic — many are discovering that the infrastructure underpinning their data estate is actively working against them. This is reflected in their costs climbing and carbon footprint increasing. And the root cause here is largely architectural.

Let me explain the challenge.

Legacy BI hangover

We could call the predicament something of a legacy Business Intelligence (BI). The dominant model – centralise your data estate, then analyse the information at its core – was built for BI, not for AI at scale. This approach drives unnecessary data movement, duplicated storage and inefficient compute usage across hybrid and multi-cloud environments. For BI workloads, the inefficiencies were tolerable, but for AI at scale they are simply not.

Two key symptoms of this outdated model are poor FinOps discipline and an inability to meet GreenOps goals – and they share the same root cause.

A federated, model-agnostic architecture, where data is accessed in place and governed at query time, fundamentally changes the equation. By abstracting the complexity of distributed data estates, organisations can reduce compute waste, minimise data duplication and align resource consumption with real business demand. This is GreenOps and FinOps done right, not as a tagged-on compliance exercise, but baked into the architecture model from the outset; in the same way AI must now be if it is to start delivering tangible business value.

Architectural minimalism

Bridging the gap between legacy data estates and today’s AI imperative requires us to embrace architectural minimalism. A redundant ETL process is not just a waste of time, energy and money; it’s a major contributor to a firm’s carbon footprint and cloud bill.

Starburst’s Aswani: Decoupling storage from compute might seem like a leap for some teams, but it’s not.

The move to use a high-performance query engine that interrogates data where it lives gives us the opportunity to reduce cloud billing headaches. When we do this at scale (which we typically will), we get to process petabytes of data in a more efficient manner and work with a distributed data estate that now becomes our single virtualised source of truth.

Decoupling storage from compute might seem like a leap for some teams, but it’s not. This is a fundamental shift that pays off very quickly.

Scaling data analytics used to require large shifts in storage costs, and data warehouses soon got bloated, lumpy and difficult to navigate around; not to mention expensive. Now, with data held in lower-cost object storage, we can spin up compute resources only for the duration of a query. It’s an on-demand model that speaks to the heart of GreenOps in terms of energy consumption and operational efficiency.

Defying data gravity

Just to reinforce the point, those legacy systems will not survive the AI era intact. Every time a model training job needs to pull petabytes of data across different workload pipelines in different locations, we start to hear the gears creaking. Gartner analysts appear to have enjoyed defining this as “data gravity”: the propensity for both applications and compute to cluster around data, rather than the other way around.

If we accept Gartner’s definition and apply it in an AI context, the gravitational pull carries a steep price, both in terms of cloud egress costs and in terms of the carbon footprint required to shuttle data across a distributed infrastructure at massive scale.

One plug for how we’re enabling this if I may. Trino was created to solve a big data problem: querying and analysing petabyte-scale data across the data lake and disparate data sources. The creators of Trino later founded Starburst to help organisations extract the most value from their Trino investments – additionally, we’re now enhancing Trino’s core functionality with enterprise-grade features that improve performance, scalability, security and usability.

Ultimately, we’ll be able (or at least more able) to talk about baked in GreenOps and FinOps.

Baked in GreenOps & FinOps

With the looming responsibility to be compliant with the Corporate Sustainability Reporting Directive (CSRD) now facing every organisation (even before a single workload is deployed), adopting a federated, query-in-place model is a sensible route to making sure GreenOps and FinOps are baked into an organisation’s architectural DNA.

In the AI era, cost efficiency and sustainability are no longer optimisation problems – they are architecture decisions.

Starburst provides a high-performance data lakehouse platform powered by the above-mentioned Trino (a fast, distributed SQL query engine) that allows data science and software application development teams to run distributed SQL queries across disparate data sources spanning clouds to traditional databases without moving or copying data.