Astronomer: Why data pipelines control the flow of AI
AI ‘eats’ data. We all know the garbage-in-garbage-out analogy and we understand why AI intelligence itself is inherently artificial by virtue of the fact that it only knows what we tell it, notwithstanding the new creative agentic strains that now make all the headlines.
But this is not a shiny surface-level AI story; this is an examination of that which lies beneath.
What needs to be told next is a data pipeline story, a backoffice systems story, a data provenance and management story… and we need a story that explains why now, despite all the front-end hype, the infrastructure layer is becoming the intelligence layer.
The agentic rush factor
Keen to know more, the Computer Weekly Developer Network (CWDN) sat down with Julian LaNeve, CTO of open source orchestration software company Astronomer to discuss the ramifications resulting from the fact that everyone’s rushing to deploy AI agents for data work. In a world where software engineers are now tasking agents with requests to write SQL, debug pipelines and perform unit tests, many agents are failing because they don’t understand how enterprise data platforms actually work, but why is this so?
“The type of agents we are talking about here don’t know what data tables are reliable, they don’t know which data pipelines are breaking, who owns what, or how changes ripple through the system,” said LaNeve. “Unsurprisingly, the orchestration layer turns out to be where all this crucial context naturally lives. Every time a pipeline runs, fails, or succeeds, it’s creating a rich trail of metadata. Every data transformation, every quality check, every usage pattern gets logged. It’s like a comprehensive flight recorder for the entire data platform.”
Looking at his company’s own platform play, LaNeve reminds us that Astronomer (at the start of this year) launched its Astro Observe service as a move that many saw as an affirmation of the company’s intentions to move into the wider data operations platform market. It’s all about organisations trying to productively actually “operationalising AI” initiatives so that they progress from being mere prototype testbeds into being a profitable part of the value chain now feeding the modern IT stack.
Complete view of data lineage
Astron Observe works by enabling data science teams, developers and operations staff to monitor and troubleshoot data, a more valuable resource and commodity. The company calls it a “complete view of lineage and health of data” at every point along the supply chain i.e. it is not limited to data lakes or warehouses. It allows users to zero-in i.e. down to the asset and task level within data workflows to understand and remedy bottlenecks.
“Our team at Astronomer is realising we’re sitting on a goldmine of context that nobody else has access to. While everyone else is trying to bolt AI onto their data stack after the fact, the orchestration companies can build “AI-native” platforms from the ground up – because they already have the context layer that makes AI actually useful,” said LaNeve.
He suggests that the timing couldn’t be better. This is because (citing the company’s own research) he suggests that “nearly 90%” of enterprises are already being asked to productionize AI for data work, but most are struggling because their AI agents are flying blind. The companies that crack the context problem first will have a massive advantage in the AI-powered data platform race.
Delving deeper
CWDN: Is there any reverse engineering or retrofitting needed to apply this kind of technology and/or how complex is deployment and what extra factors should be taken into account before going in on Astronomer?
 
  Astronomer’s LaNeve: It’s all about data ‘in-context’ and don’t forget the orchestration layer.
LaNeve: It’s one thing to have access to this interesting set of metadata… and it’s another thing to make it easily searchable and retrievable by agents and humans. We’re very focused on making it usable for both agents and humans, with a focus on having it work “out of the box” – just by nature of someone running their pipelines / Airflow with us.
CWDN: Is this kind of approach to data pipeline lineage applicable to all environments? What about highly regulated data, extreme remote IoT sensor data or air gapped environments?
LaNeve: It’s generally even more applicable in highly regulated environments. In my experience, highly regulated environments require even better lineage coverage and observability metadata because of how critical the use cases can be. Understanding where your data comes from, how it’s transformed and how it’s used is extremely critical.
CWDN: When AI becomes more fully embedded and integrated into working IT stacks, how do we ensure that data monitoring and troubleshooting follow suit and is ingrained from the start?
LaNeve: This is where relying on an open source community to help do the work can be very helpful. Apache Airflow is the standard for data orchestration – and because of that, it comes with a lot of prebuilt integrations to interact with data platform tools and extract lineage and other observability metadata. Using something like Airflow means you’ll have the right things baked in from the start.
CWDN: What aspects of this marketplace do you see shaping developments over the next 18-months, or five years, either?
LaNeve: I think there’s going to be a big fight to “own the context” in the immediate future. Ultimately, context is a great source of differentiation for companies. In the same way data (and how you use it) has historically been a moat, context is just another form of data. As frontier models continue getting even better, having a rich set of context positions you well to build around the models so everyone has a strong incentive to capture as much of it as possible.
For data platforms specifically, LaNeve says we need to start to capture context from the orchestration layer, because that means “starting from the finish line” so that orchestration is hooked into every part of an organisation’s data platform.
Astronomer’s LaNeve ‘tweets’ on X here.


 
		 
	