AWS data & analytics VP: Charting new oceans, the evolution of data lakes

Data lakes represent an ocean of information resources… and much like our blue planet’s waters, many of the depths remain uncharted and occasionally mysterious.

Our notion of data lakes has traditionally centralised around a notion of typically unstructured data resources (with some semi-structured data in place, depending on the information stream in question) that traditional databases, application data repositories and core analytics and processing systems find too tumultuous to handle at once (or in real-time) and so they provide us with a water tank to hold the ocean and its tides until we’re ready to set sail.

But the structure and use of data lakes is changing.

Just as Magellan, Drake, Columbus and de Gama must have felt when they realised they could harness a new level of control on our seas, the oceanic qualities of the data lake are now streaming (water-based pun intended, sorry) into new channels of data innovation due to a fundamental shift in the way we can now work with these information resources.

AWS vice president of technology (data and analytics) Mai-Lan Tomson Bukovec spoke to the Computer Weekly Developer Network team at this year’s AWS re: Invent 2025 in Las Vegas to explain how she sees an evolution happening in the data lake domain.

She reminds us that data lakes have generalised into general purpose storage for the modern business and notes that AWS says over a million data lakes that run on AWS S3 storage services.

New forms & data structures

But change is afoot. Her team are seeing that data lakes have expanded from unstructured data (like images, video and PDF files) to now handle more tabular data (S3 has exabytes of Parquet files now – Apache Parquet is a columnar file format for big data analytics that stores data by columns rather than rows) and now vectors. 

“Vectors is the language of AI and it provides a window into the meaning of an organisation’s data in the data lake, or in its knowledge base. AI embeddings models [see this Computer Weekly story for more on embeddings] are evolving at a rapid pace and so any business now can capture the meaning of data in vectors. It means you can ask questions of your data without having to already understand what is IN your data. In a world where data keeps growing incredibly fast (IDC says that data grows at 27% year-on-year, but many of our customers grow much faster than that), it’s incredibly important to be able to find what you need in an ever-increasing data set  –  that’s what you get with vectors,” explained Tomson Bukovec, with her usual relaxed air of compelling enthusiasm for data science.

The AWS team tell us that the price point and scale of S3 vectors is going to make it possible to use vectors ubiquitously for AI chat and other applications. 

Vectors boost agentic memory & context

Looking more closely, Tomson Bukovec tells us that with the latest AWS service launches, users get up to two billion vectors per index and this supports up to 20 trillion vectors in a single vector bucket. With S3 vectors, users can get up to 90% lower costs for uploading, storing and querying vectors compared to alternatives. Plus also, the latency of 100 milliseconds or less for warm queries means that it can be used for many types of applications and agentic infrastructures. 

AWS vice president of technology (data and analytics) Mai-Lan Tomson Bukovec.

She says that what many AI-based companies and applications are discovering is that use of vectors helps extend AI agent memory and context. By storing more context about a user’s questions and behaviour in vectors, AI agents can be far more personalised and “human”  in their responses. 

As S3’s price point and capabilities, vectors creates a virtually unlimited way to store more context for use by AI  – far more, far faster than any human can do. It seems like there are some interesting use cases for how people are using vectors right now.

Semantic similarity & structured SQL

“BMW Group uses Amazon S3 Vectors as a core component of its hybrid search solution, combining semantic similarity with structured SQL filtering for users to find conceptually related records while applying precise business logic constraints,” clarified Tomson Bukovec. 

This approach is particularly effective for queries like “find corrosion issues in F09 vehicles from the last quarter,” where both semantic understanding and structured filtering are essential, allowing BMW employees to accurately locate information and extract insights from 20 Petabytes of structured and unstructured data using natural language. 

“The solution generates vector embeddings using Amazon Bedrock’s Titan Text Embeddings V2 model and stores them in S3 Vectors for scalable, low-cost vector search across millions of records,” said Tomson Bukovec. “By combining these semantic embeddings with Amazon Athena and Amazon Bedrock–powered reasoning, BMW delivers a unified search experience that blends semantic similarity, SQL filtering and exhaustive AI-powered analysis. This architecture removes traditional data discovery barriers i.e. no schema knowledge or complex queries required, so this allows users of any technical level to reduce the time spent finding the right data and surface product quality insights faster and more intuitively across engineering, manufacturing and customer experience in BMW’s global operations.”

Mixi, a consumer tech company (photo sharing / social media) is adopting Amazon S3 Vectors to build flexible, metadata-aware semantic search capabilities that scale to serve its FamilyAlbum photo-sharing community of more than 27 million users. 

“The fully managed infrastructure in use here greatly simplifies operations compared to self-managed search systems, allowing the team to focus on delivering new AI-powered features. With plans to index roughly 400 million vectors across 100 indexes, S3 Vectors provides MIXI with the performance and cost efficiency needed to expand semantic search, powering future experiences like personalised photo print recommendations for every user,” explained Tomson Bukovec.

In another example, Spice AI integrated S3 Vectors with open table formats to provide hybrid SQL and vector search directly on S3 so that semantic and lexical queries can run side-by-side. That meant it turned raw S3 objects into immediately searchable knowledge without complex data movement. 

A new epicentre of intelligence

With Spice’s in-memory acceleration on top of AWS data warehouses and S3 data lakes, the team got millisecond response times for latency-sensitive workloads, so that AI agents could surface relevant context instantly. Together, says Tomson Bukovec, Spice AI and S3 Vectors give teams a practical way to operationalise data lakes for enterprise AI applications; developers can build applications that are both more responsive and more reliable, since S3 Vector indexes inherit S3’s durability, elasticity and economics. 

“Customers have seen 100x query performance improvement and 2x data redundancy leveraging Spice with S3 Vectors. Spice AI is a great example of how to run open-source AI workloads, combining open data access with native vector capabilities and trusted infrastructure to simplify retrieval-augmented AI at scale,” concluded Tomson Bukovec.   

Love your lake

All of this discussion, then, perhaps leads us to a point where we sit back and think about how “newly important” the poor old murky morass of the data lake could be now. This oft-unloved “bung it all there” storage tank has been variously evolved into data warehouses and data lakehouses over the last decade and a half, since the term data lake was first coined in 2011 and it almost leads us to the point where we have to ask, who knew?

AWS did, it appears.

Tomson Bukovec’s parting words of advocacy and evangelism in this space centralise around the fact that her team is seeing the S3 data lake evolve from multi-modal storage to becoming the epicentre of data-driven AI intelligence. There are new opportunities here for data science evolution, but perhaps… still wear a wetsuit for now.