Data engineering - Domino Data Lab: Adapting engineering practices for the AI era
This is a guest post for the Computer Weekly Developer Network written by Jarrod Vawdrey, field chief data scientist at Domino Data Lab.
Vawdrey writes in full as follows…
In 2023, data scientists at a major financial services (FSI) firm built sophisticated ML models to predict market trends. The models failed spectacularly, resulting in massive missed trading opportunities and flawed investment decisions. The culprit wasn’t the algorithms or the scientists’ expertise – it was inconsistent underlying data.
What should have been the same value differed wildly across systems: $10 in one database, $100 in another and $1000 in a third.
As enterprises focus on AI, a technology provider training large language models (LLMs) faces a new scale of challenge: ingesting and transforming petabytes of internet data while ensuring quality, provenance and compliance. These contrasting scenarios illustrate how data engineering has evolved from traditional database management into a critical strategic function.
Garbage in, catastrophe out
The axiom “garbage in, garbage out” has never been more relevant.
As organisations rush to implement AI, they’re discovering that data engineering – long viewed as mere “plumbing” – is actually the foundation of AI success. Data engineers are no longer simply writing ETL scripts and managing cron jobs. They’re architecting complex data pipelines that must bridge legacy operational systems with modern AI requirements.
Consider the FSI example.
Traditional banking systems served specific operational needs – payment processing, account management, or regulatory reporting. Each system maintained its own data structures optimised for its particular function. This worked until ML applications demanded a holistic view of customer behaviour. Suddenly, data engineers faced the challenge of not just moving data between systems, but reconciling discrepancies, establishing golden records and transforming normalised operational data into the flattened, feature-rich datasets ML models require.
Real-time data engineering for AI
Data engineering’s technical evolution illustrates this dramatic shift.
Traditional approaches relied on nightly batch ETL jobs moving data between databases, with basic validation checks and error logging. Modern data engineering involves real-time streaming pipelines with automated quality gates, sophisticated data validation and automated feature engineering – all while maintaining clear lineage for compliance. These pipelines must handle structured and unstructured data, processing millions of records per second while ensuring data quality and regulatory compliance.
The challenge of training LLMs further represents the departure from traditional data engineering. Engineers must design systems to scrape, process and validate massive amounts of unstructured internet data. This requires deep understanding of distributed systems, hardware capabilities and software architectures. Engineers must implement sophisticated validation pipelines, track data provenance and ensure training data quality at a new scale.
AI’s emergence transformed data engineering collaboration. Close partnership with data scientists and ML engineers is essential to productionalise models and implement feature engineering pipelines. Data engineers must work with business stakeholders to understand data lineage and establish golden records. They must coordinate with IT teams to ensure infrastructure supports intensive data-driven AI workloads.
This cross-functional collaboration has elevated these backend specialists to key strategic partners in AI initiatives.
The ‘business’ of data engineering
Success in these modern contexts requires data engineers to understand technical architecture and business context.
In FSI, they must grasp how different systems interact, regulatory requirements and the business logic behind data reconciliation. In our previous example, the data engineer took corrective actions by mapping how the data interacted across three disparate databases, considered Consumer Financial Protection Bureau regulatory requirements and working with the business, consolidated data points into a single corrected value. For LLM training, they must architect systems to handle diverse data formats while maintaining clear audit trails of data sources and transformations, as the complexities of handling scraped internet data including text, code and files now stretch data pipelines to their limits.
The days when data engineers were “plumbing experts” are over.
Organisations know successful AI implementation requires elevating today’s data engineers from a support role to a strategic function. They must upskill to leverage modern enterprise AI platforms with robust systems for automated workflows, system of record, environment reproducibility and collaboration with key stakeholders; while navigating governance and regulatory requirements, including Model Risk Management. This evolution creates strategic partners who enable robust, compliant and scalable AI implementations through expertise in modern data architectures, automated governance frameworks and delivering accurate, trustworthy data.
As data engineering roles become a critical bridge between traditional IT systems and modern AI capabilities, they must tackle five emerging challenges:
- Increased regulatory and governance requirements
- Added complexity in handling diverse unstructured data owned by different stakeholders
- Ever-growing numbers of platforms, frameworks and tools
- Increasing pressure to standardise on one data and infrastructure provider in today’s hybrid and multi-cloud reality
- Differing needs of data engineers, IT and AI practitioners
Whether reconciling financial data or training the next generation of AI models, data engineers are the guardians of the most crucial resource in modern business: high-quality, reliable data that powers the future of AI innovation.
Just ask the financial firm that learned this lesson the hard way.