This is a guest post for the Computer Weekly Developer Network written by Zuzanna Stamirowska, co-founder and CEO at Pathway.
Pathway bills itself as the ultimate data processing framework for the AI era.
Stamirowska writes in full as follows…
Exploring transformative use cases of Large Language Models (LLMs) has become a key priority in organisations across many industries.
However, one of the greatest challenges many developers subsequently face is moving pilot LLM applications into production. The difficulties arise with the data pipelines underpinning these models and the engineering investment that is typically needed to maintain the applications at scale. And, fixing them retrospectively is difficult.
The crux of the issue lies in how LLM models are trained. Most applications are built based on batch data uploads, which means the output of the model is only as accurate as the last batch upload. ChatGPT presents a great example of this issue – prompt the model with a particular query and chances are you’ll be presented with information that’s a little out of date.
The accuracy and quality of responses generated by LLM models are radically improved when based on fresh data and are of course fundamental to any use cases which involve status alerting. However, such updates can only be offered in-the-moment when the data is continually updated. By waiting for the next batch data upload, the opportunity to respond may have passed. For example, an LLM used to monitor changes in regulations, contracts, etc; is effectively useless without a timely data feed – being alerted of an issue or delay a few hours after the fact has limited use to an enterprise.
The further advantage of freshly updated LLMs relates to the ability to self-correct.
When trained on static data uploads, it means the LLM – unlike a human – is not in a continuous state of learning and therefore cannot iteratively ‘unlearn’ the information it has previously been taught when later found to be inaccurate, false or becomes outdated.
Ensuring valuable impact
So, where does the problem lie for businesses in moving LLM applications from pilot to fully operational production? Let’s look at this from the case where an application has been built using static data uploads. Where applications rely on up-to-date data, even if not real-time data specifically, the engineering requirements are massive. A data engineer (at least, if not a small team) will be required to manage the continuous need to process batch data uploads. This is a rare, indeed expensive, skillset to essentially tether to the task. Not to mention that there are undoubtedly more interesting and fulfilling roles that such team members could be enticed to.
The second option is to create LLM models from the pilot stage that can be continuously updated without creating the requirement for a full-time data engineer to maintain it. The key to overcoming this challenge is building an LLM data pipeline that can combine different data workflows – batch and live connector or API-based feeds.
Combining the two essentially takes on the hard job of plumbing the data pipeline with integrated, transformed data that can be used to feed the LLM model. Yet doing so is no mean feat.
Designing the streaming document workflows that underpin real-time data integration is a complex task that needs specialist skill sets. These are typically more advanced than the broader data team, which typically leads to workstream siloes. There are other considerations too, namely in the form of data preparation and cleaning. Different pipeline stages, including efficient classification and metadata addition, are essential to ensure that data can be properly cleaned and de-duplicated. When introducing semi-structured or structured data to help with the final output, then comes the challenge of context mismatch. Creating this context alignment across different systems requires a repeatable pipeline. Monitoring this as part of LLMOps gives technical leads a reasonable indication that the quality of data being fed to the LLM is enough to generate an accurate and useful output.
Blending the data sources
So why aren’t people just combining unstructured data for LLMs with other types of data (e.g. structured and unstructured)?
These are some of the specific challenges:
- Context mismatch
Since enterprise data stores have very specific schemas and terminology that LLMs are unaware of, getting this information into the context of LLMs becomes very hard.
LLMs lean towards a holistic view and database queries often require hierarchical, compositional views. It becomes difficult to manage similarity search within a hierarchical view in terms of system design and sustaining this structure over time.
- Business data updates
In the event that a business process changes or a business event happens, pertinent data is typically kept in the enterprise data store. The LLM data pipeline needs to be supplied with this information. Currently, this is done in batches and the workflow needs to be modified to ensure that the LLM application isn’t serving up old results.
The enterprise’s biggest challenge is taking these models from the ideation and pilot stage to full production, even though the exploration of LLMs in different use cases has shown promise. To advance enterprise LLM applications to the next level and beyond, it will be essential to modify the data pipeline’s plumbing to integrate batch and streaming workflows.