Why businesses need to fix (document) data foundations

CTOs, CIOs, CEOs, CFOs, CISOs (insert C-suite enterprise management role here) and their software application development team leaders are now faced with reassessing their data foundation, largely (and this will come as no surprise) due to the now-increased focus on the data estates that serve the new breed of AI applications and services.

With AI-driven chatbots, co-pilots and extended intelligence services now penetrating every use case, it’s an imperative being played out across all verticals, including manufacturing; engineering & construction; energy; healthcare & life sciences; professional services (legal and other); banking and insurance; and more.

During this “gold rush” era of data-centricity, technology strategists are looking to use AI to streamline workflows (initially mostly human, but let’s include machine workflow and agentic input here too) and increase corporate productivity.

But… as the saying goes, you can’t put the cart before the horse, so this whole process starts with an introspective analysis of where information provenance sits, how an organisation ingests its data and what steps it takes to make sure nobody seeks to deploy too fast (yes, Agile is still a thing, but let’s hang on a moment) before they have assessed data quality measure at a foundational level.

Passionate about this subject is Stéphane Donzé in his capacity as founder and CEO of AODocs, a company known for its capability to centralise process automation and document management at scale.

What AI gets… and doesn’t

Donzé reminds us that AI-chabots are remarkably good at using language models (large, small, other) and algorithmic logic to interpret context, meaning and nuance from data… but, crucially, they don’t inherently know the difference between good quality data and bad.

Something of a misnomer, we don’t use the term bad data very often; clearly, it’s more useful to talk about stale, outdated, invalid, obsolete or corrupt data as opposed to the good data that we know we need. Because we know that AI fed on poor data has the capability to introduce business risk, this is not what we want developers and data scientists to be working with.

Looking at where the problems come from, Donzé suggests that many of the core data issues faced by businesses today stem from the document layer itself i.e. the place where the lion’s share of enterprise information is exchanged, stored, analysed and processed.

But (rather like climate change issues, going to the dentist or any other core responsibility that we know is an issue, but is too easy to postpone), the suggestion here is that implementing new processes and tighter data controls often feels like work that can be postponed… even when there is significant pressure from security and compliance leaders not to delay.

Don’t kick the compliance bucket

The AODocs approach calls for a different way of thinking about budgets and priorities so that – as Donzé puts it – document platforms, classification, version control and access governance should not sit in a separate, detached or disconnected compliance bucket that can be too easily poured out and discarded when budgets don’t initially allow.

From his perspective, when it comes to the strategy used to formulate a new AI initiative, Donzé says that document data compliance should be right up there with the choice of AI model, selection of key data integration tools and implementation of core change management capabilities.

This is a tough process, says Donzé i.e. most companies can’t just plug their data estate into a new AI strategy with the confidence that everything is clean and shipshape at the backend. Because document data typically exists inside shared drives, old file servers, across a disparate set of disconnected collaboration tools, on unmanaged or poorly serviced cloud storage repositories, or just plain and simple email attachments… pinning everything down and making sure the data team knows which version of a policy, contract or procedure is current is tough.

Clearly keen to proffer his company’s centralised process automation and document management skills, Donzé insists that content management itself can be a minefield unless firms take a formalised approach to the discipline.

“Content is often unclassified, duplicated and poorly labeled. Permissions grow over time but rarely shrink. Whole groups can see documents they should never see. Former employees still have access to shared folders long after they have left.

When AI tools and agents are plugged into this environment, they do exactly what they are asked to do. They search everything. They surface whatever matches the prompt, whether or not it is accurate, current or appropriate for that user,” wrote Donzé, in a briefing document.

Manifestation frustrations

How does “data document data” manifest itself inside real world enterprise operational workflows? It’s simple enough says Donzé, it could be an erroneous answer in an AI (or other software service) chat window, it could be a poor instruction in a business intelligence platform, or it could be a serious financial miscalculation. All of which are really bad issues in construction, energy or manufacturing… but we might argue that they’re even worse when it happens in healthcare and life sciences.

Talking about the rise of AI agents and how they will be affected by these realities, Donzé says that AI is now amplifying the need to drill down on document governance, largely because business actions that used to be comparatively slow human-driven processes are now becoming instant, automated ones… so we need to get this stuff right.

“With AI in the mix, document governance debt is turning into concrete incidents. A sales team answers a customer using an outdated pricing sheet that the chatbot surfaced from an old folder. A support agent sees a confidential internal report that should have been restricted. A regulator asks how an automated decision was made and the organisation cannot reconstruct which document the system relied on.

None of these situations are theoretical. They are the natural result of pointing powerful retrieval and generation tools at ungoverned content,” wrote Donzé, in a previous technical blog.

He says that now, going forward, businesses need to think about whether their new AI implementation are auditable, compliant, governed and above all safe… and that process of course starts with (document) data foundations.

How do we start then? Donzé recommends starting by identifying the possibly quite limited or small number of document sets that really matter to a business. Once we know where those are and – again, crucially – which users have access to them, then we can start to think about how these contracts, operating procedures, key technical designs specifications, pricing sheets and product documentation should be used through the new AI fabric that organisations seek to build.

Sit down, let’s talk first

This process sounds tough, but Donzé recommends sitting down with business and domain owners inside an organisation to assess, audit and then leverage the document data assets that an organisation can gain AI advantage from.

“Once you know where the important documents are, you can start to bring basic order to them. That means clear ownership, so someone is on the hook for keeping a library current. It means simple, consistent names so people are not guessing which file to open. It means a decision about what happens to older versions so they stop floating around as if they were still valid. None of this requires new technology. It requires attention and follow-through,” explained Donzé.

A renewed focus on document data and how it used inside AI tools may be a tangible trend for the months ahead. Perhaps the industry will even start talking about large document models… either way, let’s document this process from here on in.