ABBYY, IBM & Red Hat announce DocLang, open source universal document format
ABBYY (hereafter written as Abbyy) used its annual user & practitioner, partner and customer event this year to lay down a weighty open source development.
The company’s Abbyy Ascend convention saw Abbyy – along with partners IBM and Red Hat – announce the formation of the DocLang working group under the Linux Foundation’s LF AI & Data Foundation.
The development of DocLang also brings the creation of a universal AI document format.
Always regarded as a worthy sign of accreditation and technology validation, being part of the LF AI & Data Foundation means this technology will benefit from access to a collaborative ecosystem, neutral governance, technical resources and global networking to accelerate open-source AI innovation.
The founding members note that they have designed the DocLang AI-native standard in order to “revolutionise enterprise document processing” (no less), by providing a unified, AI-readable format to represent documents for language model and agentic AI consumption.
PDFs & JPEGs in AI era
Why the move now? Well, many of us will have struggled with PDFs at one level of another over the years (attempting to open them on devices without a supporting app before the Chrome era, attempting to extract the information from them without formatting corruptions, attempting to fill in PDFs with personal data and successfully saving or exporting them afterwards)… and also experienced some exasperations with JPEGs.
All of which leads us (hopefully) to agree that these unstructured formats were designed for human consumption, not AI interpretation… and this has created a “fundamental disconnect” between enterprise data and AI systems intended to process them.
According to Abbyy and its partners here, a standard document structure for AI is needed to address the cacophony of digital document formats that enterprises operate on such as PDFs and JPEGs.
The companies say that DocLang addresses these critical gaps by creating a reliable abstraction layer between unstructured documents and intelligent AI systems.
Semantic meaning & geometric layout
The standard explicitly preserves both semantic meaning and geometric layout in a single AI-native format and encodes structural elements like headings, paragraphs and tables alongside their exact position on the page.
“DocLang is specifically engineered to address industry challenges with a minimal, standardised, and AI-native method for representing document structure, meaning, layout, and governance,” commented Maxime Vermeir, vice president, AI strategy at Abbyy. “Being designed for efficient machine processing provides a predictable structure optimised for modern AI tokenisation and modelling techniques. Organisations will see a significant difference with more reliable interpretation, reduced hallucinations, and lower computational costs.
DocLang embeds enforceable governance controls directly into the document. Downstream systems can automatically enforce compliance rules regarding privacy limits, extraction scopes, and model training permissions.
Linux Foundation affirmation
LF AI and Data’s mission is to build and support an open artificial intelligence (AI) and data community and drive open-source innovation in the AI and data domains by enabling collaboration and the creation of new opportunities for all members of the community.
Mark Collier, GM of AI & Infrastructure (and ED of LF AI & Data and PyTorch) spoke to the Computer Weekly Developer Network on this news.
“Standards matter most when the technology landscape is moving fastest,” said Collier. “They create the common language that allows innovation to scale without increasing fragmentation. Open, vendor-neutral specifications are the backbone of Kubernetes, cloud-native platforms, AI systems, and increasingly complex data workflows; they have enabled a high-value ecosystem while supporting interoperability, performance, and security.
Collier says that efforts like DocLang are important because they help bring structure, interoperability, and trust to a part of the stack that has become critical for AI, but remains highly inconsistent across tools and environments.
The DocLang founding members invite other technology providers and enterprise organisations to join the DocLang working group.
DocLang in motion, ABBYY FineReader
Abbyy used Ascend 2026 to detail more on the DocLang standard and explain how it contributes to the future of reliable AI data pipelines; the company also showcased a demonstration of ABBYY FineReader beta with DocLang.
ABBYY FineReader beta is a pre-release version of the AI-powered OCR software, allowing users to test upcoming document conversion, PDF editing and enhanced text recognition features before the stable launch.
Image credit: ABBYY

