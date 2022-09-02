Structured vs unstructured data – it’s a common way of categorising things. But it’s not quite that simple.

Although structured data is easy to grasp, the world of unstructured data and its transformation to more easily understandable, usable and analysable semi-structured data, is less simple.

In this article, we look at structured data, unstructured data, and how semi-structured data brings some order from potential chaos. And brings benefits to organisations that want to gain value from often very large stores of documents, images, sound files, video, social media posts, and so on.

Structured data has... structure Business information is mostly generated by systems or people. Data from systems is most likely to be structured. In its traditional format, this is most typified by data in relational databases that use SQL (structured query language). In these, structure is everything. Columns that represent variables are set up in advance and populated by rows of data in which a value sits at the intersection of each. It’s something we can all visualise. It’s like we see in a spreadsheet – though whether spreadsheets are structured data is up for debate – but complex SQL database schemas involve the equivalent of numerous spreadsheets (tables, in database-speak) that relate (whence “relational”) to each other and can be filtered, joined and manipulated in many ways because they have common elements (keys). Despite the prevalence of unstructured data and the rise of formats that are better described as semi-structured, structured databases are important and won’t go away soon. They are easy to use, by everything from large-scale enterprise applications to machine learning tools, but can be limited in how they are accessed and used and can be relatively onerous to maintain and to change once initially configured.

The mass of unstructured data Unstructured data is often generated by people – although not solely – and includes media such as images and sound recordings, social media posts, agent notes, websites and emails. Unstructured data holds to no predefined data model and files and objects come in a wide range of sizes, from a few kilobytes for a social media post, for example, to potentially terabytes for uncompressed video footage. Estimates often suggest that the vast bulk of data is unstructured – up to 80% or 90% of data held by organisations. If that is the case – and we can safely assume it often is – then this presents huge challenges for organisations. Unstructured data is, to a greater or lesser extent, undefined and opaque to search and classification. That means organisations may not know what is actually there, and that can be a security and compliance risk. At the same time, it means missing out on opportunities to interrogate that data to gain insights and value from it.

No such thing as unstructured data? But in fact, it is arguable that no data is truly unstructured. The most unstructured data you can think of – image and sound files, for example – comes with metadata headers that provide high-level information on file contents that can be searched and questioned. And it is increasingly possible to examine the contents of such files using artificial intelligence/machine learning techniques to, for example, examine and categorise the contents of sound and video files. YouTube does this to ensure copyright on music is not contravened when you upload a video, for instance, so these types of data can be tagged with new metadata-based, algorithm-based interrogation, should an organisation wish to throw compute at it.