vege - stock.adobe.com
We are in the midst of an unstructured data revolution – with 80% of data likely to be unstructured by 2025, according to IDC. The rise of advanced analytics and in particular artificial intelligence/machine learning (AI/ML) has allowed intelligence to be gained from the vast reservoir of previously opaque free text, images, voice recordings, and so on, held in large volumes by organisations.
Of course, some structure is present in the metadata that accompanies any file. But with AI/ML processing, additional structured fields of data can be derived from a dataset.
So, for example, an organisation might run AI/ML against a set of apartment rental ad images and seek to map visual patterns to the number of clicks. That entails analysis of unstructured image data to create structured fields that can be joined to existing data for analysis and, ultimately, to drive editorial decision-making.
In this article, we will look at some of the ways unstructured data is accessed as if it were structured, or given structure by S3 Select and Snowflake, and database-like products such as MongoDB.
SQL, structured recap
First, we should remind ourselves that structured data is based on the use of SQL databases, which are configured with a table-based schema and data held in rows and columns.
This pre-set format allows for extremely rapid querying of the data, with a great deal of security in terms of transaction integrity. And so, SQL databases are still at the heart of the most performant and mission-critical applications in use.
For data to exist in a structured format, it will most likely have been created that way by machines, or with the help of humans in fairly rigid and machine-dependent processes.
Here we look at ways in which unstructured data is incorporated into environments that allow for more structured approaches to analysis.
S3 Select – S3 meets SQL
Object storage is perhaps as unstructured as you can get. For a start, it lacks the rigid hierarchy of file system storage. Everything is held in a flat structure with a unique identifier to locate it. And an object could be of almost any type of data.
S3 object storage from AWS also typifies the affinity of object storage and the cloud era we are in.
But it is possible to filter S3 data using SQL query statements to choose the subset of data you want to work on from very large object stores. S3 Select results can come in CSV, JSON and Apache Parquet formats and you can perform queries from the AWS console, command line or via application programming interfaces (API)s.
Key use cases are those where you want to select data from S3 for analysis on faster, local compute, while cutting down on the larger egress charges that may come from downloading a bigger dataset.
But it’s not really a database. And with data coming in formats such as CSV and JSON, it will need some coding to wrangle it into your analysis tool of choice.
Snowflake: Adding order with directory tables
Snowflake’s approach to unstructured data and providing it with structure is based on analytics and AI/ML bringing structure to elements within it.
You can use APIs to store data in a Snowflake “stage” from external clouds (AWS, Azure, GCP) or bring your own S3 bucket containing unstructured files and mount it to Snowflake. Then you can use URL-based Rest APIs to access data and catalogue it using “directory tables”.
Directory tables allow data to be searched, sorted, and so on, via a file catalogue built out of file/object metadata and contents, where possible, and organised into a structured format. It is then possible to add tags or calculate columns and join them to the directory table.
You can get analytics access to files to get structure from unstructured data via programmed functions in Java, for example, with access control and sharing also possible.
Workloads targeted by Snowflake include optical character recognition of documents, extraction of data from PDFs, AI/ML processing of files – with signature capture and product image analysis in retail and sentiment analytics on call centre recordings as examples.
Snowflake can also support semi-structured data types such as JSON, Avro and XML alongside relational data by flattening nested formats into relational tables and then querying using SQL.
Google BigQuery: AI/ML builds object tables
Google’s BigQuery data warehouse service allows for unstructured data to be queried and joined to structured data via its Object Tables functionality.
Object Tables provides a structured record interface for unstructured data stored in Google Cloud Storage. Here you can run analytics and machine learning on images, audio, documents, and so on, using SQL and remote functions in BigQuery, with security, sharing and governance built in.
BigQuery supports a standard SQL dialect that is ANSI-compliant. To apply AI/ML to unstructured data and then create structured records from its findings, Google announced AI Vision at its recent Next event, with speech recognition, visual analysis, translation, and more.
Microsoft Azure offers similar functionality in its Azure Synapse offerings.
In MongoDB – a so-called NoSQL database – text files and other unstructured assets are stored as JSON-formatted documents.
In this way, no rigid structure is applied during the initial save procedure. That leaves the data virtually untouched, so it can be structured to suit the way the customer chooses to access it, without changing the underlying JSON. That means it is all available for future operations that may be different to current needs.
Images, audio files, movies, and so on, can be stored in MongoDB using the GridFS specification, which stores files within MongoDB collections, along with their metadata.
Read more on unstructured data
- Unstructured vs semi-structured data: Order from chaos. We look at alternatives to relational databases that have emerged to help bring some structure to unstructured data and gain valuable insight by making it semi-structured.
- Is object storage good for databases? We look at object storage vs block access SAN storage and ask if object storage can be used for database workloads, or is it just good for bulk storage of analytics datasets?