Cowboy 'wranglers' & (big) data preparation

So let’s get this straight from the start; you enjoy tracking the rise of big data and the analytics that we now impress upon it to derive new insights in everything from retail to the Internet of Things – but you’re not familiar with the term data preparation?

It’s a crying shame, but this piece of terminology does not get the kudos it deserves.

Data preparation is sometimes called data pre-processing, still no clues?

It is the manipulation and transformation of data, from its raw core, to into a form suitable for analysis and processing.

Closely connected to (and often found within) the field of data mining, data preparation happens because its processes CAN NOT be completely automated – hence, it’s very existence.

“The key ingredient of data preparation platforms is their ability to provide self-service capabilities that allow knowledgeable users, who are not IT experts, to combine, transform and cleanse relevant data prior to analysis,” said Philip Howard, research director for data management at Bloor Research.


Howard explains that data preparation is provided by a field of vendors that includes veterans and relatively new start-ups – and that a company called Paxata has attained the ‘champion’ position today.

Pataxta itself offers a purpose-built Adaptive Data Preparation application and platform.

The four kinds of big data tools

1. Tools designed to be used by end users (such as dashboards).

2. Tools for data scientists and developers (such as big data analytics engines)

3. Tools for big data orchestration and management (such as those used by DBAs)

4. Tools for data ‘wrangling’ (such as data preparation tools)

NOTE: Wrangling here is meant in the cowboy horse-handling sense.

Paxata was developed from the ground up to be an enterprise-class data preparation tool set and is currently being used by over 45 on-premise and cloud customers with stringent data quality and security requirements.

For further clarification:

• Adaptive, self-service data preparation solutions simplify, automate and reduce the manual steps of getting the data into a useable form. This is accomplished without risking loss of control over who uses the data, for what analytics, and how users prepare it for their own consumption.

• Self-service data preparation toolsets enable analysts within the business to collaborate and dynamically govern the data integration, data quality and enrichment processes at scale from their Hadoop-based data lake store.

• Self-service data preparation solutions can also offer a data library, which is a secure environment where business analysts and IT can share data sets with the business, as well as become the one-stop shop for all completed and in-process data prep projects.