Big data analytics made easy with SQL and MapReduce
With growth in unstructured big data, RDBMS is inadequate for big data analytics. Know how to use SQL and MapReduce for big data analytics, instead.
In the vast universe of IT, data is categorized as being either structured or unstructured, from a macro perspective. Generation of unstructured data is orders of magnitude higher than that generated in structured formats, and this poses major challenges in terms of storage and processing and analytics. Such large amounts of data collectively form what is known as big data, the handling of which is usually beyond the capability of traditional relational database management systems (RDBMS).
As RDBMS has been the preferred method for storing, warehousing, and analyzing structured data, the industry has matured in the analysis of mainly structured data. Ignoring unstructured data is inadvisable, as effective analytics is today a key business differentiator. Organizations are exploring all possible sources of data to develop intelligent big data analytics systems that can provide deeper insights for informed decision making.
Technologies such as Hadoop and specialized non-relational databases such as columnar databases, graph databases and document databases are being widely implemented to store and process unstructured big data for analytics. MapReduce is the distributed data processing and querying engine to extract data from big datasets hosted on compute clusters in any typical Hadoop implementation. Structured Query Language (SQL) has been the de-facto standard for querying data out of RDBMS systems.
RDBMS systems are generally known to hold data spanning terabytes in any typical warehousing environments. But when systems hosting unstructured data come into picture, the size of the data would scale to a minimum of hundreds of petabytes, thus qualifying as big data. Currently, systems with structured and unstructured data are operated in mutually exclusive mode without any interoperability. Organizations need to explore and exploit the intelligence hidden in unstructured big data, with suitable big data analytics. Nevertheless, dependence on RDBMS for existing lines of business applications would continue. The challenge is to implement a big data analytics solution that can analyze structured as well as unstructured big data using a common interface.
Integrated heterogeneous data processing using SQL and MapReduce in parallel
In a consolidated view of big organizational data, the weightage of relational data is no more than a modest-sized source system, when compared with unstructured data. In the context of this growing need, vendors are enhancing MapReduce data processing engines that can provide operating interface extensions to access structured data from relational databases using SQL. RDBMS vendors are rolling out drivers for interoperability with Hadoop environments to bridge the connectivity between MapReduce and SQL.
Due to conscious interoperability efforts made by MapReduce/Hadoop vendors, as well as RDBMS vendors, big data analytics over heterogeneous data is becoming a reality. Greenplum MapReduce is one example of the potential of big data analytics. A data flow engine driven by Greenplum’s MapReduce can query big sets of unstructured data (petabyte-scale) using parallel computing as well as query structured data from relational databases using JDBC / ODBC drivers. Technically, thus, all relational databases that support JDBC / ODBC can be queried using this parallel data flow engine.
Applications of analytics over big heterogeneous data
As the nature of data in structured and unstructured formats is different, the kind of analysis that can be done over this data is also different. With integrated big data analytics capable of sourcing data from any big data sources, applications can extract the deepest level of intelligence from the entire organizational data.
For example, consider a manufacturing unit in the FMCG segment. In the regular course of business operations, it would have big data generated from procurement of commodities; product manufacturing; brand promotion and product marketing; direct and indirect sales; customer care; sales support centers, and so on. Suppose that this company is publicly listed. Sales, procurements, stock inventory, incidents, and service requests would all be in structured data formats. Quotation negotiations, detailed readings from manufacturing instruments, online logs of user clickstreams on product advertisements, users’ feedback requests, and grievances recorded with support centers, daily stock feeds from stock exchanges, would all be stored in unstructured formats. All this amounts to big data and requires big data analytics techniques.
Pattern recognition and gap analysis are the immediate value additions that big data analytics can extract from heterogeneous data. With all these data sources analyzable using a single big data analytics solution, complex data analytics is now possible. For instance, one can analyze how a particular resource has a cascading impact on manufacturing performance, sales performance, customer reactions, CRM costs, and finally fluctuation in stock value. This pattern recognition using time series analysis can help identify gaps in processes. Also, direct and indirect association and impact of various key parameters of different business operations can be analyzed when all the data sources that form the organizational data are analyzable with big data analytics. SQL and MapReduce are likely to become the preferred querying mechanism for such big data analytics.
About the Author: Siddharth Mehta works as an associate manager and a technical architect for BI software projects at Accenture Services. He is a recipient of Microsoft’s Most Valuable Professional award, and has written extensively on Microsoft BI software on his blog. Prior to Accenture, Mehta was in Capgemini.