Big data is a hot buzzword of the past year or two and refers to high volumes of data being analysed often in near real time for actionable purposes.
What makes big data different from good old data mining is the huge amount of data involved – often petabytes – together with a high rate of change in the dataset and the wide variety of data types analysed.
Typically, big data sets include unstructured data such as emails, log files and social media activity. For that reason, it goes beyond simply accessing data within a company’s databases by data warehousing. Through big data, analytics businesses hopefully obtain insightful information by cross-referencing multiple data points.
For businesses, to accommodate large volumes of data safely is a challenge in its own right. So, running very rapid analytics on large datasets brings a variety of challenges. Therefore new approaches to storing and analysing data have recently emerged.
Big data storage
In addition to the very large amounts of data volumes to be stored, big data applications also demand real-time or near real-time responses. So big data storage demands high input/output operations per second (IOPS) performance.
While pioneer big data practitioners such as Google, Facebook and Amazon employ hyperscale computing environments, for smaller businesses willing to take advantage of big data, there are more modest options.
Unlike traditional data warehousing that mines relatively homogenous datasets, contemporary web analytics demands low-latency access to very large amounts of small files. Therefore, scale-out storage that consists of a number of compute and storage elements, where capacity and performance can be incrementally added, is among the most appropriate choices.
When it comes to scalable analytics platforms, Hadoop is one of the most commonly used open-source platforms.
In Hadoop, data is broken into smaller blocks and processed across nodes within the cluster. By means of its scale-out architecture, it is possible to cope with data analysis on a massive scale, even on many low-cost physical servers.
Hadoop storage needs to support processing with the minimum latency possible. To do that, the Hadoop Distributed File System (HDFS) is used as a storage layer. Although its name suggests otherwise, this is not a traditional file system but instead can be thought of as a data store.
HDFS also has built-in fault compensation capabilities. While data is divided into blocks, these chunks are triple-mirrored across the Hadoop cluster to provide resiliency. To protect against server failure in huge clusters, duplication of data in such way is one of the key attributes of HDFS.
Big data in Turkey
Finding the right experts to help carry out analytics projects is a key challenge around big data in any part of the world, but more so in Turkey. Plus, because Turkey lags when it comes to implementing cloud architecture and related innovations, big data analytics have not really taken up to any significant extent.
More on big data storage
Even in the telecommunication and finance sectors, where state-of-the-art IT adoption in Turkey is at its most rapid, big data applications are yet to be implemented. That said, many institutions have deployed advanced enterprise data warehousing due to regulations that enforce detailed reporting. Fast-growing volumes of data and competitive business environments are expected to result in big data applications being in more demand soon.
In the limited big data tools market in Turkey, Apache Hadoop is the leading product. Besides that, Cloudera and Hortonworks are the closest competitors. Oracle is also thriving to increase its share with its Big Data Appliance product.
Etiya introduces social media analytics with Turkish content
Etiya was founded in 2004 in Istanbul and is well known as CRM software provider but it's also a pioneer in data warehousing and big data analytics in Turkey.
Etiya’s Somemto social media management tool is, according to Abdulkerim Mizrak, data warehouse business intelligence manager at Etiya, the first big data analytics project to be developed in Turkey.
Somemto has grown to become the pre-eminent data warehouse of social media content in Turkey. The country’s citizens are big users of Facebook and Twitter, and information collected through social media analytics is highly desirable to many companies.
Etiya retrieves information for their customers through semantic big data analytics on social media. The company uses Cloudera Hadoop and uses HDFS for data storage.
Since partnering with Oracle two years ago, it has used an Oracle Big Data Appliance X4-2, a high-performance platform that can run diverse workloads on Hadoop and NoSQL systems.
Mizrak says the fact that Hadoop is an open-source product was an important point in their decision.
He says: “HDFS enables ordinary machines and disks to be used in big data analysis. Because Hadoop doesn’t require any particular machine and is able to operate on commodity hardware, it was a choice by default. On top of that, at the moment, there is no solution with lower costs in the market.”
To achieve high performance, says Mizrak, it is important to use a variety of complementary technologies that can easily be integrated in Hadoop ecosystem.
In addition to using Hadoop for big data analytics, Etiya also uses Hive, Impala, Elastic Search, SQLR, HBase and Flume. For real-time or near real-time analytics it mainly uses Storm, Spark, Shark and Redis technologies.
This was first published in July 2014