Sentiment analysis with Hadoop: 5 steps towards becoming a mind reader


This is a guest blogpost by Andy Leaver, vice president of international operations, Hortonworks

 Mass advertising and campaign marketing are like the dodgy lettuce found somewhere at the back of the fridge – insufferably bland and way past its prime. With the explosion of blogs, fora and various other types of digital and social media, consumers have unprecedented power to share their brand experiences and opinions with each other on a massive scale.

Aside from the hashtag addiction affecting youngsters, this digital evolution opens up a huge opportunity for businesses, which can now collect data from its origin, identify relevant keywords and score them to predict an outcome and ultimately upsell.

According to Ofcom, 56% of us in the UK actively consult online reviews before we purchase and Google’s consumer barometer reported that 64% of all purchases in 2015 were done online. One of the alpha resources for information and advice on purchases that most of us increasingly turn to is Twitter. A survey conducted by Millward Brown showed that nearly half (49%) of female Twitter shoppers say Twitter content has influenced their purchase decisions. Of course, this can create a big data beast that’s difficult to manage!

This is where Apache Hadoop can come in; to help predict trends, gauge consumer opinion and make real-time assessments based on unstructured data. As follows is how this works with a Twitter stream…

Collect data

One of the easiest ways to collect data is, in our view, Apache NiFi,* a service for efficiently collecting, aggregating and moving large amounts of streaming event data. NiFi enables applications to collect data from its origin and send it to a resting location such as HDFS for later analysis. In the case of tweets, it provides a free Streaming API which allows NiFi to retrieve content and forward it to HDFS.

Here is precisely how it works, which is simpler than it might sound: a flow in NiFi starts from the Twitter Client, which transmits a singular unit of data to a Source (entity through which data enters into NiFi) operating within the Agent (Java virtual machine running NiFi). The Source receiving this “Event” then delivers it to one or more Channels (conduit between the Source and the Sink). One or more Sinks (entities that deliver the data to the destination) operating within the same Agent drains these Channels.

Label your data

This is the most “business specific” part of the process. You will need to identify words that are relevant within your business to build a kind of data dictionary and to attribute to words and expressions a polarity (positive, neutral/negative) or a note (from 0 to 10, 5 being neutral). Hadoop embeds customizable catalogues and dictionary tables to help you in this task.

Apache HCatalog, a table management layer that exposes Hive metadata to other Hadoop applications, is especially useful as it presents a relational view of data. It renders unstructured tweets in a tabular format for easier management.

Run the analytics

With the help of Hadoop, score the sentiment of the tweets by the number of positive words compared to the number of negative words present in each tweet. Now that you have the data in HDFS, you can create tables in Hive.

Train and adapt and update your model

At this point, you will get first results and be able to proceed to fine-tuning. Remember that analytic tools that just look for positive or negative words can be entirely misleading if they miss important context. Typos, intentional misspellings, emoticons and jargon are just few additional obstacles in the task.

Computers also don’t understand sarcasm and irony and as a general rule are yet to develop a sense of humour. Too many of these and you will lose accuracy. It’s probably best to try to address this point by fine-tuning your model.

Get insights!

When done, simply run some interactive queries in Hive to refine the data and enjoy visualization of data via a BI tool (Microsoft Excel will do the trick if you want).

Depending on your business, Hadoop will certainly enable you to take urgent marketing decisions and actions. This is just one of many ways to collect and analyse social data using Hadoop and there are myriad other options open to be explored- it’s all about what is right for you!

*  Hortonworks has a product, Hortonworks Data Flow, based on Apache Nifi