Toward the end of 2012, Quocirca met with an interesting company called DataSift. DataSift is a social data platform company - it takes feeds of data from the majority of social media sites and can then mine through social conversations for content, trends and insights. This is of obvious interest for organisations that are tracking sentiment of their brand in the market - but may also have other uses as well.
The one obvious target for DataSift is Twitter - the vast majority of Twitter data is available in the public domain (only direct messages (DMs) are hidden from general view). However, DataSift can also track activity around an organisation's Facebook page, content from blogs and forums - including other semi-private information the organisation accesses via social networks established between itself and the public.
The platform is cloud-based with prices based on a combination of "complexity", hours and hourly cost along with a data cost. The hourly cost is the simplest to explain. The price is based on the period being analysed - for a week, this would be 168 hours, for a month (nominally) 720 hours. Complexity is more difficult and is based on a calculation that can only be completed once the query has been created. However, the business model does mean that you only pay for what you get: no on-going subscriptions that have to be paid no matter what - everything is on a per use basis. The data cost is based on a small charge per Tweet analysed. For statistical validity, DataSift recommends that a 10% sample rate is used, which lowers the price significantly.
As a test, Quocirca asked DataSift to run a Twitter-only analysis of 2012 Twitter activity for a named set of vendors who are often mentioned in the same breath as big data. The query required just 10 lines of code to be written, and gave a complexity score of 2.1. Without the 10% filter in place, 2.23 million Tweets were analysed.
We selected an interesting topic as the basis for our test and Quocirca will be writing a more detailed piece on the findings, but the highlights below illustrate the potential power of the system:
- Twitter activity around big data grew by 64% over the year. This is not surprising - big data was still an emerging topic back at the beginning of the year, but was being pushed harder and harder by the vendors and the media as the year progressed.
- Nearly three quarters of Tweets contained an active link. People were not just dropping Twitter comments about big data - they were referring people to other content outside of Twitter.
- Apache had the biggest footprint with 9.4% of vendor mentions in Tweets being about it. Apache, with its Hadoop parallel processing engine and Cassandra database, is unsurprisingly the big player here.
- Second placed was 10gen, the commercial entity that looks after MongoDB, with 6.24% of vendor mentions.
- Of the "big guys", IBM gained a creditable third place with 3.25%, with HP in fourth with 2.38%.
- There were geographic differences - IBM's strongest country was France; Cloudera's was Japan. SAP was (unsurprisingly) strong in Germany; DataSift itself was very strong in the UK.
- At a domain level - the sites that people were pointing people to most from their Tweets, Forbes.com was a surprise winner. Behind that, GigaOM.com and Techcruch.com were the next biggest content sources.
As a single point of interest, a look was taken at HP at a sentiment analysis level. Through the first part of the year, people's views of HP remained fairly level, with a net sentiment score (positive comments minus negative comments) of 0 - not good news in itself, but it could have been worse. However, between 14th November and 10th December, a lot of sentiment activity took place.
On the 21st November, HP's sentiment score plunged close to -10,000. It recovered back to zero by the 24th, and then went back down to -5,000 on the 28th, rose again and then crashed down to -7,000 on the 1st December.
Why? On November 20th, HP's CEO Meg Whitman told Wall Street analysts that HP had massively overpaid for software firm, Autonomy, and accused former executives at Autonomy of cooking the books. Financial and technical analysts went into a frenzy - the very people who use social networking the most to get information out as quickly as possible. The ongoing fall-out was what caused the triple-dip poor sentiment scores over the following weeks.
This shows that, although HP got a fourth place in the mentions it had around big data, it was not necessarily positive to HP's brand. This is why a company such as DataSift is important - it not only can remove the grunt work of dealing with analysing the massive firehose of data that comes from social networks, but also applies solid analytic against this to ensure that what a customer sees as results is there in context.