Mike Ferguson on a decade of Hadoop

| No Comments
| More

This is a guest blogpost by analyst Mike Ferguson on the 10th anniversary of Hadoop's becoming a separate Apache subproject.

In the 10 years since Hadoop became an Apache project the momentum behind it as a key component platform in big data analytics has been nothing short of enormous.

In that time we have seen huge strides in the technology with new 'component' Hadoop applications being contributed by vendors and organisations alike. Hive, Pig, Flume, Sqoop, to name a few have all become part of the Hadoop landscape accessing data in the Hadoop Distributed File System (HDFS).

However it was perhaps the emergence of Apache Hadoop YARN in 2013 that opened the flood gates by breaking the dependency on MapReduce. Today we have Hadoop distributions from Cloudera, Hortonworks, MapR, IBM, Microsoft as well as cloud vendors like Amazon with EMR, Altiscale and Qubole.

Yet the darling technology today is Spark with scalable massively parallel in-memory processing. It can run on Hadoop or on its own cluster and access HDFS, cloud storage, RDBMSs and NoSQL DBMSs.  It is a key technology combining the ability to do streaming analytics, machine learning, graph analytics and SQL data access all in the same execution environment and even in the same application.

AMPlab and then Databricks have progressed Spark functionality to the point where even vendors the size of IBM have strategically committed to its future.

From a developer perspective, we have progressed way beyond just Java with languages like R, Scala and Python now in regular use. Interactive workbenches like Apache Zepplin have also taken hold in the development and data science communities, speeding up analysis.

Today we are entering a new era.  The era of automation and the lowering of skills to let in and convert self-service business analysts into so-called 'citizen data scientists'.  Data mining tools like KNIME, IBM SPSS and RapidMiner are already supporting in-memory analytics by leveraging the analytic algorithms in Spark. SAS is also running at scale in the cluster but with its own in-memory LASR server.  

There is also a flood of analytic libraries emerging like ADAM and GeoTrellis with IBM also open sourcing SystemML.

The ETL vendors have all moved over to run data cleansing and integration jobs natively on Hadoop (e.g. Informatica Blaze, IBM BigIntegrate and Big Quality) or by running on top of Spark (e.g. Talend).  

Also, Spark based self-service data preparation vendor startups have emerged such as Paxata, Trifacta, Tamr and Clear Story Data. On the analytical tools front, there too we have seen enormous strides. Search based vendors like Attivio, Lucidworks, Splunk and Connexica all crawl and index big data in Hadoop and relational data warehouses.  New analytical tools like Datameer and Platfora were born on Hadoop with the mainstream BI vendors (e.g. Tableau, Qlik, MicroStrategy, IBM, Oracle, SAP, Microsoft, Information Builders and many more) having all built connectors to Hive and other SQL on Hadoop engines. 

If that is not enough, check out the cloud. Amazon, Microsoft, IBM, Oracle, Google all offer Hadoop as a Service. Spark is available as a service and there are analytics clouds everywhere.

If you think we are done you must be kidding.  Apache Flink, security is still being built out with Apache Sentry, Hortonworks Ranger, Zettaset, IBM Guardium and more. Oh, and data governance is finally getting done but still work in progress with the emergence of the information catalogue (Alation, Waterline Data, Semanta, IBM) together with reservoir management, data refineries .... Exhausting isn't it?

Without a doubt, Hadoop along with Spark has and is transforming the analytical landscape. It has pushed analytics into the boardroom. It has extended the analytical environment way beyond the data warehouse but is not replacing it. ETL offload is a common use case to take staging areas off data warehouses so that CIOs can avoid data warehouse upgrades. And yet more and more data continues to pour into the enterprise to be processed and analysed.  There is an explosion of data sources with a tsunami of them coming over the horizon from the Internet of Things.

SQL on Hadoop

But strangely, here we are with increasingly fractured distributed data and yet business demands more agility. Thank goodness for the fact that SQL prevails. Like it or loathe it, the most popular API on the planet is coming over the top of all of it. I'm tracking 23 SQL on Hadoop engines right now and that excludes the data virtualisation vendors! Thank goodness for data virtualisation and external tables in relational DBMS . If you want to create the logical data warehouse, this is where it is going to happen. Who said relational is dead? Federated SQL queries and optimisers are here to stay. So...are you ready for all this? Do you have a big data and analytics strategy, a business case, a maturity model and a reference architecture? Are you organised for success? It you want to be disruptive in business you'll need all of this. If you are still trying to figure it all out you can get help here.  

Happy Birthday Hadoop!

Honey, I split the universe -- quantum physics and BI

| No Comments
| More

This a guest blogpost by James Richardson, business analytics strategist, Qlik

I find solace in the many worlds interpretation (MWI) of reality. In layman's terms, the idea, first proposed by physicist Hugh Everett* in 1957, means that 'every possible outcome of every event defines or exists in its own "history" or "world"'.  In other words, every time an event happens the universe splits.

Admittedly this is a controversial hypothesis, dividing physicists - some of whom find the idea abhorrent and little more than wishful thinking. But as I've said, I find the idea of infinitely branching universes comforting, as it means that in at least one other of these probable multiverses, one of me is a rock star or Olympic gold medalist! It also means that in another universe my parallel self and his parallel wife have a daughter who can talk to us, and who will live an independent life free of disability and difficulty.

So how does this relate to business intelligence (BI)?

When I was at Gartner we used to run an annual survey on the buying drivers of BI - I think the last one was in 2010 - which was consistent year-on-year in finding that the main reason organizations invest in BI is to "speed up and improve decision making". A decision is an event. As such, in an MWI reality every possible decision outcome runs in parallel, with branching universes created at the point of every decision. If MWI is real it makes what people actually do with BI very different to how it's usually considered.

Here's my logic:

1.         People use BI as a driver for decisions.

2.         Decision events split universes.

3.         Therefore BI is a tool for switching between parallel universes.

To push the logic further, if the likely decision outcome is known, BI is a tool for consciously navigating parallel universes. Here lies the problem though. For most business decisions, beyond using rule-of-thumb experience and other heuristics, the likely result of decisions is not known in advance. This came up in a recent Qlik's survey where 36% of those questioned cited "lack of clear outcome from choices" as an inhibitor to decision making. As such, any decision making, and therefore universe switching, is uncertain, maybe even random in some cases. This problem is why probabilistic methods, and in particular Monte Carlo simulations, become very useful indeed.  By calculating the statistical probability of alternatives - and remember the QIX engine supports Monte Carlo methods - the decision can be made and the universes navigated with less uncertainty (but still some uncertainty; we can't forget the chaotic nature of complex systems, but that's for another blog post).

So, next time someone asks you what BI system is for, tell them it's to navigate to a parallel universe. They may think you're crazy, but at least you'll have intrigued them and hopefully, made them think about what decisions really do and why they're so important. Or not...remember if you get it wrong in this universe, you got it right in another.

*For the music fans - Hugh Everett was the father of 'E' in the indie band Eels. You can see him present a BBC documentary on his father's life and work on YouTube.

New Year's Data Resolutions 2016

| No Comments
| More

This is a guest post by Matt Davies, technical evangelist, Splunk

As is the way, it is time for looking forward to 2016 technology and predictions as to how IT and business will evolve over the next 12 months. 

Don't make a data lake, make a data reservoir

2016 is going to have a lot of talk about data lakes. This will probably be tied into conversations about Hadoop, data ingest, Spark, frameworks, etc. The first resolution is to make a data reservoir not a data lake. No doubt (data) lakes are (architecturally) beautiful but they get created over a long period of time, are uncontrolled, aren't managed and are prone to flooding. Reservoirs are either managed lakes that evolve or are purpose built with the right boundaries, water inputs and outputs. Reservoirs also tend to have multiple purposes: they can be used to generate power, provide water, for recreation, and so on. In 2016, don't make a data lake, make a data reservoir.

 Use data as the school work for your machine learning

Machine learning is going to continue to be a hot topic. Expect more talk about algorithms, the threat/promise of artificial intelligence emerging from machine learning and one too many Terminator and SkyNet references. On a more practical note, we'll start to use data more effectively with machine learning. With my children ready to go back to school after the holidays, it got me thinking that data scientists are the teachers, data is the syllabus and text books, homework and machine learning act as the brain. To make the most of machine learning in 2016 you really need all three - a great teacher, the right training data and a half decent machine learning brain.

Make sure the glass (table) is completely full and make your IT analytics hybrid

You've got cloud, you've got on-premise (and you've got some Christmas socks or shower gel... again!). The next New Year's data resolution is to make your IT Operational Analytics hybrid. You might have a decent view of how your on-premise IT is performing but have you got the same for the IaaS, PaaS or SaaS you have in the cloud (be it public or private)? Secondly, do you have a view of both together? With most modern IT and application landscapes, APIs, containers and microservices, you've got a complete mix or blend with a "bit of everything". The common currency from all these IT deployment models is the machine data they generate.  In 2016, make sure you've got visibility and insight into everything, be it cloud or on-premise with hybrid ITOA.

Get enterprise security right and unlock all kinds of possibilities from data

Who knows what 2016 holds for security. The types of attacks will change, become more sophisticated or larger in scale and the tools and data required to combat them will need to keep pace. Security analytics will need to be real-time, predictive and cater for more types of data than ever before. On a positive note, we've seen in 2015 that getting security right can unlock all kinds of benefits across an organisation in areas you'd never have expected. As well as spotting insider threats, advanced attacks and helping ensure breach defence, we've also seen the data collected for security be used to deliver value in other business units and departments. IKEA is a great example of this. On their journey to ensuring enterprise security, they improved their eCommerce monitoring capability and also the analytics they delivered to the different parts of the business. In 2016, getting your enterprise security strategy right will enable you to do more with your data and potentially use security as an unlikely place to start innovating.

Connected gadgets are cool but industrial data will deliver on the promise of IoT in 2016

You may well have had your first IoT [Intenet of Things] Christmas and been given some form of wearable or connected "thing". No doubt that electric toothbrush that talks to my connected thermostat and tells me I need to turn the heating up while ordering more toothpaste is useful -  but industrial data has the biggest potential to change traditional industries in 2016. The whole energy, manufacturing, construction, automotive, building, travel and logistics industry is already in the middle of disruption with initiatives like Industry 4.0 and sensor driven production, monitoring, repair and maintenance.

The industrial data (such as SCADA data) from these sensors will become an important and appreciated resource in 2016 and will help to create efficiencies and value added services. This data will also help drive the IoT security agenda this year. In 2016, look at where you can maximise the benefit of industrial data and think about the value it can bring to traditional industry. 

Smart fraud detection needs a fresh data approach

| No Comments
| More

In a guest blogpost, Emil Eifrem of Neo Technology says graph databases detect sophisticated scams and fraud rings in real time

Before PayPal came along, online commerce was fraught with security problems. But how did PayPal and its like solve the problem?

By being able to mount a real-time view of its entire payment network, due to a new way of visualising data and complex networks of connections between users. More precisely, it did this by using the emerging alternative to SQL database technology, graph technology. Years later, it's time to take graphs on a step, and embed them into fraud detection systems.

Unlike most other ways of looking at data, graph databases are designed to exploit relationships in data, which means they can uncover patterns difficult to detect using traditional representations such as tables. And although developed in-house by the big social web giants (Google, for instance, using graphs, exploited the connections in Web documents to rank search results, namely the 'Google algorithm') now these technologies that it took many engineers-hours to construct are available to the wider market. Forrester says over a quarter of enterprises will be using such databases by 2017for instance, while Gartner believes that over 70% of leading companies will be piloting a graph database by 2018.

As a result, an increasing number of enterprises, from banks to ecommerce firms, are using graphs to solve a variety of complicated data problems in real time, including the speedy detection of fraudulent activity.

 Varieties of online hoodwinking

There are various types of fraud - first-party, insurance, and e-commerce fraud, etc. But what they all have in common are layers of deceit. Traditional technologies, while still suitable for certain types of prevention, are simply not designed to detect these layers, which are only really visible by spotting patterns in data and relationships. Graph databases, in contrast, through connected analysis, provide a unique ability to uncover a variety of important fraud patterns, and in real time.

First party fraud is a good example of how graph technology can make a difference, as the complexity of the relationships is what makes these schemes so damaging. Banks lose tens of billions of pounds annually from this form of deception; experts suggest as much as 20% of unsecured bad debt at leading US and European banks is due to this form of opportunistic crime.

However it's the network of relationships powering this that makes the fraud ring vulnerable to graph-based methods of detection. First-party fraud involves the fraudsters opening bank accounts, taking out loans, credits cards and overdrafts. They initially behave like legitimate customers until the moment they clean out all their accounts and disappear. Collections processes kick in but these account thieves are long gone, repeating the process elsewhere.

A fraud ring like this usually involves two or more people sharing a subset of legitimate contact information to create a series of false identities. In the case of two individuals, sharing only a phone number and address (two pieces of data), they can create four false identities with fake names, each with four to five accounts - a total of 18 accounts. Assuming an average of £4,000 in credit exposure per account, the bank's loss could be £72,000, perhaps more. The potential loss in a ten-person fraud is no less than £1.5m, assuming 100 false identities and three financial instruments per identity with a £5,000 credit limit, and so on.

To meet the challenge, Gartner has proposed a layered model for fraud prevention that starts with simple discrete methods but which progresses to more elaborate types of analysis, specifically, Entity Link Analysis that leverages connected data. This is another way of saying, look at the relationship patterns - which by definition, is a form of analysis graph databases excel at.

Discrete data is hard to work with

Banks' standard instruments for dealing with fraud, such as a monitoring for deviation from normal purchasing patterns, is all about discrete data, rather than looking at the bigger network of relationships. Discrete data picks up sole fraudsters, well enough. But it can't as easily detect the shared characteristics that typify fraud rings (collectives working often cross-border, even cross-continent). What's more, such methods tend to issue false positives, which harm customer relationships.

The problem bedevils traditional relational database approaches, because as they can only really model data as a set of tables and columns, carrying out complex joins and self-joins when the dataset becomes more inter-related is just messy and painful. Such queries are technically tricky to construct and expensive to run, and making them work in synchronous time is problematic, with performance faltering as the total data set size increases.

Graphs are your stepping stone to 'in-flight' fraud blocking

Graph databases, by contrast, have none of these issues. Used with modern data query languages like Cypher, they offer a simple semantic for detecting fraud rings and navigating the data connections in-memory and in real time. That makes spotting the connections between fraudsters and their activities far more straightforward, potentially before anything untoward taking place. And as business processes become faster and increasingly more automated the window we have to detect fraud is shrinking too.

That makes the need for real-time, in-flight fraud blockage all the more important. Graph databases provide a unique ability to uncover a variety of important fraud patterns, in real time, and are a major step in the right direction to do just that. The verdict has to be, take a leaf out of the social web giants' book and look at this great data infrastructure alternative to working better with complex data.

2016: What's next?

| No Comments
| More

This is a guest blogpost by Larry Augustin, CEO, SugarCRM

It's that time again: prediction season.

Many predictions have been made by the global research and intelligence companies. Gartner has predicted that by 2017, 50% of product investment projects will be redirected to customer experience innovations.

Walker Info suggests that by 2020, customer experience will overtake price and product as the key brand differentiator, highlighting the impact IT and more specifically Customer Relationship Management systems will have on a business's success. But what is in store for 2016?

What will determine the success of a CRM system in the coming year as the evolution of customer service excellence continues? These are my predictions for the year ahead:

Personalised Analytics - This is the next big data trend. CRM is moving toward "systems of engagement" that use predictive analytics to cut through the big data noise to uncover actionable customer insights. Soon salespeople and marketers will use predictive analytics to forecast the impact of their activity and provide more personalised pitches or content to individual customers. By offering greater analytics for the individual user through flexible and usable tools, modern CRM systems will provide sales and marketing teams with all the relevant customer information they need to deliver personalised customer excellence.

Data Privacy Concerns will Affect SaaS CRM Deployments - 2015 saw cyberattack after cyberattack indiscriminately targeting businesses and their customer data. These attacks have raised real online privacy concerns. Therefore, a well-designed and tightly-integrated CRM system is now more than ever imperative to any organisation. In 2016, more companies will opt to deploy CRM with cloud agility, meaning they can maintain security and control of customer data, choose the best public, private or hybrid cloud deployment model (as well as on-premise) for their business, and ensure regulatory compliance.

Mobile CRM will get even better: Mobile is, and will continue to be a rising focus for the CRM space. One of the great benefits of CRM is that it allows businesses to organise themselves more effectively. However, as the workforce is dispersed and people spend time out of the office, a mobile CRM app is crucial so those valuable interactions while on the road aren't left behind. In the past, many mobile CRM apps have had limited functionality. As we move forward, mobile platforms will become more powerful. You'll see smartphones display the latest analytics and dashlets via their CRM.  In addition, users will be able to better customise their mobile experience to get the data they want and transform that data into actionable tasks to address customer needs in real-time.

UX will be bigger and better - In 2016, companies that focus on differentiating themselves by providing a positive customer experience will thrive. What drives this? Customer-facing employees having the right information and tools to best serve the customer at exactly the right time. We will now see CRM users having access to enhanced, modern interfaces that incorporate social and mobile customer data to empower the employee to drive extraordinary customer relationships. A fulfilling user experience will mean more intelligent CRM practices, which will make it much easier to execute a seamless customer journey from awareness and purchase to retention and advocacy.

CRM and the Internet of Things (IoT) will become intertwined: Predictions can't be predictions without mentioning the IoT. 2016 will see CRM and the IoT become heavily integrated for the first time. The potential of harnessing the data of billions of connected devices and integrating that data within the CRM to create extraordinary customer relationships is very exciting. This year, CRM platforms will begin to evolve to work with the data that is being generated, make sense of that data and communicate to the people who can benefit from the analysis so they can perform real actions to help the customer.

Four signs your business needs a data lake

| No Comments
| More

This is a guest blogpost by Dr. Thore Rabe, vice-president Europe, Middle East, Africa - Isilon Division at EMC

It is now well known that the digital universe, which comprises most businesses' data needs, is growing exponentially.

In this environment, it is critical that businesses use data analytics to enhance competitiveness and meet the needs of the 'information generation': millennials and more born into the digital era. From helping to predict buying behaviours, to driving innovation projects that will enhance customer service or improve business productivity, data lakes that can collate, store and analyse vast amounts of data have great power to transform a business for the better. Analytics should no longer be an aspiration, but a necessity.

However, many organisations get stuck early on in the journey. One of the main reasons is that IT and the rest of the business aren't always aligned on the best use cases and business goals of big data projects. While some businesses might be experimenting with basic data analysis (and some haven't even started), many just aren't prepared for the next level, which is far more complex and in-depth. In fact, a minority of businesses currently (we estimate) have the capacity to be always on and operate in real-time across the organisation, and almost a third haven't even started doing this.

So, how do businesses know when they need to scale-up and invest in a data lake?

There are four tell-tale signs:

1. Operational complexity: In a pre-data lake environment, if a business is trying to scale its infrastructure but doesn't have any option for additional FTEs (full-time equivalents) manager support, there's a good chance that their data requirements will outstrip their ability to manage them. Traditional tier 1 data resources aren't always pooled virtually, limiting the amount of storage an individual manager can cope with and making a clear case for a more flexible common storage resource, i.e. a data lake.

2. Operational cost: When a company finds that business demands on IT keep growing even when it is trying to reduce OpEx. it is time to look at a new approach. The same operational overheads that limit the ability for additional, FTEs also result in growing OpEx for managing IT resources. In order to address these requirements, businesses either need more FTEs or to invest in additional third party support to monitor, manage, deploy and improve their systems. The latter approach scales an order of magnitude better - or more - than simply adding headcount.

3. Production strain: Another key indicator of the need for a data lake is when existing analytics applications are putting a strain on the production systems of a business. Real-time analytics can be extremely resource-intensive, whether trying to derive insights through video analytics from dozens of HD video streams or poring through a vast waterfall of social content; dedicated resources are needed so that people trying to use the production systems don't drop-off in performance. Data lakes are key to ensuring that real-time analytics can run at optimum performance.

4. Multiprotocol analytics: A final key indicator that a business needs a data lake is when data scientists are running apps on a variety of different Hadoop distributions and need to hook their data up to them. Businesses will need multiprotocol support in the future as analytics experimentation carries on, and they need to plan for this with a data lake strategy.

Departments like marketing have led the way in analytics adoption, gathering insights to better understand their customers and tailor their communications accordingly, but other business areas are now interested in the benefits it can bring to them, from HR to IT to operations and beyond.

Across the industries, from finance to retail, manufacturing to media companies, each thinks that their problems, challenges and opportunities are unique. But, when you abstract the specifics you'll always come back to the same universal challenges I've mentioned in this piece. What unifies and characterises all of these is the transformation brought about by information technology and the potential of big data.

Not every business will be ready to deploy data analytics yet, but most will, at the very least, need to start planning for it or risk losing out to competitors that embrace the technology. Because, eventually all businesses will need to embrace data analytics, and those that don't will fade into obscurity. 

Panning for gold - the data we know and use and the data we don't

| No Comments
| More

This is a guest blogpost by Matt Davies, technical evangelist, Splunk

They say you don't know what you don't know. It's the same with your data. Most organisations have data they know about, collect and use. This data is typically structured, neat and tidy and probably in some form of database or data warehouse. However there is also a wealth of data that they don't know they have and aren't using. This is most likely machine data. It comes from every technology interaction be it machine-to-machine, person-to-machine or person-to-person (via technology). According to IDC, the digital universe is growing at 40%  a year and most of the data generated is machine data. It is coming from core IT, customer-facing applications, cloud computing, mobile devices, social media and the Internet of Things.

This machine data has certain characteristics: it is in motion (often very fast motion), there's a lot of it and it is time series data. It is also messy data (it is unstructured) and it is lazy (every company generates it but it typically gets left unused). But, to steal from a Wild West cliché, "there is gold in them thar data". The challenge has always been how do I get to it, how do I make it useable and how do I find value from it?

Increasingly the ability to use the same information for multiple purposes is one of the secrets to making the most of any kind of data. Think of your real-time machine data as a stream of light, you need some form of prism to be able to look at this data with a different "lens" or colour. The same data has value for security, IT, customer service, and so on. The term data silo isn't new and barriers preventing anyone from using data is often what hinders data-centric initiatives. A lot of time is typically spent collecting and preparing data before you ever start to ask questions and get the value from it.

 So if there's data we're not making use of, that has benefit for multiple audiences and it takes a lot of time to get to the value from it - how do you start? Technology and modern data platforms can help but this must go hand in hand with building a culture of exploration around your data and using analytics as a way of democratising it for everyone.

I have seen a great example from Deutsche Bahn, which ran a 24 hour hackathon where they provided a data set from their rail infrastructure and challenged all-comers to "show us what's in the data". By exploring this previously untapped source of data, Deutsche Bahn found out potential train delays, how journey time was impacted when comparing wooden vs. concrete sleepers and where outages are more likely to occur.  I thought it was an interesting example of largely unused data, a culture of exploration and valuable new insight and analytics.

To further illustrate, I was fortunate enough to be in a presentation with UniCredit, a European bank, who are managing multiple terabytes of this machine data every day. They are taking data from over 180 different sources, 8 billion events per day and 400,000 events per second at peak. They use this data to improve their banking operations, create real-time alerts, search to find patterns of behaviour and deliver real-time data visualisations. This capability is delivered to various parts of the bank for business analytics, security intelligence, ITOA, internet banking service monitoring, mobile banking insight and improved accounting. The value from the data includes improves SLAs, faster issue resolution, a real time data centric approach to decision making and the chance to improve customer experience.

Think about the data you know and use, then think about the data you don't know you have and don't use. Try it, explore it, share what you find and see if "there is gold in them thar data". 

Sentiment analysis with Hadoop: 5 steps towards becoming a mind reader

| No Comments
| More

This is a guest blogpost by Andy Leaver, vice president of international operations, Hortonworks

 Mass advertising and campaign marketing are like the dodgy lettuce found somewhere at the back of the fridge - insufferably bland and way past its prime. With the explosion of blogs, fora and various other types of digital and social media, consumers have unprecedented power to share their brand experiences and opinions with each other on a massive scale.

Aside from the hashtag addiction affecting youngsters, this digital evolution opens up a huge opportunity for businesses, which can now collect data from its origin, identify relevant keywords and score them to predict an outcome and ultimately upsell.

According to Ofcom, 56% of us in the UK actively consult online reviews before we purchase and Google's consumer barometer reported that 64% of all purchases in 2015 were done online. One of the alpha resources for information and advice on purchases that most of us increasingly turn to is Twitter. A survey conducted by Millward Brown showed that nearly half (49%) of female Twitter shoppers say Twitter content has influenced their purchase decisions. Of course, this can create a big data beast that's difficult to manage!

This is where Apache Hadoop can come in; to help predict trends, gauge consumer opinion and make real-time assessments based on unstructured data. As follows is how this works with a Twitter stream...

Collect data

One of the easiest ways to collect data is, in our view, Apache NiFi,* a service for efficiently collecting, aggregating and moving large amounts of streaming event data. NiFi enables applications to collect data from its origin and send it to a resting location such as HDFS for later analysis. In the case of tweets, it provides a free Streaming API which allows NiFi to retrieve content and forward it to HDFS.

Here is precisely how it works, which is simpler than it might sound: a flow in NiFi starts from the Twitter Client, which transmits a singular unit of data to a Source (entity through which data enters into NiFi) operating within the Agent (Java virtual machine running NiFi). The Source receiving this "Event" then delivers it to one or more Channels (conduit between the Source and the Sink). One or more Sinks (entities that deliver the data to the destination) operating within the same Agent drains these Channels.

Label your data

This is the most "business specific" part of the process. You will need to identify words that are relevant within your business to build a kind of data dictionary and to attribute to words and expressions a polarity (positive, neutral/negative) or a note (from 0 to 10, 5 being neutral). Hadoop embeds customizable catalogues and dictionary tables to help you in this task.

Apache HCatalog, a table management layer that exposes Hive metadata to other Hadoop applications, is especially useful as it presents a relational view of data. It renders unstructured tweets in a tabular format for easier management.

Run the analytics

With the help of Hadoop, score the sentiment of the tweets by the number of positive words compared to the number of negative words present in each tweet. Now that you have the data in HDFS, you can create tables in Hive.

Train and adapt and update your model

At this point, you will get first results and be able to proceed to fine-tuning. Remember that analytic tools that just look for positive or negative words can be entirely misleading if they miss important context. Typos, intentional misspellings, emoticons and jargon are just few additional obstacles in the task.

Computers also don't understand sarcasm and irony and as a general rule are yet to develop a sense of humour. Too many of these and you will lose accuracy. It's probably best to try to address this point by fine-tuning your model.

Get insights!

When done, simply run some interactive queries in Hive to refine the data and enjoy visualization of data via a BI tool (Microsoft Excel will do the trick if you want).

Depending on your business, Hadoop will certainly enable you to take urgent marketing decisions and actions. This is just one of many ways to collect and analyse social data using Hadoop and there are myriad other options open to be explored- it's all about what is right for you!

*  Hortonworks has a product, Hortonworks Data Flow, based on Apache Nifi

Five years: top ten speculations for the future of BI

| No Comments
| More

This is a guest blog post by James Richardson, Business Analytics Strategist at Qlik. In it, he speculates on what the analytics future may look like by 2021 (just don't hold him to it!)

At this time of year there are a lot of 'BI trends for next year' pieces around - I know as I've been asked to write enough of them.  Most of them look to the year ahead and offer little more than a series of assertions. Worse than that, they're boring. So when I was asked to consider the future (always something I'm wary of - given predictors' tendency to get things spectacularly wrong) I thought why not go big, and look five years ahead and make some educated guesses based on evidence we've gathered and partner projects Qlik is running today?

So here are ten speculations.  By 2021:

 1. Analytics of new data sources will have undermined some long-standing business models. Take drivers' insurance, the widening use of telematics could mean the demise of the actuarial-table based shared risk model, as cohorts of drivers get removed from the population and charged based on analysis of actual driving behaviours.  Health insurance can't be far behind. Perhaps this becomes true of public health systems too, which refocus on proactive health promotion rather than reactive illness treatment. Further, classic white-collar work roles like that of the auditor look increasingly ripe for analytic automation. This is a logical continuation of the mechanisation of intellectual work - we've already forgotten that 'computer' and 'calculator' were people's job titles not too long ago.


2.Decision makers will be making wide use of shared, immersive analytic experiences. BI development has been focussed on small form-factor devices, but the locus will now shift to very large (think wall size) touch devices. This will enable teams of colleagues to work towards decisions through the side by side exploration of data in real thought time.  In 2015 the #3 reason for not making a decision is disagreements with peers in 39% of cases; these kind of collaborative data experiences will mean that by 2021 we'll be working in the data together.


3. BI will support a wider gestalt and a fuller range of human learning styles. The visual representation of data is dominant in 2015. However, not all people that need to use data are equally visually oriented. Humans use an individual mixture of sensory inputs to learn - often defined as three learning styles - auditory/reading, visual or kinaesthetic. By 2021, business intelligence will be making use of information delivery mediums to use all these learning styles - for example, for auditory learners, auto generated narratives in written or spoken form that describe the shape of the data selected or the contents of a chart.  Similarly, 3D printing may play a role in creating charts for the kinaesthetic learners to feel who work best when they can physically get their hands on something. Of course, for the visually-led learners the options will grow, taking advantage of massive, higher resolution displays to enable the rendering of massive data sets, and perhaps virtual reality experiences.


4.The data literacy gap will have narrowed. Inevitably people will become familiar and comfortable with more forms of data visualization over the next five years and will learn to read and use the insights in charts more readily. (An analogy here is how people have become familiar with the 'language' of film over time, to the point where interpreting a movie is second nature).  Perhaps more importantly, the education system will have reacted by adding more on analytics to business and other courses. Leading organizations will be mandating training on data literacy, as they recognize that data literacy among staff is a driver of competitive advantage.  Of course, the implication of more data literate people is more demands for and of data.


5.Personal analytics become base line behaviour. The behaviour we see in the quantified-self movement maybe uber-geeky now, but as more data comes on stream from services and devices, this movement will rapidly become the norm as individuals analyse "data of me" for self-improvement. Not just that but they'll use analytics more and more as part of family life and in their communities (whether geographic or of shared-interest). The interesting implication for software vendors here is that this is another consumerization trend, driven by personal preference with implications for an eventual BYOAT (bring your own analytic tool) mode.


6.More people will (finally) be making use of predictive analysis. This has been a long time coming - today although most organizations have a few people doing more sophisticated statistical forecasting it's not widespread; data from industry analysts has shown for years that less than 20% use predictive analytics broadly (i.e., as part of their BI projects).   Two drivers will be critical for this barrier to be overcome.  The first is using technology to 'nudge' non-statisticians by automatically showing them likely trends. For example, line charts that use a best-fit model to forecast three periods out, telling users in narrative form that a KPI will fall out of the acceptable range by a certain date, or using Monte Carlo simulations in analytic apps. The second driver is simply the broad availability of tools to support predictive modelling.  In the past the technology and knowhow was asymmetrically distributed - in the hands of few mavens. By 2021 the fact that this has not been the case for more than 20 years (through notably the open source R stat language) will have done for statistical probabilistic analytics what Gutenberg did for writing.


7.There will be much easier analysis of the 'long past.' The dramatic fall in the cost of data storage will mean that by 2021 organizations will have data in accessible, readable form (i.e., not on tape back-ups) going back further in time.  This will enable the algorithmic recognition and analysis of deep patterns, analysing the 'long' past.  This could prove useful as the analytic period stretches beyond that of economic cycles.  This will help organizations to not repeat history. Take the example of the last recession, organizations couldn't learn from what happened as the data was effectively gone; this will not be the case by 2021.


8.Intelligent Decision Automation (IDA) will take in more business decisions as machines get smarter.  In 2016 IDA is only handling simple tactical (i.e., single customer/situation) decisions, but as AI is applied more widely to model and learn, IDA will touch a wider range of choices and not just those can be expressed as a decision tree.  Initiatives like Google making its machine learning software (TensorFlow) open source can only lead to the acceleration of the use of AI in decision making. However, this has limits...


9.More organizations will be doing decision reviews. According to Qlik-gathered data, only 23% of organizations routinely check the outcome of business decisions in 2015.  Given that the oft cited main reason for investing in BI is to 'improve decision making' this is a problem. By 2021, more organizations will be modelling more decisions. 'Decision' will therefore become a BI metadata type and therefore be analysable, so we can start to see if our organizations are making good decisions, what the inputs and outputs were, and perhaps, which teams and people are making optimal choices.


10.Hybridised heuristic/algorithmic management and decision making will be emerging in some organizations. The ideal management team would be one that draws together the positive aspects of human experiential learning, as expressed through heuristic decision making, with the power of algorithmic computing. It gives each a voice at the meeting table. A hybrid of the subjective and objective - think Captain Kirk and Doctor Spock - to make decisions informed by both viewpoints of the data and the intangible. By 2021 this hybrid may just be in the form of auto-generated data stories to enlighten people and extend the range of their perspective, beyond that, who knows? A computer generated avatar representing the data and offering input verbally many not be too far-fetched.


So, remember what I said at the start. These are speculations! In all likelihood at least half will be wrong - either too optimistic or not ambitious enough. Or perhaps something will come along that changes everything - an outside context event (aka a 'black swan').  So if I get any right, remember where you read them first. If not, no matter.


How data has fueled the evolution of enterprise software

| No Comments
| More

This is a guest blogpost by Dave Elkington, CEO, Insidesales.com 

The phrase "predictive analytics" has become a trendy buzzword that seems to show up in every investor pitch in order to elicit premium valuations. Even legacy software giants, like IBM, Microsoft and HP, are investing or reinvesting in the space and jumping on the bandwagon.

Interestingly, predictive analytics is not as new as these companies would lead you to believe. In fact, it's just a rebrand of a branch of computer science that has been around for more than 50 years, machine learning.  What's even more important to understand is that machine-learning algorithms are not the driving force behind the big data revolution.

The real hero of this story is data. The explosion of data began with the advent of the mainframe in the 1950s and has grown in significance thanks to today's massive cloud-computing platforms coupled with big data storage systems.

Machine learning and mainframes

To understand the convergence that led to this revolution, it's important to note two major developments that occurred in the 1950s: Machine learning emerged from attempts to make computers act like humans, and companies began to use mainframes to collect and analyse data. 

In the 1950s, Arthur Samuel (at IBM) developed the first machine-learning game system to simulate a person playing checkers. As early as 1959, advanced machine-learning algorithms were being used to solve real-world problems like when an artificial neural network was used to remove echo from phone conversations.

In the mid-'50s, the IBM mainframe was born. The mainframe pulled data out of filing cabinets, creating a central repository. During this phase of enterprise software, the amount of data remained relatively small and access to this data was extremely limited.

Client-server platforms

Fast forward to the 1980s, when client-server platforms emerged. These platforms, developed by organisations such as Sun Microsystems and HP, decentralised business applications and distributed them within each enterprise that used the system. The amount of data exploded because it could now be collected from multiple sources throughout the company.

While client servers improved the aggregation of data, they still faced significant limitations. Access to the data remained constrained within a company's networks. People pushed the boundaries of these limits through extensions of internal networks using secure value-added networks. Using general-purpose electronic data interchanges (EDI), these networks introduced inter-enterprise communications and data sharing.

EDI was also the beginning of the important parallel process of normalising data sets and classifying data communications between enterprises. The challenge was that each company had to build a custom value-added network with each major customer, partner or vendor.

Cloud computing and SaaS

Hosted solutions acted as a precursor to cloud computing. Hosted solutions made servers available at colocation sites and provided open access through the Internet.
In the 2000s, cloud computing brought yet another phase of application delivery and access to the data stored within the applications as companies like Salesforce.com, Omniture and Workday began to provide software-as-a-service (SaaS). The cloud completely centralised data and offered ubiquitous access.

Cloud computing also made multi-tenancy possible. One way to understand multi-tenancy is to think of it as renting an apartment. Renting an apartment is cheaper than renting a house, which is akin to the client-server model, because you are sharing core infrastructure, like plumbing and electrical wiring, with other tenants.

The advantage of multi-tenancy in enterprise software is that it not only centralises an individual company's data but also consolidates data across multiple companies. This creates the need for massive databases capable of storing data from thousands and even millions of companies - hence the name, big data.

Big data

The big data phase brought MapReduce, document data stores to the enterprise cloud-computing vendors. Companies like Cloudera, MongoDB, Couchbase, Hortonworks and MapR commoditised databases that could accommodate billions of records with complex, non-standard relationships. 

This new method of storing massive amounts of data was just what machine learning needed. Enterprise software vendors have employed data scientists to figure out what to do with all of the data they are collecting.

This development is now coined the "predictive analytics" phase of enterprise software. It materialised because cloud computing enabled mass consolidation and universal access to the enterprise applications they delivered, and because big database vendors made it possible to store massive amounts of data in a centralised way.

What's next?

Multi-tenancy not only brings together data across multiple companies, but it also spans domains and industries. This opens exciting new opportunities and will usher in the next phase of enterprise software.

A large number of applications are already consolidating data within specific domains, such as healthcare, dating, travel, consumer goods and anything else you can imagine. However, this data only represents a small slice of life and fails to capture the full picture.

You can't accurately predict who somebody will want to date if all you have is car-buying data, and you can't predict how many pizzas they'll eat this year - and when they'll eat them - if all you have is bowling shoe data, although it would be pretty cool if you could. You need to collect data across domains and put it into the appropriate context.

That's why predictive platforms represent the next generation of enterprise software. Predictive platforms will assemble data from CRM (customer relationship management), ERP (enterprise resource planning), the Internet of Things and other domains and systems to make real-time predictions based on a complete view of the real world.

The futuristic world depicted in Sci-Fi movies isn't as far off as some people think.

Dave Elkington is CEO and founder of InsideSales.com, a cloud-based sales acceleration technology company. 

Machine learning is the new SQL

| No Comments
| More

This is a guest blogpost by Oleg Rogynskyy, VP of Marketing & Growth, H2O.ai

Before the emergence of today's massive data sets, organisations primarily stored their data in relational databases produced by the likes of Oracle, Teradata, IBM, etc. Following its emergence in the latter half of the 1980s, SQL quickly became the de facto standard for working with those databases. While there are differences between various vendor flavors of SQL, the language itself follows the same general pattern, allowing business analysts without a developer background to quickly pick it up and leverage the insights from the data stored in their relational databases. Today, I think machine learning is democratising the big data era of Hadoop and Spark in much the same way that SQL did for relational databases.

The problem that SQL solved for relational databases was accessibility. Before SQL, business analysts lacking an engineering background could not work with their data directly. Analysts were dependent on database admins similarly to how developers and business analysts are dependent on data scientists today. This leads to a data "traffic jam" where developers and business analysts are unable to work with their data without direct access to a data scientist. The promise of machine learning is that it allows business analysts and developers to run analysis and discover insights on their own.

SQL allowed lay business analysts to quickly comb through large data sets for answers via queries. However, the answer would have to be an exact match for the query, requiring that both your data and query be organised perfectly. Machine learning can comb through even larger data sets and reduce those to insights without the same need. The principle is the same - both SQL and machine learning reduce datasets into answers, but SQL is more of "I know what I'm looking for and here is how I find it," while machine learning is more about "hey, show me what's interesting in this data and I'll decide what's important." In other words, SQL requires business analysts to know exactly what they're looking for while machine learning does not. With machine learning, an analyst can use all their data to diagnose the common themes in the data, predict what will happen, and (eventually) prescribe the optimal course of action. I actually believe that SQL will become as obsolete as typewriters for business analysts, as machine learning takes its place.

Today's business analysts and developers are more than capable of building and using applications that sit on top of their data by powering them with machine learning. Importing machine learning algorithms into applications is a seamless process, but the organisational will has to be there. Too many organizations cling to the antiquated notion that they can't do machine learning, either because it is too compute intensive or because it requires in-house data science expertise that they can't afford. No one expects the vast majority of organizations to develop an artificial intelligence programme on the scale of Facebook or Google, but they don't need to. Many machine learning platforms are open source and free, all it takes is someone who is smart and curious enough to begin a pilot test!

Thinking about SaaS risks - data security

| No Comments
| More

This is a guest blog by Larry Augustin, CEO, SugarCRM

The recent cyber-attack on broadband company TalkTalk proved that while not securing your own data can be embarrassing, failing to secure the data of your customers is far more serious.

Headlines about cyber security, database breaches and hacking are becoming commonplace. In the last year PlayStation Network and Microsoft's Xbox Live were hacked and taken offline for long periods of time over Christmas. More recently, British Gas had the email addresses and passwords of 2,200 customers leaked online. Then there were dozens of attacks that targeted high-profile companies and banks in North America. This included Sony having its confidential data released, and telecommunications giant AT&T falling victim to an attack in which more than 68,000 accounts were accessed without authorisation. The latter was fined $25 million for data security and privacy violations.

Even more painful than the costly implications are the remediation and communication efforts with affected customers, and lost business that results when breaches are disclosed.

However, there are ways to effectively protect data from hackers. Deploying your customer relationship management technology through a Software as a Service model means being reinforced by multiple layers of protection and security. It's important to ensure that it's hosted in Tier 1 data centre facilities no matter where it is in the world. The data centres using this application are therefore protected by not just powerful physical security mechanisms such as 24/7 secured access with motion sensors, video surveillance, and security breach alarms, but also security and infrastructure components including firewalls, robust encryption and sophisticated user authentication layers.

Data is a critical component of the daily business and it's essential to ensure the privacy and protection of data regardless of where it resides. We make a point of taking a holistic, layered and systematic approach to safeguarding that data, ensuring we are constantly evaluating, evolving and improving the privacy and security measures we has in place. We also offer the option to deploy the technology on premise, as well as in hosted and hybrid configurations, flexing to meet the broadest range of security and regulatory requirements.

Gathering and storing good quality data is now a business critical activity whether that data is being used to highlight customer trends or telling you how valuable a customer is to your business - the benefits are clear. As it grows in importance, IT professionals are now under greater pressure than ever to spare a business the embarrassment of data breaches through ensuring the best IT practices and systems are in place to keep their customer information out of reach.

Big data is sword of truth for disruptor brands

| No Comments
| More

This is a guest blogpost by Bill Wilkins, CIO/CTO, First Utility

 Despite the energy industry being data-rich, the quality of its data has always been extremely poor, its systems archaic. Customers have been left in the dark with little engagement and no real choices.

For challenger companies with change and disruption in their DNA, effective use of data is what sets them apart from conservative incumbents.

Energy makes up one of our biggest household bills, yet the fact that 40% of billpayers have never switched supplier speaks volumes about the magnitude of customer disengagement. What efforts have the incumbent energy providers made to mine their available data to understand consumers and serve them better? Evidently, not very much at all.   

Big data, used smartly, can help energy brands transform the marketplace. It's what drives real change. By informing disenfranchised customers, they become more empowered confident consumers.

The tricky bit is to successfully target, manage and distribute the most valuable data. You can't do everything. Disruptor brands must pinpoint the areas that differentiate their offering and focus all efforts and resources on implementing their competitive edge.

Like most challengers, First Utility is focused on being a fully data-enabled business. We look to the heroes of Silicon Valley for learnings. We see big potential in applying the principles of Spotify, Netflix and Amazon to strategically and creatively engage energy consumers. It's ambitious, but then so are we.

It's ambitious because the energy market is multi-faceted, made even more difficult by the fact these complexities are unique to the UK. There's no precedent to getting it right. In fact, it's often about mixing the company's disruptor gene with specialist technical skills.

Data is at the heart of our decision-making process to help us optimise our offering. As the CIO, I work with my team to set the conditions we believe will create an organic and incrementally useful approach to gathering, understanding and acting on this ever-increasing pool of information.

But data should not be owned by just one department or the data chief. It must be intrinsically woven into multiple business functions and used by general management to inform day-to-day decisions. It must be approached with a focused and united vision, implemented by individual operational teams that affect the total customer journey.

Challenger companies can take advantage of their agility to disrupt the landscape for the better. They carry less historical operational baggage than the larger incumbent rivals. They can do things differently and make faster decisions. With data clearly the present and future of business, this is the ideal time for disruptors to design and shape new models. From a relatively blank piece of paper we can connect with consumers in a way that has not been seen in the energy sector before.

Challengers have the opportunity to blindside the competition with the power of information. The truth is, that even the big incumbents could benefit from challenger thinking. It's unfortunate for the market, and specifically for consumers, that there is currently little hint towards a wave of change.

At my company, big data is at the heart of our competitive edge. Our mission is to engage consumers more in their energy usage so we consistently develop innovative technologies that can provide the insights and tools they need to give them for them to help them take better control. Our My Energy platform applies smart and highly complex customer usage information, fuelled by big data. Customers can review real time energy use over time, see their predicted future usage based on current behaviours, and can even contextualise their spend by comparing it to similar households in their neighbourhood.

We also use big data to maintain sharp business performance. Our churn dashboard tells us which customers are migrating, from what tariffs and to which suppliers, and from this we learn and grow. We are already seeing the cold hard benefits of putting data at the heart our business.

Disruptors recognise that knowledge is power. In the right hands and with clear differentiating focus, big data provides the fuel to get ahead.

Can a maths genius save the world?

| No Comments
| More

This is a guest blog post by Laurie Miles, head of analytics, SAS UK & Ireland


If I were to believe the feedback I get, statisticians are among the most difficult people to work with. What's more, they're the only group that should be allowed to work in data analytics. It sounds harsh, but not only that, this may explain why big data projects continually fail to launch successfully in so many businesses.

What businesses actually need is statisticians that are easy to work with, because conversations based primarily on maths and statistics do not solve business problems. Far from it, in fact.

Businesses need to overcome the perception that data science - despite the lexicon - is about feeding data into an engine and analysing the statistics to get answers. It requires a logical as well as creative mind, for it to deliver real value. And the starting point should really be 'what are the business challenges we need answering?' The statistics bit comes later and is just part of the process in getting to your business solution.

 Creative genius saving the world?

Problem solving is a cognitive function that relies heavily on the creative side of our brains. Humans are curious beings. It's our nature to want to solve mysteries and understand the world around us. It's a rewarding experience that creates a strong motivation for people to want to do more.

Business leaders can tap into this behaviour by giving employees more interesting problems to solve. Data science provides the opportunity to satisfy someone's curiosity, whether they are a genius or not.

That problem solving doesn't have to be at an individual level either. After all, data science is about team work. In another blog, I explored the different roles in building a data science team.

Every data analytics project is unique, so every project will have a unique team set-up. Our education system now provides the opportunity for us to nurture new talents that meet what's required for the business manager, business analyst, data management expert, and statistical modeller. Adding together different geniuses is the key to business value.

Geniuses from across the educational spectrum

Young people now have so much more choice over what they study at school and university. 'Data scientist' is seen as a technical role but it's only a small part of the job: they are also business consultants and creatives. This is why we need to recruit talent from all disciplines, from arts and humanities to STEM subjects.

Businesses need to be open-minded in their approach to data science. Hiring only statisticians is probably the worst thing a business can do.

Changing your approach won't happen overnight, but when building a data science team, first look inside your organisation to assess what skills and interests are already there. Once you have identified your candidates, provide them with training courses and help them carve out a clearly defined learning pathway to develop their role within the business. Then explore the wider circles in universities and other industries to hire new talents that supplement your existing areas of expertise.

 For more insight into what makes a great data scientist, check out what we, as SAS, found out when we asked those in the industry.

Bill McDermott: best wishes

| No Comments
| More

At this year's Sapphire, I was part of a group of journalists from outside the US who interviewed Bill McDermott, chief executive officer of SAP.

At the end of the interview he gave each of us a copy of his autobiography to date, signing each in front of us. Winners Dream is the title of the book, co-written with Joanne Gordon, a former Forbes journalist,

I thought at the time it was a gracious thing to do.

I was shocked to learn of Mr McDermott's injury. His own account tells one much about the calibre of the man. I wish him well.

Smart cities and the IoT - not just a load of rubbish

| No Comments
| More

This is a guest post by David Socha, utilities practice leader, Teradata

What really makes a city smart?  Because from my perspective, Smart Parking, Smart Homes, Smart Lighting and the like are really just the next steps on a journey that began by replacing the cry of "gardyloo"*with city plumbing.

In fact many of the things happening in today's Smart Cities could more honestly be labelled as "progress". 

What will really make a city become smart is the integration and analysis of data from these otherwise disparate initiatives and all the others like them.  Once that happens, a new intelligence will enable the city to deliver new services to its citizens - from genuinely integrated public, private and personal transport systems to energy profiles that incorporate our homes, workplaces, vehicles and more. 

But ... how does that work, exactly?  Surely it consists of more than attaching sensors to everything?  Yes, of course it does. 

And to understand how all this integration and analytics will bring Smart City citizens some actual benefits, you will have to let me get technical -  just briefly though, I promise.  We need to examine three types of data that we're going to encounter in our Smart City. Here they are:

1.       Traditional, unexciting, structured data from enterprise systems.  Information like weather forecasts from the Meteorological Office; census analyses from Government and, say, public transport performance statistics.

2.      Slightly cooler "big data" from all sorts of social media (and other sources too).  This can be valuable for sentiment analysis; for personalising services and offers and for all manner of business-to-customer or perhaps city-to-customer relationships.

3.      New and exciting Machine-to-Machine (M2M) data.  Now we're talking!  This is the stuff the Internet of Things (IoT) is made of, isn't it?  This is the future!  Well, yes and no. 

We can lift the lid on the oft-used example of the smart waste bin to guide us through how we journey from sensors to real benefits for citizens.  The first nugget you'll hear in a typical Case of The Smart Waste Bin story is pretty simple.  If a bin has a sensor that knows it's nearly full, it can call and request someone comes to empty it.  Is that "Smart"?  As I said before, yes and no.  Rubbish might be collected more often, but costs will rocket.  Lorries could be going back to the same street to empty smart bins that transmit their "I'm full!" message just a few hours apart.  Not so smart now, is it? 

Of course we can fix this.  Sensors close to one another could communicate and check if any other bins close-by are nearly full too.  Companies like Smartbin offer both sensors and a route optimisation solution for the teams that have to collect the rubbish.  So here we are, already integrating M2M data and boring old structured data.  Now our citizens will enjoy cleaner streets without having to pay extra for the privilege.  This is merely the beginning.  Additional analytics on the data we have in this example alone, can lead to better planning decisions. For example on where more or fewer bins are required, or how staff and vehicles can be more efficiently deployed.

So let's mix in more data and see what else we can do.  What if we also added Wi-Fi to the bins, as is happening in New York?  Suddenly, citizens will be connecting directly with a "smart solar-powered, connected technology platform that is literally sitting in the streets of New York".  This new service not only delivers a 'connected city' for its citizens - it also offers a chance to learn more about the people our Smart City is serving.  By applying some sentiment analysis, we can even work out just what they think about the new Smart Bin services we're providing. 

We've come a long way from that initial installation of a sensor that occasionally shouts out "I'm nearly full".  And that's the point.  This is just one example of how the Internet of Things will actually deliver benefits to people living in Smart Cities. 

It's not just about sensors.  It's not just about M2M.  Just as important is the integration of many different types of data - the cool stuff and the boring. It's about analysing the data in its entirety to reveal the relationships, dependencies and connections. And it's about taking informed, positive actions based on the new information available. Now that's what I call Smart.

*An Edinburgh phrase, first recorded in 1662.  You can take the boy out of Edinburgh...

Obama supercharges data science

| No Comments
| More

This is a guest blogpost by Mike Weston, CEO of data science consultancy Profusion, in which he discusses supercomputing and its implications for data science.

We are all creating data. A lot of data. The figures involved are mind-blowing. According to Information Service ACI, five exabytes of content were created between what it calls "the birth of the world" and 2003. In 2013, five exabytes of content were created each day. Just so you know, an exabyte is a quintillion bytes. Every minute (on average) we send around 204 million emails, make four million Google searches, and send 277,000 tweets.

With each individual creating and receiving more and more data, computers are in an arms race to keep up. Earlier this month another shot was fired: President Obama issued an executive order designed to ensure that the US leads the field in supercomputers, by building an exascale computer capable of undertaking one quintillion calculations per second. The computer will be used for, among other things, climate science, medicine and aerospace. However, from my perspective, the most exciting proposition is the application of exascale computers to data science.

The first noticeable advantage in having increased computing power is a reduction in the time it will take to carry out data science projects. Reducing the time it takes to receive results will allow for more real-time decision making. This will have a significant impact on industries such as retail, where a shop could automatically alter its pricing strategy instantaneously based on weather data, customer demographics and footfall.

Next, the processes involved in data science will become ultra-efficient. There will be decreased processing time and less time spent accumulating and preparing data. This will open up data science to work with data which previously wasn't accessible before. For instance, assisting in the mapping of the human brain and combining that information with data on a participant's emotions and lifestyle to obtain a picture of how the brain is affected by external factors.

The advanced computing power will also lead to more accuracy and the ability to create more detailed and advanced models. This will enable data science to answer more complicated questions with a larger range of structured, unstructured, historical and real-time datasets. Machine learning will become much more powerful. More computing power will allow more interactions to be presented to the machine to create artificial intelligence. Eventually, the majority of computations will become automated, with data scientists managing the AI as opposed to carrying out the day to day processes.

These new algorithms could be applied to everyday activities, such as tracking the real-time weather conditions impacting on aircraft, along with their locations and speed, the identities of all passengers on board and overall customer satisfaction as detailed through individual's social media accounts. All of this information could be combined into one user friendly interface for airline staff to then monitor and respond.

There will be additional benefits to product design, especially in the field of aeronautics. Proposed designs could be simulated without the need for wind tunnels and other expensive, not readily available, tools. Potentially one of the most exciting advances will be the development of personalised medicines. Data science will be able to look at an individual's genome, their lifestyle and alter drug properties accordingly to make them more effective.

The analysis of big data has already had revolutionary impacts on the commercial sector and within scientific discovery -- from assisting in relief efforts following natural disasters to tailoring the consumer journey on eBay. In the future we can expect to see more advanced weather forecasts, natural disaster prediction services and more accurate cancer diagnosis. With data science also unlocking key Islamic State military strategies, it's going to play a bigger role within US national security.

In the short term, the biggest impact for consumers will be in relation to the 'Internet of Things'. With more real-time data readily available, the productivity of autonomous vehicles would greatly improve. Imagine a scenario where every vehicle within a city could be mapped onto a central computer, with all those vehicles able to tell each other their locations, speeds and proposed routes. Driving would certainly be better informed and safer than it is currently.

Data science is going to undergo a rapid transformation into a faster, more accurate and more efficient process. The range of tasks that will be undertaken by machines will increase, spurred along by advances in machine learning and faster computer speeds. What we may be able to calculate in a week, in the future will take minutes. The scope of data we will be able to deal with will also increase and a greater variety of data will lead to more insights that can be found from seemingly disparate data sets.

This will lead to an exciting future where we are better informed and by virtue should be able to make more educated decisions. A master painter is only as good as his brush, and the advent of better computing will create better data scientists who will make better data insights. More powerful computers will lead to a more empowered society. 

3 ways data lakes are transforming analytics

| No Comments
| More

This is a guest blogpost by Suresh Sathyamurthy, senior director, emerging technologies, EMC

Data lakes have arrived, greeted by the tech world with a mix of scepticism and enthusiasm. In the sceptic corner, the data lake is under scrutiny as a "data dump," with all data consolidated in one place. In the enthusiasts' corner, data lakes are heralded as the next big thing for driving unprecedented storage efficiencies in addition to making analytics attainable and usable for every organization.

So who's right?

In a sense, they both are. Data lakes, like any other critical technology deployment, need infrastructure and resources to deliver value. That's nothing new. So a company deploying a data lake without the needed accoutrements is unlikely to realize the promised value.

However, data lakes are changing the face of analytics quickly and irrevocably--enabling organizations who struggle with "data wrangling" to see and analyze all their data in real time. This results in increased agility and more thoughtful decisions regarding customer acquisition and experience -- and ultimately, increased revenues.

Let's talk about those changes and what they mean for the world today, from IT right on down to the consumer.


Breaking data silos

·         Data silos have long been the storage standard -- but these are operationally inefficient and limit the ability to cross correlate data to drive better insights.

·         Cost cutting is also a big driver here. In addition to management complexity, silos require multiple licensing, server and other fees, while the data lake can be powered by a singular infrastructure in a cost efficient way.

·         As analytics become progressively faster and more sophisticated, organizations need to evolve in the same way in order to explore all possibilities. Data no longer means one thing; with the full picture of all organizational data, interpretation of analytics can open new doors in ways that weren't previously possible.


Bottom line: by breaking down data silos and embracing the data lake, companies can become more efficient, cost-effective, transparent -- and ultimately smarter and more profitable -- by delivering more personalized customer engagements.


Leveraging real-time analytics (Big Data wrangling)

Here's the thing about data collection and analytics: it keeps getting faster and faster. Requirements like credit card fraud alert analytics and stock ticket analytics needs to happen seconds after the action has taken place. But  real-time analytics aren't necessary 100% of the time; some data (such as monthly sales data, quarterly financial data or annual employee performance data) can be stored and analyzed only at specified intervals. Organizations need to be able to build the data lake that offers them the most flexibility for analytics.

Here's what's happening today:

·         Companies are generating more data than ever before. This presents the unique problem of equipping themselves to analyze it, instead of just store it -- and the data lake coupled with the Hadoop platform provides the automation and transparency needed to add value to the data.

·         The Internet of Things is both a data-generating beast and a continuous upsell opportunity -- provided that organizations can provide compelling offers in real time. Indeed, advertisers are on the bleeding edge of leveraging data lakes for consumer insights, and converting those insights into sales.

·         Putting "real-time" in context: data lakes can reduce time-to-value for analytics from months or weeks, down to minutes.

Bottom line: Analytics need to move at the speed of data generation to be relevant to the customer and drive results.


The rise of new business models

Data lakes aren't just an in-house tool; they're helping to spawn new business models in the form of Analytics-as-a-Service, which offers self-service analytics by providing access to the Data lake.

Analytics-as-a-Service isn't for everyone -- but what are the benefits?

·         The cost of analytics plummets due to outsourced infrastructure and automation. This means that companies can try things out and adjust on the fly with regard to customer acquisition and experience, without taking a big hit to the wallet.

·         Service providers who store, manage and secure data as part of Analytics-as-a-Service are a helpful avenue for companies looking to outsource.

·         Knowledge workers provide different value -- with the manual piece removed or significantly reduced, they can act more strategically on behalf of the business, based on analytics results.

·         Analytics-as-a-Service an effective path to early adoption, and to getting ahead of the competition in industries such as retail, utilities and sports clubs.

Bottom line: companies don't have to DIY a data lake in order to begin deriving value.

Overall, it's still early days for Data lakes, but global adoption is growing. For companies still operating with data silos, perhaps it's time to test the waters of real-time analytics.

App-based approach key to achieving efficient self-service Business Intelligence (BI)

| No Comments
| More

This is a guest blog by Sylvain Pavlowski, senior vice president of European Sales at Information Builders

As workers and business units clamour for more control over data analysis to gain insights at their finger tips, there is a rise in the use of self-service business intelligence (BI) tools to meet demands. But, this is not without its challenges for IT teams in particular.

A gap between business users and IT has ensued because historically IT departments have created a centralised BI model and taken ownership over BI. They want to maintain control over aspects like performance measures and data definitions, but workers are striving to gain access to the data they want, when they want it, and don't want IT to 'hand hold' them. This is creating a redistribution of self-service BI and could inhibit business success if IT departments and business users don't find a happy medium.

Gartner argues that, "Self-service business intelligence and analytics requires a centralised team working in collaboration with a finite number of decentralised teams. IT leaders should create a two-tier organisational model where the business intelligence competency centre collaborates with decentralised teams."

I agree that to manage all types of data in one place in one structure is difficult at the best of times but it's all the more difficult these days with a move towards individualism and personalisation where users want to help themselves to the data they need for their job roles, in real time. To manage the push and pull between IT and users, businesses need to look at ways to redefine self-service BI, and it's not just about the IT organisational model. An approach needs to address more than IT departments' needs.

Implementing an app-based approach to self-service BI can help appease everyone concerned. IT departments can build apps for self-service BI to serve every individual, irrespective of back end systems and data formats. "Info Apps", for example, is a new term used to describe interactive, purpose-built BI applications designed to make data more readily accessible to those business users who simply don't have the skills or the technical know-how to use complex reporting and analysis tools, to satisfy their own day-to-day needs. Some studies have even shown that such individuals can make up more than 75% of an organisation's BI user base. Using an app-based approach is therefore an extremely effective way to give business professionals the exact information they need, through an app paradigm, without requiring any analytical sophistication.

Next-generation BI portals play an important role here too. They can provide enterprises with a way to seamlessly deliver self-service BI apps to business users. By organising and presenting BI apps to users in a way that is simple and intuitive (similar to the Apple App Store), companies can empower workers with faster, easier, more interactive ways to get information.

These next-generation portals also offer high levels of customisation and personalisation so business users have full control over their BI content at all times. They will be empowered with the ability to determine what components they view, how they're arranged, how they're distributed across multiple dashboard pages, and how they interact with them. By offering unparalleled ease and convenience - giving them what they need, when and how they want it - organisations can encourage business users to take advantage of self-service BI in new and exciting ways, whilst having the peace of mind that IT departments are ensuring data quality and integrity in the background. This will all drive higher levels of BI pervasiveness, which in turn, will boost productivity, optimise business performance, and maximise return on investments. 

Understanding data: the difference between leaders and followers

| No Comments
| More

A guest blogpost by Emil Eifrem, CEO of Neo Technology.

Data is vital to running an efficient enterprise. We can all agree on that.

Of course, from there, thoughts and opinions differ widely, and it's no surprise why.

Too much of the data conversation is focused on acquiring and storing information. But the real value of data is derived from collecting customer insights, informing strategic decisions and ultimately taking action in a way that keeps your organisation competitive.

Leaders who conduct this level of analysis distinguish themselves from the rest. Data followers merely collect; data leaders connect.

Yet, with so many ways to analyze data for actionable insights, the challenge is to find the best approach.

The most traditional form of analysis is the simplest: batch analysis where raw data is examined for patterns and trends. The results of batch analysis, however, depend heavily on the ingenuity of the user in asking the right questions and spotting the most useful developments.

A more sophisticated approach is relationship analysis. This approach derives insights not from the data points themselves but from a knowledge and understanding of the data's entire structure and its relationships. Relationship analysis is less dependent on an individual user and also doesn't analyse data in a silo.

Real-World Success

Take a look at the biggest and best leading companies and you'll see a strong investment not only in data analysis but also analysis of that data's structure and inherent relationships.

For example, Google's PageRank algorithm evaluates the density of links to a given webpage to determine the ranking of search results. Or consider Facebook and LinkedIn: each site evaluates an individual's network to make highly relevant recommendations about other people, companies and jobs.

Together, these three organisations have developed real insight into their customers, markets and future challenges. In turn, they have become leaders in the Internet search, social media and recruitment sectors, respectively.

Every Data Point Matters

When it comes to effective data analysis, your enterprise must be gleaning insight from all of the data at its disposal, not just a portion of it.

With so much data to sift through, it's no surprise that most organisations fall into a similar trap, focusing their data analysis efforts on a small subset of their data instead of looking at the larger whole.

For instance, it's much easier for enterprises to only examine transactional data (the information customers supply when they purchase a product or service). However, this subset of data can only tell you so much.

The vast store of data a typical enterprise doesn't use is known as "dark data." Defined by Gartner as "information assets that organisations collect, process and store during regular business activities, but generally fail to use for other purposes," mining your dark data adds wider context to insights derived from transactional data.

Of course, data only tells part of the story with surface-level analysis. Enterprises need curious and inquiring minds to ask the right questions of their data. That's why so many leading organisations recruit data scientists solely to make sense of their data and then feed these insights back to strategic decision makers.

Ultimately, the real value of data lies not only in bringing your enterprise closer to the customer but also to prospective customers. And building a better bottom line is something we can all agree on.

Subscribe to blog feed



-- Advertisement --


Have you entered our awards yet?

Find recent content on the main index or look in the archives to find all content.


-- Advertisement --