What do businesses really look for in open data?

| No Comments
| More

This is a guest blog by Harvey Lewis, Deloitte

 

"The value of an idea lies in the using of it." Thomas A. Edison, American Inventor.

 

In 2015, the UK's primary open data portal, www.data.gov.uk, will be six years old. The portal hosts approximately 20,000 official data sets from central government departments and their agencies, local authorities and other public sector bodies across the country. Just over half of these data sets are available as open data under the Open Government Licence (OGL). Data.gov.uk forms part of an international network of over three hundred open data efforts that have seen not just thousands but millions of data sets worldwide becoming freely available for personal or commercial use. [See http://datacatalogs.org and www.quandle.com].

Reading the latest studies that highlight the global economic potential of open data, such as that sponsored by the Omidyar Network, you get a sense that a critical mass has finally been achieved and the use of open data is set for explosive growth.

These data sets include the traditional 'workhorses', like census data, published by the Office for National Statistics, which provides essential demographic information to policy makers, planners and businesses.  There are many examples of more obscure data sets, such as that covering the exposure of burrowing mammals to Radon Rn-222 in Northwest England, published by the Centre for Ecology and Hydrology.   

Although I'm not ruling out the possibility there may yet be a business in treating rabbits affected by radiation poisoning, simply publishing open data does not guarantee that a business will use it. This is particularly true in large organisations that struggle to maximise use of their own data, let alone be aware of the Government's broader open data agenda. The Government's efforts to stimulate greater business use of open data can actually be damaged by a well-intentioned but poorly targeted approach to opening up public sector information - an approach that may also leave more difficult-to-publish but still commercially and economically important data sets closed.

But is business use predicated on whether these data sets are open or not? And what is the impact on economic success?

Businesses would obviously prefer external data to be published under a genuinely open licence, such as the OGL.  The data is free for commercial use with no restrictions other than the requirement to share alike or to attribute the data to the publisher. However, if businesses are building new products or services, or relying on the data to inform their strategy, a number of characteristics other than just openness become critical in determining success:

·         Provenance - what is the source of the data and how it was collected? Is it authoritative?

·         Completeness and accuracy - are the examples and features of the data present and correct, and, if not, is the quality understood and documented?

·         Consistency - is the data published in a consistent, easy-to-access format and are any changes documented?

·         Timeliness - is the data available when it is needed for the time periods needed?

·         Richness - does the data contain a level of detail sufficient to answer our questions?

·         Guarantees of availability - will the data continue to be made available in the future?

If these characteristics cannot be guaranteed in open data or are unavailable except under a commercial licence then many businesses would prefer to pay to get them. While some public sector bodies - particularly the Trading Funds - have, over the years, established strong connections with business users of their data and understand their needs implicitly, the Open Data Institute is the first to cement these characteristics into a formal certification scheme for publishers of open data.

A campaign is needed to get publishers to adopt these certificates and to recognise that, economically at least, they are as important as Sir Tim Berners-Lee's five-star scale for linked open data.  For example, although spending data may achieve a three- or even a four-star rating in the UK, not all central government departments publish in a timely manner, in a consistent format or at the same level of richness, and some local authority spending data is missing completely. These kinds of deficiencies, which are shared by many other open data sets, are inhibiting innovation and business take-up, yet are not necessarily penalised by the current set of performance indicators used to measure success.  

It's time for open data to step up. If it is to be taken seriously by businesses then the same standards they expect to see in commercially licensed data need to be exhibited in open data - and especially in the data sets that form part of the 'core reference layer' used to connect different data sets together.

Publishing is just the first and, arguably, the easiest step in the process. The public sector's challenge is to engage with businesses to improve awareness of open data, to understand business needs and harness every company's constructive comments to improve the data iteratively. We may have proven that sunlight is the best disinfectant for public sector information, but understanding and working with business users of open data is the best way of producing a pure and usable source in the first place. 

 

Harvey Lewis is the research director for Deloitte Analytics and a member of the Public Sector Transparency Board.

Data quality everywhere

| No Comments
| More

This is a guest blog by Jean Michel Franco, Talend

Data quality follows the same principles as other well-defined, quality-related processes. It is all about creating an improvement cycle to define and detect, measure, analyse, improve and control.

This should be an ongoing effort - not just a one-off. Think about big ERP, CRM or IT consolidation projects where data quality is a top priority during the roll out, and then attention fades away once the project is delivered.

A car manufacturer, for example, makes many quality checks across its manufacturing and supply chain and needs to identify the problems and root causes in the processes as early as possible. It is costly to recall a vehicle at the end of the chain, once the product has been shipped - as Toyota experienced recently when it recalled six million vehicles at an estimated cost of $600 million.

Quality should be a moving picture too. While working through the quality cycle, there is the opportunity to move upstream in the process. Take the example of General Electric, known for years as best-in-class for putting quality methodologies such as Six Sigma at the heart of its business strategy. Now it is pioneering the use of big data for the maintenance process in manufacturing. Through this initiative, it has moved beyond detecting quality defects as they happen. It is now able to predict them and do the maintenance needed in order to avoid them.

What has been experienced in the physical world of manufacturing applies in the digital world of information management as well. This means positioning data quality controls and corrections everywhere in the information supply chain. And I see six usage scenarios for this.

Six data quality scenarios

The first one is applying quality when data needs to be repurposed. This scenario is not new; it was the first principle of data quality in IT systems. Most companies adopted it in the context of their business intelligence initiatives. It consolidates data from multiple sources, typically operational systems and gets it ready for analysis. To support this scenario, data quality tools can be provided as stand-alone tools with their own data marts or, even better, tightly bundled with data integration tools.

A similar usage scenario, but "with steroids", happens in the context of big data. In this context, the role of data quality is to add a fourth V, for Veracity, to the well-known 3 Vs defining big data; Volume, Variety and Velocity. Managing extreme Volumes mandates new approaches for processing data quality; controls have to move where the data is, rather than the opposite way. Technically speaking, this means that data quality should run natively on big data environments such as Hadoop, and leverage its native distributing processing capabilities, rather than operate on top as a separate processing engine. Variety is also an important consideration. Data may come in different forms such as files, logs, databases, documents, or data interchange formats such as XML or JSON messages. Data quality would then need to turn the "oddly" structured data often seen in big data environments into something that is more structured and can be connected to the traditional enterprise business objects, like customers, products, employees and organisations. Data quality solutions should then provide strong capabilities in terms of profiling, parsing, standardisation, entity and resolution. These capabilities can be provided before the data is stored and designed by IT professionals. This is the traditional way to deal with data quality. Or, data preparation can be delivered on an ad-hoc basis at run time by data scientists or business users. This is sometimes referred to as data wrangling or data blending.

The third usage scenario lies in the ability to create data quality services. Data quality services allow the application of data quality controls on demand. An example could be a web site with a web form to catch customer contacts information. Instead of letting a web visitor type in any data they want in a web form, a data quality service could apply checks in real time. This then gives the opportunity of checking information such as emails, address, name of the company, phone number, etc. Even better, it can automatically identify the customer without requiring them to explicitly logon and/or provide contact information, as social networks, best-in-class websites or mobile applications such as Amazon.com already do.

Going back to the automotive example, this case provides a way to cut the costs of data quality. Such controls can be applied at the earliest steps of the information chain, even before erroneous data enters into the system. Marketing managers may be the best people to understand the value of such a usage scenario; they struggle with the poor quality of the contact data they get through the internet. Once it has entered into the marketing database, poor data quality becomes very costly and badly impacts key activities such as segmenting, targeting, calculating customer value. Of course, the data can be cleansed at later stages but this requires significant effort to resolve, and the related cost much higher.

Then, there is quality for data in motion. This applies to data that flows from one application to another; for example, to an order that goes from sales to finance and then to logistics. As explained in the third usage scenario, it is best practice that each system implements gatekeepers at the point of entry, in order to reject data that doesn't match its data quality standards. Data quality then needs to be applied in real time, under the control of an Enterprise Service Bus. This fourth scenario can happen inside the enterprise and behind its firewall. Alternatively, data quality may also run on the cloud, and this is the fifth scenario.

The last scenario is data quality for Master Data Management (MDM). In this context, data is standardised into a golden record, while the MDM acts as a single point of control. Applications and business users share a common view of the data related to entities such as customers, employees, products, chart of accounts, etc. The data quality then needs to be fully embedded in the master data environment and to provide deep capabilities in terms of matching and entity resolution.

Designing data quality solutions so that they can run across these scenarios is a driver for my company. Because one of the things about our unified platform is that it generates code that can run everywhere, our data quality processing can run in any context, which we believe is a key differentiator. Data quality is delivered as a core component in all our platforms; it can be embedded into a data integration process, deployed natively in Hadoop as a Map Reduce job and be exposed as a data quality service to any application that needs to consume it in real time.

Even more importantly, data quality controls can move up into the information chain over time. Think about customer data that can be initially quality proofed in the context of a data warehouse through data integration capabilities. Then, later, through MDM, this unified customer data can be shared across applications. In this context, data stewards can learn more about the data and be alerted that they are erroneous. This will help then to identify the root cause of bad data quality, for example a web form that brings junk emails into the customer database. Data services can then come to the rescue to avoid erroneous data inputs on the web form, and reconcile this entered data with the MDM through real time matching. And, finally big data could provide an innovative approach for identity resolution so that the customer can be automatically recognised by a cookie after they opt-in, making the web form redundant.

Such a process doesn't happen overnight. Continuous improvement is the target.

The rise of the Chief Data Officer

| No Comments
| More

This is a guest blog by Karthik Krishnamurthy, Global Business Leader for Enterprise Information Management, Cognizant

 

While there is a huge amount written about data scientists, much less has been said about the role of the Chief Data Officer (CDO).  However, the value of this individual to any business must not be underestimated. In fact, Wired describes the emergence of the Chief Data Officer as "a transformational change that elevates the importance of data to the top of the organisation".

In the last couple of years, businesses have recognized the role of the data and many of them have identified data as part of their core business strategy. Businesses are also acknowledging that data in the business setting is separate from the systems running it. There is now an understanding that data is hugely valuable; if harnessed and analysed properly, it can make businesses run better by realising cost efficiencies and run differently by bringing innovative products and services to the market.  Insight from data gives a better understanding of customer preferences, helping organisations develop new commercial models, deliver tangible business value, and remain competitive. This evolution has caused a demand for new business roles, the most prominent of which is the CDO.

The CDO in financial services

The role of the CDO first emerged as a valid and valuable role in the financial services industry to deal with the extreme pressure that arose from the financial crisis and rapidly evolving regulations. While a large part of the CDO's immediate focus was around helping banks to manage and orchestrate their risk response, the focus then shifted to identifying data-driven revenue opportunities through micro-personalization and marketing triaging. As a result, the CDO's focus is now on building integrated data ecosystems that bring in social cluster data to identify unusual patterns in transactional behaviour, flagging them to prevent loss and fraud. Interestingly, this is not something that is traditionally part of financial services per se, but is increasingly central to financial businesses.

The CDO plays a pivotal role in helping financial companies stay ahead by managing risks and remain compliant more efficiently.

The CDO in retail

Retail, which has witnessed a huge change in the way their global value chains work, is another industry where CDOs are bringing real business value. Through harnessing customer data, retailers can offer targeted products and services and improve customer satisfaction levels significantly. What data analysis has revealed for retailers is that shoppers have fewer issues with the cost of products but are more concerned with the overall retail experience. Sentiment analysis can detect subtle nuances in the relationship between customer and retailer. Focusing on the tonality of a customer's voice, both face-to-face and when liaising with them over other touch pints such as the phone, social media, fora, etc., can help retailers detect the true feelings of their customers.

Other industries are rapidly catching up: many new technology companies are driven by their ability to collect vast amounts of data and their desire to monetize that data or utilize it in product design and services delivered to customers. Sectors such as telecommunications, energy & utilities, pharmaceuticals, and automotive manufacturing have all identified the value of the data and are creating business leaders responsible for data.

Data management now sits at the C-suite level, emphasising the value the role of the CDO brings to organisations.

CDO traits

Here are Cognizant's insights for the traits needed to make up the ideal CDO:

·         Has a deep knowledge of data and ability to identify this as a corporate asset

·         Has strong business acumen and ability to identify business opportunities through new information products

·         Provides vision, strategy and insight for all data-related initiatives within the organization

·         Takes ownership and accountability of data and its governance within the organization

·         Passion and interest in technology

Preparation needs to start now for imminent European data protection changes

| No Comments
| More

This s a guest blog Mike Davis, author of a report published by AIIM.

 

The forthcoming European General Data Protection Regulation (GDPR) changes signal a major opportunity for cloud providers to deliver EU-wide services under a single operations model.

'Making sense of European Data Protection Regulations as they relate to the storage and management of content in the Cloud' is an AIIM report that details the changes the IT industry will need to make in response to imminent pan-European data protection changes.

These are changes that will affect anyone interested in hosting content in the cloud, be they service provider or end user.

The study examines the forthcoming GDPR, which is set to inaugurate major change in how customer data regarding EU citizens is stored and how organisations must respond if a data breach occurs.

The change, effectively the creation of a single European data law, will mean organisations will incur fines of up to €100 million if found guilty of a 'negligent breach' of privacy or loss of data.

That is a serious threat. However, GDPR also presents a number of opportunities and could clarify a lot of issues, as well as offer prospects for long-term planning by cloud specialists.

Aim and scope

The purpose of the GDPR is to provide a single law for data protection to cover the whole of the EU, instead of the present Directive that has ended up being implemented differently in each member state.

The GDPR will also see the establishment of a European Data Protection Board to oversee the administration of the Regulation, a move Brussels is confident will make it easier for European and non-European companies to comply with data protection requirements.

The GDPR also covers organisations operating in Europe irrespective of where data is physically stored. The new regulation is a major opportunity for cloud providers to deliver EU-wide services under a single operations model; meanwhile it also means US based cloud firms need to demonstrate compliance with Europe's new privacy operating model.

A broader definition of 'personal' data

In addition to a common approach to privacy, the GDPR covers privacy for cloud computing and social media, extending the definition of personal data to include email address(es), IP address of computer(s) and posts on social media sites.

That extension has implications for cloud-delivered services both users and cloud firms need to be aware of.

A GDPR-compliant plan of attack

Organisations need to set a GDPR compliant strategy in whichever part of Europe they operate in before the end of the transition period (currently 2017; track to see if this changes).

An important part of that work will be to establish GDPR-supportive procedures and start the process of gaining explicit consent for the collection and processing of customer data ready for the new regime.

If you're a cloud provider, we recommend drafting a GDPR-compliant strategy, educating your staff on the implications of the changes and amending your contracts and provisioning to be fully compliant.

To sum up: if handled correctly, GDPR will help organisations make more informed decisions about cloud versus on-premise storage; while for the cloud services market, there may be opportunity to deliver truly pan-European services that customers can have assurance are privacy-safe.

 

The author is a Principal Analyst at msmd advisors and is the author of a new AIIM report on EU data issues study produced in collaboration with a London legal firm, Bird and Bird.

Opportunities and challenges for data processing in the Internet of Things

| No Comments
| More

This is a guest blog by Michael Hausenblas, Chief Data Engineer at MapR Technologies.

According to Gartner's Hype Cycle, the Internet of Things (IoT) is supposed to peak in 2014. This sounds like a good time to look into opportunities and challenges for data processing in the context of IoT.

So, what is IoT in a nutshell? It is the concept of a ubiquitous network of devices to facilitate communication between the devices themselves, as well as between the devices and human end users. We can group use cases along the scope of an application, from Personal IoT (focus is on a single person, such as quantified self) over Group IoT, which sets the scope on a small group of people (e.g. smart home) to Community IoT, usually in the context of public infrastructure such as smart cities and finally the Industrial IoT, one of the most mature areas of IoT, dealing with apps either within an organization (smart factory) or between organizations (such as retailer supply chain).

It is fair to say that the data IoT devices generate lends itself to the 'Big Data approach', that is, using scale-out techniques on commodity hardware in a schema-on-read fashion, along with community-defined interfaces, such as Hadoop's HDFS or the Spark API. Now, why is that so? Well, in order to develop a full-blown IoT application you need to be able to capture and store all the incoming sensor data to build up the historical references (volume aspect of Big Data). Then, there are dozens of data formats in use in the IoT world and none of the sensor data is relational per se (variety aspect of Big Data). Last but not least, many devices generate data at a high rate and usually we cope with data streams in an IoT context (the velocity aspect of Big Data).

Architectural considerations & requirements for an IoT data processing platform

Before we go into architectural considerations, let's have a look at common requirements for an IoT data processing platform:

·         Native raw data support. Both in terms of data ingestion and processing, the platform should be able to natively deal with IoT data.

·         Support for a variety of workload types. IoT applications usually require that the platform supports stream processing from the get-go as well as deal with low-latency queries against semi-structured data items, at scale.

·         Business continuity. Commercial IoT applications usually come with SLAs in terms of uptime, latency and disaster recovery metrics (RTO/RPO). Hence, the platform should be able to guarantee those SLAs, innately. This is especially critical in the context of IoT applications in domains such as healthcare, where people's lives are at stake.

·         Security & Privacy. The platform must ensure a secure operation. Currently, this is considered to be challenging in an end-to-end manner.  Last but not least, the privacy of the users must be warranted by the platform, from data provenance support over data encryption to masking.

Now, we come back to the architectural considerations. While there are no widely accepted references architectures yet, a number of proposals exist. All of them have one thing in common, though, which can be summarised in the term polyglot processing. This is the concept of combining multiple processing modes (from batch over stream to low-latency queries) within a platform; two of the more popular and well-understood example architectures in this context are Nathan Marz's Lambda Architecture and Jay Kreps' Kappa Architecture.

 

With this we conclude our little excursus into data processing challenges and opportunities in the context of the Internet of Things and we're looking forward to a lively discussion concerning the requirements and potential reference architectures.

 

About the author

Michael Hausenblas is the Chief Data Engineer for MapR. His background is in large-scale data integration, the Internet of Things, and web applications.

Twitter: @mhausenblas







Start small with big data

| No Comments
| More

This is a guest blog by Allen Bonde, VP of product marketing and innovation, Actuate. In it he explains his view that we need to think 'small' when it comes to 'Big' Data.

 

Big Data is big business. But it may also be heading for a stumble, even with all its hype. That should alert us to the reality that Big Data sometimes presents big problems. After all, Big Data is only useful if it offers the business something marketing, finance or sales (i.e. non-data-scientists) can apply to business objectives.

 

So what is 'Small' Data? At its simplest, it's the alerts or answers gleaned from Big Data that are most meaningful to the broadest set of users. Furthermore, as I've defined it before, Small Data involves packaging insights visually (say, via dashboards or an infographic) so they are accessible, understandable, and actionable for everyday tasks.

 

It's an approach that takes its inspiration from consumer apps and the observation that the best digital experiences are inherently simple, smart, responsive and social.

 

In practical terms, that means we need to pay closer attention to the content we have already captured, by exploring its applicability for supporting specific business scenarios, and delivering the resulting insights on every relevant channel, in a friendly, sharable format.

 

To understand why the Small Data movement is gaining traction, let's consider what factors are at play:

 

  • Big Data is tricky. Doing it right - i.e. with a big payoff - may take longer than the business can wait. What's more, the majority of marketers don't need the masses of Big Data to run their campaigns; they need differentiating insights that allow them to personalise offers to customers.
  • Small Data is 'on trend.' Small Data thinking helps to democratise the way apps are constructed. This is not mainstream yet, but it is rapidly becoming the preferred route of travel for IT today.
  • Small Data is all around us. Think about the vast amount of personalised data feeds available already, especially as more devices get wired up and more consumers shop and share within social networks like Facebook. And think of apps like Kayak's "When to Book" travel tool that tells you whether that too-high fare might fall within the next week. To create a complete picture of customers, we need to combine insights from these social channels along with Web analytics and transactional data. These rich datasets will be the centre of a new customer experience based not on only on Big, but also on Small Data - the data that's relevant to the task of personalised marketing.
  • Small Data is buzzing. Software vendors such as Adobe, IBM, SAP - and Actuate - now promote the relevance of the movement in industry forums and blogs.
  • Small Data focuses on the user. Big Data is still the domain of techies. Yet, to drive adoption, we need a platform that enables experiences that are easy to use, smart, scalable, and visually appealing for non-techies.

If you focus back on the customer and 'think small,' you can sidestep many of the Big Data trip wires - and deliver the useful, interactive data-driven apps your organisation needs today.

 

You can follow Allen Bonde on Twitter at @abonde.

 

Andy Jennings, chief analytics officer, FICO on mainstreaming of big data

| No Comments
| More

 2014 has seen a steady move to the mainstream of the characteristic themes of the big data movement: ways and means of dealing with unstructured data; how to operationalize big data analytics; how to build up a data science capability.

One of the stimulating conversations I've had about the big data phenomenon this year was with Andrew Jennings, FICO's chief analytics officer and head of FICO Labs.

He has held a number of leadership positions at the credit scoring outfit since joining the company in 1994, including a stint as director of FICO's European operations. Andrew was head of unsecured credit risk for Abbey National and has an academic hinterland as a lecturer in economics and econometrics at the University of Nottingham.

Did he think the big data cycle was winding down? He did, but it has not become less relevant, he said, "but we are over the more outrageous things said in the last year or so. The Wild West of Pig, Hive and Hadoop has become more tamed; it's being made easier for people to use [big data technologies] in order to do analytics with data, and make decisions".

Dr Jennings was, in part, referring to his own company's analytic cloud service, which comprises open source tools, their own products, and other third party elements. But also efforts being made by the big suppliers, such as IBM, SAP and Oracle.

"Data driven organisations do need more tools beyond the spread sheet, so there is more tendency for big data technologies to be integrated".

Jennings sees the predictive analytics technologies developed over many years for financial services companies, by the likes of FICO or SAS, as having a broader applicability, and cites network security as an adjacent area.

"And in retail, the credit risk models developed over 25 years can be extended to the best action to take for an individual consumer", depending on their price sensitivity.

FICO is experienced in recruiting and developing people it is now fashionable to call 'data scientists'. Does he think such people should get more focus than the development of data savvy managers?

"Data scientists will get frustrated if management around them has no understanding of what they are doing. So, they need data savvy managers, too".

On data scientists, as such, he said "by 'data scientist' people mean something more than a statistician or a computer scientist who knows about database technologies, but someone with a broader set of skills: who can manipulate data, write code, hack something together, do the mathematical analysis but also understand the business context in which that is important.

"In my experience the really hard role to fill is that [data analytics] person who can also understand what the business goals are and can talk to the business. Can help us to sell, in FICO's case".

The rarity of such people means that building data science teams is the way forward, he concludes.

"It always comes down to: 'What's the decision I am trying to improve?' Or 'what operation am I trying to improve?'".

FICO's approach, on his account, is to make decisions easier and repeatable. "You've got to be able deploy the model. We put our time  not just into the algorithm, but to getting the model to the point at which a decision can made and you can execute at the right speed".

As for big data technologies, he said "I've been in analytics for years, and had never heard of Hadoop five year ago. It is now in our everyday language. All the big players - Oracle, SAP, and so on - are moving to make it less geeky. We're focused on the analytics and decisioning component of that".

Digitalisation feeds big data

| No Comments
| More

This essay is a guest blog by Yves de Montcheuil, Vice President of Marketing at Talend.

 

Knowingly or not, many enterprises have embarked on a path to digitally transform their organisation. Also known as "digitalisation", this transformation takes many forms, which can range from the introduction of data processing at every step of existing business processes, all the way to the pivoting of the enterprise's business model towards the monetisation of its data assets.

Examples of digitalised organisations abound. Century-old catalogue retailers adopt modern big data technologies to get a 360-degree view of their customers and optimise their marketing. Banks from the City use advanced statistical models to predict the risk of customers defaulting on their loans. Manufacturers deploy sensors along their production lines to fine tune capacity and provide predictive maintenance, avoiding costly downtime.

Business model pivoting

Outsourcers of all kinds are especially prone to business model pivoting: because their models provide them with data sets from many clients, they can resell insight on these combined data sets - in an anonymised and aggregated mode, of course. From payroll providers selling statistics on salary ranges per function and location, to advertising agencies comparing a client's ad returns with benchmarks, to ski resorts informing leisure skiers how they fared against local champions, or even aircraft manufacturers optimising routings (and fuel savings) based on information gathered from aircraft flying the same route earlier - the examples are limited only by the creativity of business development executives (and the propensity of clients to pay for such services).

Born digital

Some companies do not need to "digitise" - they were born digital. Looking beyond the obvious examples - Google, Amazon, Facebook, Twitter - many companies' business models are based on the harvesting of data and its trade, in one form or another. Next-generation private car hire or hitchhiking/ridesharing providers are not transportation companies but intermediation platforms, bringing together drivers and riders based on location, and ensuring a smooth transaction between the parties.  Fitness apps collect data from exercising enthusiasts, providing this data, reports and alerts in an easily consumable format to the member, and further reselling it to interested third parties.

The common thread across all these different organisations? Their digital businesses are consuming and producing vast amounts of data. Social data, location data, transaction data, log data, sensor data constitute both the raw material and the outcome of their business processes.

For companies that were not born digital, some of this data existed before digitalisation began: banks stored credit instalments in paper ledgers for centuries and in computers for decades, aircrafts have fed measurements to flight data recorders since the 1960s, sensors in factories are nothing new but were historically used primarily to raise alert conditions. As digitalisation is embraced, this legacy data (or "small data") becomes a key part of the big data that is used to re-engineer business processes, or build new ones. It will get augmented in two dimensions: more of the same data, and new data.

More of the same: digitalisation requires, and produces, more of the same data. In order to accurately predict consumer behaviour, detailed transaction data must be collected - not only aggregate orders. Predictive maintenance requires sensor data to be collected at all times, not just to raise an alert when a value exceeds a threshold. Route optimisation demands collection of location data and speed parameters at frequent intervals.

New: digitalisation also requires, and produces, new types of sources of data. Meteorological information is a key input to combine with aircraft sensors to optimise fuel consumption. Traffic data helps compute transit time. Web logs and social data augment transaction history to optimise marketing. 

The technical key to successful digitalisation is the ability of the organisation to collect and process all the required data, and to inject this data into its business processes in real-time - or more accurately, in right-time.  Data technology platforms have evolved dramatically in recent years - from the advent of Hadoop and then its transition to a data operating system with a broad range of processing options, to the use of NoSQL databases with relaxed consistency to favour speed at an affordable cost, and, of course, to the role open source is playing in the democratisation of these technologies.

Of course, while technology is necessary, it is not sufficient. Business executives need to fully embrace the possibilities that are now presented to them by the new wave of data platforms.

 

About the author

Yves de Montcheuil

Yves de Montcheuil is the Vice President of Marketing at Talend, which does open source integration. Yves holds a master's degree in electrical engineering and computer science and has 20 years of experience in software product management, product marketing and corporate marketing. He is also a presenter, author, blogger, social media enthusiast, and can be followed on Twitter: @ydemontcheuil.

Data Scientist: the New Quant

| No Comments
| More

This is a guest blog by Yves de Montcheuil, Vice President of Marketing at Talend.

 

When big data was still in its infancy - or rather, before the term was minted - a population of statisticians with advanced analytical expertise would dominate the data research field. Sometimes called "quants" (short for "quantitative analysts"), these individuals had the skills to tackle a mountain of data and find the proverbial needle. Or rather, the path to that needle - so that such a path, once identified and handed over to skilled programmers, could be turned into a repeatable, operational algorithm.

Challenges facing quants were multiple. Gathering and accessing the data was the first one: often, the only data available was the data already known in advance to be useful. In order to test a theory, the quant would need to obtain access to unusual or unexpected sources, assuming these were available at all. Digging, drilling and sifting through all this data with powerful but intricate statistical languages was another issue. And then, of course, once a quant had found the gold nugget, operationalising the algorithms to repeat this finding would require another, very different set of skills.  Not only would quants command sky-high compensation packages, but they also needed a full-scale support system, from databases and IT infrastructure, to downstream programmers for operationalisation.

The coming of age of big data has seen a reshuffling of the cards. Nowadays, many an organisation does collect and store any data it produces, even if its use is not immediately relevant. This is enabled by a dramatic plunge in the cost of storing and processing data - thanks to Hadoop, which decreases the cost per terabytes by a factor of fifty. Navigating, viewing and parsing data stored in Hadoop is made intuitive and fast by the combination of next generation data visualisation tools, and the advent of new so-called "data preparation" or "data wrangling" technologies - while still in their infancy, these provide and Excel-like intuitive interface to sift through data. And the latest advances in Hadoop make the operationalisation of big data glimmer on the now-not-so-distant horizon.

These technology shifts have made it a lot simpler to harvest the value of data. Quants are being replaced by a new population: the data scientists. A few years ago, there used to be a joke that said that a "data scientist" was actually how a business analyst living in California was known. This is no longer true. Data scientists now live and work in Wall Street and in the City of London, in the car factories in Detroit and Munich, in the apparel districts of Madrid and Paris.

But simpler does not mean easy. True, the data scientist works without the complex support system that the quant required, and uses tools that have a much steeper learning curve. But the data scientist still needs to know what to look for. The data scientist is an expert in his industry and domain. He knows where to find the data, what it means, and how his organisation can optimise processes, reduce costs, increase customer value. And more importantly, the data scientist has a feel for data: structured, semi-structured, unstructured, with or without metadata, he thrives when handed a convoluted data set.

There are still very few data scientists out there. Few universities train them: whereas one can get a Masters Degree in statistics in almost any country in the world, the few data science courses that exist are mostly delivered in California.  And while big data technologies are becoming more and more pervasive, few people can invoke years of experience and show proven returns on big data projects.

Today, as an industry, we are only scratching the surface of the potential of big data. Data scientists hold the keys to that potential. They are the new statisticians. They are the new quants.

 

About the author

 

Yves de Montcheuil

 

Yves de Montcheuil is the Vice President of Marketing at Talend, which does open source integration. Yves holds a master's degree in electrical engineering and computer science and has 20 years of experience in software product management, product marketing and corporate marketing. He is also a presenter, author, blogger, social media enthusiast, and can be followed on Twitter: @ydemontcheuil.

Is data science a science?

| No Comments
| More

Imperial College, London has officially launched its Data Science Institute, announced last year. And the government has announced £42 funding for the Alan Turing Institute, location to be decided.

Data Science is, then, officially in vogue. Not just the pet name for data analytics at Silicon Valley companies, like Google, LinkedIn, Twitter, and the rest, but anointed as a 'science'.

Imperial College is doing a great deal with data, for its science, already: from the crystallisation of biological molecules for x-ray crystallography, though the hunt for dark matter to the development of an ovarian cancer database. And much else besides.

What will make the college's Data Science Institute more than the sum of these parts? I asked this question of Professor David Gann, chairman of the research board at Imperial's new institute. His response was: "Imperial College specialises in science, engineering and medicine, and also has a business school. In each of those areas we have large scale activities: largest medical school in Europe, largest engineering school in the world. And we are a top ten player in the university world globally.

"So you would expect us to be doing a lot with data. As for our developing something that is more than the sum of the parts, I would say we genuinely mean that there is a new science about how we understand data. We are going to take a slice through the [current] use of large data sets in incumbent fields of science, engineering, medicine, and business to create a new science that stands on its own two feet in terms of analytics, visualisation, and modelling. That will take us some time to get right: three to five years".

Founding director of the Institute Professor Yike Guo added: "creating value out of data is key, too. Our approach at Imperial is already multi-disciplinary, with the individual fields of study as elements of a larger chemistry, which is data".

I put the same question to Duncan Ross, director of data science, Teradata at the vendor's recent 'Universe' conference in Prague. Duncan made the traditional scientist's joke that if you have to put the word 'science' at the end of a noun, then you don't really have science. He then went on to say: "There is an element of taking a scientific approach to data which is worth striving for. But, Bayes Theorem of 1763 is hardly new, it is just that we now have the computing technology to go with it".

At the same event, Tim Harford, the 'undercover economist' who presents Radio 4's More or Less programme, ventured this take on the data science concept: "It [the data science role] seems like a cool new combination of computer science and statistics. But there is no point in hiring an elite team of data geeks who are brilliant but who no one in management understands or takes seriously".

There was a time when computer science was not considered to be a science, or at least not much of one. And, arguably, it is more about 'technology' and 'engineering' than it is about fundamental science. Can the same be said of 'data science'? The easy thing to say is that it does not matter. Perhaps an interesting test would be how many IT professionals would want their children to graduate in Data Science in preference to Mathematics, Physics, or, indeed, History, Law or PPE?

Moreover, do we want scientists and managers who are data savvy or do we need a new breed of data scientist - part statistician, part computer programmer, part business analyst, part communications specialist? Again, it is easy to say: "we want both", when investment choices will always have to be made.

As for the Alan Turing Institute, David Gann at Imperial told me: "As you can imagine, we would be interested, but the process is just starting. Other good universities would say the same".

If any institution has a decent shot of forging a new discipline (shall we just call it that?) of data science, it is Imperial College, London. That said, King's College, Cambridge and the University of Manchester might well have a word or two to say about the eventual location of the Alan Turing Institute.







The industrialisation of analytics

| No Comments
| More

The automation of knowledge work featured in a McKinsey report last year as one of ten IT-enabled business trends for the decade ahead: 'advances in data analytics, low-cost computer power, machine learning, and interfaces that "understand" humans' were cited as technological factors that will industrialise the knowledge work of 200 million workers globally.

On the surface seems at odds with the rise of the data scientist. It has become commonplace in recent years to say that businesses and other organisations are crying out for a new breed of young workers who can handle sophisticated data analysis, but who also have fluent communication skills, business acumen and political nous: data scientists.

The problem is, not surprisingly, finding them. I've heard a few solutions offered. Stephen Brobst, Teradata's CTO, suggested that physicists and other natural scientists - that is to say, not only mathematicians - are a good source.

Another approach is to automate the problem, in different ways and up to different points. Michael Ross, chief scientist at eCommera and founder of online lingerie retailer Figleaves, contends that online retailing does require industrialisation of  analytics.

He told me: "E-commerce is more Toyota than Tesco. It's more about the industrialisation of decisions based on data. It's not about having an army of data analysts. It's about automating. Physical retail is very observable. Online you've got lots of interconnected processes that look much more like a production line".

And he drew a further parallel with the Industrial Revolution, which de-skilled craftsmen: "This stage is all about replacing knowledge workers with algorithms".

As it happens, Ross is a McKinsey (and Cambridge maths) alumnus himself, but was basing his observations upon his experience at Figleaves, and elsewhere.

The supplier community - and Ross belongs to that at his company - is keen to address this problem space. For instance, SAP is developing its predictive analytics offer in the direction of more automation, in part through the Kxen Infinite Insight software it acquired last year. Virgin Media is using the software to generate sales leads by analysing customer behaviour patterns.

The limitations of Hadoop

Actian, the company that encompasses the Ingres relational database, has now positioned itself as an analytics platform provider. The pieces of that platform have come from a raft of recent acquisitions: VectorWise, Versant, Pervasive Software, and ParAccel. I attended a roundtable the supplier held last week, at which CEO Steve Shine and CTO Mike Hoskins talked about the company's vision. Both deprecated what they see as a regression in data management inadvertently caused by the rise of the Hadoop stack and related NoSQL database technologies. Hadoop requires such a "rarefied skills set" that first phase big data projects have yielded little value, said Shine.

Hoskins said his colleague had, if anything, been too kind. "MapReduce is dragging us back to the 1980s or even 1950s", he said. "It's like Cobol programming without metadata".

He said the data technology landscape is changing so massively that "entire generations of software will be swept away". Mounting data volumes in China, and elsewhere in Asia reinforces much of what has been said in the west about the "age of data", he continued and he characterized the 'internet of things' phrase as "instrumenting the universe. We are turning every dumb object into a smart object which is beaming back data".

As for a putative industrialisation of analytics, he said: "the Holy Grail is 'closed loop analytics'. Where one is not just doing iterative data science to improve a recommender system or fraud detection by 10%, but rather to drive meaningful insight into a predictive model or rule which then goes into a day to day operational systems. So it's about closed loop analytics that enable constant improvement".

The automation of data analytics does seem to make business sense. Will bands of data scientists emerge to contest its worth?







Subscribe to blog feed

Categories

 

-- Advertisement --