Data Scientist: the New Quant

| No Comments
| More

This is a guest blog by Yves de Montcheuil, Vice President of Marketing at Talend.


When big data was still in its infancy - or rather, before the term was minted - a population of statisticians with advanced analytical expertise would dominate the data research field. Sometimes called "quants" (short for "quantitative analysts"), these individuals had the skills to tackle a mountain of data and find the proverbial needle. Or rather, the path to that needle - so that such a path, once identified and handed over to skilled programmers, could be turned into a repeatable, operational algorithm.

Challenges facing quants were multiple. Gathering and accessing the data was the first one: often, the only data available was the data already known in advance to be useful. In order to test a theory, the quant would need to obtain access to unusual or unexpected sources, assuming these were available at all. Digging, drilling and sifting through all this data with powerful but intricate statistical languages was another issue. And then, of course, once a quant had found the gold nugget, operationalising the algorithms to repeat this finding would require another, very different set of skills.  Not only would quants command sky-high compensation packages, but they also needed a full-scale support system, from databases and IT infrastructure, to downstream programmers for operationalisation.

The coming of age of big data has seen a reshuffling of the cards. Nowadays, many an organisation does collect and store any data it produces, even if its use is not immediately relevant. This is enabled by a dramatic plunge in the cost of storing and processing data - thanks to Hadoop, which decreases the cost per terabytes by a factor of fifty. Navigating, viewing and parsing data stored in Hadoop is made intuitive and fast by the combination of next generation data visualisation tools, and the advent of new so-called "data preparation" or "data wrangling" technologies - while still in their infancy, these provide and Excel-like intuitive interface to sift through data. And the latest advances in Hadoop make the operationalisation of big data glimmer on the now-not-so-distant horizon.

These technology shifts have made it a lot simpler to harvest the value of data. Quants are being replaced by a new population: the data scientists. A few years ago, there used to be a joke that said that a "data scientist" was actually how a business analyst living in California was known. This is no longer true. Data scientists now live and work in Wall Street and in the City of London, in the car factories in Detroit and Munich, in the apparel districts of Madrid and Paris.

But simpler does not mean easy. True, the data scientist works without the complex support system that the quant required, and uses tools that have a much steeper learning curve. But the data scientist still needs to know what to look for. The data scientist is an expert in his industry and domain. He knows where to find the data, what it means, and how his organisation can optimise processes, reduce costs, increase customer value. And more importantly, the data scientist has a feel for data: structured, semi-structured, unstructured, with or without metadata, he thrives when handed a convoluted data set.

There are still very few data scientists out there. Few universities train them: whereas one can get a Masters Degree in statistics in almost any country in the world, the few data science courses that exist are mostly delivered in California.  And while big data technologies are becoming more and more pervasive, few people can invoke years of experience and show proven returns on big data projects.

Today, as an industry, we are only scratching the surface of the potential of big data. Data scientists hold the keys to that potential. They are the new statisticians. They are the new quants.


About the author


Yves de Montcheuil


Yves de Montcheuil is the Vice President of Marketing at Talend, which does open source integration. Yves holds a master's degree in electrical engineering and computer science and has 20 years of experience in software product management, product marketing and corporate marketing. He is also a presenter, author, blogger, social media enthusiast, and can be followed on Twitter: @ydemontcheuil.

Is data science a science?

| No Comments
| More

Imperial College, London has officially launched its Data Science Institute, announced last year. And the government has announced £42 funding for the Alan Turing Institute, location to be decided.

Data Science is, then, officially in vogue. Not just the pet name for data analytics at Silicon Valley companies, like Google, LinkedIn, Twitter, and the rest, but anointed as a 'science'.

Imperial College is doing a great deal with data, for its science, already: from the crystallisation of biological molecules for x-ray crystallography, though the hunt for dark matter to the development of an ovarian cancer database. And much else besides.

What will make the college's Data Science Institute more than the sum of these parts? I asked this question of Professor David Gann, chairman of the research board at Imperial's new institute. His response was: "Imperial College specialises in science, engineering and medicine, and also has a business school. In each of those areas we have large scale activities: largest medical school in Europe, largest engineering school in the world. And we are a top ten player in the university world globally.

"So you would expect us to be doing a lot with data. As for our developing something that is more than the sum of the parts, I would say we genuinely mean that there is a new science about how we understand data. We are going to take a slice through the [current] use of large data sets in incumbent fields of science, engineering, medicine, and business to create a new science that stands on its own two feet in terms of analytics, visualisation, and modelling. That will take us some time to get right: three to five years".

Founding director of the Institute Professor Yike Guo added: "creating value out of data is key, too. Our approach at Imperial is already multi-disciplinary, with the individual fields of study as elements of a larger chemistry, which is data".

I put the same question to Duncan Ross, director of data science, Teradata at the vendor's recent 'Universe' conference in Prague. Duncan made the traditional scientist's joke that if you have to put the word 'science' at the end of a noun, then you don't really have science. He then went on to say: "There is an element of taking a scientific approach to data which is worth striving for. But, Bayes Theorem of 1763 is hardly new, it is just that we now have the computing technology to go with it".

At the same event, Tim Harford, the 'undercover economist' who presents Radio 4's More or Less programme, ventured this take on the data science concept: "It [the data science role] seems like a cool new combination of computer science and statistics. But there is no point in hiring an elite team of data geeks who are brilliant but who no one in management understands or takes seriously".

There was a time when computer science was not considered to be a science, or at least not much of one. And, arguably, it is more about 'technology' and 'engineering' than it is about fundamental science. Can the same be said of 'data science'? The easy thing to say is that it does not matter. Perhaps an interesting test would be how many IT professionals would want their children to graduate in Data Science in preference to Mathematics, Physics, or, indeed, History, Law or PPE?

Moreover, do we want scientists and managers who are data savvy or do we need a new breed of data scientist - part statistician, part computer programmer, part business analyst, part communications specialist? Again, it is easy to say: "we want both", when investment choices will always have to be made.

As for the Alan Turing Institute, David Gann at Imperial told me: "As you can imagine, we would be interested, but the process is just starting. Other good universities would say the same".

If any institution has a decent shot of forging a new discipline (shall we just call it that?) of data science, it is Imperial College, London. That said, King's College, Cambridge and the University of Manchester might well have a word or two to say about the eventual location of the Alan Turing Institute.

The industrialisation of analytics

| No Comments
| More

The automation of knowledge work featured in a McKinsey report last year as one of ten IT-enabled business trends for the decade ahead: 'advances in data analytics, low-cost computer power, machine learning, and interfaces that "understand" humans' were cited as technological factors that will industrialise the knowledge work of 200 million workers globally.

On the surface seems at odds with the rise of the data scientist. It has become commonplace in recent years to say that businesses and other organisations are crying out for a new breed of young workers who can handle sophisticated data analysis, but who also have fluent communication skills, business acumen and political nous: data scientists.

The problem is, not surprisingly, finding them. I've heard a few solutions offered. Stephen Brobst, Teradata's CTO, suggested that physicists and other natural scientists - that is to say, not only mathematicians - are a good source.

Another approach is to automate the problem, in different ways and up to different points. Michael Ross, chief scientist at eCommera and founder of online lingerie retailer Figleaves, contends that online retailing does require industrialisation of  analytics.

He told me: "E-commerce is more Toyota than Tesco. It's more about the industrialisation of decisions based on data. It's not about having an army of data analysts. It's about automating. Physical retail is very observable. Online you've got lots of interconnected processes that look much more like a production line".

And he drew a further parallel with the Industrial Revolution, which de-skilled craftsmen: "This stage is all about replacing knowledge workers with algorithms".

As it happens, Ross is a McKinsey (and Cambridge maths) alumnus himself, but was basing his observations upon his experience at Figleaves, and elsewhere.

The supplier community - and Ross belongs to that at his company - is keen to address this problem space. For instance, SAP is developing its predictive analytics offer in the direction of more automation, in part through the Kxen Infinite Insight software it acquired last year. Virgin Media is using the software to generate sales leads by analysing customer behaviour patterns.

The limitations of Hadoop

Actian, the company that encompasses the Ingres relational database, has now positioned itself as an analytics platform provider. The pieces of that platform have come from a raft of recent acquisitions: VectorWise, Versant, Pervasive Software, and ParAccel. I attended a roundtable the supplier held last week, at which CEO Steve Shine and CTO Mike Hoskins talked about the company's vision. Both deprecated what they see as a regression in data management inadvertently caused by the rise of the Hadoop stack and related NoSQL database technologies. Hadoop requires such a "rarefied skills set" that first phase big data projects have yielded little value, said Shine.

Hoskins said his colleague had, if anything, been too kind. "MapReduce is dragging us back to the 1980s or even 1950s", he said. "It's like Cobol programming without metadata".

He said the data technology landscape is changing so massively that "entire generations of software will be swept away". Mounting data volumes in China, and elsewhere in Asia reinforces much of what has been said in the west about the "age of data", he continued and he characterized the 'internet of things' phrase as "instrumenting the universe. We are turning every dumb object into a smart object which is beaming back data".

As for a putative industrialisation of analytics, he said: "the Holy Grail is 'closed loop analytics'. Where one is not just doing iterative data science to improve a recommender system or fraud detection by 10%, but rather to drive meaningful insight into a predictive model or rule which then goes into a day to day operational systems. So it's about closed loop analytics that enable constant improvement".

The automation of data analytics does seem to make business sense. Will bands of data scientists emerge to contest its worth?

Subscribe to blog feed



-- Advertisement --