Startup and early-stage data-focused companies in Silicon Valley are coming up with technologies that promise to help CIOs glean value from investments in the so-called “new oil” of data.
Computer Weekly was represented on a recent European IT press visit to business applications and data analytics companies in San Francisco and Silicon Valley.
As often, hints about the future shape of UK enterprise IT can be discerned in what’s coming out of new and relatively new tech companies in northern California.
Alation: data cataloguing to aid rational thought
Satyen Sangani, founder and CEO of Alation, says the rationale for the company’s technology is to find ways to think rationally about data presented for consumption.
“What is required is something between where data is stored and where the questions are asked. There is an infinity of both data and questions,” he says.
Why can’t database providers do this themselves? “Each database supplier can only optimise for their own stack. So you need a data catalogue to act as a ‘Switzerland’, as neutral,” says Sangani.
“Also, with the self-service trend represented by Tableau et al, users have to be the target market [as opposed to just IT]. And that the data stays where it is – this is a critical thing for us.”
Alation bills itself as a “trusted catalogue for data”, with machine-learnt recommendations for how data is tagged. It creates a unitary reference source for an organisation, based on all its data stores. Ebay and GoDaddy are among the customers that have used its technology to build catalogues – the former drawing on a Teradata data warehouse, the latter using Tableau.
The company cites the former chief data officer at eBay, Zoher Caru, positing a data governance problem that Alation’s software addressed: “The biggest sin of data governance is if a random person queries some data, puts it in Excel, modifies it, puts it into a PowerPoint and ships it around. We had this happening a lot.”
According to Aaron Kalb, vice-president of design at the company, “collaborative filtering” is the core idea behind its data catalogue product.
This “data curation” principle is adduced by another customer, Munich Re. Wolfgang Hauner, chief data officer at the re-insurer is cited by Alation as saying: “At Munich Re, our data strategy is geared to offer new and better risk-related services to our customers. A core piece in that strategy is our integrated self-service data analytics platform. Alation’s social catalogue is part of that platform and already helps more than 600 users in the group to discover data easily and to share knowledge with each other.”
MapD: GPUs for data analytics
MapD has its origins, in a sense, in the Arab Spring of 2010. Todd Mostak, founder and CEO of the data visualisation firm, built the prototype of its technology to interactively explore big datasets while doing research, at Harvard, on the use of Twitter during the revolt across Arabia.
He then went on to the Massachusetts Institute of Technology (MIT) as a research fellow focused on graphics processing unit (GPU) databases. GPUs are able to render images faster than central processing units [CPUs] because of their parallel processing architectures. GPU chips are used for computer games and other resource-intensive tasks.
MapD has turned the technology towards general-purpose analytics, especially operational analytics, geospatial use cases and data science.
Investors include the CIA venture fund In-Q-Tel, GPU manufacturer Nvidia and, an early customer, US telecoms company Verizon. Other reference customers are Volkswagen, which uses MapD to visualise so-called “black-box” artificial intelligence (AI) and machine learning (ML) models, and a Los Angeles-based geospatial property visualisation organisation, Patriglo, which says it is using the software to address the city’s housing crisis.
“GPUs are not good for everything,” says Mostak. “A lot of problems in computing are relatively sequential, and if you are dealing with unstructured data, that’s harder with GPUs. But GPUs with, today, thousands of cores are great for when you can massively parallelise [structured data]. And when you look at the hardware trends today, you can see large effects downstream. That is why Nvidia’s stock price has gone up.”
MapD is also part of an initiative to create common data frameworks to speed up the use of data analytics on GPUs, the GPU Open Analytics Initiative (GOAI).
Aerospike: NoSQL database optimised for flash
Aerospoke is a NoSQL database supplier with a strong heritage in servicing adtech companies, which use it to orchestrate real-time bidding for advertisement slots online, and is moving more into financial services. Its database use cases include identifying new fraud patterns, financial risk in intra-day trading and for online seat reservations.
The company was founded in 2009, has 125 paying customers and has close links with Intel. It positions itself as a specialist in unstructured data, able to do real-time transactions as well as analytics, and so, on its account, addressing the domains of both other NoSQL databases – such as Couchbase, Cassandra and MongoDB – and Hadoop. The company says it combines high speed with consistency by eliminating caching from its database architecture.
Brian Bulkowski, co-founder and CTO of Aerospoke, says the hybrid flash storage and in-memory architecture it provides translates to a dramatically smaller server footprint. “This is the ‘Copernicus moment’ for a CIO or CTO, when we can, say, reduce 450 Cassandra [database] nodes to 60. The only way people believe us is when they do the proof of concept themselves,” he says.
“I was recently with the CIO of a very large telco, with thousands of servers [running a NoSQL database]. For every 50 nodes of that database we replace, we can save it $350,000 a year every year. We are the only key-value store out there that can do that.”
Similar technology runs at Google and Facebook, according to another Aerospike spokesperson, “but they don’t let it out”.
GridGain: staking bet on in-memory future of computing
GridGain founder and CTO Nikita Ivanov says he knows Aerospike well, and acknowledges its database technology’s speed against similar outfits, but contends that his company’s datastore, based on Apache Ignite, is even faster by virtue of being entirely in-memory, not in flash storage.
CEO Abe Kleinfeld says digital transformation, as a process, is driving adoption of in-memory computing, since the traditionally bifurcated data warehousing/operational database model is not agile enough for the purpose.
“GridGain is like an open sourced Hana,” he says, but argues that SAP Hana is not going to be adopted by startups or by companies that have not already invested in SAP technology morel generally, “because it is proprietary, high end and expensive”.
“The reason why SAP Hana has such a huge base of customers is because [SAP] is putting [Hana] into its applications. Customers are not using Hana for non-SAP applications. Most companies today have an open source first approach for greenfield applications. The world is favouring our approach to the proprietary SAP, Oracle, Microsoft approach,” adds Kleinfeld.
Recent customer wins for the supplier include Barclays, Société Generale and ING, in financial services, and Workday, Microsoft and Huawei in technology, he says, adding that Ignite has reached around one million downloads per year and is the fifth most committed to project at Apache. GridGain has around 100 paying customers.
One of the company’s early customers was the Russian bank Sperbank, which has built one of the world’s largest in-memory database clusters, reports GridGain, comparable with Amazon Web Services, Alibaba, et al.
The IT Press Tour previously met with GridGain in 2017. It has since almost doubled in revenue, and has 80% more employees, says Kleinfeld.
Waterline Data: bringing data up from the depths
Waterline Data was another repeat visit for the tour.
The automatic addition of inferred business labels to aid the discovery of an organisation’s data is the selling point of Waterline’s data cataloguing technology.
This time around, CEO and founder Alex Gorelik and his team were keen to present a General Data Protection Regulation (GDPR)-oriented dashboard that Waterline Data has been developing to surface silos of personal information stores in an organisation.
Gorelik says the company is hearing the plaintive cry, “What do we do?” from customers, over and over, but he believes the EU regulation will be a good thing since it will put an end to any “just pay the fine” mentality.
CreditSafe is a reference customer that is using the technology to automatically identify and tag GDPR-relevant personal data.
GlaxoSmithKline, though it is not using the GDPR system, is a reference customer for Waterline Data’s broader technology.
Mark Ramsey, chief data officer, research and development data, at GlaxoSmithKline, says it is using Waterline to analyse the lineage of masses of scientific data. He says the cataloguing software’s capacity to enable a dynamic understanding of dispersed data – across schemas and attributes – among its research scientists is valuable.
Though GSK is rationalising its scientific data according to pharmaceutical industry standards, that won’t, according to Ramsey, render the data sufficiently discoverable – and that is where Waterline comes in.
“Today, as we are populating our data lake, we are building dashboards to look at that data. As we are bringing Waterline more into the environment, it will open up the opportunity for self-service, and so the scientists and researchers will have a better life. They will be able to discover data more easily, understand its location and lineage, and will directly be able to access and analyse the data. Today, we are more in a pre-service mode where we define what they get access to through guided analytics. And so it will really open up the population of data to the scientists,” he says.
Which, finally, is the moral of the story for CIOs and heads of data management and analytics from this cohort of Silicon Valley-based companies, which are focused on enabling businesses and organisations to get more value from the data investments they have made in recent years, this time of so-called big data.
Read more about Silicon Valley based data analytics technologies
- A group of California-based startup and early-stage data analytics and management companies aim to make big data, including sensor data, more tractable for analysis.
- Read analyst Mike Ferguson on Hadoop and its context.
- Companies based in Silicon Valley discuss the issues created by big data technologies as they mature.