Choosing storage for streaming large files in big data sets
A comprehensive collection of articles, videos and more, hand-picked by our editors
There is little question that big data has broken through, not only to the enterprise IT agenda, but also to the public imagination.
Barely a day passes without mention of new capabilities for mining the internet to yield new insights on customers, optimise search, or personalise web experiences for communities comprising tens or hundreds of millions of members.
Ovum considers big data to be a problem that requires powerful alternatives beyond traditional SQL database technology.
The attributes of big data include “the three Vs” – volume, variety (structured, along with variably structured data) and velocity. Ovum believes that a fourth V – value – must also be part of the equation.
A variety of platforms have emerged to process big data, including advanced SQL (sometimes called NewSQL) databases that adapt SQL to handle larger volumes of structured data with greater speed, and NoSQL platforms that may range from file systems to document or columnar data stores that typically dispense with the need for modelling data.
Examples of Fast Data applications
- Sensory applications that provide snapshots of phenomena or events from a variety of data points that are aggregated and processed in real time or near real time.
- Stream-processing applications that process high-speed data feeds with embedded, rules-driven logic to either alert people to make decisions, or to trigger automated closed-loop operational responses.
- High-speed, low-latency series of events that, as with stream processing, generate alerts or closed-loop automated response based on embedded rules or business logic, such as high frequency trading (HFT).
- Real-time or near real-time analytics. Real-time transactional or interactive processing applications involving large, multi-terabyte, internet-scale data sets.
Most of the early implementations of big data, especially with NoSQL platforms such as Hadoop, have focused more on volume and variety, with results delivered through batch processing.
Behind the scenes, there is a growing range of use cases that also emphasise speed. Some of them consist of new applications that take advantage not only of powerful back-end data platforms, but also the growth in bandwidth and mobility. Examples include mobile applications such as Waze that harness sensory data from smartphones and GPS devices to provide real-time pictures of traffic conditions.
On the horizon there are opportunities for mobile carriers to track caller behaviour in real time to target ads, location-based services, or otherwise engage their customers, as well as Conversely, existing applications are being made more accurate, responsive and effective as smart sensors add more data points, intelligence and adaptive control.
These are as diverse as optimising supply chain inventories, regulating public utility and infrastructure networks, or providing real-time alerts for homeland security. The list of potential opportunities for fast processing of big data is limited only by the imagination.
Fast data is the subset of big data implementations that require velocity. It enables instant analytics or closed-loop operational support with data that is either not persisted, or is persisted in a manner optimised for instant, ad hoc access. Fast data applications are typically driven by rules or complex logic or algorithms.
Regarding persistence, the data is either processed immediately and is not persisted, such as through extreme low-latency event processing, or it is persisted in an optimised manner. This is typically accomplished with silicon-based flash or memory storage, and is either lightly indexed (to reduce scanning overhead) or not indexed at all. The rationale is that the speed of silicon either eliminates the need for sophisticated indexing, or allows customised data views to be generated dynamically. Connectivity is also critical.
While ultra low-latency messaging links are not mandatory (they are typically only utilised by securities trading firms), optimising connectivity through high-speed internal buses, such as Infiniband, is essential for computation of large blocks of data. Direct links to high-speed wide-area network (WAN) backbones are key for fast data applications digesting data from external sources.
What fast data is not
Fast transaction systems that can be updated interactively but do not automatically close the loop on how an organisation responds to events (for example, the systems are read manually or generate reports) are not considered applications of fast data. Conventional online transaction processing (OLTP) databases are typically designed with some nominal degree of optimisation, such as locating hot (frequently or recently used) data on the most accessible parts of disk (or sharded across multiple disks in a storage array), more elaborate indexes, and/or table designs to reduce the need for joins, and so on. In these cases, the goal is optimising the interactive response for frequent, routine queries or updates. These systems are not, however, designed for processing inordinately large volumes or varieties of data in real time.
Fast Data is not new
Real-time databases are not new. Capital markets firms have long relied on databases capable of ingesting and analysing “tick” data – a task that requires not only speed but the ability to process fairly complex analytics algorithms. Traditionally, these firms have looked to niche suppliers with highly specialised engines that could keep pace and in some cases these platforms have been backed by Kove’s in-memory storage appliance. In-memory data stores have been around for roughly 20 years, but their cost has typically restricted them to highly specialised applications that include: extreme high-value niches such as event streaming systems developed by investment banks for conducting high-speed trading triggered by patterns in real-time capital markets feeds; small footprint subsystems, such as directories of router addresses for embedded networking systems; and deterministic, also known as “hard”, real-time systems based on the need to ensure that the system responds within a specific interval.
These systems often have extremely small footprints, sized in kilobytes or megabytes, and are deployed as embedded controllers, typically as firmware burned into application-specific integrated circuit (ASIC) chips, for applications such as avionics navigation or industrial machine control. The common threads are that real-time systems for messaging, closed-loop analytics and transaction processing have been restricted to specific niches because of their cost and the fact that the amount of memory necessary for many of these applications is fairly modest.
What has changed?
It should sound familiar. The continuing technology price/performance curve, fed by trends such as Moore’s Law for processors, and corollaries for other parts of IT infrastructure, especially in storage, is making fast data technologies and solutions economical for a wider cross-section of use cases.
The declining cost of storage has become the most important inflection point. Silicon-based storage, either flash or memory, has become cheap enough to be deployed with sufficient scale to not only cache input/output (I/O), but also store significant portions of an entire database. Enabling technology has inflated user expectations accordingly.
A recent Jaspersoft big data survey of its customer base provided a snapshot of demand. Business intelligence customers, who have grown increasingly accustomed to near real-time or interactive ad hoc querying and reporting, are carrying their same expectations over to big data. Jaspersoft’s survey revealed that over 60% of respondents have deployed or are planning to deploy big data analytics within 18 months, with nearly 50% expecting results in real time or near real time. Speed is being embraced by mainstream enterprise software players and start-ups alike. Oracle and SAP are commercialising a former niche market of in-memory databases, and Tibco is promoting the ability to deliver a two-second advantage that delivers just enough information in context to make snap decisions based on a combination of messaging, in-memory data grid, rules and event processing systems.
Core SQL platforms are being reinvented to raise limits on speed and scale. For instance, the latest models of Oracle Exadata engineered appliances pack up to 75% more memory, and up to four times more flash memory at similar price points, compared with previous models..
Tony Baer is a research director at Ovum. This is an extract of Ovum’s report What is fast data? Download the full report here