Will RainStor data deduplication change the database game?

RainStor says its technology deduplicates relational databases with space savings ranging from 20:1 to 60:1. Is it a game-changing technology?

Category killer, game changer, killer app -- marketers always hype their technology. But RainStor does appear to be unique. It performs data deduplication on database records and structured information -- a Holy Grail that has so far proved elusive to others.

RainStor launched a year ago under the name Clearpace Software and then went quiet, operating underneath the publicity radar. Since then, it has changed its name to RainStor (a reference to Relational Archiving Infrastructure Storage) and released an eponymous offering.

RainStor CEO John Bantleman says that with RainStor you would need '1/20th of the hardware, 1/40th of the storage [and] no design, tuning or maintenance.'


The technology ingests a relational database structure, such as an Oracle instance, and changes all duplicate entries within and across columns to pointers to a master item. This is classic sub-file deduplication, but at the DBMS record level, not at block level. The deduplicated database remains intact and doesn't need rehydrating before it can be read or searched. Any SQL-based database query applications run as before.

The technology takes a database and breaks it up into partitions, potentially containing millions of records, and these are then stored as objects on any storage array you like, including EMC's Centera content-addressed storage family. It means that Centera, which stores unstructured reference files, can also store structured, post-deduplication database records. It can then be used as a secondary data warehouse facility.

If you need data warehouse facilities, you generally need DBMS licenses and database administrators (DBAs), and lots of hard drives and servers for the query processing. If you transform your database into a set of objects stored on whatever storage you wish -- direct-attached, network-attached, block-accessed storage-area network (SAN), content-addressed like Centera, whatever -- then you can store it on commodity drives, have the RainStor software run as a VMware virtual machine (VM) and throw cores at it to get the processing power up.

This means you need no extra DBAs, DBMS licenses, fast disk arrays and masses of server processors. RainStor CEO John Bantleman says that with RainStor you would need "1/20th of the hardware, 1/40th of the storage [and] no design, tuning or maintenance." It would save a whole lot of cash (CAPEX and OPEX).

According to Bantleman, it's not good enough for a real-time data warehouse but it is good enough for mining old data , and for compliance and archiving of data that is relatively inactive but could use faster access than a tape-based archive.

Databases used to hold telco call history records or finance house trading data can be huge, many terabytes in size. Bantleman says the product returns anything from 20:1 to 60:1 space savings, but he settles on 40:1. A terabyte of structured data becomes 25 GB. If those numbers are real, can you imagine the product's appeal to telcos and finance houses?

Other possible applications are for storage as a service (SaaS) data, little-used DBMS data, reference data from appliances that have filled up their storage and cold data from data warehouses; application archiving; and log and record retention. RainStor can apparently cope with schema and query product changes, and can present a DBMS at any point in time.

This all sounds too good to be true, but if it actually is true, it could be game-changing technology.

Partner prospects: EMC, Hitachi Data Systems, HPand NEC?

Bantleman says Clearpace technology has been sold by four partners (Informatica is one) and has snared 50 or so blue-chip clients. He expects to add another four partners soon, including AT&T, GE and -- wait for it -- EMC, which stores its Oracle e-business data in a RainStor repository. That's a very nice win.

As EMC has been tussling with Oracle over VMware running Oracle RAC, this idea of performing data deduplication on DBMS records and storing them in Centera must be especially satisfying to EMC, and perhaps explains why Bantleman was quoted in the recent Oracle federated Centera cluster announcement.

With RainStor, EMC's Centera has seen its market opened up to take in reference structured information. That will make EMC's executives excited, as Centera will have a whole new market open to it (for the time being) with a clear run at the users in it. They will surely think it's game-changing, especially with reference to the EMC-Oracle arena.

This technology could also be of interest to Hitachi Data Systems with its Hitachi Content Archive Platform, and to NEC with its Hydrastor archiving platform, if it sees the same lucrative prospects as EMC. Hewlett-Packard (HP) might also be interested if David Donatelli, executive vice president, enterprise servers, storage and networking, was exposed to RainStor when he worked at EMC.

With these partner prospects and potential enterprise customers in mind, Bantleman is setting up a San Francisco headquarters, with research and development continuing in Gloucester in the UK. Bantleman himself will move to San Francisco.

RainStor and the cloud

RainStor 3.5, the latest version, has been improved to handle larger data sets with high record ingest rates. It has multi-tenancy features, with private containers per client, and is suited, Bantleman says, to cloud storage deployment, with the front-end querying happening on-premise and behind the firewall. The queries are then executed against a RainStor repository in the cloud, running on Amazon storage or -- he didn't say this but it has to be an obvious possibility -- EMC's Atmos.

RainStor could be used to shrink the data to be stored in the cloud, with a terabyte becoming the aforementioned 25 GB. The RainStor front-end system would encrypt it and then upload it, making the network link feasible for greater amounts of raw data that would otherwise need an overnighted hard disk drive transfer.

There are huge changes for RainStor here: a new software release; the extension to the cloud; a large increase in partnerships, with EMC obviously needing lots of attention; the company rebranding; and relocating its headquarters from grassy old-world Gloucester to live-wire Silicon Valley.

If Bantleman's right and the technology is as good as he says, and EMC loves it to bits, then, fingers crossed, fame and an IPO fortune awaits. If he's wrong, then a fate like that of Copan Systems, InPhase Technologies and Verari Systems lies ahead.

BIO: Chris Mellor is storage editor at The Register.

Read more on Data protection, backup and archiving