Instaclustr anomaly detection scaled to 19 billion

The instant clustering aficionados at Instaclustr have created an anomaly detection application capable of processing and vetting real-time events at a uniquely massive scale – 19 billion events per day.

They did it by using open source Apache Cassandra and Apache Kafka and Kubernetes container orchestration technologies.

Keen to show just how much scalability the scalability factor in its own managed platform technology could handle, Instaclustr completed this action and made detailed design information available here and source code available here.

According to an Instaclustr white paper, anomaly detection is the identification of unusual events within an event stream – often indicating fraudulent activity, security threats or in general a deviation from the expected norm.

Anomaly detection applications are deployed across numerous use cases, including financial fraud detection, IT security intrusion and threat detection, website user analytics and digital ad fraud, IoT systems and beyond.

“Anomaly detection applications typically compare inspected streaming data with historical event patterns, raising alerts if those patterns match previously recognised anomalies or show significant deviations from normal behaviour. These detection systems utilise a stack of [technologies] that often include machine learning, statistical analysis and algorithm optimisation and [use] data-layer technologies to ingest, process, analyse, disseminate and store streaming data,” notes Instaclustr.

The company notes that the challenge comes in designing an architecture capable of detecting anomalies in high-scale environments where the volume of daily events reaches into the millions or billions.

When events hit the millions (or indeed billions) a streaming data pipeline application needs to be engineered for mass scale.

“To achieve this, Instaclustr teamed the NoSQL Cassandra database and the Kafka streaming platform with application code hosted in Kubernetes to create an architecture with the scalability and performance required for the solution to be viable in real-world scenarios. Kafka supports fast, scalable ingestion of streaming data, and uses a store and forward design that provides a buffer preventing Cassandra from being overwhelmed by large data spikes,” notes Instaclustr.

Cassandra serves as a linearly scalable, write-optimised database for storing high-velocity streaming data — so then, proceeding with an incremental development approach, Instaclustr monitored, debugged, tuned and retuned specific functions within the pipeline to optimise its capabilities.

No mention of customer case study references, just hardcore data crunching based upon open source technologies — that’s why we quite liked this.