Auto-tech series - Netskope: Automating data classification

This is a guest post for the Computer Weekly Developer Network written by Krishna Narayanaswarmy in his role as CTO and co-founder at Netskope – a company known for its cybersecurity technology which works to deliver cloud, data and network security to help organisations apply zero trust principles to protect data.

Narayanaswarmy suggests that accurate data classification is an essential foundation for best practices in data protection, but today it is essentially still a very manual and time-intensive task in most organisations.

He argues that the criticality situation is clear; if existing and incoming data streams are not clearly categorised as sensitive or non-sensitive (to use the most basic level labelling as an example), security systems will not be able to implement security policies and processes such as allocating access permissions and encryption.

Despite sitting at the heart of any data protection strategy, Narayanaswarmy explains that for information security teams, the classification process is hugely time-consuming and isn’t an exact science.

Narayanaswarmy writes as follows…

Let’s consider the scale and nature of the task. Imagine you are responsible for data classification at a 1,000-person organisation.

One thousand people create a lot of data in a day – all of it with differing levels of sensitivity. Time-saving approaches (such as bulk categorisation built on assumptions based around user, role or application) are highly inaccurate, either leaving sensitive data potentially exposed, or incurring unnecessary costs by implementing heavy handed protections where they are not needed.

Fortunately, two things have happened in recent years which are combining to help the information security professional:

Firstly, data has moved to the cloud, creating data flows during which it is possible to make use of cloud-based classification support.
Secondly, the evolution of AI and ML technologies has enabled significant advances in automation of classification.

Identifying sensitive data

Natural-Language Processing (NLP) uses language algorithms to develop machine learning models that can automatically classify a document as sensitive. These models are trained using documents that have already been classified, allowing them to spot patterns that will inform classification of new documents that come into the system. NLP can also harness other tools, such as Named Entity Recognition (NER) to spot names, addresses and account details typically found in sensitive data. Such systems can be of significant help for finance, legal and HR organisations.

Convolutional neural networks (CNN) are another automation tool, which work in a slightly different way and particularly help with image-based data. These are deep learning algorithms that can assign importance to specific aspects and objects within an image. This is useful for classifying scanned images of documents which might be sensitive, including passports, credit cards, or corporate identification. These documents all contain sensitive personally identifiable information (PII) that falls under compliance regulations like GDPR and PCI.

Without machine learning and automation tools such image files can be hard to track and classify as they do not contain ‘digitised’ information and often evade legacy data scanning technologies.

Building policy on automated classification

Krishna Narayanaswarmy Netskope CTO.

With automated classification and specific labelling in place, organisations can much more easily implement restrictions that will limit sensitive data access to only those employees that need it to perform their roles.

It also becomes much easier to determine which data sets to use more advanced encryption around – to give additional security in the event of a data loss incident – and data exfiltration can be better controlled using data loss prevention (DLP) techniques to make sure users don’t send sensitive or critical information outside the corporate network.

With automated data classification and stronger image inspection capabilities, new use cases can develop hand-in-hand with digital transformation and behavioural changes within an organisation. Imagine, for instance, the ways specific watermarks could be placed into internal-only presentation decks in the manner of a £10 note.

Any internal team member sharing screenshots of internal-only conference call content could be easily blocked and reminded of policy.

Adoption & implementation

The process of adoption and implementation is broken into two parts: model building and data classification automation. Model building is the process where AI and machine learning models are developed for different categories of documents that are considered sensitive.

This requires acquisition of accurate labelled data and choosing the right algorithm to be applied. Some commercial DLP solutions have prebuilt models for more common data sets like photo IDs, CVs, etc. Data classification automation is the development of a pipeline that can connect to cloud services and iteratively analyse the data that is stored there.

Enterprises will continue creating data exponentially, especially with the rise in IoT traffic and data protection regulations are only going to grow. Automating data classification is an obvious win for organisations, avoiding laborious processes for IT teams.

Ultimately, AI streamlines the decision-making process, improving efficiency and reducing the risk of data being misclassified or inadequately protected, while freeing up IT and security teams to focus on other problems that can’t be solved through automation.