The role of classification in data protection

How do you secure the flood of data in a company? Richard Chirgwin looks at reviving interest in data classification as a business security tool.

It’s the kind of story nobody wants to see happen to them. According to the Sydney Morning Herald, a confidential police interview of former judge Marcus Einfeld leaked because it was accessible on the computers of a transcription service, APT Transcriptions.

The story (if accurate) illustrates the gap between government and private sector bodies. While government people routinely deal on the basis of information being public, secret and top secret, such terms sound melodramatic or even quaint to most in the private sector.

But that may be changing. Even medium-sized businesses have to deal with more data and a wider range of applications generating that data.

“There’s an absolute flood of information-generating tools,” says Michelle Phillips, marketing manager for Information Management at HP Software and Solutions, “so the responsibility for managing organizational content or data assets can no longer remain in the hands of a few specialists.”

That’s leading private sector Chief Security Officers to take a fresh look at the discipline of data classification – a discipline that’s attracting attention from both new vendors and old as a key part of your data loss prevention and data protection strategy.

Why data protection matters

Generally, IT security has focused on networks and hosts: you secured your hosts (servers and desktops) to try and stop them being compromised by people with access to your network; and you secured networks to try and protect them against unauthorized access.

In that (simplified) model, the security of data tended to be a function of the security of its host. If you had sensitive data, you put it on a server accessible by only a few people.

That, according to RSA ANZ pre-sales engineering manager Greg Singh, is no longer enough. While access-level security is still required, if that’s all you have “it’s like saying once someone has the right to go inside the bank, they can take whatever they want.”

Amichai Shulman, CTO of Imperva, explained that simple access security “does not provide enough granularity to describe that access levels permitted to different types of user ... and require a lot of maintenance.”

For example, Shulman told TechTarget ANZ, a healthcare administrator might require access to a patient record to maintain information such as the user’s contact details; but that person should not be able to see medical history. This, he said, is difficult to implement at the user/group level.

Data classification, he explained, allows organizations to map data types against sensitivity and access, and with the right tools, this can be substantially automated.

Some markets are familiar with these issues: the more sensitive government bodies (such as security, defence, and financial ministries to name a few). Likewise, larger or strongly regulated enterprises (for example, in the financial industries) understand the issues.

However, as the APT Transcriptions story illustrates, data protection is an issue that reaches all the way down to medium and smaller businesses.

Three Steps to (partial) success

Data classification boils down to some quite straightforward steps – the devil, as always, will be in the detail.

Shulman nominated the steps as discovery; classification; and securing the discovered data according to the classification you apply.

Discovery is the challenge that most of the tools seek to address, and it’s the challenge that has run ahead of most enterprises’ capabilities. Even if we ignore the host of ways in which data can leave the “safe” world (like laptops, USB keys, burned onto DVDs and so on), data is too easily duplicated away from its intended use.

For example, a sensitive document once e-mailed among employees exists in their mailboxes and, if forwarded, in their sent mail folders; a database may be created in the development environment and copied to the live server; and that’s not counting systems using content caches to improve performance.

“You can’t shy away from the fact that an organisation has to assess the types of information assets it creates, and how they look after it,” said Phillips. No matter what tools exist to help, “you need intelligence and understanding at the grassroots level.”

CA Technologies’ principal consultant for security, Trevor Iverach, said “the tools you use need to have intelligence based on a particular country, and what’s important in that country’s regulatory environment. In America, Sarbanes-Oxley awareness is important; in Australia, you might be watching out for tax file numbers.”

“You need intelligent discovery – not just identifying 12 or 16 random numbers as a credit card, but determining the context of the information, the identity of the user, where the information is stored, and what the user intends to do with it.”

RSA’s Singh agrees.

 “If I told you to organize your laptop properly, you’d probably find it a scary job.” Now expand that to a server, or a lot of servers  - and remember that the task is open to human error: “The document looks fine, but there’s sensitive information on page 37.

“Some things can be pre-packaged – credit card numbers, tax file numbers, ABNs, resumes, intellectual property. Others are more specialized and need consulting work.”


Singh notes that the process of classification can expand the scope of knowledge needed by the security professional. In the RSA solution, he explained, some of the discovered data will be quarantined for inspection because it falls outside the categories of the discovery tool. In that case, the security professional will have to decide how to treat the quarantined data – and that means instead of their role being one of understanding the security systems, they also need to understand how particular types of data fall within corporate data protection policies.

Shulman identifies two broad approaches to the tools that help automate classification: name-based and context-based. Both, he says, have their constraints.

While fast, a name-based tool is constrained by its simplicity: it can identify a pattern as a date, for example – but a date may not be sensitive information until it’s associated with a credit card number and becomes the expiry date.

Another constrained nominated by Shulman is that in an environment like a database table, column names may not have any relationship to the data they contain: they might be creations of the developer’s convenience.

On the other hand, while a context-based tool can discover relationships such as that between credit card and expiry date, they’re more time-consuming. Imperva’s approach is to combine the two: a “name-based” classification for speed, followed by a context-based analysis to improve the accuracy of the results.

The results of an analysis like this, Iverach said, should be viewed through the prism of its impact on a company – and this is specific to the company. He cited Salesforce.com as an example: “For a company like cloud-based provider Salesforce.com, one data breach would have a significant impact – but a second breach might destroy the business model entirely.”


It’s only when you know what you’ve got, and you have it classified, that you can start applying granular data-level protection to it.

At bottom, data protection looks a lot like more traditional security – only there’s more of it. Instead of applying the access controls to a server or a network, they’re applied to a data entity like a document (yes, data protection also applies to databases, but there isn’t enough space to give that the attention it deserves).

At some level of classification, simple access controls (such as putting a document in a particular folder with appropriate access controls) may suffice. Other documents may need further protection, such as encryption, with some solutions (such as RSA’s) designed to manage the key exchange for users with appropriate access. And, of course, some of the discovered data may need to simply be deleted.

As HP’s Phillips points out, some quite established environments are finding a new role in a world that’s increasingly mandating data protection in various regulatory environments.

Document management systems like HP’s TRIM, she explained, address issues of control over data, and have done so since 1985.

If all a company’s records are managed in a single repository and don’t get siloed across different platforms, she said, then data security is also improved.

“We can apply security via a business classification scheme, or at different levels – the user might see the whole document, or only the title of the document, or nothing at all.”

Iverach notes that whatever protection scheme is applied to the data, it has to integrate well with the existing user-based security systems, because the protection you apply to data has to be able to respond to changes in a user’s roles: what’s appropriate to someone working in financial administration won’t be appropriate if they move to sales.

On the outside

However robust your internal data protection, it can all come undone once you have to send information to external partners.

Oddly enough, while organizations are wary of cloud-based solutions for reasons of security, a cloud-like approach can help protect that data that has to be handled by outsiders. But as Phillips pointed out, a Web interface to internal documents (as is provided by most serious document management solutions) can enable access to third parties without sending the document itself.

When it goes wrong

Data classification and data protection will augment access and network security – but no scheme is perfect. Some users will accidentally do things that compromise data; others will do so deliberately but not understanding either corporate policy or the risk they’re taking; and still others will be malicious.

Evan Stubbs, solutions manager for analytics as SAS, highlights the role of analytical software in identifying patterns of user behaviour that might, if unchecked, bring a company’s data protection schemes to naught.

The capabilities that company provides for fraud solutions are also applicable to more generalized data protection problems: identifying behavioural patterns so that you can flag potentially aberrant behaviour.

“Things like user accounts and login times tend to be quite structured. When you start looking at access to documents and reports, and matching that back against roles and responsibilities, you can find patterns that are different to what you’d normally expect from a person in a particular role.”

By including analytics as part of the overall data protection strategy, Stubbs said, companies can guard against “the unknown unknowns”.

“Rules are very easy to codify – but you can only codify what you already know. Analysis means you can pick up things you weren’t already aware of or weren’t already investigating.”

SAS has also found wide application for social network analysis, since social networking sites are not only a common vector for data leakage, they can also indicate when that data leakage is malicious.

Read more on Security policy and user awareness