Tip

How to find credit card numbers and other sensitive data on your users' computers

If you worry that important and sensitive data like credit card numbers is lurking on your users' hard drives, read on to learn how to search for and corral this information.

Mike Chapple, University of Notre Dame

Published: 05 May 2009

There's little question that any security manager who's suffered through a lost laptop incident knows what aspect of the ordeal causes an organisation real damage. When a laptop goes missing, it's not the loss of a $2,000 asset that causes heartburn; it's the fear, uncertainty and doubt that results from not knowing whether sensitive information was stored on the missing device.

Fortunately, security professionals may take advantage of a number of sensitive information discovery tools to identify and eradicate sensitive information stored on endpoint devices.

Sensitive information discovery algorithms
efore we delve into the tools available to assist with the search, it's important to have a basic understanding of the algorithms used to detect sensitive numbers. Only by understanding how these algorithms work is it possible to judge the effectiveness of individual scanning tools. We'll specifically look at two types of sensitive numbers commonly sought out by sensitive data discovery tools: credit card numbers and Social Security numbers.

Credit card numbers issued by the major providers follow a standard format that makes it easy to detect them using regular expressions. The rules for valid numbers include:

Visa numbers have either 13 or 16 digits and always start with a 4.
MasterCard numbers have 16 digits and always start with a 5, followed by a digit between 1-5.
American Express numbers have fifteen digits beginning with 34 or 37.
Discover Card numbers have 16 digits beginning with 6011, 622, 644-649 or 65.

These guidelines are a great starting point for ruling out quite a few false positives because they can easily be adapted to a regular expression. For example, the following regular expression can be plugged into a search tool to find potential Visa card numbers, even if there are whitespace characters between the groups of four digits:

\b4\d{3}[ -]?\d{4}[ -]?\d{4}[ -]?\d{4}[ -]\b

There's also a validation algorithm built into credit card numbers that provides even greater confidence in a match. The Luhn algorithm verifies that a card number passes the "check digit" test, which enables error detection by identifying number patterns that are known to be invalid. The algorithm works by summing all the card number digits and then performing the mod 10 operation on the sum. For those of you forgetting high school math, to perform the mod 10 operation, simply divide the number by 10. The integer remainder is the result. For example, let's verify the following credit card number:

4128 0057 1492 1925

Add the first 15 digits together, which produces a sum of 55. Divide that by 10 and you get "5 remainder 5". In other words, the remainder (5) matches the last digit of the card number (also 5), so you know it is potentially valid. This algorithm does not, of course, confirm the number corresponds to an active account, but it does provide additional confidence in your match.

Social Security numbers, on the other hand, are not quite as easy to match because there is no Luhn algorithm equivalent to verify their validity. You can look for patterns of nine digit numbers surrounded by white space and take advantage of a few clues to help with the search:

SSNs are often (but not always) written with hypens between the digits, in the form xxx-xx-xxxx. If you're willing to accept the risk of missing unformatted numbers, you can restrict your search to numbers hyphenated in this pattern to dramatically reduce false positives.
SSNs will never have all 0's in a digit group (i.e. 000-xx-xxxx, xxx-00-xxxx or xxx-xx-0000).
SSNs will never begin with 666, 732-749 or any number higher than 772.
Given the first three digits of an SSN, you can determine the highest possible values for the next two digits by consulting the Social Security Administration's High Group Number list.

Software tools to assist in the search
Unless you're looking for an adventure, it's not necessary to write your own code to implement these searches. There are a variety of open source and commercial products available to assist you in detecting these sensitive numbers on enterprise systems. Some examples include:

Cornell University's Spider
University of Texas at San Antonio's Sensitive Number Finder
Identity Finder

These tools use the algorithms described above and allow you to tinker with the settings, such as whether to restrict a search to formatted numbers, numbers in particular file types and other parameters.

Managing sensitive information and data
After deciding upon a search strategy for finding potentially sensitive information, the next step is to decide on a strategy for managing the mountains of results data.

There are two basic approaches to this problem: centralised review or decentralised authority. In the centralised approach, the tools report all results to a central administrator who is responsible for validating and eradicating suspicious data. This is an extremely time-consuming process and taxes valuable IT resources. However, it ensures consistency of rule interpretation and the thorough review of findings.

In the decentralised approach, end users are given responsibility (and accountability) for reviewing results. This distributes the workload among the entire workforce and provides the added benefit of having staff with contextual knowledge perform the review. For example, a staffer who knows that an Excel spreadsheet contains information about parts orders may be able to immediately disregard reports of SSNs in that document, while a centralised reviewer might not know the difference between that and any other file.

The downside of this approach is obvious: It's a lot harder to get all the individuals in an organisation to search their systems than it is to have a centralised staff perform the searches as a core responsibility. If you choose the decentralised approach, you'll probably want to develop a reporting mechanism to allow individual employees to provide progress reports, allowing you to track down those that ignore your requests. You'll also want to provide detailed training, perhaps through the use of screencast videos or detailed documentation, walking users through the scanning process. Finally, you'll need to make technical support available to users who have difficulty performing or interpreting the scans.

Scanning systems for sensitive data is a complex problem but, fortunately, there are a variety of tools and techniques available to assist in the process. Minimisation, the searching and eradication of sensitive information on endpoints, is a powerful strategy in the arsenal of security administrators seeking to reduce enterprise risk.

How to find credit card numbers and other sensitive data on your users' computers

If you worry that important and sensitive data like credit card numbers is lurking on your users' hard drives, read on to learn how to search for and corral this information.

Read more on Security policy and user awareness

personally identifiable information (PII)

ACLU of RI Sues RIPTA, UnitedHealthcare Over Healthcare Data Breach

Phishing Attack on Five Rivers Health Impacts Data of 156K Patients

Data of 3.3M 20/20 Hearing Care Patients Hacked From Cloud Database