Global data quality management: local headaches, international rewards

Global data is varied. It originates from many countries, contains country-specific values and is provided and stored in differing languages or special characters. Learn how to handle it.

Thanks to the Internet, even the smallest business can be global. Customers from all over the world can see what a business offers, and if the company is prepared to operate beyond its own country’s borders, these global customers can provide sales opportunities that didn’t exist previously.

Larger businesses that have offices worldwide are also moving to operate on an international basis, centralizing databases and business operations to gain efficiencies and save valuable funds. However, many of those who have ventured into the global arena have found that managing the international data that enables truly global business operations is not an easy task.

Global data introduces variation, as it originates from multiple countries, contains varying local formats and country-specific values and is stored in differing languages or special characters.

When dealing with global data, there is much to take into consideration. For example, the differences in international addresses are vast. Not only are there nearly 10,000 languages around the world, but there are also dozens of different addressing formats—some fairly similar, others quite diverse.

Naming conventions for both individuals and businesses vary by country. Many languages have alphabets that include diacritical marks, and some are iconic—using graphical images rather than individual letters. Other data elements, like dates, times and phone numbers, also differ.

For effective global data quality management, those and other challenges need to be understood and properly managed.  

Address formats. Currently, there are 131 different address formats in the world. At the most extreme, some start with the street information and end with the postal code, while others begin with the postal code and end with the street information.

Some postal systems have preferences about the casing of address elements and also the use of abbreviations and punctuation. Likewise, some countries prescribe the use of symbols in a precise manner in order to denote elements of a complex address.

Postal address maturity. Postal systems vary around the world. Many countries are just beginning to develop formal processes. In many countries, mail is not delivered to premises. Post office boxes fill the gap, and people and companies travel to the post office to pick up their mail.

This varying maturity means that postal address files used for data validation are often limited and may even be ambiguous, as many entries might only contain a city name and country.

Address formats may also evolve over time. Old, outdated formats are replaced with newer ones, but the use of new formats takes time to be accepted by the entire population. While this happens, both formats must be anticipated as part of a global data management strategy.

Personal names. The components and order of personal names can differ by country and also by culture. These diverse possibilities highlight the problems associated with using the common form of “First Name” and “Last Name.” In the majority of countries, the expectation is that names are shown as given name followed by family name. But that isn’t the case in many places. In fact, some countries have multiple options based mainly on culture or religion.

Regional variations. Names and addresses are just the start. Many aspects of customer information have variations based on location:

Phone numbers

There is a standard format for international phone numbers. However, each section of a number can have a different length, including the country code. The overall length of a phone number changes by country as well. In order to make sense of numbers, people often add characters like hyphens, parentheses and even slashes to them, potentially causing confusion within systems.

Business naming conventions

Types of business—public companies, limited companies, partnerships, sole traders and so on—are generally country-specific. In order to build a business database, it’s important to understand the different types in order to standardize business details and avoid duplicate entries.

Post office boxes

As mentioned above, post office boxes have great importance in some countries because of the lack of premises-level postal deliveries. It is often the case that a P.O. box will be used for a mailing address and should be stored alongside the physical address of a company, if one exists.

Cultural influences. It is often easier to deal with language barriers than to overcome cultural issues. Unfortunately, getting this wrong can have a dire impact on people’s perception of an organization. Care must be taken to handle the strict protocols that some cultures have for names and addresses.

A company’s reputation can be damaged when personal details are presented in a culturally disrespectful way. A combination of access to local knowledge and excellent language skills is the best way to overcome these issues.

Diacritics and other special characters. A diacritic is a mark attached to a letter to change its pronunciation or stress. The use of diacritics is central to most alphabets. As a result, when dealing with global data, there likely will be quite a number of diacritical marks that need to be maintained. For example, languages like German and French present issues because of the presence of enhanced characters, such as the umlaut (ü) in German and the accent aigu (é) in French.

There are 131 different address formats in the world. … Some start with the street information and end with the postal code, while others begin with the postal code and end with the street information.

Diacritical marks are just the beginning of issues related to different character sets. Around the globe, there is a large variety of different character sets used by individual languages. Many of the world’s emerging markets are in areas that use the most “unusual” of these. For example, some character sets are based on pictograms, as are some Asian languages. Other places that use different forms of character sets include Russia, Bulgaria, Greece and countries in the Middle East.

When data is managed locally, there is generally no need to put anything special in place to deal with a country’s character sets because the existing local IT infrastructure will have been set up to handle that process. However, when data is managed globally or even regionally, multiple character sets can become quite a problem.

Character sets require encoding so they can be supported within systems. These encodings are commonly known as code pages—tables of mappings that manage the relationships between the codes and the characters they represent.

A set of code pages designated as Unicode brings all known character sets together in a way that is language-independent. The use of Unicode will greatly simplify the management of multiple character sets within IT systems and will safeguard against data corruption. To build a single, centralised database for global data management, it is essential to use Unicode to provide the most effective results; currently, UTF-8 is the dominant character encoding format for Unicode implementations.

Is Going Global Worth It?

With all the issues that need to be handled, it would be easy to think going global on data management isn’t worth the effort. However, the challenges are not insurmountable, and the rewards can be great. An excellent reference source for managing global data is The Global Sourcebook for Name and Address Data Management, written and published by Graham Rhind. You can find it at; I swear by it.

With a clear understanding of the challenges and in-depth knowledge of the multinational data that needs to be managed, you can successfully unlock your organization’s global data potential.

Kathy Hunter is an information management consultant at Kynetika. She has more than 20 years of information systems experience and started her focus on information quality improvement 13 years ago.

Read more on Data quality management and governance