Show me the proof

In an age of increasing computerisation there are times when we need to actually question what the computer is telling us. Take spreadsheets for example. It is not uncommon to discover that the built-in functions can behave differently to the underlying macro and programming languages.

An example of this is the simple rounding problem. The built-in function may round up on a 5 while the equivalent function in code may round down. Use a mix and they may balance out and they may not. You also have to be aware of the number of decimal places that you are working with in the underlying data and then how many places you display on screen.

In a complex spreadsheet you can find many examples of rounding but in a BI environment where you are simply importing the data into a spreadsheet you may have, literally, hundreds of small errors. The rounding error is something that you can fix through settings but strangely enough, few people do. Much more serious is the complex functions that few users want to understand beyond the simple entry of numbers and the delivery of a result.

So why isn’t this constantly raised as an issue? Actually, it used to be. Visicalc, SuperCalc, Lotus, Smart, Excel, PlanPerfect, Quattro Pro, OpenOffice – all of these products have had and still have this issue. But, over time, as vendors have refused to acknowledge or fix it, people have given up trying to get a solution.

That solution would require any spreadsheet vendor to send their function library to an independent source to have it properly validated. This has been suggested but every vendor has resisted it because, ironically, the last thing that they want is to be told they are doing things in the wrong way.

The impact on a database world might seem negligible but it isn’t. User Defined Fields in some classes of databases use functions to calculate values based on other fields. The more complex the underlying data, the less chance of users actually examining that data to check for errors.

In financial and scientific fields these function errors can be serious. For example, if you are doing computational fluid dynamics (CFD) for a Formula 1 racing car and you have a built in function that is calculating a key value wrongly, your car could be slower than the competition. It doesn’t take much performance degradation to cost you places and waste large amounts of money.

In aircraft, rocket or missile technology, a mistake could take a long time to be spotted given the underlying complexity of the work. For those in financial market, there are often a lot of other systems that may provide the checks and balances but those systems are becoming ever more tightly integrated which means that data may be a calculated field in one system but entered as a value in the next. Now the problem cannot be traced to its source and has assumed a more serious proportion.

There are instances, for example, where such errors are unlikely to have much impact at all on the decisions made. For example, in a retail environment when looking at the sales and stocking levels of goods. However, if you were looking at gross margin, a product could move from marginal to drop.

BI products are aimed at allowing users to extract data from these underlying sources, do a range of manipulations on that data and then use the results to make business decisions. In some cases, that data is returned to the databases and used by other users and teams as source data for their work. As in the financial market example, the error is now compounded.

Recently, I spoke with Aster Data, a vendor of Massively Parallel Processing database solutions and we talked about a recent press release in which they announced that they had added over 1,000 MapReduce-ready functions to their data analytics suite. Such a massive increase in the number of functions without any third-party validation has to be of concern.

Aster Data say that they have not had any requests from customers to validate their functions or prepackaged applications. They do see a lot of requests from customers for more functions and applications and believe that customers are already starting to build their own use cases to validate data.

While Aster Data customers might be doing their own checks there is no widespread evidence that this is a common practice among users of spreadsheets, BI tools or databases. With the currently explosion in end-user BI, it’s time for software vendors to up their game and prove that their functions are consistent, accurate and fit for purpose.


Archived comments:

———Giles Thomas | July 23, 2010 7:15 PM

This is an interesting post; spreadsheet integrity is an increasingly critical issue, especially in the financial markets. Traders, for example, use spreadsheets to create pricing models to help them quickly decide on what positions to take, and when to take them. The spreadsheets are then routinely shared, cut and pasted and adapted across trading desks or even whole trading floors. As a result, they can easily become cumbersome, botched-together “Frankensheets”, and contain inherently difficult-to-spot inaccuracies.

While working at a large investment bank, I learned just how widely-used spreadsheets really are, and how frequently they end up being a source of frustration (or financial loss) to their users. It after several years of this that I went on to build my own spreadsheet, Resolver One. In it, the formulae you put into the code are compiled down into code in the Python programming language before being executed, while the equivalent of macro code is already in Python — so the built-in functions are identical in both, so (obviously) can’t behave differently.

A separate problem with spreadsheet accuracy — one you don’t touch on, perhaps because it doesn’t happen so much in databases, but which I’ve seen in the work of spreadsheet users who’ve reached the dangerous stage where they have started writing macros but don’t have much experience in they yet — is the difference between the “functional” model on the grid, where you can (in theory) recompute the cells in any order that respects their mutual dependencies, and even skip cells whose dependencies haven’t been changed since the last recalculation, and the “imperative” model in the macro and user-defined function (UDF) language, where a function can change global state and have side-effects. Because you can write a UDF that returns a different value each time it’s called, based on global state, it’s easy for an inexperienced macro developer to write UDFs that make the spreadsheets that use them generate inconsistent results.

———Marc | October 7, 2010 2:49 PM

I would agree 100% with Giles comments, that users writing their own macro’s can get themselves into hot water.

I can’t see the connecton between the author’s dicussion about producing sound macro’s in excel, and his conclusion that Aster Data systems would (may ?) suffer from the same issues.

Maybe this article is just a rehack of the old garbage in, garbage out adage.

MapReduce is used by google to index the world wide web, is that not some proof?