The desert island procedural set-based thinking challenge

bridgwatera | No Comments
| More

The below text is contributed content to the Computer Weekly Developer Blog written by Michael Vessey, senior consultant with software quality specialist SQS.

One of the biggest challenges for modern day developers is dealing with the number of platforms thrown at them by technology providers. Depending on the structure of their team and the size of the project, a typical Microsoft developer may have to work with SQL server, C#, WCF, WPF, CSS, LINQ, ASP and any number of other tools of the trade.

It's rare to find a developer that will advertise on their C.V. that they only work with one of these technologies, developers by nature are curious creatures that want to play with the latest toys and value any form of diversity in their work.

With this in mind, it's hard to imagine how a developer can be a master of all of these tools and still manage to keep abreast of the most recent version of each. Throw into this development minefield the mentality differences between different platform specialists and you have a recipe for disaster.

To highlight the issue let's look at the differences between set-based-thinking database professionals and procedural-thinking development professionals.

In 1971 Edgar F. Codd defined the term "Third Normal Form" (3NF) which is regarded as "THE" major design rule when building databases. With the Assistance of Raymond F. Boyce and a few others, modern day "set based" approaches were formed that dealt with the optimal way of retrieving data from databases.

A difference of "mindsets"

The idea behind set based thinking is that you gather all of the data you need in one large operation and attempt to touch any database tables only once, (or as few times as possible) and with as few round-trips as possible.

But these techniques go strongly against the mentality used by procedural developers who work in a world of iterations and loops that allow them to apply complex business logic to single instances of a class. For an application developer, sitting and waiting 30 seconds for your data to load is not on their agenda.

A real-world example

To provide a real world visualisation of this let's consider a less conceptual scenario. A desert island with 50 people stranded because of an airplane crash. They have one bucket, five cups and a pool of fresh water a short walk from their campsite.

In order to make sure the 50 stranded people don't dehydrate, they need to move water from the pool to the campsite. The bucket is set up at the campsite, one cup is left with the bucket and in true multi-threaded application style two of the crew go off to the pool each holding two cups (one in each hand).

They make several trips and slowly fill the bucket, while the cup that has been left behind allows the poor unfortunate souls on the beach to each in turn take drinks from the limited water supply available at camp (perhaps even depleting the bucket to its original empty state).

This embodies the procedural developers approach to the solution, often automatically implemented by Domain Entity Modelling systems such as N-Hibernate; where the database code can be written by the computer and not by a database expert. Data is fetched row by row, or as referred to on many database forums, RBAR - "row by agonising row" - a term coined by Jeff Moden.

This technique looks absolutely amazing when you have a development system with the bare minimum of data, but can become woefully inadequate when looking at larger data sets.

How would a database pro do it?

The approach of the database professional is very different. They pick up the bucket, walk to the pool and fill the bucket. It's very heavy, so they walk very slowly back to camp. When they arrive there are five cups available and 50 people can drink (five at a time). The survivors had to wait five times as long for the bucket to arrive, but the crew didn't need to make 13 trips to the pool in order to provide the last man a drink.

Does this happen in the real world?

To show how this really does apply in the development world, I went to a site where I was asked to look at a purge process for old data. The system had worked absolutely fine when it was released, however over the course of a year it had become very slow and was taking up to hour hours to run.

The developer was at his whit's end trying to figure out how to solve the problem and was looking at Windows performance statistics trying to figure out if it was a hardware or software related issue. The project manager confided in me that he was already in talks with the operations team about improving the server hardware at a cost of about £18,000.

Fortunately the problem was very easy to spot. Looking at the code for the purge I found the following:

declare @id int
declare curs1 cursor for select id from dbo.mytable where status=3
open curs1
fetch next from curs1 into @id
while @@fetch_status=1
begin
delete from dbo.mytable where id=@id fetch next from curs1 into @id
end
close curs1
deallocate curs1

A cursor loop inside the database code was loading up a list of records to delete and then finding each one individually (amongst several million) and deleting it.

I suggested the following much simpler line of code.
Delete from dbo.mytable where status=3

The developer was sceptical; he suggested several reasons why he did not think this would improve things and proceeded to line up a demonstration on the development system with 50 records of data. The demonstration showed no measurable difference between the sub-second performance of the two blocks of code.

I asked at that point to re-run the test with one million rows of data, promising that if there was a performance issue then we could abort the code execution and look for other solutions. The developer was astonished when the million record delete that was taking up to eight hours in the live environment had completed in just less than eight seconds with the new code.

So what lessons have we learned?

After leaving the slightly confused developer to re-run the test and after reporting back to the project manager that the issue was resolved, I began to think about lessons learned and how such poor code had gotten into the database.

The developer wasn't bad at his job; in fact his middle tier code was very tidy and lean. It was that he was working in multiple technologies and one of those was not his forte. A peer review of the code by a technical expert would have identified the issue before any testing. In addition, if the test and development environments had been data heavy then the issue would have been detected before go-live.

Neither way of thinking is right or wrong, but if you can think like both then you'll have a great career in software.

Leave a comment

About this Entry

This page contains a single entry by Adrian Bridgwater published on December 12, 2011 3:08 AM.

Will our future be shaped by "polyglot" programmers? was the previous entry in this blog.

Data analytics for the drag-and-drop generation is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.