Techniques for terminating spam before it reaches the inbox still have some way to go

E-mail software can only do so much to spot spam and eliminate it. Try people power

Until there are new laws on e-mail use, there are really only three ways to eliminate spam.

You can change your e-mail address, but your new address will soon be spotted by the spammers. Of the other two ways, one is successful - yet requires the user to be proactive - and the other, which is not as successful, has the added benefit of allowing people to be more passive.

The active approach involves the use of a third party to act as a buffer between the sender and the recipient. Choice-mail is an example of this type of service. In this set-up, e-mail is only forwarded when the user informs an intermediary that a particular sender's e-mail is safe. From that point, any message sent by a ratified sender is automatically forwarded. This is bound to be effective, if a little time-consuming.

The passive approach relies on automatic filtering techniques. These typically employ pattern-matching or, in the best systems, probabilistic and statistical methods. A computer examines the content of an e-mail and automatically decides whether or not it is spam. The computer involved may be the user's local PC or it may be remote.

A local filter is blocking software. It would be used if someone had never requested information on Viagra but nethertheless receive spam on the subject.

To set up a filter, the user creates an inbox rule, such as "if the body and/or subject of an e-mail contains the word Viagra, delete it". This will stop certain e-mails getting to the inbox, but it will not stop all spam.

The problem with a local filter is that it cannot know whether any given e-mail is spam or not - even if it is privy to every piece of information contained in it (the return address, subject line, content, etc). Spam has been sent to many people, so how can a PC know how many others have received the same e-mail?

There is also another reason why a local filter is not truly useful. When a user receives spam, the spammer (the sender) will see through the fact they had not received a notification of non-delivery that the e-mail was received. This validates the user's e-mail address, and so, very soon, the user is likely to receive spam on another subject that no rule has been set up to handle.

The alternative approach is to filter e-mail on the server before it is downloaded. A proper spam filter must be able to ascertain three things: 

  • That an identical, or very similar e-mail has been sent to a high number of people
  • That all the e-mails have been sent at more or less the same time
  • That the content has not been requested.

The comparison of incoming e-mail is usually carried out in some central way, such as via the user's internet service provider and/or the company that temporarily holds e-mail, such as Hotmail.

However, there several challenges in comparing e-mail messages. One trivial way to overcome filtering is a spamming technique known as padding. The idea behind this is to fool spam filters that look at and compare the content of multiple e-mails. When systems do this they see some similarity (the sell) in such e-mails, but overall they do not see enough similar text to tip the balance and have the e-mail marked as spam.

There are a couple of ways to tackle padding, one is easy and the other hard. If the padding is truly random, even if it is made up of random words, it can be detected through simple syntactic analysis.

If a large proportion of an e-mail looks the same as a significant number of others, and if the differences between them makes no sense, the message is most likely spam. Of course, getting a machine to decide what makes sense and what does not is the tricky part, and something that is beyond the capabilities of those systems that treat message data as a simple byte stream.

There is another easy clue in this type of e-mail that is broadly useful in the case of truly random data - padding is always at the bottom. After all, everyone starts reading at the top (which gives systems trying to spot this a little extra help).

How about the case when the random data is made up of non-random words, such as by taking an arbitrary paragraph of text from the web and inserting a different one in each and every outgoing spam e-mail?

Again, to spot this requires some analysis of the text. For example, spam is usually penned in a particular way and it is quite likely that the writing style will vary from that in the padding-text.

Before delving into textual analysis of an e-mail message to determine if it looks like spam, a simple test is to determine if the e-mail has been requested. But how can an e-mail system know what has been requested and what has not? The answer is really that it cannot. For example, say an e-mail provider decides to automatically delete e-mails that are:

  • Exactly (or more or less) the same
  • Sent to a high number of people
  • Sent at exactly (or more or less) at the same time.

If they did, they would certainly delete e-mails containing things users were interested in, such as special offers. In other words, if Hotmail were to do this, it would delete messages that have been specifically requested by people who have signed up to certain e-mail lists, which is not acceptable.

So to perform truly effective filtering, systems require specific information, and to get that, they need to have users train anti-spam systems themselves.

Systems exist that rely on people to do most of the work - users check incoming e-mail then vote on a given e-mail message to alert others that it is spam. In a typical scenario a number of people receive an unsolicited e-mail and decide it is spam by marking it as such.

A central server gathers this information, and when a pre-determined threshold of users have voted the e-mail as spam, the central server blocks the e-mail automatically.

Obviously, for users to do this, they need to have received the spam in the first place - affecting their own personal internet-bandwidth in the process.

Commercial systems generally focus on checking the subject line as a full analysis of the body text in an e-mail is computationally intensive, complex and potentially wastes bandwidth. There is also a concern over privacy. Would a user be entirely happy to allow some central filtering service to see all their incoming e-mail?

The body of an e-mail can be large and so although it makes sense to check the entire content, it does not make sense for a commercial spam filter to collect messages uploaded by the user to a central server for spam analysis.

Search engine Google may be closer than others. First, its raison d'etre is to understand content. Since its initial public offering it will be under constant commercial pressure to improve its capabilities in this direction.

Google is now offering GMail - a centralised e-mail system. Because of the storage capacity it is offering its customers and, most importantly, the Gmail end-user agreements, Google should be in a position to perform an extra step in the detection of spam - namely, to continuously perform current and historical analysis of the data users retain in their mail folders.

This means that as spam detection techniques and algorithms evolve and improve, Google should be able to test and verify the efficiency of these on millions of e-mails. Who knows, in time, Google may even be able to forecast how spammers are likely to disguise spam - and adapt along with it.

What is more, Google has grid computing expertise - should it require even more computing power than it has now to achieve this end.

It is uncertain whether spam filtering can be improved, but what is clear is that there needs to be better natural language processing techniques and these must be centralised to improve efficiency and increase detection rates.

Some spam detection systems such as Spasassassin already go a little way towards detecting certain features or other types of pattern present in spam.

However, what Spamassassin does not do is look for true linguistic features. It does not attempt to perform sophisticated semantic analysis on e-mail, but instead applies rules.

If there are deeper patterns in spam that can be used to detect it, it may be possible to extract enough meaning from an e-mail to make a correct judgement.

The challenge is whether there actually is anything typical about spam that can help identify it. Furthermore, if there is, could its definition go so far as to be able to signify that an e-mail has not been requested?

No one currently knows. However, an answer could certainly be pursued - given enough time, money, willingness and expertise, and of course, lots of spam.

Although there are plans in the pipeline that may well see an end to spam, such as the Sender Policy Framework and Caller ID for E-Mail, it is certain that, whoever cracks the problem of spam will get the keys to their very own goldmine.

Peet Morris is a member of the Artificial Intelligence Group at the University of Oxford

For more on the Spasassassin rules:



The community approach to filtering out spam can ensure that context analysis is carried out and, by requiring a majority of people to "vote" on each message, judgements are not based on whim. Here is how it works: 

  • A number of people receive the same e-mail message 
  • A certain proportion of these people also subscribe to the spam filtering service Spamnet. These subscribers examine the message and click an Outlook add-in toolbar-button provided by Spamnet to mark the message as spam 
  • Having removed the message from the subscriber's inbox, the Spamnet Outlook add-in informs a central server of the title/subject of the e-mail 
  • As the number of subscribers who have marked the message as spam increases to a certain threshold, the central server becomes increasingly confident the message is spam 
  • When the threshold is reached, the message is considered to be spam 
  • When future subscribers receive the same message, their Spamnet add-in is informed it is spam, allowing it to be removed automatically.


Microsoft has been working on a technical specification to help counter e-mail spoofing, one of the practices used by spammers. The proposal is under review by the Internet Engineering Task Force, with the aim of making Sender ID a standard way to combat spam. It is part of an overall goal called the Co-ordinated Spam Reduction Initiative. 

Spoofing attempts to get the user to open a message which looks like it came from a legitimate source.  According to Microsoft, Sender ID aims to check that every e-mail message originates from the internet domain from which it claims to have been sent. This is accomplished by checking the address of the server sending the mail against a registered list of servers that the user has authorised to send them e-mail.  

The comparison is automatically performed by the ISP or recipient's mail server before the e-mail message is delivered. If the Sender ID verification passes, the message is delivered as regular mail. If the check fails, the message is further analysed and may be refused by the receiving server, or flagged to the user as a possible deceptive message.  

Companies supporting the scheme include Sendmail and Verisign. 

Another aspect of Microsoft's Co-ordinated Spam Reduction Initiative is that the sender's computer is required to perform a simple task which requires brute-force computation. Since there is no simple way to run the task without expending computing power, any one attempting to send a large number of e-mail messages will find their servers grind to a halt.

Read more on IT architecture