Techniques for terminating spam before it reaches the inbox still have some way to go
- Posted:
- 16:40 01 Nov 2004
- Topics:
- Spam & Phishing | e-mail | Content Filtering | Electronic Messaging
Until there are new laws on e-mail use, there are really
only three ways to eliminate spam.
You can change your e-mail address, but your new address will soon
be spotted by the spammers. Of the other two ways, one is
successful - yet requires the user to be proactive - and the other,
which is not as successful, has the added benefit of allowing
people to be more passive.
The active approach involves the use of a third party to act as a
buffer between the sender and the recipient. Choice-mail is an
example of this type of service. In this set-up, e-mail is only
forwarded when the user informs an intermediary that a particular
sender's e-mail is safe. From that point, any message sent by a
ratified sender is automatically forwarded. This is bound to be
effective, if a little time-consuming.
The passive approach relies on automatic filtering techniques.
These typically employ pattern-matching or, in the best systems,
probabilistic and statistical methods. A computer examines the
content of an e-mail and automatically decides whether or not it is
spam. The computer involved may be the user's local PC or it may be
remote.
A local filter is blocking software. It would be used if someone
had never requested information on Viagra but nethertheless receive
spam on the subject.
To set up a filter, the user creates an inbox rule, such as "if the
body and/or subject of an e-mail contains the word Viagra, delete
it". This will stop certain e-mails getting to the inbox, but it
will not stop all spam.
The problem with a local filter is that it cannot know whether any
given e-mail is spam or not - even if it is privy to every piece of
information contained in it (the return address, subject line,
content, etc). Spam has been sent to many people, so how can a PC
know how many others have received the same e-mail?
There is also another reason why a local filter is not truly
useful. When a user receives spam, the spammer (the sender) will
see through the fact they had not received a notification of
non-delivery that the e-mail was received. This validates the
user's e-mail address, and so, very soon, the user is likely to
receive spam on another subject that no rule has been set up to
handle.
The alternative approach is to filter e-mail on the server before
it is downloaded. A proper spam filter must be able to ascertain
three things:
- That an identical, or very similar e-mail has been sent to a high number of people
- That all the e-mails have been sent at more or less the same time
- That the content has not been requested.
The comparison of incoming e-mail is usually carried out in some
central way, such as via the user's internet service provider
and/or the company that temporarily holds e-mail, such as
Hotmail.
However, there several challenges in comparing e-mail messages. One
trivial way to overcome filtering is a spamming technique known as
padding. The idea behind this is to fool spam filters that look at
and compare the content of multiple e-mails. When systems do this
they see some similarity (the sell) in such e-mails, but overall
they do not see enough similar text to tip the balance and have the
e-mail marked as spam.
There are a couple of ways to tackle padding, one is easy and the
other hard. If the padding is truly random, even if it is made up
of random words, it can be detected through simple syntactic
analysis.
If a large proportion of an e-mail looks the same as a significant
number of others, and if the differences between them makes no
sense, the message is most likely spam. Of course, getting a
machine to decide what makes sense and what does not is the tricky
part, and something that is beyond the capabilities of those
systems that treat message data as a simple byte stream.
There is another easy clue in this type of e-mail that is broadly
useful in the case of truly random data - padding is always at the
bottom. After all, everyone starts reading at the top (which gives
systems trying to spot this a little extra help).
How about the case when the random data is made up of non-random
words, such as by taking an arbitrary paragraph of text from the
web and inserting a different one in each and every outgoing spam
e-mail?
Again, to spot this requires some analysis of the text. For
example, spam is usually penned in a particular way and it is quite
likely that the writing style will vary from that in the
padding-text.
Before delving into textual analysis of an e-mail message to
determine if it looks like spam, a simple test is to determine if
the e-mail has been requested. But how can an e-mail system know
what has been requested and what has not? The answer is really that
it cannot. For example, say an e-mail provider decides to
automatically delete e-mails that are:
- Exactly (or more or less) the same
- Sent to a high number of people
- Sent at exactly (or more or less) at the same time.
If they did, they would certainly delete e-mails containing
things users were interested in, such as special offers. In other
words, if Hotmail were to do this, it would delete messages that
have been specifically requested by people who have signed up to
certain e-mail lists, which is not acceptable.
So to perform truly effective filtering, systems require specific
information, and to get that, they need to have users train
anti-spam systems themselves.
Systems exist that rely on people to do most of the work - users
check incoming e-mail then vote on a given e-mail message to alert
others that it is spam. In a typical scenario a number of people
receive an unsolicited e-mail and decide it is spam by marking it
as such.
A central server gathers this information, and when a
pre-determined threshold of users have voted the e-mail as spam,
the central server blocks the e-mail automatically.
Obviously, for users to do this, they need to have received the
spam in the first place - affecting their own personal
internet-bandwidth in the process.
Commercial systems generally focus on checking the subject line as
a full analysis of the body text in an e-mail is computationally
intensive, complex and potentially wastes bandwidth. There is also
a concern over privacy. Would a user be entirely happy to allow
some central filtering service to see all their incoming
e-mail?
The body of an e-mail can be large and so although it makes sense
to check the entire content, it does not make sense for a
commercial spam filter to collect messages uploaded by the user to
a central server for spam analysis.
Search engine Google may be closer than others. First, its raison
d'etre is to understand content. Since its initial public offering
it will be under constant commercial pressure to improve its
capabilities in this direction.
Google is now offering GMail - a centralised e-mail system. Because
of the storage capacity it is offering its customers and, most
importantly, the Gmail end-user agreements, Google should be in a
position to perform an extra step in the detection of spam -
namely, to continuously perform current and historical analysis of
the data users retain in their mail folders.
This means that as spam detection techniques and algorithms evolve
and improve, Google should be able to test and verify the
efficiency of these on millions of e-mails. Who knows, in time,
Google may even be able to forecast how spammers are likely to
disguise spam - and adapt along with it.
What is more, Google has grid computing expertise - should it
require even more computing power than it has now to achieve this
end.
It is uncertain whether spam filtering can be improved, but what is
clear is that there needs to be better natural language processing
techniques and these must be centralised to improve efficiency and
increase detection rates.
Some spam detection systems such as Spasassassin already go a
little way towards detecting certain features or other types of
pattern present in spam.
However, what Spamassassin does not do is look for true linguistic
features. It does not attempt to perform sophisticated semantic
analysis on e-mail, but instead applies rules.
If there are deeper patterns in spam that can be used to detect it,
it may be possible to extract enough meaning from an e-mail to make
a correct judgement.
The challenge is whether there actually is anything typical about
spam that can help identify it. Furthermore, if there is, could its
definition go so far as to be able to signify that an e-mail has
not been requested?
No one currently knows. However, an answer could certainly be
pursued - given enough time, money, willingness and expertise, and
of course, lots of spam.
Although there are plans in the pipeline that may well see an end
to spam, such as the Sender Policy Framework and Caller ID for
E-Mail, it is certain that, whoever cracks the problem of spam will
get the keys to their very own goldmine.
Peet Morris is a member of the Artificial Intelligence Group at
the University of Oxford
For more on the Spasassassin rules: http://spamassassin.apache.org/tests.html
GET PEOPLE INVOLVED IN CHECKING SPAM
The community approach to filtering out spam can ensure that context analysis is carried out and, by requiring a majority of people to "vote" on each message, judgements are not based on whim. Here is how it works:
- A number of people receive the same e-mail message
- A certain proportion of these people also subscribe to the spam filtering service Spamnet. These subscribers examine the message and click an Outlook add-in toolbar-button provided by Spamnet to mark the message as spam
- Having removed the message from the subscriber's inbox, the Spamnet Outlook add-in informs a central server of the title/subject of the e-mail
- As the number of subscribers who have marked the message as spam increases to a certain threshold, the central server becomes increasingly confident the message is spam
- When the threshold is reached, the message is considered to be spam
- When future subscribers receive the same message, their Spamnet
add-in is informed it is spam, allowing it to be removed
automatically.
MICROSOFT OUTLINES STANDARD FOR STOPPING E-MAIL SPOOFING
Microsoft has been working on a technical specification to help counter e-mail spoofing, one of the practices used by spammers. The proposal is under review by the Internet Engineering Task Force, with the aim of making Sender ID a standard way to combat spam. It is part of an overall goal called the Co-ordinated Spam Reduction Initiative.
Spoofing attempts to get the user to open a message which looks like it came from a legitimate source. According to Microsoft, Sender ID aims to check that every e-mail message originates from the internet domain from which it claims to have been sent. This is accomplished by checking the address of the server sending the mail against a registered list of servers that the user has authorised to send them e-mail.
The comparison is automatically performed by the ISP or recipient's mail server before the e-mail message is delivered. If the Sender ID verification passes, the message is delivered as regular mail. If the check fails, the message is further analysed and may be refused by the receiving server, or flagged to the user as a possible deceptive message.
Companies supporting the scheme include Sendmail and Verisign.
Another aspect of Microsoft's Co-ordinated Spam Reduction Initiative is that the sender's computer is required to perform a simple task which requires brute-force computation. Since there is no simple way to run the task without expending computing power, any one attempting to send a large number of e-mail messages will find their servers grind to a halt.