Detecting Spam Detecting Spam How Content Filters Can Improve at Identifying Spam Spam
As mentioned in ‘How a Spam does a Spam Filter Work’, content filters improve their ability to accurately identify spam messages, by ‘learning’ through use. Bayesian spam filters calculate the probability of a message being spam based on its content, and learn to differentiate spam from good mail, resulting in robust and efficient anti-spam protection. A key aim behind the development of Bayesian techniques was the desire to avoid ‘false positives’ – bone fide messages being identified as spam.
This is done by making simple content-scoring filters able to adapt. The filters scan messages for pre-defined indicators of spam such as words and characteristics deemed typical of spam. These can include:
- Capitalisation unlikely in regular e-mail; "CLICK HERE!" or "FREE! BUY NOW!"
- Spammy phrases like ‘Money Back Guarantee’ or ‘Why Pay More’.
- Individual words common in spam – ‘Viagra’, ‘Singles’, ’Discount.’
Each of these characteristic elements is ascribed an arbitrary score, and if the total of these scores adds up to exceed a certain threshold, the message is identified as spam. In theory this approach can work well, but is vulnerable to clever spammers disguising the content (e.g. ‘V1agra’), and is limited by the pre-defined, and thus limited scope of the filtering criteria.
Bayesian spam filters work on a similar content-scoring approach, but without the weakness of a manually-built list of filtration characteristics. Instead, they build their own scoring approach by analysing legitimate mail compared with known spam, to calculate the probability of various characteristics appearing in spam, and legitimate mail.
The characteristics a Bayesian spam filter might compare include:
- Words in the message body, including pairs and phrases as outlined earlier.
- Header information: sender identity and message relay chain.
- Aspects of HTML code such as colours.
- Meta information: for instance where a particular phrase appears.
If, by comparison, the email spam filter records that the word “bulldog” often appears in your legitimate e-mails, but never in spam, it will decide the probability of “bulldog” indicating spam is near zero. But if the word ”dating” appears frequently and exclusively in spam, it will be ascribed a very high probability – almost 100% - of being found in spam. Through establishing and recording these probabilities, Bayesian spam filters build their own user-specific spam definitions. If the words “bulldog” and “dating” (unlikely but theoretically possible!) were to appear in the same e-mail, the spamfilter will look at some of the other myriad individual characteristics of the e-mail to assess its veracity.
Through this system of auto-adaptive classification, Bayesian filters use both their decisions and yours – e.g. manual correction of a misjudgement by the filters – to become very effective at identifying spam. Their power is initially limited by the number of good emails and spam that they have to compare, but with use they can develop their efficiency with minimal input from you. By assessing your good e-mail they can deliver a very low rate of false positives, and are effective against spammers adapting their methods, as it is virtually impossible for a mass e-mail campaign to imitate the individual characteristics of multiple users’ legitimate e-mail.
In 2010, a spam outbreak broadcasting fake iTunes purchase receipts gave a good example of how spammers can improve their results by combining several criminal tactics. Spammers sent out empty supposed iTunes invoices with links for web based view of the email. Readers clicking on these links were asked to download PDF documents which actually infected their computers with the Zeus Trojan horse responsible for stealing banking details, making this one of the most effective multi-vector spamming schemes.
Tactics included the use of familiar, trusted brands, compromising legitimate web sites and redirecting to pharmaceutical and fake anti-virus web sites.
Pharmacy related spam still leads the top of most common spam emails, but dropping from over 40% in Q4 of 2010 to just under 30%.
September 2010 brought to light the largest social networking-related spam scheme. Billions of spam message per day were sent out worldwide pretending to have originated from the business social networking service LinkedIn. Messages contained links which, if clicked, infected the targeted computer with the Zeus Trojan horse.
The infection’s purpose was to gather personal banking details and the scheme became a “cash cow”, as considered by many experts. FBI’s Internet Crime Complaints Center warned that in 2009, more than 100 million US $ were stolen from commercial accounts with the use of similar methods.
In 2010, more than 70% of all spam messages were related to Pharmaceutical products while Q1 of 2011 showed numbers dropping to almost half. Moreover, events like the March Chile earthquake and June World Cup created a new opening for spammers and boosted spam levels worldwide.
Spamfilter Spamfilter How Content Filters Can Improve at Identifying Spam Detecting Spam