Review – A Quantitative Study of Forum Spamming Using Context-based Analysis
Authors -Yuan Niu, Yi-Min Wang, Hao Chen, Ming Ma, and Francis Hsu
Year – 2007
Published in- Proceedings of the Network & Distributed System Security (NDSS) Symposium
Link – ftp://ftp.research.microsoft.com/pub/tr/TR-2006-173.pdf
Importance to my Research - High
MY REVIEW
The approach presented by the authors is very interesting and practical, however I found one major weakness and i.e. the authors should have considered how often the spammers change domain names. If the domain names are changed on a weekly basis, the Strider URL tracer has to run every week to find out the spam domains and black list them and this can be computationally expensive and may incur a lot of network traffic, not only for the spam domains but even for the non-spam domains.
SMF Forum is not studied in this project, I assume it is one of the most popular forum software, or may be, since it has a very good spam protection, that is why it was excluded.
Just another thought I got when reading another paper, and I think it may be useful here as well, if blogs or forums have to be robust against spam, the blog and fourm providers should work hand in hand with the search engines. Anti-spamming features should be enabled in blogs and forums. Some heuristics or practical traits can be checked when comments are added e.g.
1. How long does it take to post a comment, e.g. if it is 1000 words in lenght, and is posted in a fraction of a second, it may more likely be hijacked content or spam i.e. copy and paste scenario. This information is normally not displayed on the blog or forum, but if this is made publically accessible, it would benefit the search engines, or at-least the fourm or blog operators can implement it, to control the spread of spam in the first place.
2. Second thought, is regarding email spam, here the email providers may want to check, how quickly people delete their emails from their mailbox, if the time they login, to the time they delete many email is very low, then it should be an indicator of spam emails. On the other hand it could also refer to users who may have subscribed to mailing lists but are no longer interested in it.
3. Another way comment spam can be addressed would be by using CAPTCHA, I am not sure why this simple option is not implemented in many forums and blogs.
Cite this article as
Review on “A Quantitative Study of Forum Spamming Using Context-based Analysis” by V. Potdar, 21st Feb 2008. Available online – http://drvidy.wordpress.com/2008/02/21/a-quantitative-study-of-forum-spamming-using-context-based-analysis/
Abstract
Forum spamming has become a major means of search engine spamming. To evaluate the impact of forum spamming on search quality, we have conducted a comprehensive study from three perspectives: that of the search user, the spammer, and the forum hosting site. We examine spam blogs and spam comments in both legitimate and honey forums. Our study shows that forum spamming is a widespread problem. Spammed forums, powered by the most popular software, show up in the top 20 search results for all the 189 popular keywords. On two blog sites, more than half (75% and 54% respectively) of the blogs are spam, and even on a major and reputably well maintained blog site, 8.1% of the blogs are spam1. The observation on our honey forums confirms that spammers target abandoned pages and that most comment spam is meant to increase page rank rather than generate immediate traffic. We propose context-based analyses, consisting of redirection and cloaking analysis, to detect spam automatically and to overcome shortcomings of content-based analyses. Our study shows that these analyses are very effective in identifying spam pages.
Important Terms
-
Content Based Analysis
-
Cloaking Analysis (Click through Cloaking)
-
URL Redirection
-
Spammer’s Modus Operandi
-
Honey Monkey (aka Monkey Program)
-
Doorway Pages
-
Context based spam detection
-
Behaviour & Signature based spam detection
Reference Sheet
Anecdotal evidence as well as our own experience indicates that spammers have successfully promoted their web sites in search results through forum spamming.
The observation on our honey forums confirms that spammers target abandoned pages and that most comment spam is meant to increase page rank rather than generate immediate traffic.
The tracer provides a key functionality called the Top Domain view: given a list of (primary) URLs, the tracer launches an actual browser to visit each URL
and records all secondary URLs visited as a result.
At the end of the batched scan, the Top Domain view provides the list of third-party domains that received secondary-URL traffic and rank them by the number of primary URLs that generated traffic to them.
If the input is a list of highly suspicious spam URLs (such as those collected from a spammed forum), the Top Domain view highlights those behind-the-scenes spammer domains that are associated with a large number of doorway URLs.
To identify heavily spammed forums, we looked for sites that appeared in the search results for different keywords as well as sites whose pages appeared multiple times in the results for a single keyword.
Related Useful References


