Detecting Spam Web Pages through Content Analysis
Authors – Alexandros Ntoulas, Marc Najork, Mark Manasse,Dennis Fetterly
Year – 2006
Published in – Proceedings of the 15th International World Wide Web Conference
Link – http://research.microsoft.com/research/sv/sv-pubs/www2006.pdf
Importance to my Research – High
MY REVIEW
This paper is well written and I really like reading it. The authors listed 10 heuristic approaches to classify web pages as spam and no-spam. These approaches are more often SEO techniques, which are often implemented by SEO companies, but the analysis of these techniques in classifying content as spam has shown intresting results. The authors study the following heuristics
-
Number of words in page
-
Number of words in page title
-
Average lenght of words
-
Amount of anchor text
-
Fraction of visible content
-
Compressiblity
-
Fraction of page drawn from globally popular words
-
Fraction of globally popular words
-
Independent n-gram likelyhoods
-
Conditional n-gram likelyhoods
As far as the review goes I have a few recommendations for the authors for future research. As the authors mentioned that a study on NLP would be very handy in understanding the semantics of the web page content. This is really a good direction of research and use of WordNet would be very good to see, however it is going to be computationally expensive, and I would recommend to use it on the top most layer in the layered spam detection system.
The research could also look at detecting spam on forums or blog spamming, e.g. detecting the content originating from Nigeria or some other African country, willing to transfer millions of dollars to your account etc. I did not find any article on Forum spamming but I am very interested in reading more about it. Within forums, spamming can take a completely different outlook, it could be simple advertisments placed as forum posts, with no relevance to the forums theme.
Furthermore the heuristics can be further enhanced by looking at other SEO tips and tricks and to evaluate if these tricks can be back fired by identifying patterns to classify spam web pages.
Finally, i hope this page is not classified as spam, because I have too many links on this page as well as a lot of important terms and the number of words are more than 650 :)
Cite this article as
Review on “Detecting Spam Web Pages through Content Analysis” by V. Potdar, 19th Feb 2008. Available Online – http://drvidy.wordpress.com/2008/02/19/detecting-spam-web-pages-through-content-analysis-2/
ABSTRACT
In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
Important Terms
- Link Farming
- Link Stuffing
- Keyword Stuffing
- Black Hat SEO Techniques
- White Hat SEO Techniques
- Search Engine Optimization
- Exponential Decay
- Spam Detection Heuristics
- Classification Techniques (Decision Tree, Rule Based, Neural Nets, Support Vector Machines)
- Cloaking
- Trust Rank
- Layered Spam Detection System
Related Useful References
D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In 7th International Workshop on the Web and Databases, June 2004.
D. Fetterly, M. Manasse and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. In 28th Annual International ACM SIGIR Conference on Research and Development in Information
Z. Gy¨ongyi and H. Garcia-Molina. Web Spam Taxonomy. In 1st International Workshop on Adversarial Information
G. Mishne, D. Carmel and R. Lempel. Blocking Blog Spam with Language Model Disagreement. In 1st International Workshop on Adversarial Information Retrieval on the Web,May 2005.
Google’s New Algorithm to Rank Pages and Detect Spam: PhraseRank”?
Detecting spam related and biased contexts for programmable search engines
Interesting spammer pattern – how they find sitesPage Quality and Web Spam: Using Content Analysis to Detect Spam PagesSEO Blackhat



[...] it was a good paper to read and compared to Detecting Spam Web Pages through Content Analysis this had a very different appraoch to detect spam, secondly this focussed purely on blogs as [...]
Detecting Spam Blogs: A Machine Learning Approach « Vidy’s Blog
February 20, 2008