Detecting Spam Blogs: A Machine Learning Approach

Posted on February 20, 2008. Filed under: Spam | Tags: , , , , |


Authors
Pranam Kolari, Akshay Java, Tim Finin, Tim Oates, and Anupam Joshi

Year - 2006

Published in - Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006)

Link - http://ebiquity.umbc.edu/get/a/publication/260.pdf

Importance to my Research - Very High

MY REVIEW
The authors propose a splog detection method in this paper, they consider splog detection as a classification problem, and argue that a blog can be classified in three different categories

  1. Splog ==> Not Authentic Blog
  2. Blog ==> Authentic Blog
  3. Undecided ==> A decision cannot be made on the status of the blog

The proposed approach classifies blogs using Local Features and Link Features. Several intresting local features have been identified in this paper, to classify splogs.

Some other classifiers which could be employed by the authors would include, Blogs without contextual ads are more likely to be Authentic Blogs, compared to those with contextual ads.

Overall it was a good paper to read and compared to Detecting Spam Web Pages through Content Analysis this had a very different appraoch to detect spam, secondly this focussed purely on blogs as compared to the other which was on webpage. It would be intresting to study the similarities between these two papers to see if there is an intersection, which can be adopted for blogs or web pages.

Cite this article as
Review on “Detecting Spam Blogs: A Machine Learning Approach” by V. Potdar, 20th Feb 2008, Available Online – http://drvidy.wordpress.com/2008/02/20/detecting-spam-blogs-a-machine-learning-approach/

ABSTRACT
Weblogs or blogs are an important new way to publish information, engage in discussions, and form communities on the Internet. The Blogosphere has unfortunately been infected by several varieties of spam-like content. Blog search engines, for example, are inundated by posts from splogs – false blogs with machine generated or hijacked content whose sole purpose is to host ads or raise the PageRank of target sites. We discuss how SVM models based on local and link-based features can be used to detect splogs. We present an evaluation of learned models and their utility to blog search engines; systems that employ techniques differing from those of conventional web search engines. We evaluate the effectiveness of a combination of features, and finally report our informal analysis of a blog search engine index. 

Important Terms

  1. Support Vector Machine (SVM) Models
  2. Splogs, Splog Detection
  3. Bag-of-word (Bag-of-word-N-Grams, Bag-of-anchors feature, Bag-of-urls)
  4. TFIDF
  5. Machine Generated Content
  6. Blogosphere
  7. Ping Servers

Reference Sheet

Weblogs or blogs are web sites consisting of dated entries typically listed in reverse chronological order on a single page.

Blog search engines, for example, are inundated by posts from splogs – false blogs with machine generated or hijacked content whose sole purpose is to host ads or raise the PageRank of target sites.

Independent of the content genre of blogs, they constitute such an influential subset on the Web, that they collectively create what is now known as the Blogosphere.

As the Blogosphere continues to grow, several capabilities have become critical for blog search engines. The first is the ability to recognize blog sites, understand their structure, identify constituent parts and extract relevant metadata. A second is to robustly detect and eliminate spam blogs (splogs).

Approximately 75% of such pings is received from splogs.

Splogs typically link to non-splogWeb pages, attempting to influence their PageRank, and it is infeasible for blog search engines to fetch all such pages before making a splog judgment.

Related Useful References 

Pranam Kolari et al., “Blog Track Open Task: Spam Blog Classification”, In Collection, TREC 2006 Blog Track Notebook, November 2006

Pranam Kolari et al., “Characterizing the Splogosphere”, In Proceedings, Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference, May 2006

Pranam Kolari et al., “SVMs for the Blogosphere: Blog Identification and Splog Detection”, In Proceedings, AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, March 2006

Pranam Kolari et al., “Memeta: A Framework for Multi-Relational Analytics on the Blogosphere”, In Proceedings, AAAI 2006 Student Abstract Program, February 2006

Make a Comment

Make a Comment: ( None so far )

blockquote and a tags work here.

    About

    Blog on Latest Research in the field of Social Software & Web 2.0

    RSS

    Subscribe Via RSS

    • Subscribe with Bloglines
    • Add your feed to Newsburst from CNET News.com
    • Subscribe in Google Reader
    • Add to My Yahoo!
    • Subscribe in NewsGator Online
    • The latest comments to all posts in RSS
    • Subscribe in Rojo

    Meta

Liked it here?
Why not try sites on the blogroll...