<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Vidy's Blog on Social Software</title>
	<atom:link href="http://drvidy.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://drvidy.wordpress.com</link>
	<description>Blog on Latest Research in the field of Social Software &#38; Web 2.0</description>
	<lastBuildDate>Wed, 19 Mar 2008 07:18:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='drvidy.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Vidy's Blog on Social Software</title>
		<link>http://drvidy.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://drvidy.wordpress.com/osd.xml" title="Vidy&#039;s Blog on Social Software" />
	<atom:link rel='hub' href='http://drvidy.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Mining and Summarizing Customer Reviews</title>
		<link>http://drvidy.wordpress.com/2008/03/10/mining-and-summarizing-customer-reviews/</link>
		<comments>http://drvidy.wordpress.com/2008/03/10/mining-and-summarizing-customer-reviews/#comments</comments>
		<pubDate>Mon, 10 Mar 2008 02:48:28 +0000</pubDate>
		<dc:creator>vidy</dc:creator>
				<category><![CDATA[Text Semantics]]></category>
		<category><![CDATA[nlp]]></category>
		<category><![CDATA[opinion classification]]></category>
		<category><![CDATA[opinion identification]]></category>
		<category><![CDATA[parts-of-speech]]></category>
		<category><![CDATA[review mining]]></category>
		<category><![CDATA[text mining]]></category>

		<guid isPermaLink="false">http://drvidy.wordpress.com/?p=33</guid>
		<description><![CDATA[Authors- Minqing Hu and Bing Liu Year &#8211; 2004 Published in- Proceedings of the 10th ACM International conference on knowledge discovery and data mining.  Link - http://sifaka.cs.uiuc.edu/course/591cxz04f/peng1.pdf Importance to my Research &#8211; Very High MY REVIEW This paper proposes a product review classification system that downloads reviews from the web and classifies them as positive [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=33&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Authors</strong>- Minqing Hu and Bing Liu</p>
<p><strong>Year</strong> &#8211; 2004</p>
<p><strong>Published in</strong>- Proceedings of the 10th ACM International conference on knowledge discovery and data mining. </p>
<p><strong>Link </strong>- <a href="http://sifaka.cs.uiuc.edu/course/591cxz04f/peng1.pdf">http://sifaka.cs.uiuc.edu/course/591cxz04f/peng1.pdf</a></p>
<p><span class="a"></span><strong>Importance to my Research</strong> &#8211; Very High</p>
<p><strong><u>MY REVIEW</u></strong></p>
<p>This paper proposes a product review classification system that downloads reviews from the web and classifies them as positive and negative based on each product feature. In doing so, the framework automatically identifies different features about a product or service being reviewed and then mines the relevant sentence to look for opinions. Finally the system presents the user with a summary of reviews for one particular product. This paper downloaded the product reviews from Amazon and C|Net for their experiments. The proposed system can be used by product manufactuers for improving their products as well as customers in deciding which product to buy.</p>
<p>Before I begin the review of this paper I would say this is one of the best written papers, it is so well organized and thought about before writing, that any questions that I had in mind were answered at the right time. I should congratulate the authors for their excellent effort in writing this paper, which made my reading a pleasurable experience.</p>
<p>Some Future Directions</p>
<ol>
<li>
<div>It would be a good idea to provide some additional statistics based on the review classification e.g. I would be interested in knowing how many people say that the <u>quality of picture </u>is acceptable under bright sunlight but not very good in dim light. I guess the current system only identified <u>picture quality</u> as a feature and puts all the reviews related to the picture quality underneath. Am I correct?</div>
</li>
</ol>
<p>more details coming soon&#8230;</p>
<p><strong>Cite this article as</strong><br />
Critical Review on &#8220;Mining and Summarizing Customer Reviews&#8221; by V. Potdar, 10th Mar, 2008. Available Online &#8211; <a href="http://drvidy.wordpress.com/2008/03/10/mining-and-summarizing-customer-reviews/">http://drvidy.wordpress.com/2008/03/10/mining-and-summarizing-customer-reviews/</a></p>
<p><span id="more-33"></span><strong>Abstract<br />
</strong>Merchants selling products on the Web often ask their customers to review the products that they have purchased and the associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds or even thousands. This makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. It also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. For the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. In this research, we aim to mine and to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. We do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. Our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. This paper proposes several novel techniques to perform these tasks. Our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques.</p>
<p><strong>Important Terms</strong></p>
<ol>
<li>
<div>Feature Based Summarization</div>
</li>
<li>
<div>Cognitive Linguistics</div>
</li>
<li>
<div>Sentiment Classifiers</div>
</li>
<li>
<div>Text Summarization</div>
</li>
<li>
<div>Subject Genre Classification</div>
</li>
<li>
<div>Terminology Finding</div>
</li>
<li>
<div><a href="http://acl.ldc.upenn.edu/C/C00/C00-1044.pdf">Sentence Subjectivity Classification</a></div>
</li>
</ol>
<p><strong>Reference Sheet<br />
</strong>In this research, we study the problem of generating <u>feature-based summaries </u>of customer reviews of products sold online. Here, features broadly mean product features (or attributes) and functions. Given a set of customer reviews of a particular product, the task involves three subtasks:</p>
<ol>
<li>Identifying features of the product that customers have expressed their opinions on (called product features);</li>
<li>For each feature, identifying review sentences that give positive or negative opinions; and</li>
<li>Producing a summary using the discovered information.</li>
</ol>
<p>Our task is different from traditional text summarization [15, 39, 36] in a number of ways.</p>
<p><u>First</u> of all, a summary in our case is structured rather than another (but shorter) free text document as produced by most text summarization systems.</p>
<p><u>Second</u>, we are only interested in features of the product that customers have opinions on and also whether the opinions are positive or negative.</p>
<p>We do not summarize the reviews by selecting or rewriting a subset of the original sentences from the reviews to capture their main points as in traditional text summarization.</p>
<p><u>Genre classification </u>classifies texts into different styles, e.g., “editorial”, “novel”, “news”, “poem” etc. Although some techniques for genre classification can recognize documents that express opinions [23, 24, 14], they do not tell whether the opinions are positive or negative.</p>
<p>A more closely related work is [17], in which the authors investigate sentence subjectivity classification and concludes that the presence and type of adjectives in a sentence is indicative of whether the sentence is subjective or objective.</p>
<p>Useful References</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/drvidy.wordpress.com/33/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/drvidy.wordpress.com/33/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/drvidy.wordpress.com/33/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/drvidy.wordpress.com/33/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/drvidy.wordpress.com/33/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/drvidy.wordpress.com/33/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/drvidy.wordpress.com/33/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/drvidy.wordpress.com/33/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/drvidy.wordpress.com/33/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/drvidy.wordpress.com/33/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/drvidy.wordpress.com/33/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/drvidy.wordpress.com/33/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/drvidy.wordpress.com/33/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/drvidy.wordpress.com/33/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/drvidy.wordpress.com/33/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/drvidy.wordpress.com/33/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=33&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://drvidy.wordpress.com/2008/03/10/mining-and-summarizing-customer-reviews/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d7e932c20e7cb205e974d8e8cf16019f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">vidy</media:title>
		</media:content>
	</item>
		<item>
		<title>Identifying Comparative Sentences in Text Documents</title>
		<link>http://drvidy.wordpress.com/2008/03/05/identifying-comparative-sentences-in-text-documents/</link>
		<comments>http://drvidy.wordpress.com/2008/03/05/identifying-comparative-sentences-in-text-documents/#comments</comments>
		<pubDate>Wed, 05 Mar 2008 11:48:38 +0000</pubDate>
		<dc:creator>vidy</dc:creator>
				<category><![CDATA[Text Semantics]]></category>
		<category><![CDATA[identify comparative sentences]]></category>

		<guid isPermaLink="false">http://drvidy.wordpress.com/?p=32</guid>
		<description><![CDATA[Authors &#8211; Nitin Jindal and Bing Liu Year - 2006 Published in - Proceedings of the 29th Annual International ACM SIGIR Conference 2006. Link - http://www.www2007.org/htmlposters/poster930/ Importance to my Research &#8211; High MY REVIEW In this paper, the authors propose a solution to identify comparative sentences from text documents &#8211; news articles, product reviews, forum [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=32&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Authors</strong> &#8211; Nitin Jindal and Bing Liu</p>
<p><strong>Year </strong>- 2006</p>
<p><strong>Published in </strong>- Proceedings of the 29th Annual International ACM SIGIR Conference 2006.</p>
<p><strong>Link </strong>- <a href="http://www.www2007.org/htmlposters/poster930/">http://www.www2007.org/htmlposters/poster930/</a></p>
<p><strong>Importance to my Research</strong> &#8211; High</p>
<p><strong><u><em>MY REVIEW</em></u></strong></p>
<p><em>In this paper, the authors propose a solution to identify comparative sentences from text documents &#8211; news articles, product reviews, forum discussions. Forum discussions are very important to my research and I am interested in knowing more about it, as I can apply it to my research directly.<br />
</em><em><br />
Since I am not an expert in data mining, I cannot write a review from a technical perspective, however I just summarize the key points for my future reference. </em></p>
<p><em>Here the authors argue that comparative sentences and opinions are different. Opinions may use lot of adjectives when giving an opinion about a product or service e.g. I dont like the car because it is ugly or because it is big. Ugly or Big are adjectives.</em></p>
<p><em>However comparative sentences use comparative adverbs or comparative adjectives (better, longer etc.), but there can be exceptions as pointed out by the authors. </em></p>
<p><em>Class sequential rules with multiple minimum supports is used by authors to distinguish between comparative and non-comparative sentences. </em></p>
<p><em></em></p>
<p><strong><em>Cite this review as</em><br />
</strong>Review on &#8220;<em>Identifying Comparative Sentences in Text Documents</em>&#8221; by V. Potdar, 6th March 2008, Available Online &#8211; <a href="http://drvidy.wordpress.com/2008/03/05/identifying-comparative-sentences-in-text-documents/">http://drvidy.wordpress.com/2008/03/05/identifying-comparative-sentences-in-text-documents/</a></p>
<p><strong><span id="more-32"></span></strong></p>
<p><strong>Abstract<br />
</strong>This paper studies the problem of identifying comparative sentences in text documents. The problem is related to but quite different from sentiment/opinion sentence identification or classification. Sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. An important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers’ opinions on its products. Comparisons on the other hand can be subjective or objective. Furthermore, a comparison is not concerned with an object in isolation. Instead, it compares the object with others. An example opinion sentence is “the sound quality of CD player X is poor”. An example comparative sentence is “the sound quality of CD player X is not as good as that of CD player Y”. Clearly, these two sentences give different information. Their language constructs are quite different too. Identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. This paper proposes to study the comparative sentence identification problem. It first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. Experiment results using three types of documents, news articles, consumer reviews of products, and Internet forum postings, show a precision of 79%<br />
and recall of 81%. More detailed results are given in the paper.</p>
<p><strong>Important Terms</strong></p>
<ol>
<li>
<div>Precision &amp; Recall</div>
</li>
<li>
<div>Sentiment &amp; Opinion Extraction &amp; Classification</div>
</li>
<li>
<div>Pattern Discovery</div>
</li>
<li>
<div><a href="http://en.wikipedia.org/wiki/Supervised_learning">Supervised Learning</a> (Classification)</div>
</li>
<li>
<div>Un-supervised Learning (Clustering)</div>
</li>
<li>
<div>Machine Learning Model</div>
</li>
<li>
<div>Class Sequence Rules (Class Confidence)</div>
</li>
<li>
<div>Naive Bayesian Model</div>
</li>
<li>
<div>Support Vector Machines</div>
</li>
<li>
<div>Gradable Keywords (e.g. more, less, good, better, etc.)</div>
</li>
<li>
<div>Comparative Constructions (Implicit comparatives)</div>
</li>
<li>
<div><a href="http://www.cs.jhu.edu/~brill/">Brills Tager</a> (<a href="http://www.cis.upenn.edu/~treebank/">Penn Tree Bank</a>)</div>
</li>
<li>
<div>Gradable &amp; Non-Gradable Comparatives</div>
</li>
</ol>
<p><strong>Reference Sheet<br />
</strong>Comparisons are one of the most convincing ways of evaluation.For example, in the business environment, whenever a new product comes into market, the product manufacturer wants to know consumer opinions on the product, and how the product compares with those of its competitors. Much of such information is now readily available on the Web in the form of customer reviews, forum discussions, blogs, etc. Extracting such information can significantly help businesses in their marketing and product benchmarking efforts.</p>
<p>Comparisons are related but also quite different from sentiments and opinions, which are subjective. Comparisons on the other hand can be subjective or objective. For example, an opinion sentence on a car may be “Car X is very ugly”.</p>
<p>A subjective comparative sentence may be<br />
“Car X is much better than Car Y”</p>
<p>An objective comparative sentence may be<br />
“Car X is 2 feet longer than Car Y”</p>
<p><strong>Useful References</strong></p>
<ol>
<li>
<div><a href="http://www.informatica.si/PDF/31-3/11_Kotsiantis%20-%20Supervised%20Machine%20Learning%20-%20A%20Review%20of...pdf">Supervised Machine Learning: A Review of Classification Techniques</a></div>
</li>
<li>
<div><a href="http://www.cs.utsa.edu/~bylander/cs6243/kotsiantis-clustering.pdf">Recent Advances in Clustering: A Brief Survey</a></div>
</li>
<li>
<div><a href="http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf" class="l"><font color="#0000cc">Semi-Supervised Learning Literature Survey</font></a></div>
</li>
</ol>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/drvidy.wordpress.com/32/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/drvidy.wordpress.com/32/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/drvidy.wordpress.com/32/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/drvidy.wordpress.com/32/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/drvidy.wordpress.com/32/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/drvidy.wordpress.com/32/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/drvidy.wordpress.com/32/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/drvidy.wordpress.com/32/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/drvidy.wordpress.com/32/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/drvidy.wordpress.com/32/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/drvidy.wordpress.com/32/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/drvidy.wordpress.com/32/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/drvidy.wordpress.com/32/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/drvidy.wordpress.com/32/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/drvidy.wordpress.com/32/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/drvidy.wordpress.com/32/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=32&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://drvidy.wordpress.com/2008/03/05/identifying-comparative-sentences-in-text-documents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d7e932c20e7cb205e974d8e8cf16019f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">vidy</media:title>
		</media:content>
	</item>
		<item>
		<title>Review Spam Detection</title>
		<link>http://drvidy.wordpress.com/2008/03/05/review-spam-detection/</link>
		<comments>http://drvidy.wordpress.com/2008/03/05/review-spam-detection/#comments</comments>
		<pubDate>Wed, 05 Mar 2008 04:12:36 +0000</pubDate>
		<dc:creator>vidy</dc:creator>
				<category><![CDATA[Spam]]></category>

		<guid isPermaLink="false">http://drvidy.wordpress.com/?p=31</guid>
		<description><![CDATA[Authors &#8211; Nitin Jindal and Bing Liu Year &#8211; 2007 Published in &#8211; Proceedings of the International World Wide Web Conference Committee (IW3C2).  Link - http://www.www2007.org/htmlposters/poster930/ Importance to my Research - High MY REVIEW In this paper, the authors make an attempt to study a new cateogory of spam i.e review spam and also provide insights into a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=31&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Authors</strong> &#8211; Nitin Jindal and Bing Liu</p>
<p><strong>Year</strong> &#8211; 2007</p>
<p><strong>Published in</strong> &#8211; <em>Proceedings of the International World Wide Web Conference Committee (IW3C2).</em> </p>
<p><strong>Link </strong>- <a href="http://www.www2007.org/htmlposters/poster930/">http://www.www2007.org/htmlposters/poster930/</a></p>
<p><strong>Importance to my Research </strong>- High</p>
<p><em><u><strong>MY REVIEW<br />
</strong></u>In this paper, the authors make an attempt to study a new cateogory of spam i.e <strong>review spam </strong>and also provide insights into a <strong>detection approach </strong>to identify such spam. The review spams are divided into two categories </em></p>
<ol>
<li>
<div><em>Duplicate Reviews &#8211; Duplicate or nearly similar reviews about the same product or multiple products.</em></div>
</li>
<li>
<div><em>Spam Reviews &#8211; Fake reviews or reviews which are falsified or non-trustworthy. </em></div>
</li>
</ol>
<p><em>The authors propose to use <strong>Shingle method </strong>to <strong>detect duplicate reviews</strong>. Detecting duplicate content or near duplicate content has become a major concern with the flourshing of blogs and forums. Authors who spend real time and effort to compile a good article and post it on the Internet are at the risk of copyright infringments. </em></p>
<p><em>So far the issue of copyright infringments had been a major concern for  music and film industry, but it is becoming more and more prevenlant in the text domain as well. Digital Watermarking algorithms for text need to be developed to address this issue, however the success for this technology would be quite limited as digital watermarking relies on hiding copyright signal in noise, which is not available in text, as much as in audio or video files. So implementing shingle method should do as of now. This can only detect duplication not prevent duplication. </em></p>
<p><em>I liked the authors idea of treating duplicate reviews as positive training examples of spam and use that to model the features of non-duplicate reviews. I guess this approach made their job a bit easier, because finding duplicate reviews is easy using Shingle Method, and assuming that the duplicate reviews would correspond to spam is an educated guess, which would work on most occassions as spammers dont have time to write fresh reviews, since it would not be cost-benefitial to them. </em></p>
<p><em></em></p>
<p><em>more details coming soon&#8230;&#8230;.</em></p>
<p><strong><em>Cite this article as<br />
</em></strong>Critical Review on &#8220;Review Spam Detection&#8221; by V. Potdar, 29th Feb, 2008. Available Online &#8211; <a href="http://drvidy.wordpress.com/2008/03/05/review-spam-detection/">http://drvidy.wordpress.com/2008/03/05/review-spam-detection/</a></p>
<p><span id="more-31"></span></p>
<p><strong>Abstract<br />
</strong>It is now a common practice for e-commerce Web sites to enable their customers to write reviews of products that they have purchased. Such reviews provide valuable sources of information on these products. They are used by potential customers to find opinions of existing users before deciding to purchase a product. They are also used by product manufacturers to identify problems of their products and to find competitive intelligence information about their competitors. Unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions. In this paper, we make an attempt to study review spam and spam detection. To the best of our knowledge, there is still no reported study on this problem.  </p>
<p><strong> </strong><strong>Important Terms</strong></p>
<ol>
<li>
<div>Opinion Based Applications</div>
</li>
<li>
<div>Review Spam</div>
</li>
<li>
<div>Logistic Regression</div>
</li>
<li>
<div>2-Class Classification Problem</div>
</li>
<li>
<div>Naive Bayes (Text Classification)</div>
</li>
</ol>
<p><strong>Reference Sheet<br />
</strong>Positive opinions can result in significant financial gains and/or fames for organizations and individuals. This gives good incentives for review/opinion spam [8].</p>
<p>There are generally two types of spam reviews.</p>
<ol>
<li>
<div>The<strong> </strong><em><u>first type</u></em> consists of those that deliberately mislead readers or automated opinion mining systems by giving undeserving positive opinions to some target products in order to promote them and/or by giving unjust or malicious negative reviews to some other products in order to damage their reputation.</div>
</li>
<li>
<div>The <em><u>second type</u></em> consists of non-reviews (e.g., ads) which contain no opinions on the product. </div>
</li>
</ol>
<p>Review spam is related to but also different from Web or email spam.</p>
<ol>
<li>
<div>The <u><em>objective</em> of <em>Web spam</em></u> is to attract people to some target pages by manipulating the content of the pages and/or their link structures so that they will be ranked high by search engines. Spam emails are mainly ads.</div>
</li>
<li>
<div><em><u>Spam reviews </u></em>are very different as they <em><u>give false opinions</u></em>, which are much <em><u>harder to detect even manually</u></em>. Thus, most existing methods for detecting web spam and email spam [3, 7, 9, 11] are unsuitable for review spam.</div>
</li>
</ol>
<p>We discovered that spam activities are widespread. For example,</p>
<ol>
<li>
<div>We found a large number of duplicate and near-duplicate reviews written by the same reviewers on different products or</div>
</li>
<li>
<div>by different reviewers (possibly different userids of the same persons) on the same products or different products.</div>
</li>
</ol>
<p>We propose to perform spam detection based on <u><em>duplicate finding</em> </u>and <em><u>classification</u></em>. For classification, we regard spam detection as a <em>2-class classification problem</em>, <em>spam</em> and <em>non-spam</em>.</p>
<p>To build a <em><u>classification model</u></em>, we need <em><u>labeled training examples</u> </em>of <em><u>spam reviews</u></em> and <em><u>non-spam reviews</u></em>. Recognizing whether a review is a spam review or not is extremely difficult by manually reading the reviews because one can carefully craft a spam review which is just like any other innocent review and the number of spam reviews is also small.</p>
<p>We tried to read a large number of reviews and were unable to identify reliable spam reviews except finding a few obvious advertisements, which are irrelevant to the products being reviewed and contain no opinions. Thus, other ways have to be used to find training examples. </p>
<p><strong>Useful References</strong><br />
 </p>
<p><strong>Useful Links</strong><br />
<a href="http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html">http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html</a> </p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/drvidy.wordpress.com/31/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/drvidy.wordpress.com/31/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/drvidy.wordpress.com/31/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/drvidy.wordpress.com/31/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/drvidy.wordpress.com/31/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/drvidy.wordpress.com/31/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/drvidy.wordpress.com/31/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/drvidy.wordpress.com/31/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/drvidy.wordpress.com/31/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/drvidy.wordpress.com/31/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/drvidy.wordpress.com/31/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/drvidy.wordpress.com/31/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/drvidy.wordpress.com/31/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/drvidy.wordpress.com/31/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/drvidy.wordpress.com/31/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/drvidy.wordpress.com/31/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=31&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://drvidy.wordpress.com/2008/03/05/review-spam-detection/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d7e932c20e7cb205e974d8e8cf16019f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">vidy</media:title>
		</media:content>
	</item>
		<item>
		<title>Defining and Measuring Quality in Online Discussions &#8211; A Critical Review</title>
		<link>http://drvidy.wordpress.com/2008/02/29/defining-and-measuring-quality-in-online-discussions-a-critical-review/</link>
		<comments>http://drvidy.wordpress.com/2008/02/29/defining-and-measuring-quality-in-online-discussions-a-critical-review/#comments</comments>
		<pubDate>Fri, 29 Feb 2008 01:04:34 +0000</pubDate>
		<dc:creator>vidy</dc:creator>
				<category><![CDATA[Data Quality Measurement]]></category>
		<category><![CDATA[data quality]]></category>
		<category><![CDATA[forums]]></category>

		<guid isPermaLink="false">http://drvidy.wordpress.com/?p=30</guid>
		<description><![CDATA[Authors &#8211; Alexandru Spatariu, Kendall Hartley, and Lisa D. Bendixen Year &#8211; 2004 Published in &#8211; The Journal of Interactive Online Learning, Volume 2, Number 4, Spring 2004, ISSN: 1541-4914, www.ncolr.org Link &#8211; http://www.ncolr.org/jiol/issues/PDF/2.4.2.pdf Importance to my Research &#8211; High MY REVIEW This paper studies different methods to evaluate quality in online discussion forums within [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=30&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Authors</strong> &#8211; Alexandru Spatariu, Kendall Hartley, and Lisa D. Bendixen</p>
<p><strong>Year</strong> &#8211; 2004</p>
<p><strong>Published in</strong> &#8211; The Journal of Interactive Online Learning, Volume 2, Number 4, Spring 2004, ISSN: 1541-4914, <a href="http://www.ncolr.org/">www.ncolr.org</a></p>
<p><strong>Link</strong> &#8211; <a href="http://www.ncolr.org/jiol/issues/PDF/2.4.2.pdf">http://www.ncolr.org/jiol/issues/PDF/2.4.2.pdf</a></p>
<p><strong>Importance to my Research</strong> &#8211; High</p>
<p><em><strong><u>MY REVIEW<br />
</u></strong>This paper studies different methods to evaluate quality in online discussion forums within an educational setting. The literature is categorised into four different classes as follows:</em></p>
<ol>
<li>
<div><em>Levels of Disaggrement</em></div>
</li>
<li>
<div><em>Argument Structure Analysis</em></div>
</li>
<li>
<div><em>Interaction Based Coding</em></div>
</li>
<li>
<div><em>Content Analysis</em></div>
</li>
</ol>
<p><em>All these techniques tag different sentences  or code different sentences in different quality attributes e.g. low, medium or high. For example, Levels of Disaggrement method tags sentences with varying level of disaggrement i.e. how much (or how less) is one forum post in aggrement with the previous forum post or vice-versa. Here level of disaggrement is the prime quality classifier, so posts with low disaggrement are low in quality whereas those with high disaggrement are high in quality. </em></p>
<p><em>The next technique Argument Structure Analysis involves quality evaluation by studing how arguments are made within each post, if arguments are supported by theory or facts they are weighted more than other. This technique can be used along with Levels of Disaggrement technique, to see if posts that are qualified as being high in quality, do actually have any supporting arguments as classified by Argument Structure Analysis. </em></p>
<p><em>Content based analysis takes all together different approach when analyzing quality of posts, it analyzes content or meta-content, e.g. in case of evaluating a student forum the contents would include &#8211; student participation rates, electronic interaction patterns, social cues, etc as identified by Hara, Bonk, and Angel (2002).</em></p>
<p><em>This paper provided a good insight into the existing quality evaluating parameters, however I rated this paper as being of High Importance to my research becuase these techniques are not automated, a person (editor, tutor or lecturer) has to manually code each post with different levels of quality. I would like this to be automated I am more interested in looking at NLP to do this for me. </em></p>
<p><em><strong>Cite this article as<br />
</strong>Critical Review on &#8220;Defining and Measuring Quality in Online Discussions&#8221; by V. Potdar, 29th Feb, 2008. Available Online &#8211; <a href="http://drvidy.wordpress.com/2008/02/29/defining-and-measuring-quality-in-online-discussions-a-critical-review/">http://drvidy.wordpress.com/2008/02/29/defining-and-measuring-quality-in-online-discussions-a-critical-review/</a></em></p>
<p><span id="more-30"></span><em></em></p>
<p><strong>Abstract<br />
</strong>In support of research examining relationships between learner characteristics and the quality of online discussions, this paper surveys different methods for evaluating discussions. The paper will present coding methods used in our own research as well as methods used by others interested in quality online discussions. Key topics include what constitutes quality in online discussions and how that quality can be measured?</p>
<p><strong>Important Terms</strong></p>
<ol>
<li>
<div>Asynchronous Communication</div>
</li>
<li>
<div>Toulmin Model</div>
</li>
<li>
<div>Idea Units</div>
</li>
</ol>
<p><strong>Reference Sheet<br />
</strong><em>Some Useful Statistics</em> &#8211; According to the U.S. Department of Education (2003), 89% of public, four-year institutions offered distance education courses during the 2000–2001 academic year. Of those offering distance education courses, 90% offered Internet courses.</p>
<p><strong>Useful References</strong></p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/drvidy.wordpress.com/30/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/drvidy.wordpress.com/30/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/drvidy.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/drvidy.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/drvidy.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/drvidy.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/drvidy.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/drvidy.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/drvidy.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/drvidy.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/drvidy.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/drvidy.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/drvidy.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/drvidy.wordpress.com/30/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/drvidy.wordpress.com/30/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/drvidy.wordpress.com/30/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=30&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://drvidy.wordpress.com/2008/02/29/defining-and-measuring-quality-in-online-discussions-a-critical-review/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d7e932c20e7cb205e974d8e8cf16019f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">vidy</media:title>
		</media:content>
	</item>
		<item>
		<title>Review on &#8211; Finding High Quality Content in Social Media</title>
		<link>http://drvidy.wordpress.com/2008/02/28/finding-high-quality-content-in-social-media/</link>
		<comments>http://drvidy.wordpress.com/2008/02/28/finding-high-quality-content-in-social-media/#comments</comments>
		<pubDate>Thu, 28 Feb 2008 02:26:15 +0000</pubDate>
		<dc:creator>vidy</dc:creator>
				<category><![CDATA[Data Quality Measurement]]></category>
		<category><![CDATA[content analysis]]></category>
		<category><![CDATA[data quality]]></category>
		<category><![CDATA[HITS]]></category>
		<category><![CDATA[PageRank]]></category>
		<category><![CDATA[Yahoo Answers]]></category>

		<guid isPermaLink="false">http://drvidy.wordpress.com/?p=28</guid>
		<description><![CDATA[Authors &#8211; Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, Gilad Mishne Year &#8211; 2008 Published in &#8211; WSDM 2008 Link &#8211; http://download.tailrank.com/wsdm2008/p183.pdf Importance to my Research &#8211; Very High MY REVIEW The authors propose a model for quality analysis in social media. Yahoo Answers have been used as a test bed to study the results. The [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=28&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Authors</strong> &#8211; Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, Gilad Mishne</p>
<p><strong>Year</strong> &#8211; 2008</p>
<p><strong>Published in</strong> &#8211; WSDM 2008</p>
<p><strong>Link</strong> &#8211; <a href="http://download.tailrank.com/wsdm2008/p183.pdf">http://download.tailrank.com/wsdm2008/p183.pdf</a></p>
<p><strong>Importance to my Research</strong> &#8211; Very High</p>
<p><strong><u><em>MY REVIEW<br />
</em></u></strong><em>The authors propose a model for quality analysis in social media. Yahoo Answers have been used as a test bed to study the results. The proposed model relies on using feedback from users of the system, who could ask questions (askers), provider answers (answerers) or evaluate answers (evaluators). </em></p>
<p><em>This work is based on similar ideas as the previous paper that I reviewed earlier- <a href="http://drvidy.wordpress.com/2008/02/27/automatically-assessing-the-post-quality-in-online-discussions-on-software/">Automatically Assessing the Post Quality in Online Discussions on Software</a>. This work also evaluates quality based on user feedback, but as I mentioned earlier in that review, majority of forums find that only a handful of their users actually provide feedback, however in case of Yahoo Answers, this portal is designed with quality content in mind, hence the answers to the question has to be evaluated by the user who posted the question. This provides a very good mechanism to enforce feedback. </em></p>
<p><em>However, there should be some incentives for the users to provide feedback, and these incentives should be clearly stated e.g. if you provide feedback: </em></p>
<ol>
<li>
<div><em>you will achieve higher reputation in the community or </em></div>
</li>
<li>
<div><em>you will get authority over other users in the community or</em></div>
</li>
<li>
<div><em>you will be financially rewarded by the portal etc.</em></div>
</li>
</ol>
<p><em>My research on contribution measurement for rewarding users could be very well applied in this case. I am also interested in finding quality assessment models that are not based on user feedback, because at time judging user feedback is also a big concern and a possible venue for fraud. I am planning to look at some work in the field of automated essay grading (AES), and luckily we have a very strong group in Curtin University working on AES.   </em></p>
<p><strong><em>Cite this article as<br />
</em></strong>Review on &#8220;<em>Finding High Quality Content in Social Media</em>&#8221; by V. Potdar, 28th Feb, 2008. Available online &#8211; <a href="http://drvidy.wordpress.com/2008/02/28/finding-high-quality-content-in-social-media/">http://drvidy.wordpress.com/2008/02/28/finding-high-quality-content-in-social-media/</a></p>
<p><span id="more-28"></span></p>
<p><strong>Abstract</strong></p>
<p><strong>Important Terms</strong></p>
<ol>
<li>
<div>Sentiment Classification</div>
</li>
<li>
<div>Graph Based Framework</div>
</li>
<li>
<div>User Roles &#8211; Asker, Answerer, Evaluator</div>
</li>
<li>
<div>Expertise Rank</div>
</li>
<li>
<div>Data Readability (Gunning-Fog Index, Flesch-Kincaid Formula, SMOG Grading)</div>
</li>
<li>
<div>Formality Score</div>
</li>
<li>
<div>KL Divergence</div>
</li>
<li>Intrinsic Content Quality &amp; Measurement</li>
</ol>
<p><strong>Reference Sheet</strong></p>
<p><strong>Useful References</strong></p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/drvidy.wordpress.com/28/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/drvidy.wordpress.com/28/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/drvidy.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/drvidy.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/drvidy.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/drvidy.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/drvidy.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/drvidy.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/drvidy.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/drvidy.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/drvidy.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/drvidy.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/drvidy.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/drvidy.wordpress.com/28/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/drvidy.wordpress.com/28/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/drvidy.wordpress.com/28/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=28&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://drvidy.wordpress.com/2008/02/28/finding-high-quality-content-in-social-media/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d7e932c20e7cb205e974d8e8cf16019f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">vidy</media:title>
		</media:content>
	</item>
		<item>
		<title>Review on &#8211; Assessing Discussion Forum Participation: In Search of Quality</title>
		<link>http://drvidy.wordpress.com/2008/02/27/assessing-discussion-forum-participation-in-search-of-quality/</link>
		<comments>http://drvidy.wordpress.com/2008/02/27/assessing-discussion-forum-participation-in-search-of-quality/#comments</comments>
		<pubDate>Wed, 27 Feb 2008 04:13:34 +0000</pubDate>
		<dc:creator>vidy</dc:creator>
				<category><![CDATA[Data Quality Measurement]]></category>
		<category><![CDATA[cognitive communication]]></category>
		<category><![CDATA[content analysis]]></category>
		<category><![CDATA[data quality]]></category>
		<category><![CDATA[discourse analysis]]></category>
		<category><![CDATA[forum activity measurement]]></category>

		<guid isPermaLink="false">http://drvidy.wordpress.com/?p=26</guid>
		<description><![CDATA[Authors &#8211; Stephen Corich, Kinshuk, and Lynn.M.Hunt Year &#8211; 2004 Published in &#8211; Not Available Link &#8211; http://itdl.org/Journal/Dec_04/article01.htm Importance to my Research &#8211; Medium MY REVIEW The research aims to compare two different content analysis models, proposed by Hara et al. (2000) and Garrison et al. (2001), in an attempt to assess the possiblity of [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=26&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Authors</strong> &#8211; Stephen Corich, <a href="http://scis.athabascau.ca/scis/staff/index.jsp?ct=kinshuk&amp;sn=director">Kinshuk</a>, and Lynn.M.Hunt</p>
<p><strong>Year</strong> &#8211; 2004</p>
<p><strong>Published in</strong> &#8211; Not Available</p>
<p><strong>Link</strong> &#8211; <a href="http://itdl.org/Journal/Dec_04/article01.htm">http://itdl.org/Journal/Dec_04/article01.htm</a></p>
<p><strong>Importance to my Research</strong> &#8211; Medium</p>
<p><strong><u><em>MY REVIEW</em><br />
</u></strong><em>The research aims to compare two different content analysis models, proposed by <a href="http://crlt.indiana.edu/publications/journals/techreport.pdf">Hara et al. (2000)</a> and <a href="http://www.communityofinquiry.com/files/CogPres_Final.pdf">Garrison et al. (2001)</a>, in an attempt to assess the possiblity of automatically determining the level of critical thinking amongst undergraduate students in an online discussion forum. These two models belong to the quantitative content analysis category. </em></p>
<p><em>A <u>sentence </u>was used as a human cognitive <u>unit of analysis </u>in this research when evaluating the above models. A total of 74 posts were made in 3 weeks, which resulted in 484 sentences. These sentences were hand coded against the two models by the instructor. I am not sure what it means, i should study the two models first, before writing a detailed review here, however, based on the study concluded that Garrison et al. (2000) method was more accurate compared to Hara et al. (2000). </em></p>
<p><em>I will come back to this post later once I finish reading the above papers. There are quite a few intresting references in this paper, which are worth a read. </em></p>
<p><strong><em>Cite this review as</em><br />
</strong>Review on <em>&#8220;Assessing Discussion Forum Participation: In Search of Quality&#8221;</em> by V. Potdar, 27th Feb. 2008, Available Online &#8211; <a href="http://drvidy.wordpress.com/2008/02/27/assessing-discussion-forum-participation-in-search-of-quality/">http://drvidy.wordpress.com/2008/02/27/assessing-discussion-forum-participation-in-search-of-quality/</a></p>
<p><span id="more-26"></span></p>
<p><strong>Abstract<br />
</strong>The flexibility that e-learning offers and the growing maturity of e-learning management systems has lead to a rapid growth in the acceptance of e-learning as a method of delivering educational and vocational training. The use of computer-mediated communication (CMC) tools, and in particular asynchronous discussion forums, as a means of promoting communication and a collaboration between e-learning participants has lead to a growing interest by the academic and training community in the pedagogical value of such tools.</p>
<p>This paper looks at the role of asynchronous discussion forums in e-learning and attempts to address the issue of the quality of interaction of discussion forum participants. A number of measurement models are investigated and two of them are used to assess the quality of forum contribution for students participating in a first year undergraduate degree course. The paper concludes by attempting to identify areas where the models could be improved and discusses areas for future study.</p>
<p>The paper will be of interest to those who are involved in delivering e-learning courses and who would like to use discussion forums as a possible assessment tool. It would also be of value to learners who choose to enroll in distance learning courses and who are asked to participate in assessed discussion forum debate.</p>
<p><strong>Important Terms</strong></p>
<ol>
<li>
<div>Computer Mediated Communication </div>
</li>
<li>
<div>Discourse Analysis</div>
</li>
<li>
<div>Assessed Discussion Forums</div>
</li>
<li>
<div>Activity Measurement on Discussion Forums</div>
</li>
<li>
<div>Learning Management Systems</div>
</li>
<li>
<div>Community of Learning Model</div>
</li>
<li>
<div>Ethnographic Study</div>
</li>
<li>
<div>Quantitative Content Analysis Model</div>
</li>
<li>
<div>Coefficient of Reliability (Holsti, 1969)</div>
</li>
</ol>
<p><strong>Reference Sheet<br />
</strong>Discussion forums are said to allow students to see different perspectives which can help to foster new meaning construction (Heller &amp; Kearsley, 1996; Ruberg et al., 1996).</p>
<p>Discussion forums encourage student ownership of learning and collaborative problem-solving skills (Becker, 1992).  </p>
<p>They encourage participants to put their thoughts into writing in a way that others can understand, promoting self-reflective dialogue and dialogue with others (Valacich, Dennis, &amp; Connolly, 1994).</p>
<p>Discussion forums have the potential to expose students to a broader range of views than face-to-face talk, and hence enable them to develop more complex perspectives on a topic (Prain and Lyons, 2000).</p>
<p>Quality of Online Discussion can be categoried into 4 different sets</p>
<ol>
<li>
<div>Level of Disaggrement</div>
</li>
<li>
<div>Argument Structure Analysis</div>
</li>
<li>
<div>Interaction Based</div>
</li>
<li>
<div>Content Analysis</div>
</li>
</ol>
<p>Studies belonging to the level of disagreement category adopt the approach of coding messages according to the level of disagreement that is exhibited in relation to previous posting.</p>
<p>The argument structure analysis category codes messages according to the argument quality demonstrated by participants.</p>
<p>Interaction-based coding methods place an emphasis on the message as part of a larger discussion.</p>
<p>The content analysis approach codes messages according to the message type. A review of literature suggests that content analysis is the most popular approach used by researchers to evaluate quality in discussion forum postings.</p>
<p><strong>Henri (1992)</strong> developed an <strong><em>analytical model</em></strong> that highlights <strong><em>five dimensions of the learning process </em></strong>that can be found in messages.</p>
<p><strong>Newman et al. (1995) </strong>developed an <strong><em>analytical method </em></strong>for the study of critical thinking, which presented a <strong><em>list of indicators of critical thinking</em></strong>.</p>
<p><strong>Garrison et al. (2000)</strong> assessed inquiry capabilities as well as critical thinking through three dimensional model which measured <em><strong>cognitive presence, teaching presence, and social presence</strong></em>.</p>
<p><strong>Henri (1991) </strong>identified following five dimensions which can be used to evaluate CMC: participative, social, interactive, cognitive and metacognitive. The cognitive and metacognitive dimensions measured reasoning, critical thought and self-awareness and as such are more likely to be of interest when attempting to <strong><em>reward participants for assessed discussion forum contribution</em></strong>.</p>
<p><strong><em>Community of Learning</em></strong> model assumes that learning occurs through the interaction of <em>three core components</em>:</p>
<ol>
<li>
<div><strong>cognitive presence</strong> &#8211; the extent to which the participants in any particular configuration of a community are able to construct meaning through sustained communication</div>
</li>
<li>
<div><strong>teaching presence </strong>- deals with all those declarations of the students or tutors where the creation of a dynamic group is promoted, including social relationships, expressions of emotions, and affirmation messages</div>
</li>
<li>
<div><strong>social presence </strong>- considers the interactions of teachers and students, as they formulate questions, expose ideas and answer questions.</div>
</li>
</ol>
<p>The cognitive presence concept was expanded by Garrison ,Anderson, &amp; Archer (2001) into a four stage cognitive-processing model, which was used to assess critical thinking skills in on-line discussions. The model classified student responses into <strong><em>triggering, exploration, integration </em></strong>and <strong><em>resolution </em></strong>categories. <strong> </strong></p>
<p><strong>Useful References<br />
</strong>Campos, M. (2004, April). <a href="http://www.sloan-c.org/publications/jaln/v8n2/pdf/v8n2_campos.pdf">A Constructivist method for the analysis of networked cognitive communication and the assessment of collaborative learning and knowledge building</a>. Journal of American Learning Networks, 8(2),1-29.</p>
<p>Garrison, D.R. (2000). Theoretical challenges for distance education in the tenty-first century: A shift from structural to translational issues. International Review of research and Open and Distance Learning, 1(1). Avaialable: <a href="http://www.icap.org/iuicode?149.1.1.2">http://www.icap.org/iuicode?149.1.1.2</a></p>
<p>Garrison, D.R., Anderson, T., &amp; Archer, W. (2000). Critical thinking in a text-based environment. Computer Conferencing in higher education. Internet in Higher Education, 2(2), 87-105.</p>
<p>Garrison, D. R., Anderson, T., and Archer, W. (2001). <a href="http://www.communityofinquiry.com/files/CogPres_Final.pdf">Critical Thinking, Cognitive Presence, and Computer Conferencing in Distance Education</a>. The American Journal of Distance Education 15(1),  7–23.</p>
<p>Hara, N., Bonk, C. J., &amp; Angeli, C. (2000). <a href="http://crlt.indiana.edu/publications/journals/techreport.pdf">Content analysis of online discussion in an applied educational psychology course</a>. Instructional Science, 28, 115-152. [<a href="http://crlt.indiana.edu/publications/journals/techreport.pdf">additonal source</a>]</p>
<p>Henri, F. (1992). Computer conferencing and content analysis. In A. R. Kaye (Ed.), Collaborative learning through computer conferencing: The Najaden Papers (pp. 116-136). Berlin: Springer-Verlag.</p>
<p>Järvelä, S., &amp; Häkkinen, P. (2002). <a href="http://www.edu.helsinki.fi/svy/kvali/neuro/mat/artikkeli_3_%20monimen.pdf">The levels of Web-based discussions—Using perspective-taking theory </a>as an analysis tool. In H. Van Oostendorp (Ed.), Cognition in a digital world (pp. 77-95). Mahwah, NJ: Erlbaum.</p>
<p>Meyer, K.A. (2004, April). <a href="http://www.sloan-c.org/publications/JALN/v8n2/pdf/v8n2_meyer.pdf">Evaluating online discussions: four different frames of analysis</a>. Journal of American Learning Networks, 8(2),101-114.</p>
<p>Newman, G., Webb, B., &amp; Cochrane, C. (1995). <a href="http://www.qub.ac.uk/mgt/papers/methods/contpap.html">A content analysis method to measure critical thinking in face-to-face computer supported group learning</a>. Interpersonal Computing and Technology, 3(2), 56-77.</p>
<p>Schaeffer, E. L., McGrady, J. A., Bhargava, T., &amp; Engel, C. (2002, April). <a href="http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/1a/18/49.pdf">Online debate to encourage peer interactions in the large lecture setting: Coding and analysis of forum activity</a>. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.</p>
<p>Spatariu, A., Hartley, K. &amp; Bendixen, L.D. (2004, Spring) <a href="http://www.ncolr.org/jiol/issues/PDF/2.4.2.pdf">Defining and Measuring Quality in On-line Discussion</a>. Journal of Interactive Online Learning. Vol 2(4).</p>
<p><strong></strong></p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/drvidy.wordpress.com/26/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/drvidy.wordpress.com/26/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/drvidy.wordpress.com/26/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/drvidy.wordpress.com/26/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/drvidy.wordpress.com/26/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/drvidy.wordpress.com/26/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/drvidy.wordpress.com/26/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/drvidy.wordpress.com/26/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/drvidy.wordpress.com/26/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/drvidy.wordpress.com/26/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/drvidy.wordpress.com/26/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/drvidy.wordpress.com/26/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/drvidy.wordpress.com/26/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/drvidy.wordpress.com/26/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/drvidy.wordpress.com/26/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/drvidy.wordpress.com/26/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=26&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://drvidy.wordpress.com/2008/02/27/assessing-discussion-forum-participation-in-search-of-quality/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d7e932c20e7cb205e974d8e8cf16019f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">vidy</media:title>
		</media:content>
	</item>
		<item>
		<title>Review on &#8211; Automatically Assessing the Post Quality in Online Discussions on Software</title>
		<link>http://drvidy.wordpress.com/2008/02/27/automatically-assessing-the-post-quality-in-online-discussions-on-software/</link>
		<comments>http://drvidy.wordpress.com/2008/02/27/automatically-assessing-the-post-quality-in-online-discussions-on-software/#comments</comments>
		<pubDate>Wed, 27 Feb 2008 02:54:26 +0000</pubDate>
		<dc:creator>vidy</dc:creator>
				<category><![CDATA[Data Quality Measurement]]></category>
		<category><![CDATA[forum data quality]]></category>
		<category><![CDATA[post data quality]]></category>

		<guid isPermaLink="false">http://drvidy.wordpress.com/?p=25</guid>
		<description><![CDATA[Authors &#8211; Markus Weimer and Iryna Gurevych and Max Muhlhauser Year &#8211; 2007 Published in &#8211; In Proceedings of the Association for Computational Linguistics (ACL) Link &#8211; http://elara.tk.informatik.tu-darmstadt.de/publications/2007/acl2007.pdf Importance to my Research &#8211; Very High MY REVIEW The main idea explored in the present paper is to investigate the feasibility of  automatically assessing the perceived [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=25&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Authors</strong> &#8211; Markus Weimer and Iryna Gurevych and Max Muhlhauser</p>
<p><strong>Year</strong> &#8211; 2007</p>
<p><strong>Published in</strong> &#8211; In Proceedings of the Association for Computational Linguistics (ACL)</p>
<p><strong>Link</strong> &#8211; <a href="http://elara.tk.informatik.tu-darmstadt.de/publications/2007/acl2007.pdf">http://elara.tk.informatik.tu-darmstadt.de/publications/2007/acl2007.pdf</a></p>
<p><strong>Importance to my Research</strong> &#8211; Very High</p>
<p><em><strong><u>MY REVIEW<br />
</u></strong>The main idea explored in the present paper is to investigate the feasibility of  automatically assessing the perceived quality of user generated content. As the authors say some work has been done on automatic assessment of other types of user generated content, such as essays and product reviews or student online discussions. However this is the first work that attempts to analyze the quality of forum posts. The work attempts to develop a system that can automatically adapt itslef to the quality standards existing in any user community by learning the relation between a set of features and the perceived quality of posts.</em></p>
<p><em>It is not clear whether the forum post selection is done automatically or manually (1968 posts were selected in experiments), however the authors outline that some posts were deleted (i.e. <strong>posts with spam, attachments and those that were not in English</strong>). I am not sure how practical this would be, i.e. if you want to divide the posts into <strong>good</strong> and <strong>bad</strong>, can that be done authomatically or not?</em></p>
<p><em><strong>Removing attachments is not a good idea</strong>, it may be quite possible that the user has attached some useful documents rather than copying the content on the post. So I am pretty much against this constraint. In some domains, picture may be worth a 1000 words, so excluding them outright is not reasonable. This should be considered by the authors in future work.</em></p>
<p><em>Having said that, I find the <strong>major drawback</strong> for this work is that it decides quality of posts by heavily relying on the ratings that the users provide. If however no ratings are provided by the user then this algorithm would not be useful. Also the authors point out in the introduction that only 0.1% posts in Nabble are rated by users. So I am not sure if assessing quality for only such a minor fraction is going to be any helpful. It is also not clear from the paper how the <strong>Syntactic Features </strong>are used in the assessment.</em></p>
<p><em>It is also intresting to find some more forum specific features for classification, other than IsHTML, IsMail, Quote Fraction, URL &amp; Path Count. The work done by </em><a href="http://kevinchai.net/publications/"><em>Kevin Chai </em></a><em>on assessing user contribution identifies around 16 different forum features.</em></p>
<p><em>Overall it was a good paper to read and I really found this work very useful.</em></p>
<p><strong>Cite this review as<br />
</strong>Review on <em>&#8220;Automatically Assessing the Post Quality in Online Discussions on Software&#8221;</em> by V. Potdar, 27th Feb. 2008, Available Online &#8211; <a href="http://drvidy.wordpress.com/2008/02/27/automatically-assessing-the-post-quality-in-online-discussions-on-software/">http://drvidy.wordpress.com/2008/02/27/automatically-assessing-the-post-quality-in-online-discussions-on-software/</a></p>
<p><span id="more-25"></span><strong></strong></p>
<p><strong>Abstract<br />
</strong>Assessing the quality of user generated content is an important problem for many web forums. While quality is currently assessed manually, we propose an algorithm to assess the quality of forum posts automatically and test it on data provided by Nabble. com. We use state-of-the-art classification techniques and experiment with five feature classes: Surface, Lexical, Syntactic, Forum specific and Similarity features. We achieve an accuracy of 89% on the task of automatically assessing post quality in the software domain using forum specific features. Without forum specific features, we achieve an accuracy of 82%.</p>
<p><strong>Important Terms </strong></p>
<ol>
<li>
<div>Speech Act Analysis</div>
</li>
<li>
<div>Most Authoratative Answer</div>
</li>
<li>
<div>Author&#8217;s Trustworthiness</div>
</li>
<li>
<div><a href="http://elara.tk.informatik.tu-darmstadt.de/publications/2007/acl2007.pdf">Automatic Qulaity Classification</a></div>
</li>
<li>
<div>Perceived Quality of Posts</div>
</li>
<li>
<div>Feature Classes (Surface, Lexical, Syntactic, Forum Specific and Similarity features)</div>
</li>
<li>
<div>Feature Vector &amp; Feature Values</div>
</li>
<li>
<div>Support Vector Machines</div>
</li>
<li>
<div>Confusion Matrix</div>
</li>
<li>
<div>Unigram Vector (Posts UV vs. Forums UV)</div>
</li>
</ol>
<p><strong>Reference Sheet<br />
</strong>The end user has problems to navigate through large repositories of information and find information of high quality quickly.</p>
<p>The main idea explored in the present paper is to investigate the feasibility of  automatically assessing the perceived quality of user generated content.</p>
<p>The perceived quality is not an objective measure. Rather, it models how the community at large perceives post quality. We choose a machine learning approach to automatically assess it.</p>
<p>Feng et al. (2006) and Kim et al. (2006a) describe a system to find the most authoritative answer in a forum thread. The latter add speech act analysis as a feature for this classification.</p>
<p>Another feature is the author’s trustworthiness, which could be computed based on the automatic quality classification scheme proposed in the present paper.</p>
<p>Finding the most authoritative post could also be defined as a special case of the quality assessment. However, it is definitely different from the task studied in the present paper.</p>
<p>We assess the perceived quality of a given post, based solely on its intrinsic features. Any discussion thread may contain an indefinite number of good posts, rather than a single authoritative one.</p>
<p>Assumptions by Authors &#8211; Posts with less than three stars are rated “bad”. Posts with more than three stars are “good”.</p>
<p><strong>Useful References<br />
</strong>Chih-Chung Chang and Chih-Jen Lin, 2001. LIBSVM: a library for support vector machines. Software available at http: //www.csie.ntu.edu.tw/cjlin/libsvm.</p>
<p>Donghui Feng, Erin Shaw, Jihie Kim, and Eduard Hovy. 2006. Learning to detect conversation focus of threaded discussions. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NNACL).</p>
<p>Jihie Kim, Grace Chern, Donghui Feng, Erin Shaw, and Eduard Hovya. 2006a. Mining and assessing discussions on the web through speech act analysis. In Proceedings of the Workshop<br />
on Web Content Mining with Human Language Technologies at the 5th International Semantic Web Conference.</p>
<p>Jihie Kim, Erin Shaw, Donghui Feng, Carole Beal, and Eduard Hovy. 2006b. Modeling and assessing student activities in on-line discussions. In Proceedings of the Workshop on Educational Data Mining at the conference of the American Association of Artificial Intelligence (AAAI-06), Boston, MA.</p>
<p>Soo-Min Kim, Patrick Pantel, Tim Chklovski, and Marco Penneacchiotti. 2006c. Automatically assessing review helpfulness. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 423 – 430, Sydney, Australia, July.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/drvidy.wordpress.com/25/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/drvidy.wordpress.com/25/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/drvidy.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/drvidy.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/drvidy.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/drvidy.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/drvidy.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/drvidy.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/drvidy.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/drvidy.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/drvidy.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/drvidy.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/drvidy.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/drvidy.wordpress.com/25/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/drvidy.wordpress.com/25/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/drvidy.wordpress.com/25/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=25&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://drvidy.wordpress.com/2008/02/27/automatically-assessing-the-post-quality-in-online-discussions-on-software/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d7e932c20e7cb205e974d8e8cf16019f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">vidy</media:title>
		</media:content>
	</item>
		<item>
		<title>Review &#8211; Web Spam Taxanomy</title>
		<link>http://drvidy.wordpress.com/2008/02/26/web-spam-taxanomy/</link>
		<comments>http://drvidy.wordpress.com/2008/02/26/web-spam-taxanomy/#comments</comments>
		<pubDate>Tue, 26 Feb 2008 02:03:18 +0000</pubDate>
		<dc:creator>vidy</dc:creator>
				<category><![CDATA[Spam]]></category>
		<category><![CDATA[taxonomy]]></category>

		<guid isPermaLink="false">http://drvidy.wordpress.com/?p=24</guid>
		<description><![CDATA[Authors &#8211; Zoltan Gyongyi, Hector Garcia-Molina Year &#8211; 2006 Published in &#8211; In 1st International Workshop on Adversarial Information Link &#8211; http://airweb.cse.lehigh.edu/2005/gyongyi.pdf MY REVIEW I was planning to write a review for this paper, however I found that Mr Pedram Hayati has written an excellent review on this article. So I got a bit lazy [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=24&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Authors</strong> &#8211; Zoltan Gyongyi, Hector Garcia-Molina</p>
<p><strong>Year</strong> &#8211; 2006</p>
<p><strong>Published in</strong> &#8211; In 1st International Workshop on Adversarial Information</p>
<p><strong>Link</strong> &#8211; <a href="http://airweb.cse.lehigh.edu/2005/gyongyi.pdf">http://airweb.cse.lehigh.edu/2005/gyongyi.pdf</a></p>
<p><em><strong><u>MY REVIEW<br />
</u></strong>I was planning to write a review for this paper, however I found that </em><a href="http://pi3ch.wordpress.com/"><em>Mr Pedram Hayati</em></a><em> has written an excellent review on this article. So I got a bit lazy <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . Please find the review on &#8220;Web Spam Taxanomy&#8221; at </em><a href="http://pi3ch.wordpress.com/2008/02/22/web-spam-taxonomy/"><em>http://pi3ch.wordpress.com/2008/02/22/web-spam-taxonomy/</em></a><em>. However you can still have a look at my </em><a href="http://drvidy.wordpress.com/2008/02/26/web-spam-taxanomy/"><em>reference sheet</em></a><em>, which lists some key statements made by the authors.</em></p>
<p><strong>Cite this article as<br />
</strong>Review on &#8220;<em>Web Spam Taxanomy</em>&#8221; by Pedram Hayati, 23rd Feb, 2008. Available Online at &#8211; <a href="http://pi3ch.wordpress.com/2008/02/22/web-spam-taxonomy/">http://pi3ch.wordpress.com/2008/02/22/web-spam-taxonomy/</a></p>
<p><strong><span id="more-24"></span> Abstract<br />
</strong>Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures.</p>
<p><strong>Important Terms</strong></p>
<ul>
<li>
<div>Spamdexing</div>
</li>
<li>
<div>Term Spamming</div>
</li>
</ul>
<p><strong>Reference Sheet</strong></p>
<p>The primary consequence of web spamming is that the quality of search results decreases.</p>
<p>The secondary consequence of spamming is that search engine indexes are inflated with useless pages, increasing the cost of each processed query.</p>
<p>We believe that the first step in combating spam is understanding it, that is, analyzing the techniques the spammers use to mislead search engines.</p>
<p>The objective of a search engine is to provide highquality results by correctly identifying all web pages that are relevant for a specific query, and presenting the user with some of the most important of those relevant pages.</p>
<p>Importance refers to the global (query-independent) popularity of a page, as often inferred<br />
from the link structure (e.g., pages with many incoming links are more important), or perhaps other indicators.</p>
<p>There are <strong>two categories</strong> of <strong>techniques </strong>associated with <strong>web spam</strong>. The first category includes the <strong><em>boosting techniques</em></strong>, i.e., methods through which one seeks to achieve high relevance and/or importance for some pages.</p>
<p>The second category includes <strong><em>hiding techniques</em></strong>, methods that by themselves do not influence the search engine&#8217;s ranking algorithms, but that are used to hide the adopted boosting techniques from the eyes of human web users.</p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/drvidy.wordpress.com/24/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/drvidy.wordpress.com/24/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/drvidy.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/drvidy.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/drvidy.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/drvidy.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/drvidy.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/drvidy.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/drvidy.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/drvidy.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/drvidy.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/drvidy.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/drvidy.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/drvidy.wordpress.com/24/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/drvidy.wordpress.com/24/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/drvidy.wordpress.com/24/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=24&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://drvidy.wordpress.com/2008/02/26/web-spam-taxanomy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d7e932c20e7cb205e974d8e8cf16019f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">vidy</media:title>
		</media:content>
	</item>
		<item>
		<title>Review &#8211; A Quantitative Study of Forum Spamming Using Context-based Analysis</title>
		<link>http://drvidy.wordpress.com/2008/02/21/a-quantitative-study-of-forum-spamming-using-context-based-analysis/</link>
		<comments>http://drvidy.wordpress.com/2008/02/21/a-quantitative-study-of-forum-spamming-using-context-based-analysis/#comments</comments>
		<pubDate>Thu, 21 Feb 2008 06:35:19 +0000</pubDate>
		<dc:creator>vidy</dc:creator>
				<category><![CDATA[Spam]]></category>
		<category><![CDATA[forums]]></category>
		<category><![CDATA[splogs]]></category>

		<guid isPermaLink="false">http://drvidy.wordpress.com/2008/02/21/a-quantitative-study-of-forum-spamming-using-context-based-analysis/</guid>
		<description><![CDATA[Authors -Yuan Niu, Yi-Min Wang, Hao Chen, Ming Ma, and Francis Hsu Year &#8211; 2007 Published in- Proceedings of the Network &#38; Distributed System Security (NDSS) Symposium Link &#8211; ftp://ftp.research.microsoft.com/pub/tr/TR-2006-173.pdf Importance to my Research - High MY REVIEW The approach presented by the authors is very interesting and practical, however I found one major weakness and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=21&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Authors </strong>-Yuan Niu, Yi-Min Wang, Hao Chen, Ming Ma, and Francis Hsu</p>
<p><strong>Year</strong> &#8211; 2007</p>
<p><strong>Published in</strong>- Proceedings of the Network &amp; Distributed System Security (NDSS) Symposium</p>
<p><strong>Link</strong> &#8211; <a href="ftp://ftp.research.microsoft.com/pub/tr/TR-2006-173.pdf">ftp://ftp.research.microsoft.com/pub/tr/TR-2006-173.pdf</a></p>
<p><strong>Importance to my Research </strong>- High</p>
<p><em><strong><u>MY REVIEW<br />
</u></strong>The approach presented by the authors is very interesting and practical, however I found one major weakness and i.e. the authors should have considered how often the spammers change domain names. If the domain names are changed on a weekly basis, the Strider URL tracer has to run every week to find out the spam domains and black list them and this can be computationally expensive and may incur a lot of network traffic, not only for the spam domains but even for the non-spam domains.</em></p>
<p><em>SMF Forum is not studied in this project, I assume it is one of the most popular forum software, or may be, since it has a very good spam protection, that is why it was excluded.</em></p>
<p><em>Just another thought I got when reading another paper, and I think it may be useful here as well, if blogs or forums have to be robust against spam, the blog and fourm providers should work hand in hand with the search engines. Anti-spamming features should be enabled in blogs and forums. Some heuristics or practical traits can be checked when comments are added e.g.</em></p>
<p><em>1. How long does it take to post a comment, e.g. if it is 1000 words in lenght, and is posted in a fraction of a second, it may more likely be hijacked content or spam i.e. copy and paste scenario.  This information is normally not displayed on the blog or forum, but if this is made publically accessible, it would benefit the search engines, or at-least the fourm or blog operators can implement it, to control the spread of spam in the first place.</em></p>
<p><em>2. Second thought, is regarding email spam, here the email providers may want to check, how quickly people delete their emails from their mailbox, if the time they login, to the time they delete many email is very low, then it should be an indicator of spam emails. On the other hand it could also refer to users who may have subscribed to mailing lists but are no longer interested in it.</em></p>
<p><em>3. Another way comment spam can be addressed would be by using </em><a href="http://www.captcha.net/"><em>CAPTCHA</em></a><em>, I am not sure why this simple option is not implemented in many forums and blogs.</em></p>
<p><strong>Cite this article as<br />
</strong>Review on <em>&#8220;A Quantitative Study of Forum Spamming Using Context-based Analysis&#8221; </em>by V. Potdar, 21st Feb 2008. Available online &#8211; <a href="http://drvidy.wordpress.com/2008/02/21/a-quantitative-study-of-forum-spamming-using-context-based-analysis/">http://drvidy.wordpress.com/2008/02/21/a-quantitative-study-of-forum-spamming-using-context-based-analysis/</a></p>
<p><span id="more-21"></span><strong></strong></p>
<p align="left"><strong>Abstract<br />
</strong>Forum spamming has become a major means of search engine spamming. To evaluate the impact of forum spamming on search quality, we have conducted a comprehensive study from three perspectives: that of the search user, the spammer, and the forum hosting site. We examine spam blogs and spam comments in both legitimate and honey forums. Our study shows that forum spamming is a widespread problem. Spammed forums, powered by the most popular software, show up in the top 20 search results for all the 189 popular keywords. On two blog sites, more than half (75% and 54% respectively) of the blogs are spam, and even on a major and reputably well maintained blog site, 8.1% of the blogs are spam1. The observation on our honey forums confirms that spammers target abandoned pages and that most comment spam is meant to increase page rank rather than generate immediate traffic. We propose context-based analyses, consisting of redirection and cloaking analysis, to detect spam automatically and to overcome shortcomings of content-based analyses. Our study shows that these analyses are very effective in identifying spam pages.</p>
<p><strong>Important Terms</strong></p>
<ol>
<li>
<div>Content Based Analysis</div>
</li>
<li>
<div>Cloaking Analysis <em>(Click through Cloaking</em>)</div>
</li>
<li>
<div>URL Redirection</div>
</li>
<li>
<div>Spammer&#8217;s Modus Operandi</div>
</li>
<li>
<div><a href="http://research.microsoft.com/HoneyMonkey/NDSS_2006_HoneyMonkey_Wang_Y_camera-ready.pdf">Honey Monkey </a>(<em>aka Monkey Program</em>)</div>
</li>
<li>
<div>Doorway Pages</div>
</li>
<li>
<div>Context based spam detection</div>
</li>
<li>
<div><a href="http://research.microsoft.com/URLTracer/">Strider URL Tracer</a></div>
</li>
<li>
<div>Behaviour &amp; Signature based spam detection</div>
</li>
</ol>
<p><strong>Reference Sheet<br />
</strong>Anecdotal evidence as well as our own experience indicates that spammers have successfully promoted their web sites in search results through forum spamming.</p>
<p>The observation on our honey forums confirms that spammers target abandoned pages and that most comment spam is meant to increase page rank rather than generate immediate traffic.</p>
<p>The tracer provides a key functionality called the Top Domain view: given a list of (primary) URLs, the tracer launches an actual browser to visit each URL<br />
and records all secondary URLs visited as a result.</p>
<p>At the end of the batched scan, the Top Domain view provides the list of third-party domains that received secondary-URL traffic and rank them by the number of primary URLs that generated traffic to them.</p>
<p>If the input is a list of highly suspicious spam URLs (such as those collected from a spammed forum), the Top Domain view highlights those behind-the-scenes spammer domains that are associated with a large number of doorway URLs.</p>
<p>To identify heavily spammed forums, we looked for sites that appeared in the search results for different keywords as well as sites whose pages appeared multiple times in the results for a single keyword.</p>
<p><strong>Related Useful References</strong></p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/drvidy.wordpress.com/21/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/drvidy.wordpress.com/21/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/drvidy.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/drvidy.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/drvidy.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/drvidy.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/drvidy.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/drvidy.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/drvidy.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/drvidy.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/drvidy.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/drvidy.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/drvidy.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/drvidy.wordpress.com/21/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/drvidy.wordpress.com/21/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/drvidy.wordpress.com/21/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=21&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://drvidy.wordpress.com/2008/02/21/a-quantitative-study-of-forum-spamming-using-context-based-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d7e932c20e7cb205e974d8e8cf16019f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">vidy</media:title>
		</media:content>
	</item>
		<item>
		<title>Review &#8211; Bias and Controversy: Beyond the Statistical Deviation</title>
		<link>http://drvidy.wordpress.com/2008/02/20/bias-and-controversy-beyond-the-statistical-deviation/</link>
		<comments>http://drvidy.wordpress.com/2008/02/20/bias-and-controversy-beyond-the-statistical-deviation/#comments</comments>
		<pubDate>Wed, 20 Feb 2008 10:59:12 +0000</pubDate>
		<dc:creator>vidy</dc:creator>
				<category><![CDATA[Paper Reviews]]></category>
		<category><![CDATA[Controversy]]></category>
		<category><![CDATA[Evaluating Bias]]></category>

		<guid isPermaLink="false">http://drvidy.wordpress.com/?p=20</guid>
		<description><![CDATA[Authors - Hady W. Lauw, Ee-Peng Lim, and Ke Wang Year &#8211; 2006 Published in - Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD&#8217;06) Link - http://www.hadylauw.com/kdd06.pdf Importance to my Research - High MY REVIEW coming soon&#8230;.. Cite this review as  Review on “Bias and Controversy: Beyond the Statistical Deviation” by [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=20&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><strong>Authors </strong>- <a href="http://www.hadylauw.com/">Hady W. Lauw</a>, Ee-Peng Lim, and Ke Wang</p>
<p><strong>Year &#8211; 2006</strong></p>
<p><strong>Published in </strong>- Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD&#8217;06)</p>
<p><strong>Link </strong>- <a href="http://www.hadylauw.com/kdd06.pdf">http://www.hadylauw.com/kdd06.pdf</a></p>
<p><strong>Importance to my Research </strong>- High</p>
<p><em><strong><u>MY REVIEW<br />
</u></strong>coming soon&#8230;..</em></p>
<p><strong>Cite this review as</strong> <br />
Review on “Bias and Controversy: Beyond the Statistical Deviation” by V. Potdar, 20th Feb 2008, Available Online <a href="http://drvidy.wordpress.com/">http://drvidy.wordpress.com/2008/02/20/bias-and-controversy-beyond-the-statistical-deviation/</a><br />
<span id="more-20"></span></p>
<p><strong><strong>Important Terms</strong></strong></p>
<ol>
<li>
<div>bipartite graph</div>
</li>
<li>
<div></div>
</li>
</ol>
<p><strong><strong></strong><br />
 </strong></p>
<br /><img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/drvidy.wordpress.com/20/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/drvidy.wordpress.com/20/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/drvidy.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/drvidy.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/drvidy.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/drvidy.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/drvidy.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/drvidy.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/drvidy.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/drvidy.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/drvidy.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/drvidy.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/drvidy.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/drvidy.wordpress.com/20/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/drvidy.wordpress.com/20/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/drvidy.wordpress.com/20/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=drvidy.wordpress.com&amp;blog=2785435&amp;post=20&amp;subd=drvidy&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://drvidy.wordpress.com/2008/02/20/bias-and-controversy-beyond-the-statistical-deviation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/d7e932c20e7cb205e974d8e8cf16019f?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">vidy</media:title>
		</media:content>
	</item>
	</channel>
</rss>
