Detecting spam blogs from blog search results

How blog spammers manipulate search engines and how to stop them

Abstract

Blogging has become a major medium for self-expression and information sharing. However, the growth of spam blogs (splogs) significantly reduces the value of blog platforms and search engines. This research proposes a framework for detecting splogs by monitoring online search results, especially focusing on blogs that bypass spam filters.

The method profiles temporal behavior of blogs using “blog profiles” and evaluates their likelihood of being splogs. Experiments using real data confirm that splogs can be detected accurately using this approach without modifying existing blog search engines.


Key Contributions

  1. A splog detection framework that operates without training data or human input.

  2. Introduction of blog profiles based on temporal behavior.

  3. Effective spam-post scoring functions to assess suspicious activity.

  4. Real-world testing over 1.5 years of data shows high accuracy.


Problem Statement

Splogs aim to manipulate search engine rankings to gain traffic and promote products or services. Existing detection techniques (content, link, or collaborative-based) have limitations due to:

  • Dynamic content changes

  • Sparse blog link structures

  • User-generated noisy links (e.g., in comments)


Proposed Method

  • Monitor search results from real-time user queries.

  • Select a targeted query set that is more likely to attract splogs.

  • Analyze top-k ranked blog posts (typically k=50).

  • Record blog activity in a blog profile as a sequence of time-stamped “blog state tuples.”

  • Use scoring functions to detect spam-posts and classify blogs as splogs.


Assumptions

  1. Authentic blogs are more prevalent than splogs in search engine indexes.

  2. Splogs appear frequently in top results for certain high-traffic queries.


Modules in Framework

  • Spam-post Detection: Assigns a score to each blog post based on extracted features (e.g., similarity, repetition).

  • Splog Detection: Uses blog profiles to identify temporal patterns typical of splogs.


Experimental Setup

  • Data collected from a popular blog search engine.

  • Two experiments conducted:

    • Evaluating scoring function effectiveness

    • Measuring the impact of varying detection parameters


Related Work

  • Past splog detection focused on content or user-flagging.

  • This approach is the first to leverage live search results and temporal behavior without needing labeled datasets.


Conclusion

The study introduces an effective and flexible framework to detect splogs in real time, using only blog search engine results. It works independently of existing systems and can be integrated with any search engine. The approach offers a scalable and robust solution to combat blog spam.