Red By Association
Posted by Shane Keats on December 16, 2005 06:37 PM
When we started SiteAdvisor last spring, we thought that our job would be relatively straightforward: sign up for stuff and download stuff and tell you the results. We knew it would be hard to do in practice, but at least it was a relatively predictable problem to tackle. Along the way towards implementation, we realized that the nature of the Web’s sketchy and suspicious practices was more complex, less transparent and more dangerous than any of us first thought.
Michael Kearns is a computer science professor at the University of Pennsylania. Before he joined UPenn, he spent a decade doing artificial intelligence and machine learning research at AT&T Labs and Bell Labs. He’s one of a handful of true pioneers in these fields.
Now, one of the byproducts of the millions of tests our Web bots conduct is an enormous data set we’ve built, not just of adware bundles or spam factories, but of relationships between Web sites. Michael and his grad student Jenn Wortman helped us approach this data in a novel way. Take a look at Screensaver.com for a second.

We initially rated screensaver.com ‘Green’ – safe to use for browsing, signing-up and downloading. Yet after downloading screen savers from here, our PC started popping up contextual ads.
Here’s what’s really happening:

From my user perspective, I’m on a site called screensaver.com, downloading a piece of software from them. From a technical perspective, however, my PC is actually calling a host computer run by freeze.com. Not only don’t I notice this, but even if I do, it won’t help. As an average user, I don’t know anything about freeze.com.
But our database does. What Michael and Jenn helped us realize is that we could use the data from our Web crawl to help users understand where they really are on the Web. This guidance will in turn help users make better, more informed decisions about whom and what they can trust online.
Defining Links
Enter Matt Gattis, a young developer who joined us from MIT. “What defines a ‘bad’ link?" Matt asked. He developed an algorithm for measuring the degree of association between two sites by looking at their linking relationships. And because machines running Matt’s code can’t be fooled by link obfuscation and other social engineering tricks, SiteAdvisor is able to see patterns and relationships that were effectively invisible to the human eye. What we’ve done with link analysis is make the Web more transparent. In fact, we think we’ve created something kind of cool.
The Weakest Links
Here’s how SiteAdvisor’s link analysis works in practice. Take a look at our link diagram for Screensaver.com:
Among many other things, our link analysis shows some basic relationships between sites. For example, the short arrow to freeze.com documents that the biggest ‘target’ for screensaver’s out-bound links is freeze. (In fact, freeze bought screenscaver.com in 2003 from risoftsystems, another red flagged friend. According to a freeze.com press release the sale included a five year “sponsorship contract highlighting RISS products.")
Improving the Odds
In an ideal world, users get full disclosure. Web sites not only tell the user what’s being installed, they disclose where the install is coming from in a way that’s meaningful to the non-technical user. I for one am not holding my breath. As a practical matter, without our link data, users are effectively browsing while blind. Clicking through to an unknown site is like betting it all on black. Heaven forbid if the marble lands on red. I’m not here to argue against aimless browsing; I love the serendipitous Web discovery. The problem with surfing blindly is that within three or four clicks, you can find yourself in places where all safety bets are off.
With SiteAdvisor, I know if the site I’m on engages in link practices that can land me in hot water. Browsing with our link analysis data is like going to a party where the only person you know happens to be the most social person in the room. He can tell you who’s friends with whom, who’s hooking up and who has trouble holding their liquor. Good person to know.

Comments
Hi, blog spammers, this is chris from SiteAdvisor. As you can imagine, we getting all kinds of spam, especially blog spam. Interesting specimens for our research lab... keep it up!
Posted by: cdixon | December 29, 2005 11:42 PM
Spammers should be shot ...
Posted by: JPV | January 15, 2006 09:57 PM
When the transmission button is pushed by mistake before the registered user finishes writing the review, are it thought about the rescue plan etc?
(For instance, it is made to correct after the edit function is applied.)
Posted by: m-file | November 5, 2006 06:42 AM
In your link analysis article you refer to sites being marked as yellow because they automatically change the homepage when installed. This is obviously VERY annoying. Just wanted to mention that even reputable big companies do that e.g. microsoft for MSN and Yahoo when you download their messengers. Even if you just update them. Downloading a free service they offer (obviously in return for registering the users details etc) doesnt necessarily mean the user wants to set the parent site as a homepage.
Posted by: Malena Platis | April 26, 2007 07:56 AM
Hi,
Thanks for the indepth research & analysis. I was really puzzled how could 2 totally unrelated sites be sharing the same IP address. I'll very worried if my site is sharing an IP address with some notorious sites. Your report really enlightens me a lot. Thank you!
Posted by: Janelle | October 23, 2007 12:31 AM