Clone Detection
Posted by Shane Keats at 09:45 AM
Is Google Adwords getting scammed? McAfee SiteAdvisor has a solution.
The Web is full of interesting people. Take Ken Miyazawa, for example.

Ken is a man of strong opinions. Check out his endorsement of mymusicinc.com:
"I am very impressed with the speed of your download technology. The quality of the music is superior to my prior service."
He says the same thing about mimem.org/emusic:
"I am very impressed with the speed of your download technology. The quality of the music is superior to my prior service."
Peter Sanchez agrees. He has this to say about K-Lite Pro:
"I am very impressed with the speed of your download technology. The quality of the music is superior to my prior service."
Cindy Griffin is also a convert to K-Lite Pro, though not for Peter's reasons:
"I just happened to get into this, and now this is one of my favourite sites. I visit very often to get new music, softwares and stuff. Really a cool one. Thanx to the team." (sic)
She's identically effusive about k-lite.tk.
"...now this is one of my favourite sites."
In fact, Cindy is so compelling that her opinions themselves are worth copying. Hendick (sic?) Theodore writes that he too:
"...just happened to get into this, and now this is one of my favourite sites. I visit very often to get new music, softwares and stuff. Really a cool one. Thanx to the team."
Wow. Steve Richards feels the same way about Ares Ultra:
"I just happened to get into this, and now this is one of my favourite sites. I visit very often to get new music, softwares and stuff. Really a cool one. Thanx to the team."
What’s going on here? Has someone finally cloned a human being? Or is a massive wave of plagiarism sweeping the web?
The Clone Wars
What’s going on is that the bad guys are using shortcuts, and we’ve found a way to cut them off at the pass. Why are there so many nearly identical scam sites on the web? One reason for this proliferation is the quest for the perfect site, the one that maximizes profits and minimize costs. And we can safely assume that profit maximization can be enhanced by testing sales pitch variations. Does site design A yield better click-through than site design B? What about site design C? Cost minimization can be enhanced by making site changes incrementally. Swap this graphic for that graphic, but keep the text and HTML. Change the URL but keep the text. Rearrange frames in the HTML but keep the graphics the same.
Ben Edelman, a technical advisor to SiteAdvisor and a spyware researcher, noted another, perhaps more significant motivation. Recall a recent announcement by Google announcing a change in Adwords policy. Going forward, Adwords would allow at most one ad listing for a given landing domain name. By copying the same site onto multiple domain names, a site can try to avoid this restriction and get multiple ad listing slots. It's possible that small tweaks like variations in color scheme, text or layout could also helpful to this end, in preventing Google's automated (and perhaps also human) reviewers from flagging all the sites as dupes.
When we first noticed these common text strings for file sharing scam sites, we asked Hugo Liu, a post-doc at the MIT Media Lab and another one of our technical advisors, if there was any way he could use this to help. Hugo specializes in semantic analysis. He tries to find patterns and meaning in what appear to be random data. Hugo began to play.
What would happen, Hugo wondered, if you took an interesting phrase and created a map of sites that shared that phrase? If the phrase originated from a known scam site, could it be used as a prospecting tool to find other similar scam sites? Perhaps clones of the original?
Hugo found that once a strong phrase is identified, it can be tested against a group of public Web sites. Think of the
process as passing a lens over a stream of text as you look for a string of key words. Hugo likens it to a kid passing a magic decoder over a "spy" book. This windowing, as the process is known in corpus-based linguistics, works because bad actors re-use content in their effort to maximize profit and minimize cost.
Building a case
Hugo noticed that text analysis alone delivered a lot of false positives. For example, early efforts at clone detection
yielded a lot of Wikipedia trawling sites – sites that copy a Wikipedia excerpt and then surround it with text ads. Other results had parasitically pulled text from many different sites, presumably in an effort to increase their reach in search engine results or piggbyback on a more established brand. Distasteful for sure, but not a scam, at least as defined by SiteAdvisor.
Shared phrasing is like circumstantial evidence. It’s enough to bring a suspect in, but typically insufficient to convict. Hugo needed a DNA match and he found it in structural analysis.
Web developers know that a lot of HTML production is idiosyncratic. Could decisions like when to capitalize a tag be used as evidence of cloning? The answer is yes. The developer of a scam site typically reuses the template as much as possible to reduce time and cost. Consequently, any of the developer’s original quirks inherent in the template get replicated. And that gives us a huge clue.
U2canbecloned
Lead Streams Marketing is a multi-level marketing company that offers "the Home Business System" as one of its programs. Our traditional testing discovered 4thepackage.com, an LSM affiliate, and rated it red for sending high-volume, somewhat spammy e-mail. Clone detection recently uncovered u2canbesuccessful.com. The sites are identical except for a tiny bit of text indicating ownership.

LeadStream isn't the only company posting many copies of a single web site. We've long followed sites set up by MarketEngines, CashEngines and Euclid Networks (the last company purportedly of the island of St. Kitts). Many of their sites are scams -- charging users for software that can be found elsewhere for free, and purporting to offer tech support that we've found to be practically nonexistent. They post some sites themselves, and they pay affiliates to post copies of these sites. For example, consider imusicaccess.com:

Now, compare it to imusicnow and winmx-downloading. Structure and layout are nearly identical. The occasional font, a few strings of text and some color schemes are all that separates these clones from one another:

These sites are clones of each other but can’t be detected using SiteAdvisor’s other automated tests because their red ratings are not due to e-mail practices or bundled spyware but because they provide services of low or no value. Clone detection lets us flag all these duplicate sites. After we identify the problem with one such site, clone detection helps us make sure we catch all its copies too.
Harder and harder to hide
Human beings are pretty sophisticated consumers but when it comes to the Web, it’s relatively easy to fool our "sixth sense." If the site looks well produced, if it appears to have original content, it’s relatively easy to overcome our basic level of skepticism. In the absence of a tool like clone detection, a typical consumer will be hard pressed to know that a particular site is a template that shares 95% of its text and 95% of its HTML with a With a site known to provide a bad value.
While clone detection uses technical algorithms, it succeeds thanks to economic fundamentals. Financially motivated scammers need customers, so they have no choice but to use public methods like search engine ads to reach their victims. That need to be public is their Achilles heel. Along with our automated testing for spam, spyware and exploits, tools like clone detection make it increasingly efficient for us to search for and find the bad guys.
