Thursday, February 7, 2008

Web Search and Targeted Ads Considered Harmful

1. Knowing the last 5-10 web pages you visited is typically enough to isolate you against other people
2. A related principle can be applied to search terms

Someone from Yahoo! Research gave a talk tonight about a security dilemma they're facing: they want to release user query records to researchers so we can come up with better methods, but they must also somehow scrub identities out of the data. The data in Netflix's search set has already been correlated with certainty to IMDB user accounts; same thing with search. However, the dimensionality in query records is much bigger than in movies, and mixed with sparseness and common sense, dangerous. For example, something like half the people in their database have performed 'vanity' searches for their own name. Add in some local restaurant names or businesses, and voila. Other search terms will reveal sensitive information: did you look at an AID clinic website recently? Perhaps for some medical symptoms? Search companies can't release that sort of information. One of the examples of malicious use are blackmail: "I'll tell your spouse that you looked for an adult club."

I felt good that they were thinking about this, but then I thought: wait a minute, aren't they already exposing some of this data? In particular, they allow advertisers to display clickable ads for particular terms. That is almost a full two-way channel! It doesn't reveal all of your data, but still enough to identify you in a potentially incriminating or otherwise undesirable manner and communicate that fact to you. For example, if both store Good and store Bad are in your town and use Google adSense, a malicious user can place Flash ads for both and thus record IP addresses of visitors. They can build up a 'hit list' of IP addresses that match, and then blackmail you on some other term: next time you see a targeted ad by these guys (for some other local search term), they show a window that says "IP 294.294.32.32, we know you searched for store Bad and we're gonna tell your wife..." except more imaginative.

The costs involved today are still somewhat steep, but with more thought, I suspect a better related ploy is possible, and, more importantly, this stresses that these companies must tread very softly in how they interface with advertisers. The problem is a fundamental one, however. Proving an interface like this secure is a challenge. Stephen McCamant has some neat work on tracking quantitative information flow that helps set the PL/SE mood before you switch into game theory or anything fancier.

2 comments:

lee said...

Javascript also considered harmful. If you allow arbitrary Javascript to run in the browser, you can allow someone to load up arbitrary hrefs, style them with css using different colors for seen and unseen links, find the rendered colors, and then get as much of their browsing history as you like.

lmeyerov said...

An oldie but goodie :) I bet content caching can also be detected.

In terms of attacks, I'm really concerned about ones on pages I trust, so XSS attacks on popular content management systems, or sniffing through common ad systems (I ought to install a blocker..), are scary. If gmail lets an email run JS on the same page without running the script through an interpreter, well, I'm sticking to other clients.

Anyways, the post is more about the danger of aggregate information and exposing it: even if companies show only a bit of information about you, that can be enough. Not so bad as the demonstrated fact that people will give away enough data to forge their identity for a cookie, but still worth thought. Heck, even storing aggregate data of your users isn't such a good idea; you might get hacked or some government official may ask for a peek and thus alienate your libertarian user base.