The Good Soldier LMeyerov: February 2008

Tuesday, February 26, 2008

Sharing Secrets for Fun and Profit

Congrats to Mira for defending an inspiring thesis today on anonymous, fair, secret exchange, aka e-cash. It was a fun few days visiting RI and talking to Thea, Mira, Shriram, Dave M., and Chris E. about the research going on at Brown since I left.

Mira's defense was just as eye opening to me today as was her proposal last year. The primary motivation for the work is that when we authenticate to perform some action, we reveal who we are, even though all that is import for the action is that we are simply authenticated. For example, the fact that you have a social security number is generally more important than what it is. Similarly, while Apple may want to know all the music you bought on iTunes so they can better target advertisements against you, there is no reason beyond this that you need a unique iTunes account to buy music. The basic solution to these issues, anonymous currency whose duplication can be detected, actually also scratches another itch: it provides an anonymous foundation for incentive systems. As I'll get to below, this is applicable to resource sharing such as for network bandwidth and files in bit torrent style systems with an added touch of anonymity, and as Thea and I speculate, even automatic whitelist-based spam filtering. Mira has focused on getting the crypto up to snuff: tying up theoretical loose ends to build enough building blocks with useful anonymity properties and sufficient efficiency properties so that JJ and Anna's groups are now pursuing a fleet of applications. Applications, and I wager application frameworks, are now more pertinent questions.

The first application (after much discussion on previous theoretical work) was onion routing. Imagine playing a game of telephone where you send a message down a line of friends, with each person whispering it into the ear of the next. This is unsafe; everybody knows the message intended for the last person and the path it will take. So, you can encrypt the message n times and have the next step destination point as part of each message, then transmit. However, as invariably happens in a game of telephone, not only is there no incentive to preserve its integrity, it's actually worth doing something more funny. As an incentive, we can use e-cash: each layer of encryption also has some money wrapped into it. That is not enough, however. To get a person to forward on the message, the money is actually encrypted again, using two keys, but you also get a key of your own. One key to decrypt the money comes from the person before, and another from the one after: you exchange your key with the person from before for theirs, and again with the person after, and thus you unlock your coin. If transactions require money, you must then play tit-for-tat: the way to get coins is to help others. You can apply extensions of this basic idea to most karma systems. What's more, after you get a coin, nobody knows what exchange you got it from, just potentially the person, if it is ever double paid. There's a paper on Mira's website about this.

The second application Thea and I schemed a little about today over a cup of tea. For spam, whitelists are great: you are guaranteed not to lose messages from your friends. However, the difficulty is automating the addition of people to this list: what happens when someone emails you for the first time and therefore is not on the list? Enter e-cash, stage right. When you exchange emails, you exchange coins. E-cash actually supports inflation in a funny sense: in one paper, Mira decided to support the case that a customer crashes in the midst of paying. If communication protocols are written correctly, this isn't so much of a concern. However, some reason she did it, and has the result that paying with the same money k times can be allowed. That means every time you get one coin, you actually get k of them. I think it was a little more subtle than that: you get at most k, and if you go through a lot of exchanges with people, you aren't too sure how much you have. I'm not 100% on it: if that's true, you have inflation, if not, each coin just has its value multiplied by k and thus no true change. Either way, a coin gives a message benefit of the doubt, and a whitelist is exactly that. The first email may use a coin, and a response causes an implicit placement on a whitelist. There's mobility in what checking happens where; whitelists can be private data, so all checking can be on the server, or distributed and prioritized. The more a person spams in such a system, the more catchable (eg, duplicated coins) they are. An interesting question is newsletters: the newsletter shouldn't send register emails as it'd run out of coins if attacked with fake register requests, but if users send coins, the newsletter would become a money sink. It might be argued that users talk to more people than newsletters they register to so that this isn't a problem, but I'm not too satisfied.

I think e-cash is one of those technologies about to break out. Furthermore, the tools surrounding it may change how we deal with the web: no more of this Google-knows-all mentality. Finally, as a PL and SE guy, it opens a lot of development questions: how do we actually use this stuff? Can I build a CMS that facilitates anonymous capabilities, or a library to make such CMSs? Wikileaks is one of the few websites even starting to come close, and I suspect it's already fairly limited with respect to the general model being built.

Anyways, neat-o. I also talked to Shriram a bit about future web trends, optimistic replication (what's the real problem here?) and local storage, and again about capability based security (forces a re-encoding of access control all over again for anything interesting, is hell for informal debugging analysis and formal verification, and doesn't play well in multilingual and modal settings). Also, a question I've been toying with: should MAC make a comeback, perhaps as a subversive element of local/server browser storage consistency APIs? I gave an example on the flapjax support list for wanting universal data security semantics between applications:

I'll rehash the old idea that an environment should support a 'shared' use access control mode that is some join point of capabilities of all users acting. (*Aside for others: there are multiple useful ways to merge users' capabilities.) Different data should be displayed and otherwise interacted with based on the context; if I wanted to show something on my screen, I might be concerned about both confidentiality and integrity of data in all open applications. I can close everything manually... but why not have a standard API and have everything react to the necessary extent? Wasn't that part of the point of reactive security objects for all data? My bank page in another tab or window could stay mostly open, but really private data would disappear. If I could tag particular emails as super secret, I could even safely show a gmail without worrying about preview features on others!

More paper writing tomorrow and then flying back to Berkeley!

Monday, February 11, 2008

Splatter

Busy times, but interesting ones:

Reading "Synchronous Kahn Networks": relating efficient synchronous computation and listlessness optimizations by presenting streams as lazy lists following a clock calculus. Addition of dependent types to the clock calculus is embarrassingly impeding my progress, but a good motivator to get a better feel for it. End goal: "N-Synchronous Kahn Networks" follow-up paper.
Reading "The Categorical Abstract Machine": slide 4 describes the goal here - compiling the original basis of ML in a rigorous way. End goal: points! Or knowledge.
Wrote the final technical part of my paper, so filler, an example, and heavy editing are all that's left. Oh, and submitting it somewhere.
The more I try to answer what a browser and web program are, the less I can. Tricky to make a definition that will hold in 5-15 years. Wikipedia's definition of a browser is sufficiently vague and conservative to squeak by, but I do not buy its one for web applications.
Coming enlightenment by Bill Thies (streaming computation? developing region and assistive tech? race detection? programmable microfluids?), as well as Burak Emir (pattern based extraction support in Scala, Done Right)

Also, something about camping in Death Valley, least RE set encodings for lambda calculus, and running something ginormous on the similarly ginormous NERSC cluster through LBL. For the latter - what will it be? A type miner, a concolic tester, a browser bug finder, ... oh boy oh boy. Maybe I should give Dawn a poke, I bet she could burn a few cycles in the name of Good :)

Ah, and congrats to Chang-Seo for rocking his prelims. He is that much closer to being a free man.

Thursday, February 7, 2008

Web Search and Targeted Ads Considered Harmful

1. Knowing the last 5-10 web pages you visited is typically enough to isolate you against other people
2. A related principle can be applied to search terms

Someone from Yahoo! Research gave a talk tonight about a security dilemma they're facing: they want to release user query records to researchers so we can come up with better methods, but they must also somehow scrub identities out of the data. The data in Netflix's search set has already been correlated with certainty to IMDB user accounts; same thing with search. However, the dimensionality in query records is much bigger than in movies, and mixed with sparseness and common sense, dangerous. For example, something like half the people in their database have performed 'vanity' searches for their own name. Add in some local restaurant names or businesses, and voila. Other search terms will reveal sensitive information: did you look at an AID clinic website recently? Perhaps for some medical symptoms? Search companies can't release that sort of information. One of the examples of malicious use are blackmail: "I'll tell your spouse that you looked for an adult club."

I felt good that they were thinking about this, but then I thought: wait a minute, aren't they already exposing some of this data? In particular, they allow advertisers to display clickable ads for particular terms. That is almost a full two-way channel! It doesn't reveal all of your data, but still enough to identify you in a potentially incriminating or otherwise undesirable manner and communicate that fact to you. For example, if both store Good and store Bad are in your town and use Google adSense, a malicious user can place Flash ads for both and thus record IP addresses of visitors. They can build up a 'hit list' of IP addresses that match, and then blackmail you on some other term: next time you see a targeted ad by these guys (for some other local search term), they show a window that says "IP 294.294.32.32, we know you searched for store Bad and we're gonna tell your wife..." except more imaginative.

The costs involved today are still somewhat steep, but with more thought, I suspect a better related ploy is possible, and, more importantly, this stresses that these companies must tread very softly in how they interface with advertisers. The problem is a fundamental one, however. Proving an interface like this secure is a challenge. Stephen McCamant has some neat work on tracking quantitative information flow that helps set the PL/SE mood before you switch into game theory or anything fancier.

Sunday, February 3, 2008

Old School CS and New

Mobile devices, optimal scheduling, and SK combinators have been on my mind recently. They don't intersect too well, so this post is about the last item and a teaser on the first.

Numbers: SK, Gödel, and Catalan

We've all heard about the lambda calculus and how we can encode the factorial function in it, and this all makes sense to us assuming we know Scheme (or Lisp/ML/Haskell, depending on your age and geographic history). However, can we drill down the essence of computation even further? In particular, given a few specific lambda-like functions, can we still encode everything? In comes the SKI combinator calculus, which show that with just three, yes, we can. One of the first things we learn when using them is that the third combinator, "I" for identity, can actually be encoded via "SKK" or "SKS". If we do our homework a bit further, we'll even see that we only really need one iota combinator for a sufficient calculus!

In our intro to semantics course, Dana Scott started with all the traditional Church/Kleene encodings (booleans, numbers, recursion) - except using SK combinators. Two novel thoughts came up for me out of it:

First, naming: S for Slide and K for Kill. This makes sense when you look at the rewrite rules:

S x y z => x z [y z]
K x y => x

Further in that spirit are Chris Rathman's combinator birds.

Second, complexity:

In the spirit of Iota, we can reduce are representation further. We might want to use some sort of Gödel numbering system to more concisely represent terms in our language - every term should correspond to a unique number. Something like trinary would be obvious in our system (or binary, if we encode I). Ideally, we would be able to do it from within our own system, constructively shown by writing a combinator that takes another and returns that one's unique number. We can break down the problem: as we are using SK combinators, whose terms correspond to binary trees, we want a function G: Z+ * Z+ -> Z+ to determine the label of a composition whose constituents have already been converted. To start, we assume 0 corresponds to S, 1 to K, and 2 to I. Using Gödel's original strategy of prime factorization, we would have:

G(a,b) = 2 + (2^a) * (2 * b + 1)

where a, b (lowercase) are the unique numbers for A, B, (uppercase) and thus G(a,b) uniquely represents A[B].

We can encode this formula in SK combinators easily - once we have iteration, addition is iterating adding 1, multiplication is iterated addition, and exponentiation iterated multiplication - but the spacing between numbers is horrible so our representation of terms blows up exponentially.

So, we need a lossless numbering scheme. We can just say that we project the sparse scheme in order, but that's hard to write using combinators. Instead, we can use a grid: the X and Y axis represent Gödel numbers, the coordinate (G(A),G(B)) represents the combinator A[B], and the trick is how to write a combinator F(G(A),G(B)) = i without missing any numbers. Bad memories of Cantor diagonalization emerge, and so squares and triangles are instant choices. I prefer squares due to ordering (shouldn't go into details as this was an optional homework problem).

I wrote a closed form representation of F using easily attainable addition, exponentiation, max, and comparison operators. But, we want to make sure we're good: for a given length combinator string, what is the bound on the corresponding Gödel number? The general intuition for the square algorithm is that all smaller length terms are encoded (all the inner squares), so for a length n string, each character can be one of S, K, or I, and that the string can be split up into any binary tree. Thus, we have:

G(E) is <= 2 + (3^n) * C_(n-1) where n = len(E)

C_n is a Catalan number: the notion of splitting a string into all possible binary trees is surprisingly common and powerful.

Edit: The formula should really be the sum of Catalan numbers up to that point, which probably can be reduced further.

I'm still puzzling about introspecting on argument structure: how far can one go without hitting the halting problem?

The Future of Mobile

We were over at Mozilla last week for their all-hands meeting (some awesome t-shirts and hardcore developers there) and I gave my bit on achieving implicitly parallel scripting by using domain specific abstractions and component based design, but a looming question was, "why parallel?" One of our traditional arguments has been that manycore is the way to go for the 5-15 year period when computing on a mobile device due to power: you don't want to drain the battery or burn the user. An alternative strategy making the rounds is seen with a flurry of proxy-based computing mobile browsers. A device need only handle basic interactions for the browser, and a proxy server can be used to do as much as to totally emulate the environment to as little as to better encode or reformat the transmitted content. A big question is, in terms of security, power, latency, and connectivity - is this realistic? We've been trying to put together some numbers on that as it's not actually too obvious, and gets hazy as you go from 2, to 5, to 10 years. We'll probably have a post about it as our sophomore one on the parallel browser blog soon, so stay tuned...