Actual parallel parsing is easy. Real applications provide significantly simplifying assumptions. Case in point: for our browser-based big data visualization framework, we needed to somehow load in big JSON arrays. Modern browsers provide a fast native function "JSON.parse :: String -> JSObj" that is fine for normal sized data sets, but parsing 9MB of sparse data takes over 1s, which requires a loading screen and eats your battery. The solution took ~2 days: on the server, split the data into multiple arrays, and on the client, use web workers to parse each array in parallel. Even though I'm controlling the parsing in JavaScript, the actual parsing is still being done natively by calling into JSON.parse for handling each fragment.
Anyways, the proof is in the pudding. The relevant benchmarks are impact on file size and load time.
Benchmark 1: file size
Splitting one of our sparse JSON data files into multiple chunks has no real hit to the file size, and due to some uninteresting phenomena, actually helps when gzip is used. I'm also showing the benefit of compressing repeating zeros ("sparse json") for our particular data set.
Benchmark 2: speedup
For the second benchmark, I'm testing loading the data set from a local hard drive when using different formats. The first two columns are sequential data formats, and with the second column primarily demonstrating the benefit of doing preprocessing for my particular application. The blue part of each column shows, in ms, how much time is spent reading the file from disk and parsing it.
The important part is comparing the blue of the third column against the blue part of the second: using workers to parse JSON chunks is 2x faster than parsing the file in one sequential go.
For further parallel speedups, it is important to realize that this application hit Amdahl's Law. In particular, the sequential stuff, like the black subcolumns, must be parallelized. The blue columns also have sequential code due to APIs I'm using. First, Safari does not have zero-copy message passing (transferable objects), so a sequential copy is made in communicating a parsed fragment back to the master. Second, I'm actually creating a big typed array, so even current web worker APIs do not help: I want to the Buffer into the worker, not get a new buffer out of one, which means I have to do a sequential copy even in Firefox and Chrome. Basically, we're seeing the web's poor support of message passing forcing a lot of slow, sequential copies.
Finally, I only benchmarked the case of local file loads. Our application, ultimately, will be used online with a server providing the data on-the-fly. The optimizations I showed should still help even if the application is network bound: they also enable pipeline parallelism. By splitting the JSON, parsing one chunk is now done in parallel with downloading the next. Modern webpages alternate between waiting on the network and the CPU, which this solution elegantly addresses.
Conclusion
So there you have it -- parallel parsing, in practice, should be easy. I did a constrained case of handling big JSON arrays. Even for more general JSON, it should only take a day more of work: partition the JSON via work stealing -- I successfully did a similar thing in my thesis for HTML in a few hundred lines of C++.
If you'd like the code, it'll be part of our Superconductor (automatically GPU-accelerated web programming framework for big data visualization) release next month. Ping me if you'd be interested in working on parallel parsing of arbitrary JSON -- it should be awesome!
UPDATE
I more carefully measured the overheads and also did a couple more optimizations. The chart below shows the new results. It also now includes the message passing overheads in gray, where ~90% of it would go away in Chrome with transferable objects and the other 10% might be eliminated with ownership transfer of buffer views (sub-buffers), not just buffers.
Two big things are going on:
- The workers now perform decompression rather than the master. However, decompressing in the worker has a cost:  each worker now returns an inflated result message. Safari does not have ownership transfer so almost 100ms is spent copying the message back to the main thread rather than passing a pointer. 
 The overhead is still worth it -- the blue region shrunk a lot, even if there is now a big gray region. (The gray region is in both because the sequential job runs in a worker in order to avoid pressuring the UI thread.) If/when Safari gets transferrable objects, 90% of the gray bars will disappear.
- Faster reduction of worker results into the final GPU buffer. Typed arrays support method set that can also help such reductions. It acts like a memcpy, so result.set(chunk, len) copies array buffer "chunk" into array buffer "result" starting at byte offset "len" into "result." It only takes about 13ms. If the typed array spec was extended so that disjoint sub-buffers could be transferred across workers, the other 10% of the gray bars would disappear.



I am not sure I follow this statement:
ReplyDeleteSecond, I'm actually creating a big typed array, so even current web worker APIs do not help: I want to the Buffer into the worker, not get a new buffer out of one, which means I have to do a sequential copy even in Firefox and Chrome.
You can transfer an array buffer both ways, both to and from web worker. Anything I am missing?
Awesome -- I missed that it was symmetric. (I didn't read closely because I have to use Safari, which doesn't have transferable objects at all).
ReplyDeleteHowever, I don't think this is enough: I want to have each worker yield a disjoint view of the buffer. Can I transfer ownership of a disjoint buffer view into the worker, then when done, pass it back?
(Also, all of this only applies to parallel parsing of arrays -- I'm still stuck on copying for the more general JSON case.)
After another day of work:
ReplyDelete1. The shared subbuffer issue is sufficiently solved via ArrayBuffer.set (~=memcpy). The idea is to transfer the worker's buffer and then copy it into the appropriate sub buffer. This reduction takes 15ms.
2. Safari doesn't have transferable objects, so the story is much worse in our actual system: transferring adds another 100ms overhead, which is ~50% of the time.
Your post got shared on G+ and re-shared by me; you should read the comments there.
ReplyDelete