Skip to content

CloudCrowd Architecture

jashkenas edited this page Sep 13, 2010 · 9 revisions

What runs where?

The central server keeps track of the status of all the active jobs and work units, but does not perform any action processing itself. Every split, process, and merge is performed on a worker daemon. Worker daemons process a single work unit at a time, and clean up their temporary files when finished.

A common bottleneck for distributed processing systems is the file storage, which is used to store all intermediate results, as well as inputs and outputs. Google’s original MapReduce uses an in-house distributed filesystem; Hadoop uses the custom HDFS Hadoop filesystem. CloudCrowd uses S3, which makes it a particularly good idea to deploy your CloudCrowd cluster on EC2. Alternative storage support can be plugged into the AssetStore class and selected in config.yml.

These constraints make CloudCrowd appropriate for moderate volumes of highly expensive (for either CPU, memory, or bandwidth) work. The overhead of a running worker daemon is around 10 megabytes; running a large number of them should allow you to max out CPU without running into memory limits — which is often the limiting factor with Ruby — especially if you prefer shelling out over reading files into memory.

Clone this wiki locally