Ruby on Rails, Io, Lisp, JavaScript, Dynamic Languages, Prototype-based programming and more...

Technoblog reader special: $10 off web hosting by FatCow!

Wednesday, August 16, 2006

MapReduce for Ruby: Ridiculously Easy Distributed Programming

Yahoo! My Web 2.0

I am very happy to announce that Google's MapReduce is now available for Ruby (via gem install starfish). MapReduce is the technique used by Google to do monstrous distributed programming over 30 terabyte files. I have been reading about MapReduce recently and thought that it was very exciting for Google to have laid out the ideas that ran Google. I also wondered how they could be applied to everyday applications.

Recently, I gave a talk on Ridiculously easy ways to distribute processor intensive tasks using Rinda and DRb. This talk came from my work with Rinda recently at MOG. We use distributed programming to handle real-time processor intensive needs for over 1 million requests a day. We also use it to make large changes or clean up our database. I realized that the plumbing I wrote in Rinda to accomplish these tasks could be abstracted and easily conform to the MapReduce technique.

Before I move on, I will provide a little more background of Google's MapReduce. MapReduce is a C++ library written by Google. There are about 12 MapReduce programs used to create the inverted index of the www that Google uses for searching. The term MapReduce itself refers to map and reduce functions. Joel recently wrote an article that explains what map a reduce do, so I will refrain from repeating him. One of the parts Joel unfortunately messed up on was this sentence though:

[...] you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it's a zillion times faster which means it can be used to tackle huge problems in an instant [...]

Google, nor anyone I know, has written a map function that will "replace" your existing calls to map, like a plugin. In fact, here is some real world MapReduce example code that is used to provide a word count on an arbitrarily sized document:

#include "mapreduce/mapreduce.h"

// User's map function
class WordCounter : public Mapper {
virtual void Map(const MapInput& input) {
const string& text = input.value();
const int n = text.size();
for (int i = 0; i < n; ) {
// Skip past leading whitespace
while ((i < n) && isspace(text[i]))
// Find word end
int start = i;
while ((i < n) && !isspace(text[i]))
if (start < i)


// User's reduce function
class Adder : public Reducer {
virtual void Reduce(ReduceInput* input) {
// Iterate over all entries with the
// same key and add the values
int64 value = 0;
while (!input->done()) {
value += StringToInt(input->value());
// Emit sum for input->key()


int main(int argc, char** argv) {
ParseCommandLineFlags(argc, argv);
MapReduceSpecification spec;

// Store list of input files into "spec"
for (int i = 1; i < argc; i++) {
MapReduceInput* input = spec.add_input();

// Specify the output files:
// /gfs/test/freq-00000-of-00100
// /gfs/test/freq-00001-of-00100
// ...
MapReduceOutput* out = spec.output();

// Optional: do partial sums within map
// tasks to save network bandwidth

// Tuning parameters: use at most 2000
// machines and 100 MB of memory per task

// Now run it
MapReduceResult result;

if (!MapReduce(spec, &result)) abort();
// Done: 'result' structure contains info
// about counters, time taken, number of
// machines used, etc.
return 0;

MapReduce takes a large data set (in this case a large file), divides the file into many different pieces, and lets 2000 machines each count words and aggregate statistics for a small part of that file, aggregating the result together in the end.

One of the parts that stood out to me is how there is a clear separation of how to do the call to map and how to do the call to reduce. The other part is all the set calls like spec.set_machines(2000);. I love the simplicity: you tell the system how to map, you tell it how to reduce, you set some options, and run it. Notice specifically that you are not writing network code... this is obviously a very network intensive task, but that is all hidden behind #include "mapreduce/mapreduce.h". This is much like Rinda for Ruby where you do not have to write any network code to distribute objects over the network. You do however have to learn an API to use either Rinda or DRb. MapReduce feels much less like an API and more like a layout, a template that you fill in.

I took the lessons from MapReduce, injected my background of Ruby and came up with what I call Starfish. The backend implementation of Starfish is vastly different than Google's MapReduce: MapReduce is highly optimized for speed and best use of 2000 computer resources at a time, Starfish is highly optimized for speed of development and ease of use. That said, the goal of Starfish is the same as MapReduce.

Starfish takes a large data set (in this case a database), divides the table into many different sections, and lets machines each do work on sections of the database in parallel, aggregating the result together in the end.

Here is some example code:

class Item < ActiveRecord::Base; end

server do |map_reduce|
map_reduce.type = Item

client do |item|
logger.info item.some_processor_intensive_task

You will notice a few major differences quite quickly. First, you do not need to require any libraries, if this file was called item.rb you would run
starfish item.rb
on the command line on as many servers as you want and it will do everything it needed to start working and distributing the work. Next, you do not specify map and reduce functions, rather you specify a client and a server. I loved the simplicity and clarity of defining the two most important parts to Google's MapReduce, but in Ruby it would have been silly to do so because it is not C++ and mapping and reducing is too easy. So I gave it some thought and came up with what I thought was the most important part of distributed programming: what does the server serve and how do the client process the served objects.

Aside from the differences, you will notice the similarity, in the server you are setting options, setting map_reduce.type = Item much like input->set_format("text"); in MapReduce. In the near future, you will be able to tell Starfish that the type is File and let Starfish process files the same way we saw MapReduce do it in the example. Also, logger.info sends some information back to the server that logs it to a file much the same way that out->set_filebase("/gfs/test/freq"); works.

However the biggest major difference is that Starfish is open-source and easy to use. Performing distributed tasks is now a ridiculously easy reality for programmers that may not have been steeped enough in CORBA or some other library to accomplish before.

I hope that you find this library helpful, please tell me how you use it and how I can make it work better for you. There any many options I didn't cover, so if you do use it, please read the documentation.

UPDATE: I wrote an example of how I sent emails 10x faster than before using Starfish.

You should follow me on twitter here.

Technoblog reader special: click here to get $10 off web hosting by FatCow!


Blogger Amr Malik said...

Lucas, this sounds very interesting indeed! Are you at liberty to discuss the architecture at MOG?

I am wondering why it bacame necessary (from an architecture perspective) to read huge datasets (30gb?) from the database and process in memory?

Aren't DBMS systems optimized for this sort of processing or is it something that the DBMS does not offer (horizontally partitioned tables or federated views etc) and therefore necessary to do it in this manner.

I would be very interested in your thoughts on this.


1:31 PM, August 18, 2006

Anonymous Joe Martinez said...

very nice sir!

2:28 PM, August 18, 2006

Anonymous Anonymous said...

The beauty of google's system is that the data was distributed as well. Aren't you going to have bottleneck issues with database disk i/o?

5:06 PM, August 18, 2006

Blogger Lucas Carlson said...

Certainly, but Google built their own distributed file system to deal with that issue. Starfish is not ideal for super-large scale distributed efforts, but it certainly helps for many of the distributed systems you regularly encounter.

5:23 PM, August 18, 2006

Anonymous Anonymous said...

Looking at your reference to GFS ("/gfs/test/freq"), does it mean that your library works on GFS? I didn't know Google had relased this piece of technology. And the same for Google's clusters of commodity Linux PCs. Without that infrastructure, how your library works? Thanks.

5:55 PM, August 18, 2006

Anonymous Anonymous said...

mmm... i have a feeling you don't understand the mapreduce operation well. by spliting into server and clients, your abstraction not only leaks the underlaying network, but you completely forgot the reduce part. certainly you made a distributed system but not a map reduce operation. it looks to me like a small wrapper on top of drb.

6:04 PM, August 18, 2006

Blogger Lucas Carlson said...

To address the GFS comment, GFS is a completely independent tool, a filesystem, used at Google exclusively. Neither Starfish nor MapReduce directly know or care about the underlying filesystem involved. Google does utilize the implications of GFS when using MapReduce to handle some of the disk i/o issues. I am not interested in solving that particular problem since I have not personally reached a point where i/o is the limiting factor.

To address the question to my competence, I would like to say that Starfish and MapReduce were written in vastly different languages with different goals in mind. After studying the papers associated with MapReduce as well as the code examples, I thought to myself about the problems MapReduce set out to solve. I am not interested in writing an exact clone of the code involved to run MapReduce, I am interested in solving a similar problem on a smaller scale.

Secondarily, I am very interested in making distributed programming more accessible to programmers who have never tried to do distributed programming before. I wanted to keep familiar terms like "client" and "server," and keep unfamiliar terms at bay. I feel this helps to quickly understand what is going on which aids in maintenance of Starfish code.

To directly address whether or not I understand the what the mapreduce operation does, I can tell you that in Ruby, mapreduce is simply a call to inject.

To address why I did not feel compelled to add a distributed reduce function to the initial release of starfish, I felt that such a function would be much simpler to write as a helper method to the server than a whole separate operation into itself. I may change my mind about that, but I was going for ease of development of distributed applications, and I find helper methods defined in the server much quicker and easier to write than full reduction methods.

Starfish is not built to compete with MapReduce, it is meant to solve smaller scale distributed programming problems. That said, I believe it succeeds at what it does like no other library I have ever seen.

As a final comment, more of a correction, Starfish is a small wrapper built upon Rinda, which is a small wrapper built upon DRb. I would like to thank Matz for making a language like Ruby and Masatoshi Seki for writing DRb (originally in less than 90 lines of code), it is a joy to stand on the shoulder of giants and write such useful utilities like Starfish as a small wrapper.

9:34 PM, August 18, 2006

Anonymous Anonymous said...

"MapReduce feels much less like an API and more like a layout, a template that you fill in." it is indeed an occurrence of the "template method" design pattern...

3:39 AM, August 19, 2006

Anonymous Anonymous said...

A key feature of Google's MapReduce
is that it's built on top of a batch
system to provide fault tolerance.
So that if a worker node fails
during execution, its subtask will
be transparently reassigned to another node. I haven't looked at the code - does your solution provide fault tolerance? Without it, I doubt the solution would be very practical (but the fact that it's in Ruby is cool!)

I work with Condor and I myself have
recently been thinking about implementing a "poor man's" MapReduce in Python or Ruby on top
of Condor.

7:07 AM, August 19, 2006

Blogger Lucas Carlson said...

It currently does not support fault tolerance, but it will shortly, fault tolerance is easy to add to Starfish the way that I programmed it.

11:26 AM, August 19, 2006

Blogger DJB said...

Looks interesting. I posted a few comments on the project page regarding ways to make it more Windows friendly. :)

- Dan

9:41 PM, August 19, 2006

Anonymous Anonymous said...

it seems to be a map only, no reduce solution

9:48 PM, August 19, 2006

Anonymous Anonymous said...

This is quite misleading:

"I am very happy to announce that Google's MapReduce is now available for Ruby"

Furthermore, the class is named MapReduce.

BUT it's not MapReduce, and it's certainly not "Google's" either.

Yes, it is missing reduce, and there's no point in trying to refute that. MapReduce is NOT simply a call to inject, the same way the stuff you wrote is not "simply a call to #each".

Starfish seems fairly useful on its own; there's no need for you to publicize it (indirect- and somewhat unfortunately) as "Google's MapReduce".

12:02 PM, August 20, 2006

Blogger Lucas Carlson said...

You are entitled to have your anonymous opinion and I am entitled to mine. There are actually ways to do the reduce function in Starfish which I will go into more detail in soon.

1:09 PM, August 20, 2006

Blogger Ronie Uliana said...

That's a great API, indeed! I'd like we have something like that on other languages, maybe in a near future. But, so far, to have something like that on Ruby is great for me.

6:54 AM, August 21, 2006

Blogger Adam Rosien said...

It's not really MapReduce, but I like your Starfish library. I write about my own MapReduce at my blog. Maybe we can collaborate!

10:50 AM, August 21, 2006

Blogger Lucas Carlson said...

I would love to collaborate, let's see where we can go with this.

11:00 AM, August 21, 2006

Anonymous Adrian Madrid said...

Lucas, your library sounds very interesting to. I guess I'm in your target market since I have not played with distributed programming and I have a project coming that would need it. Could you elaborate some more on a n example of how it would be used?


Adrian Madrid

12:18 PM, August 21, 2006

Anonymous Anonymous said...

Lucas, I am might impressed with your work thus far. I look forward to seeing the fault tolerance implemented.

Most distributed frameworks are needlessly complex; please stick to your guns and keep the core simple.

Thanks again for your contribution!


10:05 PM, August 22, 2006

Anonymous Anonymous said...

when installing the gem i've got

lib/starfish.rb:179:27: Couldn't find RingFinger. Assuming it's a module

ruby 1.8.4 on tiger

6:05 AM, September 01, 2006

Blogger Lucas Carlson said...

That is nothing to worry about, just a warning.

11:35 AM, September 01, 2006

Blogger Sausheong said...

Starfish doesn't seem to work in Windows.

6:58 AM, October 20, 2006

Anonymous Anonymous said...

The examples here: http://rufy.com/starfish/doc/ don't show this technology using multiple machines. How easy is it to setup in a distributed environment? It's looking great btw!

12:44 PM, November 03, 2006

Anonymous Anonymous said...

i buy hydrocodone at buy hydrocodone - can't find any cheaper

7:06 AM, January 28, 2007

Anonymous Jared said...

Like some others have said... This seems like it might be useful in a distributed computing environment for doing big tasks... From the example code, I have seen, this is not Map Reduce in the same sense as the Google Labs paper.

Perhaps it's your choice of example? Would you might posting a complete example code of how you would implement the classic example of term counting presented in Section 2.1 of the original MapReduce paper from Google?

4:20 PM, January 29, 2007

Blogger f00biebletch said...

I've tried playing with starfish, one thing that is not clear at all from the examples is how to run a client. I've set up a server hitting a mysql table, and running multiple instances of starfish on a single physical node works great. However, if I simply copy the same code to another node and run starfish concurrently on multiple nodes, I get redundant processing. How do I configure the code to run one server and multiple clients across multiple physical nodes? Thanks.

11:58 AM, January 30, 2007

Blogger Wendy said...

I'm having trouble getting starfish to work with a class hierarchy implemented as single table inheritance in rails.

Is this something starfish can handle, or does it only work with a single class per table?


9:13 PM, June 05, 2007

Anonymous Anonymous said...

Old thread but i found this in relation to:


12:48 AM, January 03, 2008

Blogger Adam said...

I'm the author of Skynet. This article and the code that Lucas wrote was heavily influential in the development of Skynet. Thanks Lucas!


2:10 PM, January 06, 2008

Anonymous Anonymous said...

Lucas, is there a way to remove items from the ActiveRecord source immediately after a client processes it from the map_reduce queue? In your email example, this would be the same as deleting an email after it was sent. Thanks!

8:37 PM, February 09, 2008

Anonymous شركة بن طامى said...

ارخص شركة تنظيف بالرياض

5:30 AM, May 21, 2016

Anonymous شركة بن طامى said...

شركة تنظيف منازل شمال الرياض

افضل شركة جلي بلاط بالرياض

ارخص شركة تنظيف موكيت

افضل شركة تنظيف كنب بالرياض

شركة نظافة

5:31 AM, May 21, 2016

Anonymous شركة بن طامى said...

افضل شركة مكافحة نمل ابيض بالرياض

مكافحة نمل ابيض بالرياض

افضل شركة نقل اثاث بالرياض

ارخص شركة نقل عفش بالرياض

افضل شركة تخزين اثاث بالرياض

شركة شراء اثاث

افضل شركة كشف تسربات بالرياض

ارخص شركة كشف تسربات

افضل شركة ترميمات بالرياض

ارخص شركة دهانات بالرياض

شركة شفط بيارات بالرياض

5:40 AM, May 21, 2016

Blogger Lai Weizhou said...

chung cư tháp doanh nhân
tháp doanh nhân / tháp doanh nhân hà đông / dự án tháp doanh nhân hà đông / chung cư the legend / chung cư the legend 109 nguyễn tuân
chung cư 360 giải phóng / dự án 360 giải phóng / chung cư vietracimex / chung cư 201 minh khai / chung cư thanh xuân tower / chung cư 35 lê văn thiêm
chung cư eco green city / chung cu eco green city / chung cư eco green city nguyễn xiển / chung cu eco green city nguyen xien / eco green city / eco green city nguyễn xiển / dự án eco green city / toà ct4 eco green city / toà ct3 eco green city / toà ct1 eco green city

11:08 AM, May 21, 2016

Blogger Zhenhong Bao said...

toms outlet
michael kors online
ralph lauren uk
coach factory outlet
tory burch outlet
replica watches
toms shoes
fitflops sale
coach outlet
michael kors factory outlet
oakley sunglasses wholesale
longchamp handbags
air max 90
michael kors handbags wholesale
louis vuitton sunglasses
michael kors outlet online
louis vuitton bags
oakley sunglasses
ray ban sunglasses
fitflops clearance
chaussure louboutin
swarovski crystal
michael kors outlet store
michael kors outlet sale
chrome hearts outlet
louboutin pas cher
coach outlet online
louis vuitton outlet
louis vuitton neverfull sale
ralph lauren outlet
rolex watches
swarovski outlet
nike huarache
oakley sunglasses
ralph lauren outlet

6:22 PM, May 30, 2016

Blogger Guu cà phê said...

Mua cây kim tiền ở đâu
Cây kim tiền

2:13 AM, May 31, 2016


Post a Comment

Subscribe to Post Comments [Atom]

<< Home


If you like this blog, you might also like top photography schools.