Quantcast

Ruby on Rails, Io, Lisp, JavaScript, Dynamic Languages, Prototype-based programming and more...

Technoblog reader special: $10 off web hosting by FatCow!

Wednesday, August 16, 2006

MapReduce for Ruby: Ridiculously Easy Distributed Programming


Digg
del.icio.us
FURL
Yahoo! My Web 2.0
Reddit


I am very happy to announce that Google's MapReduce is now available for Ruby (via gem install starfish). MapReduce is the technique used by Google to do monstrous distributed programming over 30 terabyte files. I have been reading about MapReduce recently and thought that it was very exciting for Google to have laid out the ideas that ran Google. I also wondered how they could be applied to everyday applications.

Recently, I gave a talk on Ridiculously easy ways to distribute processor intensive tasks using Rinda and DRb. This talk came from my work with Rinda recently at MOG. We use distributed programming to handle real-time processor intensive needs for over 1 million requests a day. We also use it to make large changes or clean up our database. I realized that the plumbing I wrote in Rinda to accomplish these tasks could be abstracted and easily conform to the MapReduce technique.

Before I move on, I will provide a little more background of Google's MapReduce. MapReduce is a C++ library written by Google. There are about 12 MapReduce programs used to create the inverted index of the www that Google uses for searching. The term MapReduce itself refers to map and reduce functions. Joel recently wrote an article that explains what map a reduce do, so I will refrain from repeating him. One of the parts Joel unfortunately messed up on was this sentence though:

[...] you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it's a zillion times faster which means it can be used to tackle huge problems in an instant [...]

Google, nor anyone I know, has written a map function that will "replace" your existing calls to map, like a plugin. In fact, here is some real world MapReduce example code that is used to provide a word count on an arbitrarily sized document:

#include "mapreduce/mapreduce.h"

// User's map function
class WordCounter : public Mapper {
public:
virtual void Map(const MapInput& input) {
const string& text = input.value();
const int n = text.size();
for (int i = 0; i < n; ) {
// Skip past leading whitespace
while ((i < n) && isspace(text[i]))
i++;
// Find word end
int start = i;
while ((i < n) && !isspace(text[i]))
i++;
if (start < i)
Emit(text.substr(start,i-start),"1");
}
}
};

REGISTER_MAPPER(WordCounter);

// User's reduce function
class Adder : public Reducer {
virtual void Reduce(ReduceInput* input) {
// Iterate over all entries with the
// same key and add the values
int64 value = 0;
while (!input->done()) {
value += StringToInt(input->value());
input->NextValue();
}
// Emit sum for input->key()
Emit(IntToString(value));
}
};

REGISTER_REDUCER(Adder);

int main(int argc, char** argv) {
ParseCommandLineFlags(argc, argv);
MapReduceSpecification spec;

// Store list of input files into "spec"
for (int i = 1; i < argc; i++) {
MapReduceInput* input = spec.add_input();
input->set_format("text");
input->set_filepattern(argv[i]);
input->set_mapper_class("WordCounter");
}

// Specify the output files:
// /gfs/test/freq-00000-of-00100
// /gfs/test/freq-00001-of-00100
// ...
MapReduceOutput* out = spec.output();
out->set_filebase("/gfs/test/freq");
out->set_num_tasks(100);
out->set_format("text");
out->set_reducer_class("Adder");

// Optional: do partial sums within map
// tasks to save network bandwidth
out->set_combiner_class("Adder");

// Tuning parameters: use at most 2000
// machines and 100 MB of memory per task
spec.set_machines(2000);
spec.set_map_megabytes(100);
spec.set_reduce_megabytes(100);

// Now run it
MapReduceResult result;

if (!MapReduce(spec, &result)) abort();
// Done: 'result' structure contains info
// about counters, time taken, number of
// machines used, etc.
return 0;
}

MapReduce takes a large data set (in this case a large file), divides the file into many different pieces, and lets 2000 machines each count words and aggregate statistics for a small part of that file, aggregating the result together in the end.

One of the parts that stood out to me is how there is a clear separation of how to do the call to map and how to do the call to reduce. The other part is all the set calls like spec.set_machines(2000);. I love the simplicity: you tell the system how to map, you tell it how to reduce, you set some options, and run it. Notice specifically that you are not writing network code... this is obviously a very network intensive task, but that is all hidden behind #include "mapreduce/mapreduce.h". This is much like Rinda for Ruby where you do not have to write any network code to distribute objects over the network. You do however have to learn an API to use either Rinda or DRb. MapReduce feels much less like an API and more like a layout, a template that you fill in.

I took the lessons from MapReduce, injected my background of Ruby and came up with what I call Starfish. The backend implementation of Starfish is vastly different than Google's MapReduce: MapReduce is highly optimized for speed and best use of 2000 computer resources at a time, Starfish is highly optimized for speed of development and ease of use. That said, the goal of Starfish is the same as MapReduce.

Starfish takes a large data set (in this case a database), divides the table into many different sections, and lets machines each do work on sections of the database in parallel, aggregating the result together in the end.

Here is some example code:

class Item < ActiveRecord::Base; end

server do |map_reduce|
map_reduce.type = Item
end

client do |item|
logger.info item.some_processor_intensive_task
end

You will notice a few major differences quite quickly. First, you do not need to require any libraries, if this file was called item.rb you would run
starfish item.rb
on the command line on as many servers as you want and it will do everything it needed to start working and distributing the work. Next, you do not specify map and reduce functions, rather you specify a client and a server. I loved the simplicity and clarity of defining the two most important parts to Google's MapReduce, but in Ruby it would have been silly to do so because it is not C++ and mapping and reducing is too easy. So I gave it some thought and came up with what I thought was the most important part of distributed programming: what does the server serve and how do the client process the served objects.

Aside from the differences, you will notice the similarity, in the server you are setting options, setting map_reduce.type = Item much like input->set_format("text"); in MapReduce. In the near future, you will be able to tell Starfish that the type is File and let Starfish process files the same way we saw MapReduce do it in the example. Also, logger.info sends some information back to the server that logs it to a file much the same way that out->set_filebase("/gfs/test/freq"); works.

However the biggest major difference is that Starfish is open-source and easy to use. Performing distributed tasks is now a ridiculously easy reality for programmers that may not have been steeped enough in CORBA or some other library to accomplish before.

I hope that you find this library helpful, please tell me how you use it and how I can make it work better for you. There any many options I didn't cover, so if you do use it, please read the documentation.

UPDATE: I wrote an example of how I sent emails 10x faster than before using Starfish.

You should follow me on twitter here.

Technoblog reader special: click here to get $10 off web hosting by FatCow!

41 Comments:

Blogger Amr Malik said...

Lucas, this sounds very interesting indeed! Are you at liberty to discuss the architecture at MOG?

I am wondering why it bacame necessary (from an architecture perspective) to read huge datasets (30gb?) from the database and process in memory?

Aren't DBMS systems optimized for this sort of processing or is it something that the DBMS does not offer (horizontally partitioned tables or federated views etc) and therefore necessary to do it in this manner.

I would be very interested in your thoughts on this.

thanks,

1:31 PM, August 18, 2006

 
Anonymous Joe Martinez said...

very nice sir!

2:28 PM, August 18, 2006

 
Anonymous Anonymous said...

The beauty of google's system is that the data was distributed as well. Aren't you going to have bottleneck issues with database disk i/o?

5:06 PM, August 18, 2006

 
Blogger Lucas Carlson said...

Certainly, but Google built their own distributed file system to deal with that issue. Starfish is not ideal for super-large scale distributed efforts, but it certainly helps for many of the distributed systems you regularly encounter.

5:23 PM, August 18, 2006

 
Anonymous Anonymous said...

Looking at your reference to GFS ("/gfs/test/freq"), does it mean that your library works on GFS? I didn't know Google had relased this piece of technology. And the same for Google's clusters of commodity Linux PCs. Without that infrastructure, how your library works? Thanks.

5:55 PM, August 18, 2006

 
Anonymous Anonymous said...

mmm... i have a feeling you don't understand the mapreduce operation well. by spliting into server and clients, your abstraction not only leaks the underlaying network, but you completely forgot the reduce part. certainly you made a distributed system but not a map reduce operation. it looks to me like a small wrapper on top of drb.

6:04 PM, August 18, 2006

 
Blogger Lucas Carlson said...

To address the GFS comment, GFS is a completely independent tool, a filesystem, used at Google exclusively. Neither Starfish nor MapReduce directly know or care about the underlying filesystem involved. Google does utilize the implications of GFS when using MapReduce to handle some of the disk i/o issues. I am not interested in solving that particular problem since I have not personally reached a point where i/o is the limiting factor.

To address the question to my competence, I would like to say that Starfish and MapReduce were written in vastly different languages with different goals in mind. After studying the papers associated with MapReduce as well as the code examples, I thought to myself about the problems MapReduce set out to solve. I am not interested in writing an exact clone of the code involved to run MapReduce, I am interested in solving a similar problem on a smaller scale.

Secondarily, I am very interested in making distributed programming more accessible to programmers who have never tried to do distributed programming before. I wanted to keep familiar terms like "client" and "server," and keep unfamiliar terms at bay. I feel this helps to quickly understand what is going on which aids in maintenance of Starfish code.

To directly address whether or not I understand the what the mapreduce operation does, I can tell you that in Ruby, mapreduce is simply a call to inject.

To address why I did not feel compelled to add a distributed reduce function to the initial release of starfish, I felt that such a function would be much simpler to write as a helper method to the server than a whole separate operation into itself. I may change my mind about that, but I was going for ease of development of distributed applications, and I find helper methods defined in the server much quicker and easier to write than full reduction methods.

Starfish is not built to compete with MapReduce, it is meant to solve smaller scale distributed programming problems. That said, I believe it succeeds at what it does like no other library I have ever seen.

As a final comment, more of a correction, Starfish is a small wrapper built upon Rinda, which is a small wrapper built upon DRb. I would like to thank Matz for making a language like Ruby and Masatoshi Seki for writing DRb (originally in less than 90 lines of code), it is a joy to stand on the shoulder of giants and write such useful utilities like Starfish as a small wrapper.

9:34 PM, August 18, 2006

 
Anonymous Anonymous said...

"MapReduce feels much less like an API and more like a layout, a template that you fill in." it is indeed an occurrence of the "template method" design pattern...

3:39 AM, August 19, 2006

 
Anonymous Anonymous said...

A key feature of Google's MapReduce
is that it's built on top of a batch
system to provide fault tolerance.
So that if a worker node fails
during execution, its subtask will
be transparently reassigned to another node. I haven't looked at the code - does your solution provide fault tolerance? Without it, I doubt the solution would be very practical (but the fact that it's in Ruby is cool!)

I work with Condor and I myself have
recently been thinking about implementing a "poor man's" MapReduce in Python or Ruby on top
of Condor.

7:07 AM, August 19, 2006

 
Blogger Lucas Carlson said...

It currently does not support fault tolerance, but it will shortly, fault tolerance is easy to add to Starfish the way that I programmed it.

11:26 AM, August 19, 2006

 
Blogger DJB said...

Looks interesting. I posted a few comments on the project page regarding ways to make it more Windows friendly. :)

- Dan

9:41 PM, August 19, 2006

 
Anonymous Anonymous said...

it seems to be a map only, no reduce solution

9:48 PM, August 19, 2006

 
Anonymous Anonymous said...

This is quite misleading:

"I am very happy to announce that Google's MapReduce is now available for Ruby"

Furthermore, the class is named MapReduce.

BUT it's not MapReduce, and it's certainly not "Google's" either.

Yes, it is missing reduce, and there's no point in trying to refute that. MapReduce is NOT simply a call to inject, the same way the stuff you wrote is not "simply a call to #each".

Starfish seems fairly useful on its own; there's no need for you to publicize it (indirect- and somewhat unfortunately) as "Google's MapReduce".

12:02 PM, August 20, 2006

 
Blogger Lucas Carlson said...

You are entitled to have your anonymous opinion and I am entitled to mine. There are actually ways to do the reduce function in Starfish which I will go into more detail in soon.

1:09 PM, August 20, 2006

 
Blogger Ronie Uliana said...

That's a great API, indeed! I'd like we have something like that on other languages, maybe in a near future. But, so far, to have something like that on Ruby is great for me.

6:54 AM, August 21, 2006

 
Blogger Adam Rosien said...

It's not really MapReduce, but I like your Starfish library. I write about my own MapReduce at my blog. Maybe we can collaborate!

10:50 AM, August 21, 2006

 
Blogger Lucas Carlson said...

I would love to collaborate, let's see where we can go with this.

11:00 AM, August 21, 2006

 
Anonymous Adrian Madrid said...

Lucas, your library sounds very interesting to. I guess I'm in your target market since I have not played with distributed programming and I have a project coming that would need it. Could you elaborate some more on a n example of how it would be used?

Thanks,


Adrian Madrid

12:18 PM, August 21, 2006

 
Anonymous Anonymous said...

Lucas, I am might impressed with your work thus far. I look forward to seeing the fault tolerance implemented.

Most distributed frameworks are needlessly complex; please stick to your guns and keep the core simple.

Thanks again for your contribution!

Jim
http://www.runfatboy.net

10:05 PM, August 22, 2006

 
Anonymous Anonymous said...

when installing the gem i've got

lib/starfish.rb:179:27: Couldn't find RingFinger. Assuming it's a module

ruby 1.8.4 on tiger

6:05 AM, September 01, 2006

 
Blogger Lucas Carlson said...

That is nothing to worry about, just a warning.

11:35 AM, September 01, 2006

 
Blogger Sausheong said...

Starfish doesn't seem to work in Windows.

6:58 AM, October 20, 2006

 
Anonymous Anonymous said...

The examples here: http://rufy.com/starfish/doc/ don't show this technology using multiple machines. How easy is it to setup in a distributed environment? It's looking great btw!

12:44 PM, November 03, 2006

 
Anonymous Anonymous said...

i buy hydrocodone at buy hydrocodone - can't find any cheaper

7:06 AM, January 28, 2007

 
Anonymous Jared said...

Like some others have said... This seems like it might be useful in a distributed computing environment for doing big tasks... From the example code, I have seen, this is not Map Reduce in the same sense as the Google Labs paper.

Perhaps it's your choice of example? Would you might posting a complete example code of how you would implement the classic example of term counting presented in Section 2.1 of the original MapReduce paper from Google?

4:20 PM, January 29, 2007

 
Blogger f00biebletch said...

I've tried playing with starfish, one thing that is not clear at all from the examples is how to run a client. I've set up a server hitting a mysql table, and running multiple instances of starfish on a single physical node works great. However, if I simply copy the same code to another node and run starfish concurrently on multiple nodes, I get redundant processing. How do I configure the code to run one server and multiple clients across multiple physical nodes? Thanks.

11:58 AM, January 30, 2007

 
Blogger Wendy said...

I'm having trouble getting starfish to work with a class hierarchy implemented as single table inheritance in rails.

Is this something starfish can handle, or does it only work with a single class per table?

thanks.

9:13 PM, June 05, 2007

 
Anonymous Anonymous said...

Old thread but i found this in relation to:

http://skynet.rubyforge.org/

12:48 AM, January 03, 2008

 
Blogger Adam said...

I'm the author of Skynet. This article and the code that Lucas wrote was heavily influential in the development of Skynet. Thanks Lucas!

adam

2:10 PM, January 06, 2008

 
Anonymous Anonymous said...

Lucas, is there a way to remove items from the ActiveRecord source immediately after a client processes it from the map_reduce queue? In your email example, this would be the same as deleting an email after it was sent. Thanks!

8:37 PM, February 09, 2008

 
Anonymous Anonymous said...

hi Lucas,
could you please explain something to show this technology using multiple machines? I want to send many mails once but how can I call 10 clients at a time?

4:43 AM, August 26, 2008

 
Blogger micko said...

I just found this blog and find the information on the ssae 16 audit very helpful. Keep up the great work, its hard to find good ones. I have added to my favorites. sizegenetics discountInteresting post. I have been wondering about this issue,so thanks for posting.

9:28 PM, March 14, 2014

 
Anonymous Anonymous said...

I'm often to running a weblog and i truly admire your content material. The article has truly peaks my interest. I am going to bookmark your website and maintain checking for new info.stellar phoenix photo recovery


11:04 PM, March 14, 2014

 
Anonymous Anonymous said...

his can be among the ideal posts that I've ever seen; you might contain some extra ideas inside the similar theme. I'm nonetheless waiting for some interesting thoughts from your side inside your subsequent post.wondershare dr fone

11:06 PM, March 14, 2014

 
Blogger basma gaber said...

شركة نظافة فلل بالرياض
تنظيف فلل بالرياض
شركة تنظيف فلل
شركة تنظيف فلل بالرياض
تنظيف فلل

شركة نظافة
شركة نظافة بالرياض
شركات تنظيف بالرياض
مؤسسة نظافة بالرياض
شركة تنظيف
افضل شركة تنظيف بالرياض
شركة تنظيف بالرياض

شركة تنظيف شقق بالرياض
شركة تنظيف شقق

11:20 PM, April 02, 2014

 
Blogger basma gaber said...


شركة نظافة شقق بالرياض

شركة تنظيف منازل بالرياض
شركة نظافة منازل بالرياض
شركة تنظيف منازل
شركة تنظيف بيوت بالرياض

شركة تنظيف مسابح بالرياض

شركة تنظيف مجالس بالرياض

شركة تنظيف كنب بالرياض

جلي بلاط في الرياض

شركة تنظيف موكيت بالرياض
شركة تنظيف سجاد بالرياض
شركات تخزين اثاث بالرياض

افضل شركة لنقل الاثاث بالرياض
نقل اثاث داخل الرياض

11:21 PM, April 02, 2014

 
Blogger basma gaber said...

شركة تنظيف بالرياض

شركة كشف تسربات بالرياض

شركة نظافة فلل بالرياض

شركة مكافحة حشرات بالرياض

شركة نقل اثاث

شركة رش مبيدات بالرياض

شركة تسليك مجاري

شركة تنظيف مساجد بالرياض

شركة تنظيف منازل بالمدينة

2:42 PM, May 18, 2014

 
Blogger hakeem nour said...

شركة تنظيف شقق بالرياض
شركة تنظيف واجهات بالرياض
شركة تنظيف مجالس بالرياض
شركة تنظيف موكيت بالرياض
شركة غسيل السجاد بالرياض
شركة تنظيف مسابح بالرياض
شركة كشف تسربات بالرياض
كشف تسربات المياه بالرياض
عزل خزانات واسطح
شركة تنظيف خزانات بالرياض
شركة عزل اسطح بالرياض
شركة عزل خزانات بالرياض
مكافحة حشرات
شركة رش مبيدات بالرياض
شركة مكافحة حشرات بالرياض

11:46 AM, June 07, 2014

 
Blogger كيمو نور said...

here
here
here

تنظيف مجالس
شركة تنظيف مجالس بالرياض
شركة تنظيف موكيت بالرياض
شركة غسيل السجاد بالرياض
شركة تسليك مجارى بالرياض
شركة تنظيف بيارات بالرياض
شركة شفط بيارات بالرياض
شركة تنظيف بالرياض
شركة تنظيف فلل بالرياض
شركة تنظيف شقق بالرياض
شركة تنظيف منازل بالرياض
شركة تنظيف بيوت بالرياض
شركة كشف تسربات بالرياض
شركة تسربات المياه بالرياض

11:45 AM, July 21, 2014

 
Blogger كيمو نور said...


شركة تنظيف واجهات بالرياض
شركة تنظيف مسابح بالرياض
شركة تنظيف بالرياض
عزل خزانات واسطح
شركة عزل خزانات بالرياض
شركة تنظيف خزانات بالرياض
شركة عزل اسطح بالرياض
نقل اثاث
شركة نقل اثاث بالرياض
شركة نقل عفش بالرياض
شركة تخزين عفش بالرياض
شركة رش مبيدات بالرياض
شركة مكافحة حشرات بالرياض
مكافحة حشرات

11:45 AM, July 21, 2014

 
Blogger كيمو نور said...

شركة تنظيف فلل بالرياض
شركة مكافحة حشرات بالرياض
شركة عزل خزانات بالرياض
شركة تسليك مجارى بالرياض
شركة تخزين اثاث بالرياض
كشف تسربات المياه
شركة نقل عفش بالرياض
تنظيف خزنات بالرياض

11:46 AM, July 21, 2014

 

Post a Comment

Subscribe to Post Comments [Atom]

<< Home

 

If you like this blog, you might also like top photography schools.