Quantcast

Ruby on Rails, Io, Lisp, JavaScript, Dynamic Languages, Prototype-based programming and more...

Technoblog reader special: $10 off web hosting by FatCow!

Monday, August 28, 2006

How I processed a log file 20x SLOWER than before

Sometimes you have to sit back and laugh at yourself. Having recently written Starfish to help speed up slow tasks, I tried to find as many uses as I could for it. At MOG we have to parse huge log files so I thought I would be clever and try to use Starfish for the task. After running it for a while, I looked at the stats only to find that it had been processing my file 20x slower than it would have without distributing it. At first I was puzzled, until I realized a very important thing about distributing processes. You have to make sure that the task you distribute takes longer than the distribution process.
if overhead_time > processing_time then puts "Don't use Starfish" end

It turns out that I could process 10,000 lines of the log in about a second... so to send each one of those lines over the network to have them processed was just silly. Even sending 10,000 lines at a time is relatively unnecessary.

I share this story so that you might not make the same mistake that I did. However I realized that Starfish can know when the overhead makes it not worth the trouble, I can actually warn people using Starfish when it is and is not a good use of resources. I will be adding this to the next release which shall come out shortly.

You should follow me on twitter here.

Technoblog reader special: click here to get $10 off web hosting by FatCow!

Wednesday, August 23, 2006

Title Match of the Century: Speed of Development vs. Speed of Computation

It is entertaining to read the comments people have about Starfish over on Reddit:
If your task is intensive enough to warrant parallelization, it is intensive enough to warrant investigating faster languages. Ruby is good for a lot of things, but if my choice is between throwing more processors at the problem or finding a better solution I will go for the better solution every time.

Interesting point, not that he is right, but what he omits gives me pause. The vast majority of the comments where people talking about Ruby being 1000 times slower than their language, but they give no consideration at all to the most striking aspect of Starfish (in my opinion): I can do relatively advanced distributed programming in 6 lines of code.

I'll say it again because it is important. 6 lines of code.

With hardly more than a flick of my wrist, I can parallelize a task and get performance gains of 10, 20, 30 times, whatever I need. In less than a minute, I can write code that will go through a 10Gb log file grepping for a string and parsing that information, collecting that information and wait for new lines to process on demand, in a distributed system that can work over N machines.

I have written much simpler processes in faster languages like C and it takes me hours and hours, not only for writing the hundreds of lines of code but for debugging the darn thing. If I was tasked with creating a distributed log parser in C that did something non-trivial with each line of the log, it could take me a week and it still wouldn't be right.

I work at startups. I don't work for banks, I don't work for Microsoft, I don't work for enterprise. Can I, as a head programmer at a startup, afford over 80 hours of my time writing a log parser in C because it could be 1000 times faster? Not if my startup wants to succeed. Can Microsoft afford to have one of its tens of thousands of programmers spend the time to do that? Of course.

Starfish can and does, on a daily basis, parallelize and speed up what would have otherwise been a slower process. It does so with almost no code.

A few minutes and N times faster than a regular ruby script, or a few weeks and N times faster than a regular ruby script. I know which I choose, and it works extremely well for us. I am always a fan of the right tool for the right purpose. I know that Starfish is not always the right tool, but it is amazing how quickly people discount a tool without considering all of the issues involved. It is not always about processing power. Man hours saved can be much more valuable than a few extra orders of magnitude in processor power.

You should follow me on twitter here.

Technoblog reader special: click here to get $10 off web hosting by FatCow!

More Advanced Starfish Feature

I promised you in Dynamically Add Methods to Classes Through their Objects in Ruby that there was a good use for that idea coming up. The time has come to show you how to use it.

server do |map_reduce|
map_reduce.type = File
map_reduce.input = "some_file_name.txt"

map_reduce.process = lambda do |text|
do_something_on_the_server(text)
end

map_reduce.finished = lambda do
do_something_when_the_collection_is_totally_processed()
end
end

client do |line|
if line =~ /some_regular_expression/
server.process($1)
end
end

Notice how I am dynamically adding methods to map_reduce in the server declaration. I define the process and finished methods. The process method is called from the client via server.process and the finished method is called when the collection has been fully processed.

Astute readers will notice that being able to dynamically add server side helper methods does a non-distribtued version of reduce (from MapReduce), which is good enough for many real world situations. Enjoy!

You should follow me on twitter here.

Technoblog reader special: click here to get $10 off web hosting by FatCow!

Tuesday, August 22, 2006

Starfish is MapReduce for Ruby

MapReduce and CORBA are huge honking power drills, Starfish is a nice little screw driver. I call Starfish the MapReduce of Ruby because they both do the same task: screw.

I am trying to build the simplest to use, easiest to setup, and fastest to enjoy screw driver on earth. People who ridicule Starfish for its simplicity and lack of features are inadvertently praising me for succeeding at the goal I set for myself. Therefore, ridiculers, I thank you.

You should follow me on twitter here.

Technoblog reader special: click here to get $10 off web hosting by FatCow!

How I sent emails 10x faster than before

Like many startups, at MOG we send out regular updates to our users with news and information. As our user base expands, sending this email takes more and more time. Even though the call to deliver the mail only puts it in the sendmail queue, it can take a chunk of time to do so with so many users.

When I demoed Starfish to people, the common response was: that's great, I wish I had a use for it, I wish I had a DB source big enough to use. Well here is one, albeit not mission critical but still damn cool, of the ways we use Starfish.

require 'config/environment'
require 'user'
require 'notifier'

server do |map_reduce|
map_reduce.type = User
map_reduce.conditions = "opt_out = 0"
end

client do |user|
Notifier.deliver_email(user)
end

This tiny amount of code with next to nothing that needs to be memorized and takes 30 seconds to write down can potentially save you hours in deliver time. Even running 10 clients at once on the SAME MACHINE gave us nearly 10x the speed it would have taken serially. This was not mission critical, but gives you a good sense of ways to apply Starfish to mission critical applications.

You should follow me on twitter here.

Technoblog reader special: click here to get $10 off web hosting by FatCow!

Wednesday, August 16, 2006

MapReduce for Ruby: Ridiculously Easy Distributed Programming


Digg
del.icio.us
FURL
Yahoo! My Web 2.0
Reddit


I am very happy to announce that Google's MapReduce is now available for Ruby (via gem install starfish). MapReduce is the technique used by Google to do monstrous distributed programming over 30 terabyte files. I have been reading about MapReduce recently and thought that it was very exciting for Google to have laid out the ideas that ran Google. I also wondered how they could be applied to everyday applications.

Recently, I gave a talk on Ridiculously easy ways to distribute processor intensive tasks using Rinda and DRb. This talk came from my work with Rinda recently at MOG. We use distributed programming to handle real-time processor intensive needs for over 1 million requests a day. We also use it to make large changes or clean up our database. I realized that the plumbing I wrote in Rinda to accomplish these tasks could be abstracted and easily conform to the MapReduce technique.

Before I move on, I will provide a little more background of Google's MapReduce. MapReduce is a C++ library written by Google. There are about 12 MapReduce programs used to create the inverted index of the www that Google uses for searching. The term MapReduce itself refers to map and reduce functions. Joel recently wrote an article that explains what map a reduce do, so I will refrain from repeating him. One of the parts Joel unfortunately messed up on was this sentence though:

[...] you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it's a zillion times faster which means it can be used to tackle huge problems in an instant [...]

Google, nor anyone I know, has written a map function that will "replace" your existing calls to map, like a plugin. In fact, here is some real world MapReduce example code that is used to provide a word count on an arbitrarily sized document:

#include "mapreduce/mapreduce.h"

// User's map function
class WordCounter : public Mapper {
public:
virtual void Map(const MapInput& input) {
const string& text = input.value();
const int n = text.size();
for (int i = 0; i < n; ) {
// Skip past leading whitespace
while ((i < n) && isspace(text[i]))
i++;
// Find word end
int start = i;
while ((i < n) && !isspace(text[i]))
i++;
if (start < i)
Emit(text.substr(start,i-start),"1");
}
}
};

REGISTER_MAPPER(WordCounter);

// User's reduce function
class Adder : public Reducer {
virtual void Reduce(ReduceInput* input) {
// Iterate over all entries with the
// same key and add the values
int64 value = 0;
while (!input->done()) {
value += StringToInt(input->value());
input->NextValue();
}
// Emit sum for input->key()
Emit(IntToString(value));
}
};

REGISTER_REDUCER(Adder);

int main(int argc, char** argv) {
ParseCommandLineFlags(argc, argv);
MapReduceSpecification spec;

// Store list of input files into "spec"
for (int i = 1; i < argc; i++) {
MapReduceInput* input = spec.add_input();
input->set_format("text");
input->set_filepattern(argv[i]);
input->set_mapper_class("WordCounter");
}

// Specify the output files:
// /gfs/test/freq-00000-of-00100
// /gfs/test/freq-00001-of-00100
// ...
MapReduceOutput* out = spec.output();
out->set_filebase("/gfs/test/freq");
out->set_num_tasks(100);
out->set_format("text");
out->set_reducer_class("Adder");

// Optional: do partial sums within map
// tasks to save network bandwidth
out->set_combiner_class("Adder");

// Tuning parameters: use at most 2000
// machines and 100 MB of memory per task
spec.set_machines(2000);
spec.set_map_megabytes(100);
spec.set_reduce_megabytes(100);

// Now run it
MapReduceResult result;

if (!MapReduce(spec, &result)) abort();
// Done: 'result' structure contains info
// about counters, time taken, number of
// machines used, etc.
return 0;
}

MapReduce takes a large data set (in this case a large file), divides the file into many different pieces, and lets 2000 machines each count words and aggregate statistics for a small part of that file, aggregating the result together in the end.

One of the parts that stood out to me is how there is a clear separation of how to do the call to map and how to do the call to reduce. The other part is all the set calls like spec.set_machines(2000);. I love the simplicity: you tell the system how to map, you tell it how to reduce, you set some options, and run it. Notice specifically that you are not writing network code... this is obviously a very network intensive task, but that is all hidden behind #include "mapreduce/mapreduce.h". This is much like Rinda for Ruby where you do not have to write any network code to distribute objects over the network. You do however have to learn an API to use either Rinda or DRb. MapReduce feels much less like an API and more like a layout, a template that you fill in.

I took the lessons from MapReduce, injected my background of Ruby and came up with what I call Starfish. The backend implementation of Starfish is vastly different than Google's MapReduce: MapReduce is highly optimized for speed and best use of 2000 computer resources at a time, Starfish is highly optimized for speed of development and ease of use. That said, the goal of Starfish is the same as MapReduce.

Starfish takes a large data set (in this case a database), divides the table into many different sections, and lets machines each do work on sections of the database in parallel, aggregating the result together in the end.

Here is some example code:

class Item < ActiveRecord::Base; end

server do |map_reduce|
map_reduce.type = Item
end

client do |item|
logger.info item.some_processor_intensive_task
end

You will notice a few major differences quite quickly. First, you do not need to require any libraries, if this file was called item.rb you would run
starfish item.rb
on the command line on as many servers as you want and it will do everything it needed to start working and distributing the work. Next, you do not specify map and reduce functions, rather you specify a client and a server. I loved the simplicity and clarity of defining the two most important parts to Google's MapReduce, but in Ruby it would have been silly to do so because it is not C++ and mapping and reducing is too easy. So I gave it some thought and came up with what I thought was the most important part of distributed programming: what does the server serve and how do the client process the served objects.

Aside from the differences, you will notice the similarity, in the server you are setting options, setting map_reduce.type = Item much like input->set_format("text"); in MapReduce. In the near future, you will be able to tell Starfish that the type is File and let Starfish process files the same way we saw MapReduce do it in the example. Also, logger.info sends some information back to the server that logs it to a file much the same way that out->set_filebase("/gfs/test/freq"); works.

However the biggest major difference is that Starfish is open-source and easy to use. Performing distributed tasks is now a ridiculously easy reality for programmers that may not have been steeped enough in CORBA or some other library to accomplish before.

I hope that you find this library helpful, please tell me how you use it and how I can make it work better for you. There any many options I didn't cover, so if you do use it, please read the documentation.

UPDATE: I wrote an example of how I sent emails 10x faster than before using Starfish.

You should follow me on twitter here.

Technoblog reader special: click here to get $10 off web hosting by FatCow!

Friday, August 11, 2006

Ruby Cookbook PDF

I have some exciting news to share with you all, the Ruby Cookbook is now available for half price in PDF form from O'Reilly. That's right, you can now download and search the cookbook for all your recipes and needs. I usually don't prefer PDF books, but if I hadn't written it, the cookbook certainly would be an exception since each recipe is about a page or two in length. I think the PDF version of the cookbook makes more sense in PDF version than some other technical books. Do you guys prefer hard copies or PDF copies?

You should follow me on twitter here.

Technoblog reader special: click here to get $10 off web hosting by FatCow!

Thursday, August 03, 2006

Dynamically Add Methods to Classes Through their Objects in Ruby

Playing with the implementation of a new library I am working on called starfish (I will blog about it shortly), I came up with this fun little Ruby hack that makes Ruby seem more like a prototype-based language than it already does.


class Foo
def method_missing(name, *args)
if name.to_s =~ /(.*)=$/ && args[0].is_a?(Proc)
self.class.instance_eval do
define_method($1, args[0])
end
else
super
end
end
end

f = Foo.new
f.greet = lambda {|t| "Hello #{t}!"}
f.greet "Lucas Carlson" # => Hello Lucas Carlson!

j = Foo.new
j.greet "World" # => Hello World!


Hope you enjoy it as much as I do and I can't wait to show you how I am using it!

You should follow me on twitter here.

Technoblog reader special: click here to get $10 off web hosting by FatCow!

 

If you like this blog, you might also like top photography schools.