Feeds:
Posts
Comments

Archive for the ‘Technology’ Category

And here it comes again as every year, Google Summer of Code!

As usual, I will be working for Apache Pig also this summer. However, I will be working from the other side of the fence this time: I will be mentoring!

As I have already graduated and I am no more a student (shock!) I cannot take part as a student (no way!). So I decided to mentor.

  • Pro: you need much less time to mentor compared to being a student.
  • Con: you don’t learn as much, nor have as much fun, nor get the same recognition, nor get paid as much. 😉

However, the great thing about mentoring is that you propose ideas you would like to see realized in the project, and then students make them real!
Indeed, the project I am mentoring this year is about an idea I had some while ago.
It happens very often (to me and to my colleagues as well) to have a list of tuples and to need to attach a unique number to it.
For example when creating the lexicon for a collection of documents you have a list of unique words appearing in the collection, and you want to transform each word into a numerical id to reduce memory usage and ease later processing.

A while ago I implemented a MapReduce algorithm for this task that scales very well (i.e. no single reducer that does all the work).
However this same problem comes up very often and it is cumbersome to fit a MR job in the middle of a Pig pipeline while you are building the script in an iterative way (read: you don’t know what you are doing), especially because the code I wrote is not general enough to be applied as is, so each time I need to customize it a bit.
“Indeed, why the hell Pig does not do it already?”  was my reaction. So PIG-2353 was born.

PIG-2353 is quite a stretch for a night’s coding, so it stayed there for a while. Until an interested student got interested in it for GSoC. Enter Allan Avendaño.

You can follow Allan’s progress also on his blog.
He has already got his first patch in (PIG-2583) and started working on the main project this week.
I expect great stuff coming out of this project!

I am ready for a great summer, “flipping bits not burgers”.

Read Full Post »

Just discovered this command line tool today!

purge

Yes, that’s all folks! 🙂

Read Full Post »

<rant>
I wonder how you can entertain a deep conversation on any topic when the medium is short slogans 140 characters long. When you read them in isolation they always sound convincing, mostly because there is no context around them.

To me it looks very much like trying to communicate by throwing small stones at each other while trying to draw the attentions of passers-by.
</rant>

Read Full Post »

I have recently started to look at the BSP (Bulk Synchronous Parallel) model of computation for large scale graph analysis.
The model itself is quite old, but it has seen a resurgence lately mainly because of the Pregel system by Google. Here the paper.
The idea of Pregel is to use synchronous barriers to cleanly separate messages in rounds.
In each step, a vertex can read the messages sent to it in the previous step, perform some computation, and generate messages for the next step.
When all the nodes vote to halt the  algorithm and no more messages arrive, the computation stops.

As with MapReduce and Hadoop, a number of clones of Pregel have popped up.
I will briefly review the most interesting ones in this post.

 

Apache Giraph

Giraph is a Hadoop-based graph processing engine open sourced by Yahoo! and now in the Apache Incubator.
It is at a very early stage, but the community looks very active and diverse. Giraph developers play in the same backyard of Hadoop, Pig and Hive developers, and this makes me feel confident about its future.
Giraph uses Hadoop for scheduling its workers and get the data loaded in memory, but then implements its own communication protocol using HadoopRPC. It also uses Zookeeper for fault tolerance by periodic checkpointing.
Interestingly, Giraph does not require installing anything on the cluster, so you can try it out on your existing Hadoop infrastructure. A Giraph job runs like a normal map-only job. It can actually be thought as a graph-specific Hadoop library.
However, because of this fact, the API currently lack encapsulation. Users need to write a custom  Reader + Writer + InputFormat + OutputFormat for each Vertex program they create. A library to read common graph formats would be a nice addition.

The good:
Very active community with people from the Hadoop ecosystem.
Runs directly on Hadoop.

The bad:
Early stage.
Hadoop leaks in the API.

 

Apache Hama

Hama is a Hadoop-like solution and is the oldest member in the group. Currently it is in the Apache Incubator and it was initially developed by an independent Korean programmer.
Hama focuses on general BSP computations, so it is not only for graphs. For example there are algorithms for matrix inversion and linear algebra (I know, one could argue that a graph and a matrix are actually the same data structure).
Unfortunately, the project seems to be moving slowly even though lately there has been a spike of activity, probably caused mainly by the GSoC (it works!).
Currently it is still at a very early stage:  the current version doesn’t provide a proper I/O API and data partitioner. From my understanding there is no fault-tolerance either.
From the technical point of view, Hama uses Zookeeper for  coordination and HDFS for data persistence. It is designed to be “The Hadoop of BSP”.
Given its focus on general BSP, the kind of primitives that it provides are at a low level of abstraction, very much like a restricted version of MPI.

The good:
General BSP.
Complete system.
The logo is cute.

The bad:
Early stage.
Requires additional infrastructure.
Graph processing library not yet released.

 

GoldenOrb

GoldenOrb is a Hadoop based Pregel clone open sourced by Ravel. It should be in the process of getting into the Apache Incubator.
It is a close clone of Pregel and very similar to Giraph: a system to run distributed iterative algorithms on graphs.
The components of the system like vertex values (state), edges and messages are built on top of the Hadoop Writables system.
One thing I don’t understand is why they didn’t leverage Java generics. To get the value of a message you need to do ugly explicit casts:

@Override public void compute(Collection<IntMessage> messages) {
  int _maxValue = 0;
  for(IntMessage m: messages) {
    int msgValue = ((IntWritable)m.getMessageValue()).get();
    _maxValue = Math.max(_maxValue, msgValue);
  }
}

As with Giraph, Hadoop details leak into the API. However, GoldenOrb requires additional infrastructure to run. An OrbTracker needs to be running on each Hadoop node. It Also uses HadoopRPC for communication. The administrator can define the number of partitions to assign to each OrbTracker and the threads per node to launch can be configured on a per-job basis.

The good:
Commercial support.

The bad:
Early stage, not yet in Incubator.
API has rough edges.
Requires additional infrastructure.

 

JPregel

A final mention goes to JPregel. It is a Java clone of Pregel which is not Hadoop based.
The programming interface is thought from the ground up and is very clean.
Right now it is at a very very early stage of development, e.g. messages can only be doubles and the system itself is not yet finalized.
Interestingly, no explicit halting primitive is present. From what I understood it automatically deduces the termination by the absence of new messages.
It is a very nice piece of work, especially considering the fact that it has been done by only a couple of first year master students.

 

PS: I waited too much before publishing this post and I was beaten on time by one of the Giraph committers, even though he reviews also some other piece of software. Have a look at his post.

Read Full Post »

In science, if you know what you are doing, you should not be doing it. In engineering, if you do not know what you are doing, you should not be doing it. Of course, you seldom, if ever, see either pure state.

Richard Hamming, The Art of Doing Science and Engineering

Read Full Post »

Pig committer

Starting from today, I am officially a Pig committer!

The Pig PMC has decided to “promote” me to committer after 2 years of continued involvement in the platform.

I am really excited to be part of this great community, and to receive this acknowledgement of my work.

I will do my best to contribute to the project and make Pig a better platform!

Cheers!

Update: the updated “Who we are” Web page

Read Full Post »

http://blog.s4.io/2011/08/s4-0-3-0-released/

Read Full Post »

Actually it should not be an update, but a wrap-up, as I basically have finished my project for this year. My last patch already got a +1 and it’s just waiting for the tests to finish to be committed.

I completed my selected tasks PIG-1926 and PIG-1904 (see my previous post for an explanation of what they do), plus some more small fixes here and there: PIG-2156 PIG-2136 PIG-2060 PIG-2026 PIG-2025 PIG-2024

I also gave some longer term ideas on how to refactor the grammar to make it safer and easier to modify, and on some new features: PIG-2138 PIG-2123 PIG-2119 PIG-2047

However, given that I have still 1 month left before the official end of the GSoC, I will tackle the rest of the “Sugar” projects listed on the PIG GSoC page, which means adding syntax support for Tuple/Map/Bag conversions: PIG-1387

All my fixes will go in Pig 0.10, as 0.9 has already been branched and will be out very soon.

Working on the front end has been a very interesting and enriching experience.

  • I got to learn how to use ANTLR (my mentor called me an “ANTLR expert” :P).
  • I learned how Pig scripts are compiled and how to work with the logical, physical and mapreduce levels.
  • I have a full understanding of the workflow and the dataflow of the operators in Pig. I am sure this will come in handy in the future.
  • I also increased my proficiency in Pig/Latin scripting.
  • Finally, I really got to seriously use and appreciate git. It makes working on different patches at the same time a breeze.
See you in a month for the actual wrap-up!

Read Full Post »

Pointers in C

I’ve come to realize that understanding pointers in C is not a skill, it’s an aptitude. In first year computer science classes, there are always about 200 kids at the beginning of the semester, all of whom wrote complex adventure games in BASIC for their PCs when they were 4 years old. They are having a good ol’ time learning C or Pascal in college, until one day the professor introduces pointers, and suddenly, they don’t get it. They just don’t understand anything any more. 90% of the class goes off and becomes Political Science majors, then they tell their friends that there weren’t enough good looking members of the appropriate sex in their CompSci classes, that’s why they switched. For some reason most people seem to be born without the part of the brain that understands pointers.

Joel Spolsky (The Guerrilla Guide to Interviewing)

Read Full Post »

git to svn patch

After discovering git I practically fell in love with it.
So I decided to use the git Apache mirror for Pig for this year’s GSoC.
One problem i found is that the ASF (Apache Software Foundation) uses svn (subversion) patches, but git by default produces a slightly different diff format that is not readily understood by the patch utility. A simple workaround for this issue is to use the –no-prefix option of git diff. (it should also work to use -p1 instead of -p0 in the patch command).
To make completely transparent that I am using a different repository, I also keep a separate pristine tree checked out of svn and always up to date with trunk. To try my modifications, I can simply check out the branch I want to try out, generate a patch, apply it on the fly on the pristine svn tree and run ant test while I continue working on the git tree. To generate the final patch to submit I resort again to svn.

git co PIG-XXXX
git diff trunk --no-prefix | patch -p0 -d ../pigpristine/
cd ../pigpristine/ && svn diff > PIG-XXXX.patch

In the snippet I assume the git and svn tree are siblings in the filesystem, and that the svn tree is called pigpristine.

Read Full Post »

« Newer Posts - Older Posts »