Feeds:
Posts
Comments

Posts Tagged ‘graph’

I have recently started to look at the BSP (Bulk Synchronous Parallel) model of computation for large scale graph analysis.
The model itself is quite old, but it has seen a resurgence lately mainly because of the Pregel system by Google. Here the paper.
The idea of Pregel is to use synchronous barriers to cleanly separate messages in rounds.
In each step, a vertex can read the messages sent to it in the previous step, perform some computation, and generate messages for the next step.
When all the nodes vote to halt the  algorithm and no more messages arrive, the computation stops.

As with MapReduce and Hadoop, a number of clones of Pregel have popped up.
I will briefly review the most interesting ones in this post.

 

Apache Giraph

Giraph is a Hadoop-based graph processing engine open sourced by Yahoo! and now in the Apache Incubator.
It is at a very early stage, but the community looks very active and diverse. Giraph developers play in the same backyard of Hadoop, Pig and Hive developers, and this makes me feel confident about its future.
Giraph uses Hadoop for scheduling its workers and get the data loaded in memory, but then implements its own communication protocol using HadoopRPC. It also uses Zookeeper for fault tolerance by periodic checkpointing.
Interestingly, Giraph does not require installing anything on the cluster, so you can try it out on your existing Hadoop infrastructure. A Giraph job runs like a normal map-only job. It can actually be thought as a graph-specific Hadoop library.
However, because of this fact, the API currently lack encapsulation. Users need to write a custom  Reader + Writer + InputFormat + OutputFormat for each Vertex program they create. A library to read common graph formats would be a nice addition.

The good:
Very active community with people from the Hadoop ecosystem.
Runs directly on Hadoop.

The bad:
Early stage.
Hadoop leaks in the API.

 

Apache Hama

Hama is a Hadoop-like solution and is the oldest member in the group. Currently it is in the Apache Incubator and it was initially developed by an independent Korean programmer.
Hama focuses on general BSP computations, so it is not only for graphs. For example there are algorithms for matrix inversion and linear algebra (I know, one could argue that a graph and a matrix are actually the same data structure).
Unfortunately, the project seems to be moving slowly even though lately there has been a spike of activity, probably caused mainly by the GSoC (it works!).
Currently it is still at a very early stage:  the current version doesn’t provide a proper I/O API and data partitioner. From my understanding there is no fault-tolerance either.
From the technical point of view, Hama uses Zookeeper for  coordination and HDFS for data persistence. It is designed to be “The Hadoop of BSP”.
Given its focus on general BSP, the kind of primitives that it provides are at a low level of abstraction, very much like a restricted version of MPI.

The good:
General BSP.
Complete system.
The logo is cute.

The bad:
Early stage.
Requires additional infrastructure.
Graph processing library not yet released.

 

GoldenOrb

GoldenOrb is a Hadoop based Pregel clone open sourced by Ravel. It should be in the process of getting into the Apache Incubator.
It is a close clone of Pregel and very similar to Giraph: a system to run distributed iterative algorithms on graphs.
The components of the system like vertex values (state), edges and messages are built on top of the Hadoop Writables system.
One thing I don’t understand is why they didn’t leverage Java generics. To get the value of a message you need to do ugly explicit casts:

@Override public void compute(Collection<IntMessage> messages) {
  int _maxValue = 0;
  for(IntMessage m: messages) {
    int msgValue = ((IntWritable)m.getMessageValue()).get();
    _maxValue = Math.max(_maxValue, msgValue);
  }
}

As with Giraph, Hadoop details leak into the API. However, GoldenOrb requires additional infrastructure to run. An OrbTracker needs to be running on each Hadoop node. It Also uses HadoopRPC for communication. The administrator can define the number of partitions to assign to each OrbTracker and the threads per node to launch can be configured on a per-job basis.

The good:
Commercial support.

The bad:
Early stage, not yet in Incubator.
API has rough edges.
Requires additional infrastructure.

 

JPregel

A final mention goes to JPregel. It is a Java clone of Pregel which is not Hadoop based.
The programming interface is thought from the ground up and is very clean.
Right now it is at a very very early stage of development, e.g. messages can only be doubles and the system itself is not yet finalized.
Interestingly, no explicit halting primitive is present. From what I understood it automatically deduces the termination by the absence of new messages.
It is a very nice piece of work, especially considering the fact that it has been done by only a couple of first year master students.

 

PS: I waited too much before publishing this post and I was beaten on time by one of the Giraph committers, even though he reviews also some other piece of software. Have a look at his post.

Advertisements

Read Full Post »

My last work “Social Content Matching in MapReduce” got accepted in VLDB
(Very Large Data Bases).

YES!!!

(as you might tell, I am extremely happy about this 🙂 )

In the paper we tackle the problem of content distribution in a social media web site like flickr, model the problem as a b-matching problem on a graph and solve it with a smart iterative algorithm in MapReduce. We also show how to design a scalable greedy algorithm for the same problem in MapReduce.

Here the abstract:

Matching problems are ubiquitous. They occur in economic markets, labor markets, internet advertising, and elsewhere. In this paper we focus on an application of matching for social media. Our goal is to distribute content from information suppliers to information consumers.
We seek to maximize the overall relevance of the matched content from suppliers to consumers while regulating the overall activity, e.g., ensuring that no consumer is overwhelmed with data and that all suppliers have chances to deliver their content.

We propose two matching algorithms, GreedyMR and StackMR, geared for the MapReduce paradigm. Both algorithms have provable approximation guarantees, and in practice they produce high-quality solutions. While both algorithms scale extremely well, we can show that StackMR requires only a poly-logarithmic number of MapReduce steps, making it an attractive option for applications with very large datasets. We experimentally show the trade-offs between quality and efficiency of our solutions on two large datasets coming from real-world social-media web sites.

On a final note, thanks to my co-authors for their hard work and guidance:
Aris Gionis from Yahoo! Research and Mauro Sozio from Max Planck Institut.

Read Full Post »

Code Quality vs Time to deadline

Read Full Post »

Read Full Post »