Posts Tagged ‘cloud’

“Big Data”

Read Full Post »

Directly from my PhD thesis.

Read Full Post »

I have recently started to look at the BSP (Bulk Synchronous Parallel) model of computation for large scale graph analysis.
The model itself is quite old, but it has seen a resurgence lately mainly because of the Pregel system by Google. Here the paper.
The idea of Pregel is to use synchronous barriers to cleanly separate messages in rounds.
In each step, a vertex can read the messages sent to it in the previous step, perform some computation, and generate messages for the next step.
When all the nodes vote to halt the  algorithm and no more messages arrive, the computation stops.

As with MapReduce and Hadoop, a number of clones of Pregel have popped up.
I will briefly review the most interesting ones in this post.


Apache Giraph

Giraph is a Hadoop-based graph processing engine open sourced by Yahoo! and now in the Apache Incubator.
It is at a very early stage, but the community looks very active and diverse. Giraph developers play in the same backyard of Hadoop, Pig and Hive developers, and this makes me feel confident about its future.
Giraph uses Hadoop for scheduling its workers and get the data loaded in memory, but then implements its own communication protocol using HadoopRPC. It also uses Zookeeper for fault tolerance by periodic checkpointing.
Interestingly, Giraph does not require installing anything on the cluster, so you can try it out on your existing Hadoop infrastructure. A Giraph job runs like a normal map-only job. It can actually be thought as a graph-specific Hadoop library.
However, because of this fact, the API currently lack encapsulation. Users need to write a custom  Reader + Writer + InputFormat + OutputFormat for each Vertex program they create. A library to read common graph formats would be a nice addition.

The good:
Very active community with people from the Hadoop ecosystem.
Runs directly on Hadoop.

The bad:
Early stage.
Hadoop leaks in the API.


Apache Hama

Hama is a Hadoop-like solution and is the oldest member in the group. Currently it is in the Apache Incubator and it was initially developed by an independent Korean programmer.
Hama focuses on general BSP computations, so it is not only for graphs. For example there are algorithms for matrix inversion and linear algebra (I know, one could argue that a graph and a matrix are actually the same data structure).
Unfortunately, the project seems to be moving slowly even though lately there has been a spike of activity, probably caused mainly by the GSoC (it works!).
Currently it is still at a very early stage:  the current version doesn’t provide a proper I/O API and data partitioner. From my understanding there is no fault-tolerance either.
From the technical point of view, Hama uses Zookeeper for  coordination and HDFS for data persistence. It is designed to be “The Hadoop of BSP”.
Given its focus on general BSP, the kind of primitives that it provides are at a low level of abstraction, very much like a restricted version of MPI.

The good:
General BSP.
Complete system.
The logo is cute.

The bad:
Early stage.
Requires additional infrastructure.
Graph processing library not yet released.



GoldenOrb is a Hadoop based Pregel clone open sourced by Ravel. It should be in the process of getting into the Apache Incubator.
It is a close clone of Pregel and very similar to Giraph: a system to run distributed iterative algorithms on graphs.
The components of the system like vertex values (state), edges and messages are built on top of the Hadoop Writables system.
One thing I don’t understand is why they didn’t leverage Java generics. To get the value of a message you need to do ugly explicit casts:

@Override public void compute(Collection<IntMessage> messages) {
  int _maxValue = 0;
  for(IntMessage m: messages) {
    int msgValue = ((IntWritable)m.getMessageValue()).get();
    _maxValue = Math.max(_maxValue, msgValue);

As with Giraph, Hadoop details leak into the API. However, GoldenOrb requires additional infrastructure to run. An OrbTracker needs to be running on each Hadoop node. It Also uses HadoopRPC for communication. The administrator can define the number of partitions to assign to each OrbTracker and the threads per node to launch can be configured on a per-job basis.

The good:
Commercial support.

The bad:
Early stage, not yet in Incubator.
API has rough edges.
Requires additional infrastructure.



A final mention goes to JPregel. It is a Java clone of Pregel which is not Hadoop based.
The programming interface is thought from the ground up and is very clean.
Right now it is at a very very early stage of development, e.g. messages can only be doubles and the system itself is not yet finalized.
Interestingly, no explicit halting primitive is present. From what I understood it automatically deduces the termination by the absence of new messages.
It is a very nice piece of work, especially considering the fact that it has been done by only a couple of first year master students.


PS: I waited too much before publishing this post and I was beaten on time by one of the Giraph committers, even though he reviews also some other piece of software. Have a look at his post.

Read Full Post »

The Cloud

Read Full Post »

Data Intensive Scalable Computing
DISC=Data Intensive Scalable Computing

ML=Machine Learning
DM=Data Mining
IR=Information Retrieval
DS=Distributed Systems
DB=Data Bases

Read Full Post »

Cloud Consulting

Dilbert - Cloud Consulting

Read Full Post »


The name SMAQ (Storage MapReduce And Query) Cloud Stack was proposed by Edd Dumbill in this article on O’Reilly Radar (the article is sure worth reading).

While I find that a nice name is a good way to crystallize a concept, I am not sure it captures the whole picture about Big Data.
For example, compare the image on the left from the article, with the one on the right that comes directly from my Ph.D. research proposal.

The SMAQ stack for Big Data

View of the SMAQ stack from O'Reilly Radar

Cloud Computing Stack

View of the Data Intensive Cloud Computing stack from my Ph.D. research proposal

I will now dub my stack proposal CDC (Computation Data Coordination) stack (suggestions for better names super welcome!).

The query layer from SMAQ would map to my High Level Languages layer.
This layer includes systems like Pig, Hive and Cascading.

The MapReduce layer from SMAQ would map to my Computation layer.
The only other system in this layer is Dryad, for the moment, but there could be many others.

The Storage layer from SMAQ would map to my Distributed Data layer.
The SMAQ classification does not differentiate between systems like HDFS and HBase.
According to me, the high level interface is what distinguishes HDFS from HBase (or for example Voldemort from Cassandra).
That is why I would put HDFS and Voldemort in the Distributed Data layer, and HBase and Cassandra in the Data Abstraction layer (even though Cassandra does not actually rely on another system to store its data).

Finally, the SMAQ stack is totally lacking the Coordination layer.
This is comprehensible, as the audience of radar is more “analyst” oriented. From a operational perspective, systems like Chubby and Zookeeper are useful to build the frameworks in the stack.

What is missing from both stacks, and will be the main trend in 2011, is the Real-Time layer (even though I would have no idea where to put it 🙂 )

Read Full Post »

define: Cloud Computing

I have been looking for a good definition of Cloud Computing for a while. Cloud Computing is of course a buzzword, so no wonder its meaning is fuzzy. The official definition of NIST reminds me of some standards: put everything together to make everyone happy.

Even Wikipedia gets a bit fuzzy about Cloud Computing, basically because it mixes up technical definitions, marketing, business models and a lot of other things. Also the critics do not help to define the thing, as they say things like “Cloud is everything we do” or “Technologies now dubbed as Cloud existed long before the name”.

Given that a definition is always an approximation (ontologically, because it is just a categorization for our mind), the best technical definition (what I am interested in) I found was given in this blog post. I summarize it here “Distributed location-independent scale-free cooperative agents”. You can check the post to see what each piece of  the definition means.

While this was the best definition I found, it is not exactly what I have in mind when I think about Cloud Computing. Also, this does not encompass a lot of technologies that I can think of when I say Cloud (one for all, MapReduce). So I will take a stab at defining what Cloud Computing is:

“Distributed, transparent, scale-free computing system”

Yes, it doesn’t change much, does it? But the core point here is that I do not care what kind of system we are talking about, but I just care that the system is distributed and scale-free. Furthermore location independence is not the only interesting property: access, failure and replication transparency are important as well. You should aim to the best transparency you can get without impacting performance (too much transparency hinders optimization).

The rationale is that a Cloud Computing is such that you can solve a problem faster/better just throwing more hardware at it. So scalability is the key feature, and in particular being scale-free (the scale of the system is not a design parameter).

Read Full Post »

Here is a little trick I had to learn while developing Apache Pig.

Pig uses JUnit as test framework. JUnit tests are very useful for unit testing, but end-to-end testing is not as easy. Even more in the case of Pig, that uses Hadoop (a distributed MapReduce engine) to execute its scripts. The MiniCluster class addresses this issue: it simulates a full execution environment on the local machine, with HDFS and everything you need. More information here.

MiniCluster is very easy to use, assuming you are running your tests via ant. But if you want to debug and trace your test (using Eclipse, for instance) there are a couple of catches. Basically, you need to reproduce the environment the ant script builds inside Eclipse.

The first thing to set is the hadoop.log.dir property, that tells where to put logs. Its default value is build/test/logs. To set it, go in the Run Configurations screen, Arguments tab, and add this line to the VM arguments:


If you forget to set this, you will get a nice NullPonterException:

ERROR mapred.MiniMRCluster: Job tracker crashed
at java.io.File.<init>(File.java:222)
at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:151)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1617)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
at org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:106)
at java.lang.Thread.run(Thread.java:619)

The other thing to take care of is where to find MiniCluster‘s configuration file. For Pig, you should first create it by running the ant test target once from the command line. This will create a standard minimum configuration file for your use in ${HOME}/pigtest/conf. To set it, you should add this directory to the classpath in the Classpath tab, under User Entries using the Advanced… button.

If you forget to set this, you get a nice ExecException:

org.apache.pig.backend.executionengine.ExecException: ERROR 4010: Cannot find hadoop configurations in 
 classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath).If you plan to use 
 local mode, please put -x local option in command line
 at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:149)
 at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:114)
 at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
 at org.apache.pig.PigServer.<init>(PigServer.java:216)
 at org.apache.pig.PigServer.<init>(PigServer.java:205)
 at org.apache.pig.PigServer.<init>(PigServer.java:201)
 at org.apache.pig.test.TestSecondarySort.setUp(TestSecondarySort.java:73)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
 at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:73)
 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:46)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:180)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:41)
 at org.junit.runners.ParentRunner$1.evaluate(ParentRunner.java:173)
 at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:220)
 at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
 at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

Even after this, you will still get some exceptions (regarding threads, manifest files, jars), but they are not a problem and debugging will work.

Hope this helps!

Read Full Post »

Freedom in the Cloud

A.K.A. why I do not have a Facebook profile (actually, why Facebook does not have a profile on me)

The cloud means that we can’t even point in the direction of the server anymore and because we can’t even point in the direction of the server anymore we don’t have extra technical or non-technical means of reliable control over this disaster in slow motion. You can make a rule about logs or data flow or preservation or control or access or disclosure but your laws are human laws and they occupy particular territory and the server is in the cloud and that means the server is always one step ahead of any rule you make or two or three or six or poof! I just realized I’m subject to regulation, I think I’ll move to Oceana now.

Which means that in effect, we lost the ability to use either legal regulation or anything about the physical architecture of the network to interfere with the process of falling away from innocence that was now inevitable in the stage I’m talking about, what we might call late Google stage 1.

It is here, of course, that Mr. Zuckerberg enters.

The human race has susceptibility to harm but Mr. Zuckerberg has attained an unenviable record: he has done more harm to the human race than anybody else his age.

Because he harnessed Friday night. That is, everybody needs to get laid and he turned it into a structure for degenerating the integrity of human personality and he has to a remarkable extent succeeded with a very poor deal. Namely, “I will give you free web hosting and some PHP doodads and you get spying for free all the time”. And it works.

That’s the sad part, it works.

How could that have happened?

There was no architectural reason, really. There was no architectural reason really. Facebook is the Web with “I keep all the logs, how do you feel about that?” It’s a terrarium for what it feels like to live in a panopticon built out of web parts.

And it shouldn’t be allowed. It comes to that. It shouldn’t be allowed. That’s a very poor way to deliver those services. They are grossly overpriced at “spying all the time”. They are not technically innovative. They depend upon an architecture subject to misuse and the business model that supports them is misuse. There isn’t any other business model for them. This is bad.

I’m not suggesting it should be illegal. It should be obsolete. We’re technologists, we should fix it.

I’m glad I’m with you so far. When I come to how we should fix it later I hope you will still be with me because then we could get it done.

But let’s say, for now, that that’s a really good example of where we went wrong and what happened to us because. It’s trickier with gmail because of that magical untouched by human hands-iness. When I say to my students, “why do you let people read your email”, they say “but nobody is reading my email, no human being ever touched it. That would freak me out, I’d be creeped out if guys at Google were reading my email. But that’s not happening so I don’t have a problem.”

Now, this they cannot say about Facebook. Indeed, they know way too much about Facebook if they let themselves really know it. You have read the stuff and you know. Facebook workers know who’s about to have a love affair before the people do because they can see X obsessively checking the Facebook page of Y.

Like a lot of unfreedom, the real underlying social process that forces this unfreedom along is nothing more than perceived convenience.

Read the full story here

Read Full Post »

Older Posts »