Posts Tagged ‘cloud’

define: Cloud Computing

I have been looking for a good definition of Cloud Computing for a while. Cloud Computing is of course a buzzword, so no wonder its meaning is fuzzy. The official definition of NIST reminds me of some standards: put everything together to make everyone happy.

Even Wikipedia gets a bit fuzzy about Cloud Computing, basically because it mixes up technical definitions, marketing, business models and a lot of other things. Also the critics do not help to define the thing, as they say things like “Cloud is everything we do” or “Technologies now dubbed as Cloud existed long before the name”.

Given that a definition is always an approximation (ontologically, because it is just a categorization for our mind), the best technical definition (what I am interested in) I found was given in this blog post. I summarize it here “Distributed location-independent scale-free cooperative agents”. You can check the post to see what each piece of  the definition means.

While this was the best definition I found, it is not exactly what I have in mind when I think about Cloud Computing. Also, this does not encompass a lot of technologies that I can think of when I say Cloud (one for all, MapReduce). So I will take a stab at defining what Cloud Computing is:

“Distributed, transparent, scale-free computing system”

Yes, it doesn’t change much, does it? But the core point here is that I do not care what kind of system we are talking about, but I just care that the system is distributed and scale-free. Furthermore location independence is not the only interesting property: access, failure and replication transparency are important as well. You should aim to the best transparency you can get without impacting performance (too much transparency hinders optimization).

The rationale is that a Cloud Computing is such that you can solve a problem faster/better just throwing more hardware at it. So scalability is the key feature, and in particular being scale-free (the scale of the system is not a design parameter).


Read Full Post »

Here is a little trick I had to learn while developing Apache Pig.

Pig uses JUnit as test framework. JUnit tests are very useful for unit testing, but end-to-end testing is not as easy. Even more in the case of Pig, that uses Hadoop (a distributed MapReduce engine) to execute its scripts. The MiniCluster class addresses this issue: it simulates a full execution environment on the local machine, with HDFS and everything you need. More information here.

MiniCluster is very easy to use, assuming you are running your tests via ant. But if you want to debug and trace your test (using Eclipse, for instance) there are a couple of catches. Basically, you need to reproduce the environment the ant script builds inside Eclipse.

The first thing to set is the hadoop.log.dir property, that tells where to put logs. Its default value is build/test/logs. To set it, go in the Run Configurations screen, Arguments tab, and add this line to the VM arguments:


If you forget to set this, you will get a nice NullPonterException:

ERROR mapred.MiniMRCluster: Job tracker crashed
at java.io.File.<init>(File.java:222)
at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:151)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1617)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
at org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:106)
at java.lang.Thread.run(Thread.java:619)

The other thing to take care of is where to find MiniCluster‘s configuration file. For Pig, you should first create it by running the ant test target once from the command line. This will create a standard minimum configuration file for your use in ${HOME}/pigtest/conf. To set it, you should add this directory to the classpath in the Classpath tab, under User Entries using the Advanced… button.

If you forget to set this, you get a nice ExecException:

org.apache.pig.backend.executionengine.ExecException: ERROR 4010: Cannot find hadoop configurations in 
 classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath).If you plan to use 
 local mode, please put -x local option in command line
 at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:149)
 at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:114)
 at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
 at org.apache.pig.PigServer.<init>(PigServer.java:216)
 at org.apache.pig.PigServer.<init>(PigServer.java:205)
 at org.apache.pig.PigServer.<init>(PigServer.java:201)
 at org.apache.pig.test.TestSecondarySort.setUp(TestSecondarySort.java:73)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
 at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:73)
 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:46)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:180)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:41)
 at org.junit.runners.ParentRunner$1.evaluate(ParentRunner.java:173)
 at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:220)
 at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
 at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

Even after this, you will still get some exceptions (regarding threads, manifest files, jars), but they are not a problem and debugging will work.

Hope this helps!

Read Full Post »

Freedom in the Cloud

A.K.A. why I do not have a Facebook profile (actually, why Facebook does not have a profile on me)

The cloud means that we can’t even point in the direction of the server anymore and because we can’t even point in the direction of the server anymore we don’t have extra technical or non-technical means of reliable control over this disaster in slow motion. You can make a rule about logs or data flow or preservation or control or access or disclosure but your laws are human laws and they occupy particular territory and the server is in the cloud and that means the server is always one step ahead of any rule you make or two or three or six or poof! I just realized I’m subject to regulation, I think I’ll move to Oceana now.

Which means that in effect, we lost the ability to use either legal regulation or anything about the physical architecture of the network to interfere with the process of falling away from innocence that was now inevitable in the stage I’m talking about, what we might call late Google stage 1.

It is here, of course, that Mr. Zuckerberg enters.

The human race has susceptibility to harm but Mr. Zuckerberg has attained an unenviable record: he has done more harm to the human race than anybody else his age.

Because he harnessed Friday night. That is, everybody needs to get laid and he turned it into a structure for degenerating the integrity of human personality and he has to a remarkable extent succeeded with a very poor deal. Namely, “I will give you free web hosting and some PHP doodads and you get spying for free all the time”. And it works.

That’s the sad part, it works.

How could that have happened?

There was no architectural reason, really. There was no architectural reason really. Facebook is the Web with “I keep all the logs, how do you feel about that?” It’s a terrarium for what it feels like to live in a panopticon built out of web parts.

And it shouldn’t be allowed. It comes to that. It shouldn’t be allowed. That’s a very poor way to deliver those services. They are grossly overpriced at “spying all the time”. They are not technically innovative. They depend upon an architecture subject to misuse and the business model that supports them is misuse. There isn’t any other business model for them. This is bad.

I’m not suggesting it should be illegal. It should be obsolete. We’re technologists, we should fix it.

I’m glad I’m with you so far. When I come to how we should fix it later I hope you will still be with me because then we could get it done.

But let’s say, for now, that that’s a really good example of where we went wrong and what happened to us because. It’s trickier with gmail because of that magical untouched by human hands-iness. When I say to my students, “why do you let people read your email”, they say “but nobody is reading my email, no human being ever touched it. That would freak me out, I’d be creeped out if guys at Google were reading my email. But that’s not happening so I don’t have a problem.”

Now, this they cannot say about Facebook. Indeed, they know way too much about Facebook if they let themselves really know it. You have read the stuff and you know. Facebook workers know who’s about to have a love affair before the people do because they can see X obsessively checking the Facebook page of Y.

Like a lot of unfreedom, the real underlying social process that forces this unfreedom along is nothing more than perceived convenience.

Read the full story here

Read Full Post »

Today I presented my PhD research topic at ISTI CNR, institute of computer science and technology from the Italian National Research Council.

I have been working with the HPC lab since late November 2009, when I chose my thesis supervisor, Claudio Lucchese.

The topic of the seminar was “How to survive the Data Deluge: Petabyte scale Cloud Computing”.
In the seminar I gave an introduction to the problem of large scale data management and to its motivations. I described the new technologies that are used today to perform analysis on these large datasets (mainly focusing on the MapReduce paradigm) and the difference with the other competing technology, Parallel DBMS.

There is a very harsh ongoing debate on which technology is the best one, as there are advantages on both sides. One of the main detractors of the MapReduce paradigm is Michael Stonebraker, Professor of Computer Science at MIT and strong DataBase supporter, given also that he co-founded one of the companies that produces Vertica, a PDBMS that targets more or less the same analytical workloads of MapReduce, even if in a different fashion.

He published a post, together with Prof. DeWitt, in which he basically blamed MapReduce for not being a DataBase. The post received very harsh critiques (read also the comments to the original post as they are very interesting). Stonebraker and DeWitt doubled with another post in which they replied to the answers they received, providing examples of DataBase superiority. They then decided to push this forward and published a paper comparing the two systems on various workloads, showing how Vertica is far superior to Hadoop in almost all tasks.

The last page in this story is in this month’s Communications of the ACM. I said page but they are actually pages, because the editor published two very interesting articles side by side. The first one is the latest from the Stonebraker&DeWitt couple, that basically says MapReduce and PDBMS serve different purposes and have to coexist. The latest one is a reply by the original authors of MapReduce (Jeffrey Dean and Sanjay Ghemawat) to all the critiques to their creature. They show how most of the flaws identified by S&DW are actually implementation problems rather than limits of the paradigm. Dean and Ghemawat also let slip through that the comparison performed in their article is biased towards database oriented tasks. In their words “The conclusions about performance in the comparison paper were based on flawed assumptions about MapReduce and overstated the benefit of parallel database systems”

I will abstain from commenting on this issue for now, even though I deem it as very interesting for my future research. I just think I do not have matured my opinion enough to express it.

In the meanwhile, here is the slide deck I used for my presentation.

Petabyte Scale Cloud Computing

Read Full Post »


Read Full Post »

« Newer Posts