Feeds:
Posts
Comments

Archive for the ‘PhD’ Category

ICDM 2010 in OZ

I am back from ICDM 2010 (IEEE International Conference on Data Mining).
This year the conference was held in Sydney, Australia, or OZ as they like to say.
The Aussie experience was very interesting.

I presented my work on Self-Similarity Join using MapReduce. The time to present was quite not enough, but overall I was satisfied with my performance. The conference was OK, even though I would say not excellent. I did not get many ideas from the presenters. Luckily I met some nice people there, and I got some good contacts.

Australia is a wonderful country Cheerful people, nice weather, relaxed atmosphere, great nature, awesome beaches, food from all around the world.
Only downside: it is outright expensive.
If you have a chance to go there, be sure to make a trip to the Blue Mountains, they really deserve it. The name comes from the light blue mist that hovers on the eucalyptus forests there. The sap of the tree evaporates and causes the mist.

Being hugged by a kangaroo is an experience I will never forget. Their kicks are also remarkable, I think my legs will remember them too.
I got my vengeance by eating them! ‘roo loins are delicious. Their meat is very lean, full of proteins, remembers a bit horse meat. Kangaroo meat is quite common there, you can find it in many restaurants and even in the supermarket. On the exotic side, I also tried crocodile skewers. It is a bit of a cliche, but it tastes very much like chicken. The texture is different though, it is much tougher. Recommended place to try them: Cafe Ish, 82 Campbell St. Be sure to try their Wattle Mocaccino too 🙂

Read Full Post »

define: Cloud Computing

I have been looking for a good definition of Cloud Computing for a while. Cloud Computing is of course a buzzword, so no wonder its meaning is fuzzy. The official definition of NIST reminds me of some standards: put everything together to make everyone happy.

Even Wikipedia gets a bit fuzzy about Cloud Computing, basically because it mixes up technical definitions, marketing, business models and a lot of other things. Also the critics do not help to define the thing, as they say things like “Cloud is everything we do” or “Technologies now dubbed as Cloud existed long before the name”.

Given that a definition is always an approximation (ontologically, because it is just a categorization for our mind), the best technical definition (what I am interested in) I found was given in this blog post. I summarize it here “Distributed location-independent scale-free cooperative agents”. You can check the post to see what each piece of  the definition means.

While this was the best definition I found, it is not exactly what I have in mind when I think about Cloud Computing. Also, this does not encompass a lot of technologies that I can think of when I say Cloud (one for all, MapReduce). So I will take a stab at defining what Cloud Computing is:

“Distributed, transparent, scale-free computing system”

Yes, it doesn’t change much, does it? But the core point here is that I do not care what kind of system we are talking about, but I just care that the system is distributed and scale-free. Furthermore location independence is not the only interesting property: access, failure and replication transparency are important as well. You should aim to the best transparency you can get without impacting performance (too much transparency hinders optimization).

The rationale is that a Cloud Computing is such that you can solve a problem faster/better just throwing more hardware at it. So scalability is the key feature, and in particular being scale-free (the scale of the system is not a design parameter).

Read Full Post »

¡Y!

¡Hola!

From 20 Sep 2010 to 31 Mar 2011 I will be visiting Yahoo! Research Labs in Barcelona, Spain as a “Research Intern”.

I am really glad for this opportunity to work in a thriving environment and live in this wonderful city. I already love the place.

I will be still working on my thesis during these 6 months, but I will probably (and hopefully) open up new research paths. I will also try to continue my work on Apache Pig, as it is widely used inside Yahoo!

Hasta luego!

Read Full Post »

I was watching Jeff Dean’s keynote presentation for the ACM Symposium on Cloud Computing 2010 (SOCC) that was held yesterday and I found this very interesting bit of information. This is so useful that every Computer Scientist and Engineer should learn it by heart!

Operation Time (nsec)
L1 cache reference 0.5
Branch mispredict 5
L2 cache reference 7
Mutex lock/unlock 25
Main memory reference 100
Compress 1KB bytes with Zippy 3,000
Send 2K bytes over 1 Gbps network 20,000
Read 1MB sequentially from memory 250,000
Roundtrip within same datacenter 500,000
Disk seek 10,000,000
Read 1MB sequentially from disk 20,000,000
Send packet CA -> Netherlands -> CA 150,000,000

These numbers give you some insight into why random reads from a disk are a really bad idea.

This piece information complements the very nice image from Adam Jacobs, and his excellent “The Pathologies of Big Data” article.

Comparison of random and sequential speeds for Memory, SSD and Disk

Random is BAD (and SSD is NOT going to solve the problem)

What should we learn from all this stuff?

  • Do your back-of-the-envelope calculations
  • Do avoid random operations
  • Do benchmarks your system

Read Full Post »

PhD thesis proposal

I filed my thesis proposal for my PhD on February 15th 2010.

The real part of the proposal (the last chapter) is willingly short and generic. I believe in an agile approach to planning: you can’t know upfront everything you are going to do, so planning every tiny detail in advance is a useless waste of time. While you do research you get a deeper understanding of the subject (as in programming), and so you get new ideas of trash old ones.

A nice thing to do would be to transform the state-of-the-art chapter in a comparative survey. To do this I need to experimentally evaluate most of the softwares I review. It is for sure not a quick thing. And I also have to find a good benchmark.

Here the pdf of my thesis proposal

Read Full Post »

HPC-Europa2

HPC-Europa2 is calling for applications from researchers working in Europe to visit any of the 7 centres in its Transnational Access programme.

The programme offers visiting European researchers:
– access to some of the most powerful High Performance Computing (HPC) facilities in Europe;
– travel costs, subsistence expenses and accommodation (may be in a shared flat).

HPC-Europa2 has been funded for 4 years from January 2009, and selection meetings will be held approximately 4 times per year.

HPC-Europa2

Read Full Post »

Scale-free systems

I have been looking around for the definition of scale-free system I came up with, I had totally forgotten where I took it from. I think I kind of ripped it from this post and condensed it into a concise form.

During the presentation I gave a couple of days ago I was asked if this scale-free had anything to do with scale-free networks, which are networks that follow a power-law in degree distribution (a few vertexes with high degree [Hubs] and a lot of vertexes with low degree). These networks are used as a model for Internet, social networks and a lot of other things. They have some interesting properties that are kept no matter the scale (number of nodes) of the network, and thus they exhibit self-similarity, fractal structure and the like.

My answer is: no.

Or at least, when I say scale-free I mean that the scale is not a design parameter of the system. Or alternatively that the system’s design is free of scale, so you can run the system on 10 or 10^10 nodes without modification to the architecture. There might be some way so design a scale-free system using a power-law-distribution-of-something structure, but it is not the main point.

Read Full Post »

Today I presented my PhD research topic at ISTI CNR, institute of computer science and technology from the Italian National Research Council.

I have been working with the HPC lab since late November 2009, when I chose my thesis supervisor, Claudio Lucchese.

The topic of the seminar was “How to survive the Data Deluge: Petabyte scale Cloud Computing”.
In the seminar I gave an introduction to the problem of large scale data management and to its motivations. I described the new technologies that are used today to perform analysis on these large datasets (mainly focusing on the MapReduce paradigm) and the difference with the other competing technology, Parallel DBMS.

There is a very harsh ongoing debate on which technology is the best one, as there are advantages on both sides. One of the main detractors of the MapReduce paradigm is Michael Stonebraker, Professor of Computer Science at MIT and strong DataBase supporter, given also that he co-founded one of the companies that produces Vertica, a PDBMS that targets more or less the same analytical workloads of MapReduce, even if in a different fashion.

He published a post, together with Prof. DeWitt, in which he basically blamed MapReduce for not being a DataBase. The post received very harsh critiques (read also the comments to the original post as they are very interesting). Stonebraker and DeWitt doubled with another post in which they replied to the answers they received, providing examples of DataBase superiority. They then decided to push this forward and published a paper comparing the two systems on various workloads, showing how Vertica is far superior to Hadoop in almost all tasks.

The last page in this story is in this month’s Communications of the ACM. I said page but they are actually pages, because the editor published two very interesting articles side by side. The first one is the latest from the Stonebraker&DeWitt couple, that basically says MapReduce and PDBMS serve different purposes and have to coexist. The latest one is a reply by the original authors of MapReduce (Jeffrey Dean and Sanjay Ghemawat) to all the critiques to their creature. They show how most of the flaws identified by S&DW are actually implementation problems rather than limits of the paradigm. Dean and Ghemawat also let slip through that the comparison performed in their article is biased towards database oriented tasks. In their words “The conclusions about performance in the comparison paper were based on flawed assumptions about MapReduce and overstated the benefit of parallel database systems”

I will abstain from commenting on this issue for now, even though I deem it as very interesting for my future research. I just think I do not have matured my opinion enough to express it.

In the meanwhile, here is the slide deck I used for my presentation.

Petabyte Scale Cloud Computing

Read Full Post »

« Newer Posts