Today I presented my PhD research topic at ISTI CNR, institute of computer science and technology from the Italian National Research Council.
The topic of the seminar was “How to survive the Data Deluge: Petabyte scale Cloud Computing”.
In the seminar I gave an introduction to the problem of large scale data management and to its motivations. I described the new technologies that are used today to perform analysis on these large datasets (mainly focusing on the MapReduce paradigm) and the difference with the other competing technology, Parallel DBMS.
There is a very harsh ongoing debate on which technology is the best one, as there are advantages on both sides. One of the main detractors of the MapReduce paradigm is Michael Stonebraker, Professor of Computer Science at MIT and strong DataBase supporter, given also that he co-founded one of the companies that produces Vertica, a PDBMS that targets more or less the same analytical workloads of MapReduce, even if in a different fashion.
He published a post, together with Prof. DeWitt, in which he basically blamed MapReduce for not being a DataBase. The post received very harsh critiques (read also the comments to the original post as they are very interesting). Stonebraker and DeWitt doubled with another post in which they replied to the answers they received, providing examples of DataBase superiority. They then decided to push this forward and published a paper comparing the two systems on various workloads, showing how Vertica is far superior to Hadoop in almost all tasks.
The last page in this story is in this month’s Communications of the ACM. I said page but they are actually pages, because the editor published two very interesting articles side by side. The first one is the latest from the Stonebraker&DeWitt couple, that basically says MapReduce and PDBMS serve different purposes and have to coexist. The latest one is a reply by the original authors of MapReduce (Jeffrey Dean and Sanjay Ghemawat) to all the critiques to their creature. They show how most of the flaws identified by S&DW are actually implementation problems rather than limits of the paradigm. Dean and Ghemawat also let slip through that the comparison performed in their article is biased towards database oriented tasks. In their words “The conclusions about performance in the comparison paper were based on flawed assumptions about MapReduce and overstated the benefit of parallel database systems”
I will abstain from commenting on this issue for now, even though I deem it as very interesting for my future research. I just think I do not have matured my opinion enough to express it.
In the meanwhile, here is the slide deck I used for my presentation.