Posts Tagged ‘big data’

Big Data, the term

2012 was for sure the year “big data” went ballistic, and throughout 2013 and 2014 it became commonplace and commodified. It is so prevalent nowadays in both industry and academia that is has almost lost any meaning. But when did this trend start? Or, to be more concrete, when was the term “big data” coined?

It was for sure before 2010. One of the possible culprit is Randal E. Bryant, who also coined the term DISC (Data-Intensive Scalable Computing), which I prefer over “big data” to describe tools such as Hadoop — it’s just much more precise. However, this happened just around the corner in 2008.

You might think that “big data” is a recent things, at least more recent than 2000. Well, think again. This paper by Diebold shows a few references from the ’90s. In particular, footnote 9 says:

On the academic side, Tilly (1984) mentions Big Data, but his article is not about the Big Data phe- nomenon and demonstrates no awareness of it; rather, it is a discourse on whether statistical data analyses are of value to historians. On the non-academic side, the margin comments of a computer program posted to a newsgroup in 1987 mention a programming technique called “small code, big data.” Fascinating, but off-mark. Next, Eric Larson provides an early popular-press mention in a 1989 Washington Post article about firms that assemble and sell lists to junk-mailers. He notes in passing that “The keepers of Big Data say they do it for the consumer’s benefit.” Again fascinating, but again off-mark. (See Eric Larson, “They’re Making a List: Data Companies and the Pigeonholing of America,” Washington Post, July 27, 1989.) Finally, a 1996 PR Newswire, Inc. release mentions network technology “for CPU clustering and Big Data applications…” Still off-mark, neither reporting on the Big Data phenomenon nor demonstrating awareness of it, instead reporting exclusively on a particular technology, the so-called high-performance parallel interface.

The best guess at when the term was coined is 1998, by John Mashey (retired former Chief Scientist at SGI) who produced slide deck entitled “Big Data and the Next Wave of InfraStress”. However, the famous 3V’s of big data came around 2001 introduced by Laney at Gartner.

Read Full Post »

Ever itched to do machine learning and data mining on streams? On huge, big data streams?

We have a solution for you!

SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

SAMOA is currently in Alpha stage, and is developed in Yahoo Labs in Barcelona. It is released under an Apache Software License v2.

Thanks to everybody who made this release possible!

read more here on Yahoo engineering

Read Full Post »

Nice blog post on approximate quantiles by the guys behind Druid at Metamarkets. The basic technique they use is the histogram proposed by Ben-Haim & Tom-Tov for their Streaming Parallel Decision Tree.

Read the blog post.

Read Full Post »

A lot of people seem to think performance is about doing the same thing, just doing it faster. That’s not what performance is all about. If you can do something really fast really well, people start using it differently.

Linus Torvalds (speaking about git)

Because more is not just more. More is different.

Read Full Post »

It’s something that’s difficult to explain, has many interpretations,
and the best way to learn it is to do it.

Shamelessly copied from The Apache Way

Read Full Post »

On Data Science:

The word Data tells you that I transform raw information into actionable information. The word Scientist emphasizes my commitment to making sure that the analyses my colleagues and I produce are verifiable and repeatable—as all good science should be.

Not sure I agree on the whole argument in the post, but the definition of data science is the best I have seen so far.

Melinda Thielbar

“Any field of study followed by the word “science”, so goes the old wheeze, is not really a science, including computer science, climate science, police science, and investment science.”—Ray Rivera, Forbes Magazine

I too have engaged in my fair share of hand-wringing over “data science”, how the term is used and mis-used, the high quantity of snake oil available, and some generally sloppy practices that seem to be becoming the norm in the internet’s new data-based gold rush.

However, as my mama used to say, “I can beat up on my brothers all I want, but you, sir, are not family.”

Data, harnessed for good, is going to transform our world and the way we do business. People who understand data, the mathematics of how data streams relate to each other, and how computers interact with that data, are going to be indispensable to this process. I don’t always…

View original post 444 more words

Read Full Post »

S4 and Storm are two distributed, scalable platforms for processing continuous unbounded streams of data.

I have been involved in the development of S4 (I designed the fault-recovery module) and I have used Storm for my latest project, so I have gained a bit of experience on both and I want to share my views on these two very similar and competing platforms.

First, some commonalities.
Both are distributed stream processing platforms, run on the JVM (S4 is pure Java while Storm is part Java part Clojure), are open source (Apache/Eclipse licenses), are inspired by MapReduce and are quite new. Both frameworks use keyed streams as their basic building block.

Now for some differences.

Programming model.

S4 implements the Actors programming paradigm. You define your program in terms of Processing Elements (PEs) and Adapters, and the framework instantiates one PE per each unique key in the stream. This means that the logic inside a PE can be very simple, very much like MapReduce.

Storm does not have an explicit programming paradigm. You define your program in terms of bolts and spouts that process partitions of streams. The number of bolts to instantiate is defined a-priori and each bolt will see a partition of the stream.

To make things more clear, let’s use the classic “hello world” program from MapReduce: word count.

Let’s say we want to implement a streaming word count. In S4, we can define a word to be a key, and our PE would need to keep track of the number of instances it processes by using a single long (again, very much like MapReduce). In Storm, we need to program each bolt as if it had to process the whole stream, so we would use a data structure like a Map<String, Long> to keep track of the word counts. The distribution and parallelism are orthogonal to the program.

In synthesis, in S4 you program for a single key, in Storm you program for the whole stream. Storm gives you the basic tools to build a framework, while S4 gives you a well-defined framework. To use an analogy from Java build systems, Storm is more like Ant and S4 is more like Maven.

My personal preference here goes to S4, as it makes programming much easier. Most of the times in Storm you will anyway end mimicking the Actors model by implementing a hash based structure on a key, like in the example above.

Data pipeline.

S4 uses a push model, events are pushed to the next PE as fast as possible. If receiver buffers get full events are dropped, and this can happen at any stage in the pipeline (from the Adapter to any PE).

Storm uses a pull model. Each bolt pulls event from its source, be it a spout or another bolt. Event loss can thus happen only at ingestion time, in the spouts if they cannot keep up with the external event rate.

In this case my preference goes to Storm, as it makes deployment much easier: you need to tune buffer sizes in order to deal peaks and event loss only at single place, the spout. If your deployment is badly sized in terms of parallelism level, at worst you get a performance hit in terms of throughput and latency, but the algorithm will produce the same result.

Fault tolerance.

S4 provides state recovery via uncoordinated checkpointing. When a node crashes, a new node takes over its task and restarts from a recent snapshot of its state. Events sent after the last checkpoint and before the recovery are lost. Indeed, events can be lost in any case due to overload, so this design makes perfect sense. State recovery is very important for long running machine learning programs, where the state represents days or weeks worth of data.

Storm provides guaranteed delivery of events/tuples. Each tuple traverses the entire pipeline within a time interval or is declared as failed and can be replayed from the start by the spout. Spouts are responsible to keep tuples around for replay, or can rely on external services to do so (like Apache Kafka). However, the framework provides no state recovery.

I declare a tie here. State recovery is needed for many ML applications, although guaranteed delivery makes it easier to reason about the state of applications. Having both would be ideal, but implementing both of them without performance penalties is not trivial.


There are many other differences, but for sake of brevity I just present a short summary of the pros of each platform that the other one lacks.

S4 pros:

  • Clean programming model.
  • State recovery.
  • Inter-app communication.
  • Classpath isolation.
  • Tools for packaging and deployment.
  • Apache incubation.

Storm pros:

  • Pull model.
  • Guaranteed processing.
  • More mature, more traction, larger community.
  • High performance.
  • Thread programming support.
  • Advanced features (transactional topologies, Trident).

Now the hard question: “Which one should I use for my new project?”.

Unfortunately there is no easy answer, it mostly depends on your needs. I think the biggest factor to consider is whether you need guaranteed processing of events or state recovery. Also worth considering, Storm has a larger and more active user community, but the project is mainly a one-man effort, while S4 is in incubation with the ASF. This difference might be important if you are a large organization trying to decide on which platform to invest for the long term.

Read Full Post »

It looks like people are now realizing the need for powerful real-time analytics engines. Dremel was designed by Google as an interactive query engine for cluster environments. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. It’s the perfect tool to power your Big Data dashboards. Now a bunch of open source clones are appearing. Will they have the same luck as Hadoop?

Here the contenders:

  • The regular: Apache Drill
    Mainly supported by MapR Technologies and currently in incubator. At the moment of writing there are a total of 12 issues in Jira (for a comparison, we just got to 3000 in Pig this week). Even the name still needs to be confirmed. As usual in the Apache style, this will most likely be a community-driven project, with all the pros and cons of the case. The Java + Maven + Git combo should be familiar and enable contributors to get up to speed quickly.
  • The challenger: Cloudera Impala
    Just open-sourced a couple of days ago. Surprisingly, it even offers the possibility of joining tables (something Dremel didn’t do for efficiency reasons). Unfortunately it is all C++, which I have hanged on the nail after my master’s thesis. I hope this choice won’t scare away contributors. However I suspect that Cloudera wants to drive most of the development in-house rather than build a community project.
  • The outsider: Metamarkets Druid
    Even though it has been around for a year, it has only recently become open source. It’s interesting to read how these guys were frustrated by existing technology so they just decided to roll their own solution. My (unsupported) feeling is that this is by far the most mature of the three clones. One interesting feature of Druid is real-time ingestion of data. From what I gather, Impala relies on Hive tables and Drill on Avro files, so my guess is that both of them cannot do the same. (For the records, also here Java + Maven + Git).

As a technical side note, and a personal curiosity, I wonder whether these project would benefit from YARN integration. I guess it will be easier for Drill than for the others. However startup latency could be an issue in this case.

The whole situation seems like a déjà vu of the Giraph/Hama/GoldenOrb clones of Pregel. Who will win the favor of the Big Data crowd?
Who will be able to grow the largest community? Technical issues are only a part of the equation in this case.

I am quite excited to see this missing piece of the Big Data ecosystem getting more attention and thrilled by the competition.

PS: I have read somewhere around the Web that this will be the end of Hadoop and MapReduce. Nothing can be more wrong than this idea. Dremel is the perfect complement for MapReduce. Indeed, how better could you analyze the results of you MapReduce computation? Often the results are at least as big as the inputs, so you need a way to quickly generate small summaries. Hadoop counters have been (ab)used for this purpose, but more flexible and powerful post-processing capabilities are needed.

PPS: Just to be clear, there is nothing tremendously innovative in the science behind these products. Distributed query execution engines have been around for a while in parallel database systems. What’s yet to see is whether they can deliver their promise of extreme scalability, which parallel database system have failed to offer.

Read Full Post »

“Big Data”

Read Full Post »

Is that if anything has the slightest chance to go wrong, it will go horribly wrong.

It’s Murphy’s law at its highest peak.

Read Full Post »

Older Posts »