Seems like much research on Twitter data going on nowadays would benefit from reading xkcd.
Archive for the ‘Research’ Category
Frequentist vs. Bayesian
Posted in Fun, Research, tagged bayesian, comic, frequentist, statistics, Twitter, xkcd on 25 June 2013| Leave a Comment »
Algorithms Every Data Scientist Should Know: Reservoir Sampling
Posted in Research, tagged data mining, hadoop, mapreduce, reservoir sampling, sampling, stratified sampling on 2 May 2013| Leave a Comment »
Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.
Algorithms Every Data Scientist Should Know: Reservoir Sampling
Big data is sort of like Zen
Posted in Fun, Research, tagged big data, quote, zen on 5 April 2013| Leave a Comment »
It’s something that’s difficult to explain, has many interpretations,
and the best way to learn it is to do it.
Shamelessly copied from The Apache Way
Research makes you feel stupid
Posted in Fun, Musing, PhD, Research, tagged phd, research on 1 March 2013| Leave a Comment »
Posted in Research, Technology, tagged big data, data mining, data science, definition, fad on 11 January 2013| Leave a Comment »
On Data Science:
The word Data tells you that I transform raw information into actionable information. The word Scientist emphasizes my commitment to making sure that the analyses my colleagues and I produce are verifiable and repeatable—as all good science should be.
Not sure I agree on the whole argument in the post, but the definition of data science is the best I have seen so far.
“Any field of study followed by the word “science”, so goes the old wheeze, is not really a science, including computer science, climate science, police science, and investment science.”—Ray Rivera, Forbes Magazine
I too have engaged in my fair share of hand-wringing over “data science”, how the term is used and mis-used, the high quantity of snake oil available, and some generally sloppy practices that seem to be becoming the norm in the internet’s new data-based gold rush.
However, as my mama used to say, “I can beat up on my brothers all I want, but you, sir, are not family.”
Data, harnessed for good, is going to transform our world and the way we do business. People who understand data, the mathematics of how data streams relate to each other, and how computers interact with that data, are going to be indispensable to this process. I don’t always…
View original post 444 more words
Distributed stream processing showdown: S4 vs Storm
Posted in Research, Technology, tagged big data, distributed systems, S4, showdown, Storm, stream processing on 2 January 2013| 5 Comments »
S4 and Storm are two distributed, scalable platforms for processing continuous unbounded streams of data.
I have been involved in the development of S4 (I designed the fault-recovery module) and I have used Storm for my latest project, so I have gained a bit of experience on both and I want to share my views on these two very similar and competing platforms.
First, some commonalities.
Both are distributed stream processing platforms, run on the JVM (S4 is pure Java while Storm is part Java part Clojure), are open source (Apache/Eclipse licenses), are inspired by MapReduce and are quite new. Both frameworks use keyed streams as their basic building block.
Now for some differences.
Programming model.
S4 implements the Actors programming paradigm. You define your program in terms of Processing Elements (PEs) and Adapters, and the framework instantiates one PE per each unique key in the stream. This means that the logic inside a PE can be very simple, very much like MapReduce.
Storm does not have an explicit programming paradigm. You define your program in terms of bolts and spouts that process partitions of streams. The number of bolts to instantiate is defined a-priori and each bolt will see a partition of the stream.
To make things more clear, let’s use the classic “hello world” program from MapReduce: word count.
Let’s say we want to implement a streaming word count. In S4, we can define a word to be a key, and our PE would need to keep track of the number of instances it processes by using a single long (again, very much like MapReduce). In Storm, we need to program each bolt as if it had to process the whole stream, so we would use a data structure like a Map<String, Long> to keep track of the word counts. The distribution and parallelism are orthogonal to the program.
In synthesis, in S4 you program for a single key, in Storm you program for the whole stream. Storm gives you the basic tools to build a framework, while S4 gives you a well-defined framework. To use an analogy from Java build systems, Storm is more like Ant and S4 is more like Maven.
My personal preference here goes to S4, as it makes programming much easier. Most of the times in Storm you will anyway end mimicking the Actors model by implementing a hash based structure on a key, like in the example above.
Data pipeline.
S4 uses a push model, events are pushed to the next PE as fast as possible. If receiver buffers get full events are dropped, and this can happen at any stage in the pipeline (from the Adapter to any PE).
Storm uses a pull model. Each bolt pulls event from its source, be it a spout or another bolt. Event loss can thus happen only at ingestion time, in the spouts if they cannot keep up with the external event rate.
In this case my preference goes to Storm, as it makes deployment much easier: you need to tune buffer sizes in order to deal peaks and event loss only at single place, the spout. If your deployment is badly sized in terms of parallelism level, at worst you get a performance hit in terms of throughput and latency, but the algorithm will produce the same result.
Fault tolerance.
S4 provides state recovery via uncoordinated checkpointing. When a node crashes, a new node takes over its task and restarts from a recent snapshot of its state. Events sent after the last checkpoint and before the recovery are lost. Indeed, events can be lost in any case due to overload, so this design makes perfect sense. State recovery is very important for long running machine learning programs, where the state represents days or weeks worth of data.
Storm provides guaranteed delivery of events/tuples. Each tuple traverses the entire pipeline within a time interval or is declared as failed and can be replayed from the start by the spout. Spouts are responsible to keep tuples around for replay, or can rely on external services to do so (like Apache Kafka). However, the framework provides no state recovery.
I declare a tie here. State recovery is needed for many ML applications, although guaranteed delivery makes it easier to reason about the state of applications. Having both would be ideal, but implementing both of them without performance penalties is not trivial.
Summary.
There are many other differences, but for sake of brevity I just present a short summary of the pros of each platform that the other one lacks.
S4 pros:
- Clean programming model.
- State recovery.
- Inter-app communication.
- Classpath isolation.
- Tools for packaging and deployment.
- Apache incubation.
Storm pros:
- Pull model.
- Guaranteed processing.
- More mature, more traction, larger community.
- High performance.
- Thread programming support.
- Advanced features (transactional topologies, Trident).
Now the hard question: “Which one should I use for my new project?”.
Unfortunately there is no easy answer, it mostly depends on your needs. I think the biggest factor to consider is whether you need guaranteed processing of events or state recovery. Also worth considering, Storm has a larger and more active user community, but the project is mainly a one-man effort, while S4 is in incubation with the ASF. This difference might be important if you are a large organization trying to decide on which platform to invest for the long term.
In defense of keeping data private
Posted in Research, tagged advantage, data, private, research, unfair on 25 December 2012| Leave a Comment »
“The proposal is that certain conferences make it mandatory to publish datasets that were used for the experiments. This is a very bad idea and two things are getting confused here: scientific progress and common access. These two are not identical. Reproducibility is often confused with common access.”
Read more here, by Alex Smola.
GSoC 2012: the other side of the fence
Posted in Research, Technology, tagged GSoC, mentor, pig, rank on 24 May 2012| Leave a Comment »
And here it comes again as every year, Google Summer of Code!
As usual, I will be working for Apache Pig also this summer. However, I will be working from the other side of the fence this time: I will be mentoring!
As I have already graduated and I am no more a student (shock!) I cannot take part as a student (no way!). So I decided to mentor.
- Pro: you need much less time to mentor compared to being a student.
- Con: you don’t learn as much, nor have as much fun, nor get the same recognition, nor get paid as much. 😉
However, the great thing about mentoring is that you propose ideas you would like to see realized in the project, and then students make them real!
Indeed, the project I am mentoring this year is about an idea I had some while ago.
It happens very often (to me and to my colleagues as well) to have a list of tuples and to need to attach a unique number to it.
For example when creating the lexicon for a collection of documents you have a list of unique words appearing in the collection, and you want to transform each word into a numerical id to reduce memory usage and ease later processing.
A while ago I implemented a MapReduce algorithm for this task that scales very well (i.e. no single reducer that does all the work).
However this same problem comes up very often and it is cumbersome to fit a MR job in the middle of a Pig pipeline while you are building the script in an iterative way (read: you don’t know what you are doing), especially because the code I wrote is not general enough to be applied as is, so each time I need to customize it a bit.
“Indeed, why the hell Pig does not do it already?” was my reaction. So PIG-2353 was born.
PIG-2353 is quite a stretch for a night’s coding, so it stayed there for a while. Until an interested student got interested in it for GSoC. Enter Allan Avendaño.
You can follow Allan’s progress also on his blog.
He has already got his first patch in (PIG-2583) and started working on the main project this week.
I expect great stuff coming out of this project!
I am ready for a great summer, “flipping bits not burgers”.
The true problem with big data
Posted in Musing, Research, tagged big data, failure on 3 May 2012| 3 Comments »
Is that if anything has the slightest chance to go wrong, it will go horribly wrong.
It’s Murphy’s law at its highest peak.