Archive for the ‘Technology’ Category

WhatsApp is a nice mobile app for real time messaging that I bought some time ago (0.99€). The app is extremely simple and many people I know use it, so it quickly became part of my toolkit for social life.

Now for the bad pard. I own an iPhone 3G and the latest os that runs on my phone is iOS 4.2. The last version of WhatsApp dropped support for this iOS version, which is normal in today’s accelerated market. What’s not normal, is that WhatsApp forced the upgrade on my mobile. This means that the app stopped working until I upgraded it. And of course after I upgraded it I was not able to install it anymore due to the new requirements.

End of the story, I am left with a blank tile in my iPhone, I have lost the ability to connect with my friends, I have lost all my conversations and finally I have been robbed of my 0.99€.

Of course they blame it on Apple. Here is what the support answered to my inquiry:


Thanks for your message.

In order to connect to WhatsApp, or activate your number, you will need the latest version of WhatsApp Messenger – v2.8.7 – available from the App Store on your iPhone. Please note that iPod and iPad are not supported devices.

The latest version of WhatsApp for iPhone requires iOS 4.3 or later. Regretfully, Apple does not allow new app updates to be compatible with both iOS 6 and older versions of iOS, effectively ending support for the iPhone 3G. Because of Apple’s decision to stop supporting these devices, we can no longer provide new app updates for iPhone 3G users.

If you have an iPhone 3GS or newer device, please update to the latest version of iOS. Instructions for updating can be found at this Apple Support page: http://support.apple.com/kb/HT4972

WhatsApp is also supported on Android, Windows Phone, BlackBerry, and select Nokia devices. Find out more at http://www.whatsapp.com/

However this is total BS. In any normally managed company backwards compatibility is a priority if you don’t want to alienate the user base. It’s perfectly normal to stop supporting an old version of the app and stop providing new features. It’s totally crazy to force the upgrade on people and make the app stop working.

Thanks WhatsApp, seriously great customer care.

Update: WhatsApp has changed its mind and now allows the old version (2.8.4) to access its servers. So if you have an iPhone 3G and the old version you should be all set. If you (like me) tried to update, then you have to download WhatsApp 2.8.4.ipa from somewhere on the internet, and downgrade your version (as easy as double-clicking on the .ipa). I think I am not allowed by law to put my .ipa bundle online, however here is its md5 hash in order to avoid fakes:

Disclaimer: I am reasonably sure the .ipa I have is the correct one, but you never know. Don’t blame me if things blow up 🙂

Read Full Post »

S4 and Storm are two distributed, scalable platforms for processing continuous unbounded streams of data.

I have been involved in the development of S4 (I designed the fault-recovery module) and I have used Storm for my latest project, so I have gained a bit of experience on both and I want to share my views on these two very similar and competing platforms.

First, some commonalities.
Both are distributed stream processing platforms, run on the JVM (S4 is pure Java while Storm is part Java part Clojure), are open source (Apache/Eclipse licenses), are inspired by MapReduce and are quite new. Both frameworks use keyed streams as their basic building block.

Now for some differences.

Programming model.

S4 implements the Actors programming paradigm. You define your program in terms of Processing Elements (PEs) and Adapters, and the framework instantiates one PE per each unique key in the stream. This means that the logic inside a PE can be very simple, very much like MapReduce.

Storm does not have an explicit programming paradigm. You define your program in terms of bolts and spouts that process partitions of streams. The number of bolts to instantiate is defined a-priori and each bolt will see a partition of the stream.

To make things more clear, let’s use the classic “hello world” program from MapReduce: word count.

Let’s say we want to implement a streaming word count. In S4, we can define a word to be a key, and our PE would need to keep track of the number of instances it processes by using a single long (again, very much like MapReduce). In Storm, we need to program each bolt as if it had to process the whole stream, so we would use a data structure like a Map<String, Long> to keep track of the word counts. The distribution and parallelism are orthogonal to the program.

In synthesis, in S4 you program for a single key, in Storm you program for the whole stream. Storm gives you the basic tools to build a framework, while S4 gives you a well-defined framework. To use an analogy from Java build systems, Storm is more like Ant and S4 is more like Maven.

My personal preference here goes to S4, as it makes programming much easier. Most of the times in Storm you will anyway end mimicking the Actors model by implementing a hash based structure on a key, like in the example above.

Data pipeline.

S4 uses a push model, events are pushed to the next PE as fast as possible. If receiver buffers get full events are dropped, and this can happen at any stage in the pipeline (from the Adapter to any PE).

Storm uses a pull model. Each bolt pulls event from its source, be it a spout or another bolt. Event loss can thus happen only at ingestion time, in the spouts if they cannot keep up with the external event rate.

In this case my preference goes to Storm, as it makes deployment much easier: you need to tune buffer sizes in order to deal peaks and event loss only at single place, the spout. If your deployment is badly sized in terms of parallelism level, at worst you get a performance hit in terms of throughput and latency, but the algorithm will produce the same result.

Fault tolerance.

S4 provides state recovery via uncoordinated checkpointing. When a node crashes, a new node takes over its task and restarts from a recent snapshot of its state. Events sent after the last checkpoint and before the recovery are lost. Indeed, events can be lost in any case due to overload, so this design makes perfect sense. State recovery is very important for long running machine learning programs, where the state represents days or weeks worth of data.

Storm provides guaranteed delivery of events/tuples. Each tuple traverses the entire pipeline within a time interval or is declared as failed and can be replayed from the start by the spout. Spouts are responsible to keep tuples around for replay, or can rely on external services to do so (like Apache Kafka). However, the framework provides no state recovery.

I declare a tie here. State recovery is needed for many ML applications, although guaranteed delivery makes it easier to reason about the state of applications. Having both would be ideal, but implementing both of them without performance penalties is not trivial.


There are many other differences, but for sake of brevity I just present a short summary of the pros of each platform that the other one lacks.

S4 pros:

  • Clean programming model.
  • State recovery.
  • Inter-app communication.
  • Classpath isolation.
  • Tools for packaging and deployment.
  • Apache incubation.

Storm pros:

  • Pull model.
  • Guaranteed processing.
  • More mature, more traction, larger community.
  • High performance.
  • Thread programming support.
  • Advanced features (transactional topologies, Trident).

Now the hard question: “Which one should I use for my new project?”.

Unfortunately there is no easy answer, it mostly depends on your needs. I think the biggest factor to consider is whether you need guaranteed processing of events or state recovery. Also worth considering, Storm has a larger and more active user community, but the project is mainly a one-man effort, while S4 is in incubation with the ASF. This difference might be important if you are a large organization trying to decide on which platform to invest for the long term.

Read Full Post »

Great explanation of workers, executors and tasks in Storm, one of the most confusing bits in my opinion, by Michael G. Noll.

Understanding the parallelism of a Storm topology

Read Full Post »

In building there are three stages: Preparation, Production and Proving.

Preparation. The environment must be prepared before work can commence. When painting a room, we have to choose the color scheme, measure, tape up the woodwork, and buy the paint, all before we can start putting it on the wall.

Production is the steepest slope where the maximum rate of measurable work occurs—the most code is written, the most paint is applied.

Proving is the final long tail of the process. This always seems to take longer than it should partly because we invariably find out things we were not expecting—which is where the Second Order Ignorance (unknown unknowns) comes in. In painting, this is the detail work, the tricky corners and, of course, the cleanup—which also always seems to take longer than it should.

All this means that when the product is 90% complete, the activity is only about halfway through its total time (so true).

Cumulative competion curve.

Cumulative completion when production follows a Rayleigh curve.

Read more here (you will need access to CACM): How We Build Things

Read Full Post »

Person that does not know how machine learning works,
but knows what to do when machine learning does not work.

Read Full Post »


Inserts a blank after every line that does not start with the same word as the immediately following line.
Extremely useful for .tsv and .csv files.
(might need some tweaking for non-word characters)

Vim awesomeness.

Read Full Post »

It looks like people are now realizing the need for powerful real-time analytics engines. Dremel was designed by Google as an interactive query engine for cluster environments. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. It’s the perfect tool to power your Big Data dashboards. Now a bunch of open source clones are appearing. Will they have the same luck as Hadoop?

Here the contenders:

  • The regular: Apache Drill
    Mainly supported by MapR Technologies and currently in incubator. At the moment of writing there are a total of 12 issues in Jira (for a comparison, we just got to 3000 in Pig this week). Even the name still needs to be confirmed. As usual in the Apache style, this will most likely be a community-driven project, with all the pros and cons of the case. The Java + Maven + Git combo should be familiar and enable contributors to get up to speed quickly.
  • The challenger: Cloudera Impala
    Just open-sourced a couple of days ago. Surprisingly, it even offers the possibility of joining tables (something Dremel didn’t do for efficiency reasons). Unfortunately it is all C++, which I have hanged on the nail after my master’s thesis. I hope this choice won’t scare away contributors. However I suspect that Cloudera wants to drive most of the development in-house rather than build a community project.
  • The outsider: Metamarkets Druid
    Even though it has been around for a year, it has only recently become open source. It’s interesting to read how these guys were frustrated by existing technology so they just decided to roll their own solution. My (unsupported) feeling is that this is by far the most mature of the three clones. One interesting feature of Druid is real-time ingestion of data. From what I gather, Impala relies on Hive tables and Drill on Avro files, so my guess is that both of them cannot do the same. (For the records, also here Java + Maven + Git).

As a technical side note, and a personal curiosity, I wonder whether these project would benefit from YARN integration. I guess it will be easier for Drill than for the others. However startup latency could be an issue in this case.

The whole situation seems like a déjà vu of the Giraph/Hama/GoldenOrb clones of Pregel. Who will win the favor of the Big Data crowd?
Who will be able to grow the largest community? Technical issues are only a part of the equation in this case.

I am quite excited to see this missing piece of the Big Data ecosystem getting more attention and thrilled by the competition.

PS: I have read somewhere around the Web that this will be the end of Hadoop and MapReduce. Nothing can be more wrong than this idea. Dremel is the perfect complement for MapReduce. Indeed, how better could you analyze the results of you MapReduce computation? Often the results are at least as big as the inputs, so you need a way to quickly generate small summaries. Hadoop counters have been (ab)used for this purpose, but more flexible and powerful post-processing capabilities are needed.

PPS: Just to be clear, there is nothing tremendously innovative in the science behind these products. Distributed query execution engines have been around for a while in parallel database systems. What’s yet to see is whether they can deliver their promise of extreme scalability, which parallel database system have failed to offer.

Read Full Post »

Against any law of conservation, you can create entire worlds out of nothing!

Read Full Post »

In about the same place as the engineering in computer engineering.

Read Full Post »

Kong Jin Jie

Not all services/applications can be disabled through Preferences in your Mac. These startup items is defined in scripts. You can find them in StartupItems folder:

  • /Library/StartupItems
  • /System/Library/StartupItems

You should see some folders which contains the startup script. To disable to specific service, perform these steps in the Terminal and replace <appfolder> with the name of the folder/service:

  1. cd /Library/StartupItems/<appfolder>
  2. sudo touch .disabled

This will create an empty file named “disabled”. To confirm this, open up System Profiler and go to Startup Items and notice that the service is disabled.

View original post

Read Full Post »

« Newer Posts - Older Posts »