Ever itched to do machine learning and data mining on streams? On huge, big data streams?
We have a solution for you!
SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.
SAMOA is currently in Alpha stage, and is developed in Yahoo Labs in Barcelona. It is released under an Apache Software License v2.
Thanks to everybody who made this release possible!
read more here on Yahoo engineering
Posted in Research, Technology | Tagged big data, data mining, machine learning, S4, SAMOA, Storm, stream processing | Leave a Comment »
Nice blog post on approximate quantiles by the guys behind Druid at Metamarkets. The basic technique they use is the histogram proposed by Ben-Haim & Tom-Tov for their Streaming Parallel Decision Tree.
Read the blog post.
Posted in Research, Technology | Tagged approximation, big data, quantiles | Leave a Comment »
A lot of people seem to think performance is about doing the same thing, just doing it faster. That’s not what performance is all about. If you can do something really fast really well, people start using it differently.
Linus Torvalds (speaking about git)
Because more is not just more. More is different.
Posted in Technology | Tagged big data, different, performance, scale | Leave a Comment »
Seems like much research on Twitter data going on nowadays would benefit from reading xkcd.
Posted in Fun, Research | Tagged bayesian, comic, frequentist, statistics, Twitter, xkcd | Leave a Comment »
Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.
Algorithms Every Data Scientist Should Know: Reservoir Sampling
Posted in Research | Tagged data mining, hadoop, mapreduce, reservoir sampling, sampling, stratified sampling | Leave a Comment »
In the future, thanks to the Grain, a chip which can be implanted on a hard drive in the brain,every single action that a person makes is recorded and may be played back. Liam, a lawyer, married with a child, suspects that his wife Fi is having a fling with the brash Jonas,whom they meet at a dinner party. After playing clips from his own ‘Grain’ his suspicions are confirmed, and he gets drunk and attacks Jonas, forcing the guilty pair to show him what is in their memory banks. In fact the affair has been going on for months and Liam attacks Jonas, demanding that he erase all memories of Fi from his brain. This, however, does not result in marital reconciliation.
The more I look at videos about Google Glass the more it reminds me of the Grain from “The Entire History of You”, Black Mirror s1e3.
I am not sure I share the enthusiasm with which Google speaks about its project. It’s definitely a technological wonder, but how will it affect personal relationship and society in general? What about split attention? Will we be able to handle the constant information flow?
Sometimes entrance barriers and usage fees (in terms of effort and time in this case) are useful. See for example the tragedy of commons. How will the ability to check Facebook every second of your waking time change your life?
Posted in Musing, Technology | Tagged black mirror, glass, Google, information overload, society | Leave a Comment »