Archive for the ‘Research’ Category

Let it snow…

Christmas is approaching, even in Korea 🙂

I am happy to announce that the SNOW workshop on Social News on the Web will be running again this year for its second edition. We will be co-located with WWW ’14 in Seoul, Korea. Quoting the workshop page:

The workshop provides an interdisciplinary forum to bring together researchers and professionals working in several fields including journalism, computer science, and social science to present novel ideas and discussing future directions in this scenario.

The call for papers is already out. This year we will also have a data challenge linked to the workshop, with dataset, prizes and all the proper shiny things. Stay tuned!



Read Full Post »

Ever itched to do machine learning and data mining on streams? On huge, big data streams?

We have a solution for you!

SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

SAMOA is currently in Alpha stage, and is developed in Yahoo Labs in Barcelona. It is released under an Apache Software License v2.

Thanks to everybody who made this release possible!

read more here on Yahoo engineering

Read Full Post »

Nice blog post on approximate quantiles by the guys behind Druid at Metamarkets. The basic technique they use is the histogram proposed by Ben-Haim & Tom-Tov for their Streaming Parallel Decision Tree.

Read the blog post.

Read Full Post »

Seems like much research on Twitter data going on nowadays would benefit from reading xkcd.

Read Full Post »

Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.

Algorithms Every Data Scientist Should Know: Reservoir Sampling

Read Full Post »

It’s something that’s difficult to explain, has many interpretations,
and the best way to learn it is to do it.

Shamelessly copied from The Apache Way

Read Full Post »

If it doesn’t, you’re not doing it right.

And on why research is not like sex.

Read Full Post »

« Newer Posts - Older Posts »