Archive for the ‘Research’ Category

Big Data, the term

2012 was for sure the year “big data” went ballistic, and throughout 2013 and 2014 it became commonplace and commodified. It is so prevalent nowadays in both industry and academia that is has almost lost any meaning. But when did this trend start? Or, to be more concrete, when was the term “big data” coined?

It was for sure before 2010. One of the possible culprit is Randal E. Bryant, who also coined the term DISC (Data-Intensive Scalable Computing), which I prefer over “big data” to describe tools such as Hadoop — it’s just much more precise. However, this happened just around the corner in 2008.

You might think that “big data” is a recent things, at least more recent than 2000. Well, think again. This paper by Diebold shows a few references from the ’90s. In particular, footnote 9 says:

On the academic side, Tilly (1984) mentions Big Data, but his article is not about the Big Data phe- nomenon and demonstrates no awareness of it; rather, it is a discourse on whether statistical data analyses are of value to historians. On the non-academic side, the margin comments of a computer program posted to a newsgroup in 1987 mention a programming technique called “small code, big data.” Fascinating, but off-mark. Next, Eric Larson provides an early popular-press mention in a 1989 Washington Post article about firms that assemble and sell lists to junk-mailers. He notes in passing that “The keepers of Big Data say they do it for the consumer’s benefit.” Again fascinating, but again off-mark. (See Eric Larson, “They’re Making a List: Data Companies and the Pigeonholing of America,” Washington Post, July 27, 1989.) Finally, a 1996 PR Newswire, Inc. release mentions network technology “for CPU clustering and Big Data applications…” Still off-mark, neither reporting on the Big Data phenomenon nor demonstrating awareness of it, instead reporting exclusively on a particular technology, the so-called high-performance parallel interface.

The best guess at when the term was coined is 1998, by John Mashey (retired former Chief Scientist at SGI) who produced slide deck entitled “Big Data and the Next Wave of InfraStress”. However, the famous 3V’s of big data came around 2001 introduced by Laney at Gartner.

Read Full Post »

A thoughtful piece on industrial labs and research styles. My favorite bit is the part about “empowering researchers”. All the managers dealing with researchers should read this paper by Roy Levin.

Windows On Theory

The closing of MSR-SV two months ago raised a fair bit of discussion, and I would like to contribute some of my own thoughts. Since the topic of industrial research is important, I would like the opportunity to counter some misconceptions that have spread. I would also like to share my advice with anyone that (like me) is considering an industrial research position (and anyone that already has one).


On Thursday 09/18/2014, an urgent meeting was announced for all but a few in MSR-SV. The short meeting marked the immediate closing of the lab. By the time the participants came back to their usual building, cardboard boxes were waiting for the prompt packing of personal items (to be evacuated by the end of that weekend). This harsh style of layoffs was one major cause for shock and it indeed seemed unprecedented for research labs of this sort. But I find the following…

View original post 1,227 more words

Read Full Post »

Because when we sit down and think about a problem, when we take the time to not only understand what our feature space “is” and what it “implies” in the real-world — then we are acting like machine learning scientists. Otherwise, we [are] just a bunch of machine learning engineers, blindly performing black box learning and operating a set of R, MATLAB, and Python libraries.

The takeaway is this: machine learning isn’t a tool. It’s a methodology with a rational thought process that is entirely dependent on the problem we are trying to solve.

Get off the deep learning bandwagon and get some perspective – PyImageSearch

Read Full Post »

It feels a bit strange to write this post’s title, because I don’t find myself defending Facebook very often. But there seems to be some discontent in the socialmediaverse at the moment over a new study in which Facebook data scientists conducted a large-scale–over half a million participants!–experimental manipulation on Facebook in order to show that emotional contagion occurs on social networks. The news that Facebook has been actively manipulating its users’ emotions has, apparently, enraged a lot of people…

In defense of Facebook | [citation needed].

Read Full Post »

Suzana Herculano-Houzel: What is so special about the human brain?

Read Full Post »

In Japan there is a word to describe the various limits in innovative thinking. Taga, which literally describes the metal hoops which keep a tight hold on the wooden boards which make a barrel, is used to describe the current state of Japanese innovation. Taga is what causes organizations to decide unconsciously and automatically what is possible and what is not based on current circumstances, not future predictions, hopes or opportunities. It stops completely the ability of a company to adopt a positive attitude towards any change or new idea. Taga is usually fostered in a tacit agreement to, or unspoken understanding of, customary rules or organizational paradigms within a company. When new people join a company (usually it’s the hope that new people bring new ideas) they tend to quickly become unconsciously accustomed to thinking along the lines of the existing organization paradigm. This means that it can be extremely difficult for a company to be aware of taga limiting creativity and implementation of new ideas within your own company.


Click to access understanding_Taga.pdf

Read Full Post »

My first cube :)

Yahoo patent cube

Read Full Post »

Let it snow…

Christmas is approaching, even in Korea 🙂

I am happy to announce that the SNOW workshop on Social News on the Web will be running again this year for its second edition. We will be co-located with WWW ’14 in Seoul, Korea. Quoting the workshop page:

The workshop provides an interdisciplinary forum to bring together researchers and professionals working in several fields including journalism, computer science, and social science to present novel ideas and discussing future directions in this scenario.

The call for papers is already out. This year we will also have a data challenge linked to the workshop, with dataset, prizes and all the proper shiny things. Stay tuned!


Read Full Post »

Ever itched to do machine learning and data mining on streams? On huge, big data streams?

We have a solution for you!

SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

SAMOA is currently in Alpha stage, and is developed in Yahoo Labs in Barcelona. It is released under an Apache Software License v2.

Thanks to everybody who made this release possible!

read more here on Yahoo engineering

Read Full Post »

Nice blog post on approximate quantiles by the guys behind Druid at Metamarkets. The basic technique they use is the histogram proposed by Ben-Haim & Tom-Tov for their Streaming Parallel Decision Tree.

Read the blog post.

Read Full Post »

Older Posts »