Archive for the ‘Technology’ Category

Apple Inc. employee Jef Raskin named the Macintosh line of personal computers after the McIntosh. He deliberately misspelled the name to avoid conflict with the hi-fi equipment manufacturer McIntosh Laboratory.


Read Full Post »

Big Data, the term

2012 was for sure the year “big data” went ballistic, and throughout 2013 and 2014 it became commonplace and commodified. It is so prevalent nowadays in both industry and academia that is has almost lost any meaning. But when did this trend start? Or, to be more concrete, when was the term “big data” coined?

It was for sure before 2010. One of the possible culprit is Randal E. Bryant, who also coined the term DISC (Data-Intensive Scalable Computing), which I prefer over “big data” to describe tools such as Hadoop — it’s just much more precise. However, this happened just around the corner in 2008.

You might think that “big data” is a recent things, at least more recent than 2000. Well, think again. This paper by Diebold shows a few references from the ’90s. In particular, footnote 9 says:

On the academic side, Tilly (1984) mentions Big Data, but his article is not about the Big Data phe- nomenon and demonstrates no awareness of it; rather, it is a discourse on whether statistical data analyses are of value to historians. On the non-academic side, the margin comments of a computer program posted to a newsgroup in 1987 mention a programming technique called “small code, big data.” Fascinating, but off-mark. Next, Eric Larson provides an early popular-press mention in a 1989 Washington Post article about firms that assemble and sell lists to junk-mailers. He notes in passing that “The keepers of Big Data say they do it for the consumer’s benefit.” Again fascinating, but again off-mark. (See Eric Larson, “They’re Making a List: Data Companies and the Pigeonholing of America,” Washington Post, July 27, 1989.) Finally, a 1996 PR Newswire, Inc. release mentions network technology “for CPU clustering and Big Data applications…” Still off-mark, neither reporting on the Big Data phenomenon nor demonstrating awareness of it, instead reporting exclusively on a particular technology, the so-called high-performance parallel interface.

The best guess at when the term was coined is 1998, by John Mashey (retired former Chief Scientist at SGI) who produced slide deck entitled “Big Data and the Next Wave of InfraStress”. However, the famous 3V’s of big data came around 2001 introduced by Laney at Gartner.

Read Full Post »

It feels a bit strange to write this post’s title, because I don’t find myself defending Facebook very often. But there seems to be some discontent in the socialmediaverse at the moment over a new study in which Facebook data scientists conducted a large-scale–over half a million participants!–experimental manipulation on Facebook in order to show that emotional contagion occurs on social networks. The news that Facebook has been actively manipulating its users’ emotions has, apparently, enraged a lot of people…

In defense of Facebook | [citation needed].

Read Full Post »

LinkedIn, for example, has almost no batch data collection at all. The majority of our data is either activity data or database changes, both of which occur continuously. In fact, when you think about any business, the underlying mechanics are almost always a continuous process—events happen in real-time, as Jack Bauer would tell us. When data is collected in batches, it is almost always due to some manual step or lack of digitization or is a historical relic left over from the automation of some non-digital process. Transmitting and reacting to data used to be very slow when the mechanics were mail and humans did the processing. A first pass at automation always retains the form of the original process, so this often lingers for a long time.

Production “batch” processing jobs that run daily are often effectively mimicking a kind of continuous computation with a window size of one day. The underlying data is, of course, always changing. These were actually so common at LinkedIn (and the mechanics of making them work in Hadoop so tricky) that we implemented a whole framework for managing incremental Hadoop workflows.

Seen in this light, it is easy to have a different view of stream processing: it is just processing which includes a notion of time in the underlying data being processed and does not require a static snapshot of the data so it can produce output at a user-controlled frequency instead of waiting for the “end” of the data set to be reached. In this sense, stream processing is a generalization of batch processing, and, given the prevalence of real-time data, a very important generalization.


(emphasis added by me)

via The Log: What every software engineer should know about real-time data’s unifying abstraction | LinkedIn Engineering.

Read Full Post »

Ever itched to do machine learning and data mining on streams? On huge, big data streams?

We have a solution for you!

SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

SAMOA is currently in Alpha stage, and is developed in Yahoo Labs in Barcelona. It is released under an Apache Software License v2.

Thanks to everybody who made this release possible!

read more here on Yahoo engineering

Read Full Post »

Nice blog post on approximate quantiles by the guys behind Druid at Metamarkets. The basic technique they use is the histogram proposed by Ben-Haim & Tom-Tov for their Streaming Parallel Decision Tree.

Read the blog post.

Read Full Post »

A lot of people seem to think performance is about doing the same thing, just doing it faster. That’s not what performance is all about. If you can do something really fast really well, people start using it differently.

Linus Torvalds (speaking about git)

Because more is not just more. More is different.

Read Full Post »

Older Posts »