Archive for the ‘Technology’ Category

Apple Inc. employee Jef Raskin named the Macintosh line of personal computers after the McIntosh. He deliberately misspelled the name to avoid conflict with the hi-fi equipment manufacturer McIntosh Laboratory.


Read Full Post »

Big Data, the term

2012 was for sure the year “big data” went ballistic, and throughout 2013 and 2014 it became commonplace and commodified. It is so prevalent nowadays in both industry and academia that is has almost lost any meaning. But when did this trend start? Or, to be more concrete, when was the term “big data” coined?

It was for sure before 2010. One of the possible culprit is Randal E. Bryant, who also coined the term DISC (Data-Intensive Scalable Computing), which I prefer over “big data” to describe tools such as Hadoop — it’s just much more precise. However, this happened just around the corner in 2008.

You might think that “big data” is a recent things, at least more recent than 2000. Well, think again. This paper by Diebold shows a few references from the ’90s. In particular, footnote 9 says:

On the academic side, Tilly (1984) mentions Big Data, but his article is not about the Big Data phe- nomenon and demonstrates no awareness of it; rather, it is a discourse on whether statistical data analyses are of value to historians. On the non-academic side, the margin comments of a computer program posted to a newsgroup in 1987 mention a programming technique called “small code, big data.” Fascinating, but off-mark. Next, Eric Larson provides an early popular-press mention in a 1989 Washington Post article about firms that assemble and sell lists to junk-mailers. He notes in passing that “The keepers of Big Data say they do it for the consumer’s benefit.” Again fascinating, but again off-mark. (See Eric Larson, “They’re Making a List: Data Companies and the Pigeonholing of America,” Washington Post, July 27, 1989.) Finally, a 1996 PR Newswire, Inc. release mentions network technology “for CPU clustering and Big Data applications…” Still off-mark, neither reporting on the Big Data phenomenon nor demonstrating awareness of it, instead reporting exclusively on a particular technology, the so-called high-performance parallel interface.

The best guess at when the term was coined is 1998, by John Mashey (retired former Chief Scientist at SGI) who produced slide deck entitled “Big Data and the Next Wave of InfraStress”. However, the famous 3V’s of big data came around 2001 introduced by Laney at Gartner.

Read Full Post »

It feels a bit strange to write this post’s title, because I don’t find myself defending Facebook very often. But there seems to be some discontent in the socialmediaverse at the moment over a new study in which Facebook data scientists conducted a large-scale–over half a million participants!–experimental manipulation on Facebook in order to show that emotional contagion occurs on social networks. The news that Facebook has been actively manipulating its users’ emotions has, apparently, enraged a lot of people…

In defense of Facebook | [citation needed].

Read Full Post »

LinkedIn, for example, has almost no batch data collection at all. The majority of our data is either activity data or database changes, both of which occur continuously. In fact, when you think about any business, the underlying mechanics are almost always a continuous process—events happen in real-time, as Jack Bauer would tell us. When data is collected in batches, it is almost always due to some manual step or lack of digitization or is a historical relic left over from the automation of some non-digital process. Transmitting and reacting to data used to be very slow when the mechanics were mail and humans did the processing. A first pass at automation always retains the form of the original process, so this often lingers for a long time.

Production “batch” processing jobs that run daily are often effectively mimicking a kind of continuous computation with a window size of one day. The underlying data is, of course, always changing. These were actually so common at LinkedIn (and the mechanics of making them work in Hadoop so tricky) that we implemented a whole framework for managing incremental Hadoop workflows.

Seen in this light, it is easy to have a different view of stream processing: it is just processing which includes a notion of time in the underlying data being processed and does not require a static snapshot of the data so it can produce output at a user-controlled frequency instead of waiting for the “end” of the data set to be reached. In this sense, stream processing is a generalization of batch processing, and, given the prevalence of real-time data, a very important generalization.


(emphasis added by me)

via The Log: What every software engineer should know about real-time data’s unifying abstraction | LinkedIn Engineering.

Read Full Post »

Ever itched to do machine learning and data mining on streams? On huge, big data streams?

We have a solution for you!

SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

SAMOA is currently in Alpha stage, and is developed in Yahoo Labs in Barcelona. It is released under an Apache Software License v2.

Thanks to everybody who made this release possible!

read more here on Yahoo engineering

Read Full Post »

Nice blog post on approximate quantiles by the guys behind Druid at Metamarkets. The basic technique they use is the histogram proposed by Ben-Haim & Tom-Tov for their Streaming Parallel Decision Tree.

Read the blog post.

Read Full Post »

A lot of people seem to think performance is about doing the same thing, just doing it faster. That’s not what performance is all about. If you can do something really fast really well, people start using it differently.

Linus Torvalds (speaking about git)

Because more is not just more. More is different.

Read Full Post »

Against any law of physics, you can create entire worlds out of nothing!

Read Full Post »

In the future, thanks to the Grain, a chip which can be implanted on a hard drive in the brain,every single action that a person makes is recorded and may be played back. Liam, a lawyer, married with a child, suspects that his wife Fi is having a fling with the brash Jonas,whom they meet at a dinner party. After playing clips from his own ‘Grain’ his suspicions are confirmed, and he gets drunk and attacks Jonas, forcing the guilty pair to show him what is in their memory banks. In fact the affair has been going on for months and Liam attacks Jonas, demanding that he erase all memories of Fi from his brain. This, however, does not result in marital reconciliation.

The more I look at videos about Google Glass the more it reminds me of the Grain from “The Entire History of You”, Black Mirror s1e3.

I am not sure I share the enthusiasm with which Google speaks about its project. It’s definitely a technological wonder, but how will it affect personal relationship and society in general? What about split attention? Will we be able to handle the constant information flow?

Sometimes entrance barriers and usage fees (in terms of effort and time in this case) are useful. See for example the tragedy of commons. How will the ability to check Facebook every second of your waking time change your life?

Read Full Post »

On Data Science:

The word Data tells you that I transform raw information into actionable information. The word Scientist emphasizes my commitment to making sure that the analyses my colleagues and I produce are verifiable and repeatable—as all good science should be.

Not sure I agree on the whole argument in the post, but the definition of data science is the best I have seen so far.

Melinda Thielbar

“Any field of study followed by the word “science”, so goes the old wheeze, is not really a science, including computer science, climate science, police science, and investment science.”—Ray Rivera, Forbes Magazine

I too have engaged in my fair share of hand-wringing over “data science”, how the term is used and mis-used, the high quantity of snake oil available, and some generally sloppy practices that seem to be becoming the norm in the internet’s new data-based gold rush.

However, as my mama used to say, “I can beat up on my brothers all I want, but you, sir, are not family.”

Data, harnessed for good, is going to transform our world and the way we do business. People who understand data, the mathematics of how data streams relate to each other, and how computers interact with that data, are going to be indispensable to this process. I don’t always…

View original post 444 more words

Read Full Post »

Older Posts »