There are two silences. One when no word is spoken. The other when perhaps a torrent of language is being employed.

Harold Pinter

Substitute “information” for “word” and you get censorship in one case and information overload in the other. Two sides of the same coin?

The WordPress.com stats helper monkeys prepared a 2013 annual report for this blog.

Here’s an excerpt:

The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 20,000 times in 2013. If it were a concert at Sydney Opera House, it would take about 7 sold-out performances for that many people to see it.

Because that’s how you get on the bestseller list. You promise the moon and stars, you say everything you heard before was wrong, and you blame everything on one thing. You get a scapegoat; it’s classic. Atkins made a fortune with that formula. We’ve got Rob Lustig saying it’s all fructose; we’ve got T. Colin Campbell saying it’s all animal food; we now have Perlmutter saying it’s all grain. There’s either a scapegoat or a silver bullet in almost every bestselling diet book.

The recurring formula is apparent: Tell readers it’s not their fault. Blame an agency; typically the pharmaceutical industry or U.S. government, but also possibly the medical establishment. Alluding to the conspiracy vaguely will suffice. Offer a simple solution. Cite science and mainstream research when applicable; demonize it when it is not.

Is complexity too scary to be the subject matter of a bestseller?

via This Is Your Brain on Gluten – James Hamblin – The Atlantic.

LinkedIn, for example, has almost no batch data collection at all. The majority of our data is either activity data or database changes, both of which occur continuously. In fact, when you think about any business, the underlying mechanics are almost always a continuous process—events happen in real-time, as Jack Bauer would tell us. When data is collected in batches, it is almost always due to some manual step or lack of digitization or is a historical relic left over from the automation of some non-digital process. Transmitting and reacting to data used to be very slow when the mechanics were mail and humans did the processing. A first pass at automation always retains the form of the original process, so this often lingers for a long time.

Production “batch” processing jobs that run daily are often effectively mimicking a kind of continuous computation with a window size of one day. The underlying data is, of course, always changing. These were actually so common at LinkedIn (and the mechanics of making them work in Hadoop so tricky) that we implemented a whole framework for managing incremental Hadoop workflows.

Seen in this light, it is easy to have a different view of stream processing: it is just processing which includes a notion of time in the underlying data being processed and does not require a static snapshot of the data so it can produce output at a user-controlled frequency instead of waiting for the “end” of the data set to be reached. In this sense, stream processing is a generalization of batch processing, and, given the prevalence of real-time data, a very important generalization.


(emphasis added by me)

via The Log: What every software engineer should know about real-time data’s unifying abstraction | LinkedIn Engineering.

Christmas is approaching, even in Korea :)

I am happy to announce that the SNOW workshop on Social News on the Web will be running again this year for its second edition. We will be co-located with WWW ’14 in Seoul, Korea. Quoting the workshop page:

The workshop provides an interdisciplinary forum to bring together researchers and professionals working in several fields including journalism, computer science, and social science to present novel ideas and discussing future directions in this scenario.

The call for papers is already out. This year we will also have a data challenge linked to the workshop, with dataset, prizes and all the proper shiny things. Stay tuned!



