Posts Tagged ‘data mining’

Ever itched to do machine learning and data mining on streams? On huge, big data streams?

We have a solution for you!

SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

SAMOA is currently in Alpha stage, and is developed in Yahoo Labs in Barcelona. It is released under an Apache Software License v2.

Thanks to everybody who made this release possible!

read more here on Yahoo engineering

Read Full Post »

Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.

Algorithms Every Data Scientist Should Know: Reservoir Sampling

Read Full Post »

On Data Science:

The word Data tells you that I transform raw information into actionable information. The word Scientist emphasizes my commitment to making sure that the analyses my colleagues and I produce are verifiable and repeatable—as all good science should be.

Not sure I agree on the whole argument in the post, but the definition of data science is the best I have seen so far.

Melinda Thielbar

“Any field of study followed by the word “science”, so goes the old wheeze, is not really a science, including computer science, climate science, police science, and investment science.”—Ray Rivera, Forbes Magazine

I too have engaged in my fair share of hand-wringing over “data science”, how the term is used and mis-used, the high quantity of snake oil available, and some generally sloppy practices that seem to be becoming the norm in the internet’s new data-based gold rush.

However, as my mama used to say, “I can beat up on my brothers all I want, but you, sir, are not family.”

Data, harnessed for good, is going to transform our world and the way we do business. People who understand data, the mathematics of how data streams relate to each other, and how computers interact with that data, are going to be indispensable to this process. I don’t always…

View original post 444 more words

Read Full Post »

Grokking data

  • When you have been exploring a dataset for a while, studying its distribution, its composition, its quirks, its innards and its “essence” in the end.
  • When you know the answer to a query even before performing it.
  • When you have a strict hierarchical organization of the folders for plots.
  • When you know by heart the number of unique URLs and usernames in the dataset.
  • When your methodology to name files according to their schema has become more complex that the schemas themselves.
  • When you have five different versions of the dataset, but you forgot the reason behind four of them.
  • When the size of the scripts to analyze the dataset begins to rival the dataset itself.
  • Ultimately, when the only thought of putting again your hands on that data gives you urticaria.
That’s when you grokked the data.
Now imagine doing that on hundreds of gigabytes…

Read Full Post »

Wikipedia Miner

I have been playing with Wikipedia Miner for my new research project. Wikipedia Miner is a toolkit that does many interesting things with Wikipedia. The one I am using is “wikification”, that is “The process of adding wiki links to specific named entities and other appropriate phrases in an arbitrary text.” It is a very useful procedure to enrich a text. “The process consists of automatic keyword extraction, word sense disambiguation, and automatically adding links to documents to Wikipedia”. In my case, I am more interested in topic detection so I care only about the first two (emphasized) phases.

Even though the software is a great piece of work, and the online demo works flawlessly, setting it up locally is a nightmare. The main reason is the very limited documentation. The main problem is that the Requirements section is missing all the version numbers for the required software.

To prevent others from suffering my same trial, I write here what I discovered about the set up of Wikipedia Miner.

  1. MySQL. You can use any version, but beware if you use version 4. Varchars over 255 in length get automatically converted to the smallest text fields that can contain it. Because text fields can not be fully indexed, you need to specify how much of it you want to index. Otherwise you will get this nice exception “java.sql.SQLException: Syntax error or access violation message from server: BLOB/TEXT column used in key specification without a key length”. Therefore, to make it work, add the bold/underlined parts (that specify the index length) to WikipediaDatabase.java:150 and recompile.

    createStatements.put("anchor", "CREATE TABLE anchor ("
    + "an_text varchar(300) binary NOT NULL, "
    + "an_to int(8) unsigned NOT NULL, "
    + "an_count int(8) unsigned NOT NULL, "
    + "PRIMARY KEY (an_text(300), an_to), "
    + "KEY (an_to)) ENGINE=MyISAM DEFAULT CHARSET=utf8;") ;

    createStatements.put("anchor_occurance", "CREATE TABLE anchor_occurance ("
    + "ao_text varchar(300) binary NOT NULL, "
    + "ao_linkCount int(8) unsigned NOT NULL, "
    + "ao_occCount int(8) unsigned NOT NULL, "
    + "PRIMARY KEY (ao_text(300))) ENGINE=MyISAM DEFAULT CHARSET=utf8;");

  2. Connector/J. Use version 3.0.17 or set the property jdbcCompliantTruncation=false. If you don’t you will get a nice “com.mysql.jdbc.MysqlDataTruncation: Data truncation” exception.
  3. Weka. Use version 3.6.4, otherwise you will get deserialization exceptions when loading the models (in my case “java.io.InvalidClassException: weka.classifiers.Classifier; local class incompatible: stream classdesc serialVersionUID = 6502780192411755341, local class serialVersionUID = 66924060442623804”).

So far I din’t have any problems with trove and servlet-api.

This confirms a well know fact: that one of the biggest problems of open source is lack of documentation. I should learn from this experience as well 🙂

I hope this spares some hours of work to somebody.

And kudos to David Milne for creating and releasing Wikipedia Miner as open source.
This is the proper way to do science!

Read Full Post »

Data Intensive Scalable Computing
DISC=Data Intensive Scalable Computing

ML=Machine Learning
DM=Data Mining
IR=Information Retrieval
DS=Distributed Systems
DB=Data Bases

Read Full Post »