Feeds:
Posts
Comments

Archive for the ‘Technology’ Category

My proposal for this year’s Google Summer of Code (GSoC) has been accepted!
Also this year I will be working on Apache Pig.
Last year I worked on the backend and on improving performance. This year instead I will work on the front end and on improving usability. I will implement a couple of “syntactic sugar” features for Pig/Latin.

  • Variable argument for SAMPLE and LIMIT. (PIG-1926)
    Currently, SAMPLE and LIMIT only take a constant argument. It would be better to be able to use a variable (scalar) in the place of a constant.
  • Default SPLIT destination. (PIG-1904)
    SPLIT partitions a relation into two or more relations.
    It would be useful to have a default destination for tuples that are not assigned to any other relation, in a fashion similar to a switch/case/default statement.

These features are simple but quite useful. My proposal outlines some interesting use cases.

This year I will be mentored by Thejas Nair. I am very happy to be able to contribute again to this very interesting open source project.

It’s a pity I didn’t start GSoCing before and this will be my last year (blame my memory, on my first year as a PhD student I missed the deadline by 3 days…).

Read Full Post »

Wikipedia Miner

I have been playing with Wikipedia Miner for my new research project. Wikipedia Miner is a toolkit that does many interesting things with Wikipedia. The one I am using is “wikification”, that is “The process of adding wiki links to specific named entities and other appropriate phrases in an arbitrary text.” It is a very useful procedure to enrich a text. “The process consists of automatic keyword extraction, word sense disambiguation, and automatically adding links to documents to Wikipedia”. In my case, I am more interested in topic detection so I care only about the first two (emphasized) phases.

Even though the software is a great piece of work, and the online demo works flawlessly, setting it up locally is a nightmare. The main reason is the very limited documentation. The main problem is that the Requirements section is missing all the version numbers for the required software.

To prevent others from suffering my same trial, I write here what I discovered about the set up of Wikipedia Miner.

  1. MySQL. You can use any version, but beware if you use version 4. Varchars over 255 in length get automatically converted to the smallest text fields that can contain it. Because text fields can not be fully indexed, you need to specify how much of it you want to index. Otherwise you will get this nice exception “java.sql.SQLException: Syntax error or access violation message from server: BLOB/TEXT column used in key specification without a key length”. Therefore, to make it work, add the bold/underlined parts (that specify the index length) to WikipediaDatabase.java:150 and recompile.


    createStatements.put("anchor", "CREATE TABLE anchor ("
    + "an_text varchar(300) binary NOT NULL, "
    + "an_to int(8) unsigned NOT NULL, "
    + "an_count int(8) unsigned NOT NULL, "
    + "PRIMARY KEY (an_text(300), an_to), "
    + "KEY (an_to)) ENGINE=MyISAM DEFAULT CHARSET=utf8;") ;


    createStatements.put("anchor_occurance", "CREATE TABLE anchor_occurance ("
    + "ao_text varchar(300) binary NOT NULL, "
    + "ao_linkCount int(8) unsigned NOT NULL, "
    + "ao_occCount int(8) unsigned NOT NULL, "
    + "PRIMARY KEY (ao_text(300))) ENGINE=MyISAM DEFAULT CHARSET=utf8;");

  2. Connector/J. Use version 3.0.17 or set the property jdbcCompliantTruncation=false. If you don’t you will get a nice “com.mysql.jdbc.MysqlDataTruncation: Data truncation” exception.
  3. Weka. Use version 3.6.4, otherwise you will get deserialization exceptions when loading the models (in my case “java.io.InvalidClassException: weka.classifiers.Classifier; local class incompatible: stream classdesc serialVersionUID = 6502780192411755341, local class serialVersionUID = 66924060442623804”).

So far I din’t have any problems with trove and servlet-api.

This confirms a well know fact: that one of the biggest problems of open source is lack of documentation. I should learn from this experience as well 🙂

I hope this spares some hours of work to somebody.

And kudos to David Milne for creating and releasing Wikipedia Miner as open source.
This is the proper way to do science!

Read Full Post »

Il computer non è una macchina intelligente che aiuta le persone stupide, anzi è una macchina stupida che funziona solo nelle mani delle persone intelligenti.

(A computer is not an intelligent machine which helps stupid people, rather it’s a stupid machine that only works in the hands of intelligent people.)

Umberto Eco

Read Full Post »

Are you really aware of the price you are paying for Web commodities?

So I can only imagine the reaction in the boardrooms of those traditional firms when Facebook and Google built their Psychographic Marketing Honeypots and disguised them as a social network and a search engine. “All that data we’ve worked so hard to source! Merde! People just sit there all day giving it to them!”

The world has changed though, hasn’t it? We have entered the Matrix, but it’s not our body heat they want. They want the preference model encoded in our amygdala and a list of all the people that might influence that model tomorrow.

You can read the whole post here at O’Reilly Radar: Amygdala FarmVille.

Read Full Post »

Save the last RSS

I have already explained why I don’t have a Facebook account.
Well, I do use RSS and I find them very useful.
I don’t want to give up control on my interest list, I don’t want to be tracked by whichever proprietary platform when I access my daily feeds. There is no added value in that. If I want to be social, I can opt in to use social media, share links, etc… but my privacy is more valuable.

If RSS isn’t saved now, if browser vendors don’t realise the potential of RSS to save users a whole bunch of time and make the web better for them, then the alternative is that I will have to have a Facebook account, or a Twitter account, or some such corporate-controlled identity, where I have to “Like” or “Follow” every website’s partner account that I’m interested in, and then have to deal with the privacy violations and problems related with corporate owned identity owning a list of every website I’m interested in (and wanting to monetise that list), and they, and every website I’m interested in, knowing every other website I’m interested in following, and then I have to log in and check this corporate owned identity every day in order to find out what’s new on other websites, whilst I’m advertised to, because they are only interested in making the biggest and the best walled garden that I can’t leave.

IF RSS DIES, WE LOSE THE ABILITY TO READ IN PRIVATE

Continue reading here

Read Full Post »

GSoC wrap-up

GSoC 2010 is over!

It was a great experience. First time for me contributing to a top class project. I must say that I was a bit worried at the beginning: understanding a big and complex project like Apache Pig is not an easy task.

I managed to get my project done and I passed the final evaluation. I want to thank my mentor Daniel Dai for the support he gave me during the project and for his patience. The result of my efforts was a 10x improvement in speed of the comparator I worked on. This translated to a ~20% improvement to target queries in the standard PigMix2 bench suite. Neat!

All the code has already been integrated in Pig and will be out with the 0.8 release (branched a few days ago).

Now I am eager to get my hands on my GSoC t-shirt 🙂

Read Full Post »

Here is a little trick I had to learn while developing Apache Pig.

Pig uses JUnit as test framework. JUnit tests are very useful for unit testing, but end-to-end testing is not as easy. Even more in the case of Pig, that uses Hadoop (a distributed MapReduce engine) to execute its scripts. The MiniCluster class addresses this issue: it simulates a full execution environment on the local machine, with HDFS and everything you need. More information here.

MiniCluster is very easy to use, assuming you are running your tests via ant. But if you want to debug and trace your test (using Eclipse, for instance) there are a couple of catches. Basically, you need to reproduce the environment the ant script builds inside Eclipse.

The first thing to set is the hadoop.log.dir property, that tells where to put logs. Its default value is build/test/logs. To set it, go in the Run Configurations screen, Arguments tab, and add this line to the VM arguments:

-Dhadoop.log.dir=build/test/logs

If you forget to set this, you will get a nice NullPonterException:

ERROR mapred.MiniMRCluster: Job tracker crashed
java.lang.NullPointerException
at java.io.File.<init>(File.java:222)
at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:151)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1617)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
at org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:106)
at java.lang.Thread.run(Thread.java:619)

The other thing to take care of is where to find MiniCluster‘s configuration file. For Pig, you should first create it by running the ant test target once from the command line. This will create a standard minimum configuration file for your use in ${HOME}/pigtest/conf. To set it, you should add this directory to the classpath in the Classpath tab, under User Entries using the Advanced… button.

If you forget to set this, you get a nice ExecException:

org.apache.pig.backend.executionengine.ExecException: ERROR 4010: Cannot find hadoop configurations in 
 classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath).If you plan to use 
 local mode, please put -x local option in command line
 at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:149)
 at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:114)
 at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
 at org.apache.pig.PigServer.<init>(PigServer.java:216)
 at org.apache.pig.PigServer.<init>(PigServer.java:205)
 at org.apache.pig.PigServer.<init>(PigServer.java:201)
 at org.apache.pig.test.TestSecondarySort.setUp(TestSecondarySort.java:73)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
 at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:73)
 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:46)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:180)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:41)
 at org.junit.runners.ParentRunner$1.evaluate(ParentRunner.java:173)
 at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:220)
 at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
 at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

Even after this, you will still get some exceptions (regarding threads, manifest files, jars), but they are not a problem and debugging will work.

Hope this helps!

Read Full Post »

I was watching Jeff Dean’s keynote presentation for the ACM Symposium on Cloud Computing 2010 (SOCC) that was held yesterday and I found this very interesting bit of information. This is so useful that every Computer Scientist and Engineer should learn it by heart!

Operation Time (nsec)
L1 cache reference 0.5
Branch mispredict 5
L2 cache reference 7
Mutex lock/unlock 25
Main memory reference 100
Compress 1KB bytes with Zippy 3,000
Send 2K bytes over 1 Gbps network 20,000
Read 1MB sequentially from memory 250,000
Roundtrip within same datacenter 500,000
Disk seek 10,000,000
Read 1MB sequentially from disk 20,000,000
Send packet CA -> Netherlands -> CA 150,000,000

These numbers give you some insight into why random reads from a disk are a really bad idea.

This piece information complements the very nice image from Adam Jacobs, and his excellent “The Pathologies of Big Data” article.

Comparison of random and sequential speeds for Memory, SSD and Disk

Random is BAD (and SSD is NOT going to solve the problem)

What should we learn from all this stuff?

  • Do your back-of-the-envelope calculations
  • Do avoid random operations
  • Do benchmarks your system

Read Full Post »

GSoC

My project has been accepted for Google Summer of Code!
I will be working on Pig, more specifically I will implement a binary comparator for secondary sort.

GSoC project list

I am really excited for this opportunity to contribute!

Read Full Post »

Freedom in the Cloud

A.K.A. why I do not have a Facebook profile (actually, why Facebook does not have a profile on me)

The cloud means that we can’t even point in the direction of the server anymore and because we can’t even point in the direction of the server anymore we don’t have extra technical or non-technical means of reliable control over this disaster in slow motion. You can make a rule about logs or data flow or preservation or control or access or disclosure but your laws are human laws and they occupy particular territory and the server is in the cloud and that means the server is always one step ahead of any rule you make or two or three or six or poof! I just realized I’m subject to regulation, I think I’ll move to Oceana now.

Which means that in effect, we lost the ability to use either legal regulation or anything about the physical architecture of the network to interfere with the process of falling away from innocence that was now inevitable in the stage I’m talking about, what we might call late Google stage 1.

It is here, of course, that Mr. Zuckerberg enters.

The human race has susceptibility to harm but Mr. Zuckerberg has attained an unenviable record: he has done more harm to the human race than anybody else his age.

Because he harnessed Friday night. That is, everybody needs to get laid and he turned it into a structure for degenerating the integrity of human personality and he has to a remarkable extent succeeded with a very poor deal. Namely, “I will give you free web hosting and some PHP doodads and you get spying for free all the time”. And it works.

That’s the sad part, it works.

How could that have happened?

There was no architectural reason, really. There was no architectural reason really. Facebook is the Web with “I keep all the logs, how do you feel about that?” It’s a terrarium for what it feels like to live in a panopticon built out of web parts.

And it shouldn’t be allowed. It comes to that. It shouldn’t be allowed. That’s a very poor way to deliver those services. They are grossly overpriced at “spying all the time”. They are not technically innovative. They depend upon an architecture subject to misuse and the business model that supports them is misuse. There isn’t any other business model for them. This is bad.

I’m not suggesting it should be illegal. It should be obsolete. We’re technologists, we should fix it.

I’m glad I’m with you so far. When I come to how we should fix it later I hope you will still be with me because then we could get it done.

But let’s say, for now, that that’s a really good example of where we went wrong and what happened to us because. It’s trickier with gmail because of that magical untouched by human hands-iness. When I say to my students, “why do you let people read your email”, they say “but nobody is reading my email, no human being ever touched it. That would freak me out, I’d be creeped out if guys at Google were reading my email. But that’s not happening so I don’t have a problem.”

Now, this they cannot say about Facebook. Indeed, they know way too much about Facebook if they let themselves really know it. You have read the stuff and you know. Facebook workers know who’s about to have a love affair before the people do because they can see X obsessively checking the Facebook page of Y.

Like a lot of unfreedom, the real underlying social process that forces this unfreedom along is nothing more than perceived convenience.

Read the full story here

Read Full Post »

« Newer Posts - Older Posts »