Feeds:
Posts
Comments

Posts Tagged ‘pig’

And here it comes again as every year, Google Summer of Code!

As usual, I will be working for Apache Pig also this summer. However, I will be working from the other side of the fence this time: I will be mentoring!

As I have already graduated and I am no more a student (shock!) I cannot take part as a student (no way!). So I decided to mentor.

  • Pro: you need much less time to mentor compared to being a student.
  • Con: you don’t learn as much, nor have as much fun, nor get the same recognition, nor get paid as much. 😉

However, the great thing about mentoring is that you propose ideas you would like to see realized in the project, and then students make them real!
Indeed, the project I am mentoring this year is about an idea I had some while ago.
It happens very often (to me and to my colleagues as well) to have a list of tuples and to need to attach a unique number to it.
For example when creating the lexicon for a collection of documents you have a list of unique words appearing in the collection, and you want to transform each word into a numerical id to reduce memory usage and ease later processing.

A while ago I implemented a MapReduce algorithm for this task that scales very well (i.e. no single reducer that does all the work).
However this same problem comes up very often and it is cumbersome to fit a MR job in the middle of a Pig pipeline while you are building the script in an iterative way (read: you don’t know what you are doing), especially because the code I wrote is not general enough to be applied as is, so each time I need to customize it a bit.
“Indeed, why the hell Pig does not do it already?”  was my reaction. So PIG-2353 was born.

PIG-2353 is quite a stretch for a night’s coding, so it stayed there for a while. Until an interested student got interested in it for GSoC. Enter Allan Avendaño.

You can follow Allan’s progress also on his blog.
He has already got his first patch in (PIG-2583) and started working on the main project this week.
I expect great stuff coming out of this project!

I am ready for a great summer, “flipping bits not burgers”.

Read Full Post »

Pig committer

Starting from today, I am officially a Pig committer!

The Pig PMC has decided to “promote” me to committer after 2 years of continued involvement in the platform.

I am really excited to be part of this great community, and to receive this acknowledgement of my work.

I will do my best to contribute to the project and make Pig a better platform!

Cheers!

Update: the updated “Who we are” Web page

Read Full Post »

Actually it should not be an update, but a wrap-up, as I basically have finished my project for this year. My last patch already got a +1 and it’s just waiting for the tests to finish to be committed.

I completed my selected tasks PIG-1926 and PIG-1904 (see my previous post for an explanation of what they do), plus some more small fixes here and there: PIG-2156 PIG-2136 PIG-2060 PIG-2026 PIG-2025 PIG-2024

I also gave some longer term ideas on how to refactor the grammar to make it safer and easier to modify, and on some new features: PIG-2138 PIG-2123 PIG-2119 PIG-2047

However, given that I have still 1 month left before the official end of the GSoC, I will tackle the rest of the “Sugar” projects listed on the PIG GSoC page, which means adding syntax support for Tuple/Map/Bag conversions: PIG-1387

All my fixes will go in Pig 0.10, as 0.9 has already been branched and will be out very soon.

Working on the front end has been a very interesting and enriching experience.

  • I got to learn how to use ANTLR (my mentor called me an “ANTLR expert” :P).
  • I learned how Pig scripts are compiled and how to work with the logical, physical and mapreduce levels.
  • I have a full understanding of the workflow and the dataflow of the operators in Pig. I am sure this will come in handy in the future.
  • I also increased my proficiency in Pig/Latin scripting.
  • Finally, I really got to seriously use and appreciate git. It makes working on different patches at the same time a breeze.
See you in a month for the actual wrap-up!

Read Full Post »

git to svn patch

After discovering git I practically fell in love with it.
So I decided to use the git Apache mirror for Pig for this year’s GSoC.
One problem i found is that the ASF (Apache Software Foundation) uses svn (subversion) patches, but git by default produces a slightly different diff format that is not readily understood by the patch utility. A simple workaround for this issue is to use the –no-prefix option of git diff. (it should also work to use -p1 instead of -p0 in the patch command).
To make completely transparent that I am using a different repository, I also keep a separate pristine tree checked out of svn and always up to date with trunk. To try my modifications, I can simply check out the branch I want to try out, generate a patch, apply it on the fly on the pristine svn tree and run ant test while I continue working on the git tree. To generate the final patch to submit I resort again to svn.

git co PIG-XXXX
git diff trunk --no-prefix | patch -p0 -d ../pigpristine/
cd ../pigpristine/ && svn diff > PIG-XXXX.patch

In the snippet I assume the git and svn tree are siblings in the filesystem, and that the svn tree is called pigpristine.

Read Full Post »

My proposal for this year’s Google Summer of Code (GSoC) has been accepted!
Also this year I will be working on Apache Pig.
Last year I worked on the backend and on improving performance. This year instead I will work on the front end and on improving usability. I will implement a couple of “syntactic sugar” features for Pig/Latin.

  • Variable argument for SAMPLE and LIMIT. (PIG-1926)
    Currently, SAMPLE and LIMIT only take a constant argument. It would be better to be able to use a variable (scalar) in the place of a constant.
  • Default SPLIT destination. (PIG-1904)
    SPLIT partitions a relation into two or more relations.
    It would be useful to have a default destination for tuples that are not assigned to any other relation, in a fashion similar to a switch/case/default statement.

These features are simple but quite useful. My proposal outlines some interesting use cases.

This year I will be mentored by Thejas Nair. I am very happy to be able to contribute again to this very interesting open source project.

It’s a pity I didn’t start GSoCing before and this will be my last year (blame my memory, on my first year as a PhD student I missed the deadline by 3 days…).

Read Full Post »

GSoC wrap-up

GSoC 2010 is over!

It was a great experience. First time for me contributing to a top class project. I must say that I was a bit worried at the beginning: understanding a big and complex project like Apache Pig is not an easy task.

I managed to get my project done and I passed the final evaluation. I want to thank my mentor Daniel Dai for the support he gave me during the project and for his patience. The result of my efforts was a 10x improvement in speed of the comparator I worked on. This translated to a ~20% improvement to target queries in the standard PigMix2 bench suite. Neat!

All the code has already been integrated in Pig and will be out with the 0.8 release (branched a few days ago).

Now I am eager to get my hands on my GSoC t-shirt 🙂

Read Full Post »

Here is a little trick I had to learn while developing Apache Pig.

Pig uses JUnit as test framework. JUnit tests are very useful for unit testing, but end-to-end testing is not as easy. Even more in the case of Pig, that uses Hadoop (a distributed MapReduce engine) to execute its scripts. The MiniCluster class addresses this issue: it simulates a full execution environment on the local machine, with HDFS and everything you need. More information here.

MiniCluster is very easy to use, assuming you are running your tests via ant. But if you want to debug and trace your test (using Eclipse, for instance) there are a couple of catches. Basically, you need to reproduce the environment the ant script builds inside Eclipse.

The first thing to set is the hadoop.log.dir property, that tells where to put logs. Its default value is build/test/logs. To set it, go in the Run Configurations screen, Arguments tab, and add this line to the VM arguments:

-Dhadoop.log.dir=build/test/logs

If you forget to set this, you will get a nice NullPonterException:

ERROR mapred.MiniMRCluster: Job tracker crashed
java.lang.NullPointerException
at java.io.File.<init>(File.java:222)
at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:151)
at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1617)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183)
at org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:106)
at java.lang.Thread.run(Thread.java:619)

The other thing to take care of is where to find MiniCluster‘s configuration file. For Pig, you should first create it by running the ant test target once from the command line. This will create a standard minimum configuration file for your use in ${HOME}/pigtest/conf. To set it, you should add this directory to the classpath in the Classpath tab, under User Entries using the Advanced… button.

If you forget to set this, you get a nice ExecException:

org.apache.pig.backend.executionengine.ExecException: ERROR 4010: Cannot find hadoop configurations in 
 classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath).If you plan to use 
 local mode, please put -x local option in command line
 at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:149)
 at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:114)
 at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
 at org.apache.pig.PigServer.<init>(PigServer.java:216)
 at org.apache.pig.PigServer.<init>(PigServer.java:205)
 at org.apache.pig.PigServer.<init>(PigServer.java:201)
 at org.apache.pig.test.TestSecondarySort.setUp(TestSecondarySort.java:73)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
 at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:73)
 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:46)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:180)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:41)
 at org.junit.runners.ParentRunner$1.evaluate(ParentRunner.java:173)
 at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:220)
 at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
 at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
 at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

Even after this, you will still get some exceptions (regarding threads, manifest files, jars), but they are not a problem and debugging will work.

Hope this helps!

Read Full Post »

GSoC

My project has been accepted for Google Summer of Code!
I will be working on Pig, more specifically I will implement a binary comparator for secondary sort.

GSoC project list

I am really excited for this opportunity to contribute!

Read Full Post »