Here is a little trick I had to learn while developing Apache Pig.
Pig uses JUnit as test framework. JUnit tests are very useful for unit testing, but end-to-end testing is not as easy. Even more in the case of Pig, that uses Hadoop (a distributed MapReduce engine) to execute its scripts. The MiniCluster
class addresses this issue: it simulates a full execution environment on the local machine, with HDFS and everything you need. More information here.
MiniCluster
is very easy to use, assuming you are running your tests via ant. But if you want to debug and trace your test (using Eclipse, for instance) there are a couple of catches. Basically, you need to reproduce the environment the ant script builds inside Eclipse.
The first thing to set is the hadoop.log.dir
property, that tells where to put logs. Its default value is build/test/logs
. To set it, go in the Run Configurations screen, Arguments tab, and add this line to the VM arguments:
-Dhadoop.log.dir=build/test/logs
If you forget to set this, you will get a nice NullPonterException:
ERROR mapred.MiniMRCluster: Job tracker crashed java.lang.NullPointerException at java.io.File.<init>(File.java:222) at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:151) at org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:1617) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:183) at org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:106) at java.lang.Thread.run(Thread.java:619)
The other thing to take care of is where to find MiniCluster
‘s configuration file. For Pig, you should first create it by running the ant test
target once from the command line. This will create a standard minimum configuration file for your use in ${HOME}/pigtest/conf
. To set it, you should add this directory to the classpath in the Classpath tab, under User Entries using the Advanced… button.
If you forget to set this, you get a nice ExecException
:
org.apache.pig.backend.executionengine.ExecException: ERROR 4010: Cannot find hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath).If you plan to use local mode, please put -x local option in command line at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:149) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:114) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.<init>(PigServer.java:216) at org.apache.pig.PigServer.<init>(PigServer.java:205) at org.apache.pig.PigServer.<init>(PigServer.java:201) at org.apache.pig.test.TestSecondarySort.setUp(TestSecondarySort.java:73) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:73) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:46) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:180) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:41) at org.junit.runners.ParentRunner$1.evaluate(ParentRunner.java:173) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.ParentRunner.run(ParentRunner.java:220) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Even after this, you will still get some exceptions (regarding threads, manifest files, jars), but they are not a problem and debugging will work.
Hope this helps!