October 30, 2013

RESOLVED: Apache Pig with -tagsource/-tagFile option generates incorrect columns

Filed under: Tips,Web Content Pages — Tags: , — Webopius @ 4:42 pm

If you are using the fantastic Apache Hadoop & Pig tools to process large datasets, you may encounter situations where Pig Latin isn’t returning the columns you expect. Particularly if you are using PigStorage with the ‘-tagsource’ or ‘-tagFile’ options to generate a pseudo first column containing the filename being processed.

Here’s an example. Consider this scenario:

A1 = load '/user/hduser/testdata' using PigStorage(',') as (col0, col1, col2, col3);
B1 = FOREACH A1 GENERATE $0, $2, $3;

As you’d expect, the output consists of col0, col2 and col3

Now, if you change this slightly:

-- Remember to set pig.splitCombination to false if you are using the -tagFile option
set pig.splitCombination false;
A2 = load '/user/hduser/testdata' using PigStorage(',','-tagFile') as (filename, col0, col1, col2, col3);
B2 = FOREACH A2 GENERATE $1, $3, $4;

(Note that because the pseudo column ‘filename’ has been added, all other columns have moved, so $0 in the previous example is now $1.

What should happen is that exactly the same values appear as in the first example. What actually occurs (at least on my version 0.12.0 of Pig) is that you see results for the filename, col1 and col3 instead of col0, col2 and col3. This also occurs if you reference the columns by name in the FOREACH clause.

Fixing unusual column behaviour in Pig

This odd behaviour can be resolved by launching Pig with the command line ColumnMapKeyPrune option like this:

pig -x mapreduce -t ColumnMapKeyPrune

Running example 2 above with this option set produces the result you’d expect to see.

You can see this documented on this site along with some other useful debugging tips.

If you’d like to analyse large data sets using products such as Hadoop, Pig and Hive, get in touch with Webopius to see how we can help.

  • Tags