Check existence of a field in HDFS avro format using Pig/Python

598 views Asked by At

I have a set of files in HDFS stored in Avro format. Some of them have a column named id:int as follows

{
  "type" : "record",
  "name" : "metric",
  "fields" : [ {
    "name" : "timestamp",
    "type" : "long"
  }, {
    "name" : "id",
    "type" : "long"
  }, {
    "name" : "metric",
    "type" : "string"
  }, {
    "name" : "value",
    "type" : "double"
  } ]
}

I need to flag the files (output the file names) having "id" column. Is there a way to get it done using Pig/Python UDF / Pig streaming or embedded Pig in Python. I have used Python UDF with Pig but not sure about how to check the existence of a field. I will appreciate if anybody can post a small sample. Thanks in advance.

1

There are 1 answers

0
brandon.bell On

If Hadoop streaming will work, you can use the AvroAsTextInputFormat which will send one Avro datum (record) in JSON format to the map tasks. (http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroAsTextInputFormat.html).

$ hadoop fs -ls avro-test
Found 1 items
-rw-r--r--   3 brandon.bell hadoop        548 2015-06-17 12:13 avro-test/twitter.avro

The following:

$ hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar -D mapred.reduce.tasks=0 -files avro-1.7.7.jar,avro-mapred-1.7.7-hadoop2.jar -libjars avro-1.7.7.jar,avro-mapred-1.7.7-hadoop2.jar -input avro-test -output avro-test-output -mapper org.apache.hadoop.mapred.lib.IdentityMapper -inputformat org.apache.avro.mapred.AvroAsTextInputFormat

outputs the JSON :

$ hadoop fs -cat avro-test-output/part-*
{"username": "miguno", "tweet": "Rock: Nerf paper, scissors is fine.", "timestamp": 1366150681} 
{"username": "BlizzardCS", "tweet": "Works as intended.  Terran is IMBA.", "timestamp": 1366154481}

The input files I tested on were found here.

A simple Python script as the mapper will be able to test the key/values for what you are looking for. To output the filename, you can use the environment variable that's set up with Streaming jobs. This should work unless it's changed with more recent versions. file_name = os.getenv('map_input_file')