I have a set of files in HDFS stored in Avro format. Some of them have a column named id:int as follows
{
"type" : "record",
"name" : "metric",
"fields" : [ {
"name" : "timestamp",
"type" : "long"
}, {
"name" : "id",
"type" : "long"
}, {
"name" : "metric",
"type" : "string"
}, {
"name" : "value",
"type" : "double"
} ]
}
I need to flag the files (output the file names) having "id" column. Is there a way to get it done using Pig/Python UDF / Pig streaming or embedded Pig in Python. I have used Python UDF with Pig but not sure about how to check the existence of a field. I will appreciate if anybody can post a small sample. Thanks in advance.
If Hadoop streaming will work, you can use the AvroAsTextInputFormat which will send one Avro datum (record) in JSON format to the map tasks. (http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroAsTextInputFormat.html).
The following:
outputs the JSON :
The input files I tested on were found here.
A simple Python script as the mapper will be able to test the key/values for what you are looking for. To output the filename, you can use the environment variable that's set up with Streaming jobs. This should work unless it's changed with more recent versions.
file_name = os.getenv('map_input_file')