I'm using the MRJob module for python 2.7. I have created a class that inherits from MRJob, and have correctly mapped everything using the inherited mapper function.
Problem is, I would like to have the reducer function output a .csv file...here is the code for the reducer:
def reducer(self, geo_key, info_list):
info_list.insert(0, ['Name,Age,Gender,Height'])
for set in info_list:
yield set
Then i run in the command line---> python -m map_csv <inputfile.txt> outputfile.csv
I keep getting this error, and dont really understand why:
Counters from step 1:
Unencodable output:
TypeError: 785
The info_list
parameter in the reducer is simply a list containing lists of various values that match the types in the header
(i.e.
[
['Bill', 28, 'Male',75],
['Emily', 16, 'Female',56],
['Jason', 21, 'Male',63]]
Any idea what the problem is here? Thanks!
To manage input and output formats in
mrjob
, you need to use protocols.Luckily, there is an existing package which implements a CSV protocol that you could use - https://pypi.python.org/pypi/mr3px
Import the package in your job script
Specify the protocol in your job class
And then just
yield
your list (or tuple) of fieldsNote that you cannot reliably add a header row to this output because Hadoop will use several reducers to generate the output in parallel.
To use this package on EMR, you'll need to install it during the instance bootstrap phase by adding an item to the
bootstrap
section of your config.disclaimer - I am the maintainer of the
mr3px
package, which is forked frommr3po