Splitting CSV file into multiple sheets in an Excel file based on row limit argument

Question

Splitting CSV file into multiple sheets in an Excel file based on row limit argument

2.7k views Asked by ramses1592 At 04 September 2017 at 06:28

Hi I am trying to run a utility script i found in github https://gist.github.com/Athmailer/4cdb424f03129248fbb7ebd03df581cd

Update 1: Hi I modified the logic a bit more so that rather than splitting the csv into multiple csvs again i am creating a single excel file with multiple sheets containing the splits. Below is my code

import os
import csv
import openpyxl
import argparse

def find_csv_filenames( path_to_dir, suffix=".csv" ):
    filenames = os.listdir(path_to_dir)
    return [ filename for filename in filenames if filename.endswith( suffix ) ]

def is_binary(filename):
    """
    Return true if the given filename appears to be binary.
    File is considered to be binary if it contains a NULL byte.
    FIXME: This approach incorrectly reports UTF-16 as binary.
    """
    with open(filename, 'rb') as f:
        for block in f:
            if '\0' in block:
                return True
    return False

def split(filehandler, delimiter=',', row_limit=5000,
    output_name_template='.xlsx', output_path='.', keep_headers=True):

class MyDialect(csv.excel):
    def __init__(self, delimiter=','):
        self.delimiter = delimiter
    lineterminator = '\n'

my_dialect = MyDialect(delimiter=delimiter)
reader = csv.reader(filehandler, my_dialect)

index = 0
current_piece = 1

# Create a new Excel workbook
# Create a new Excel sheet with name Split1
current_out_path = os.path.join(
     output_path,
     output_name_template
)
wb = openpyxl.Workbook()
ws = wb.create_sheet(index=index, title="Split" + str(current_piece))
current_limit = row_limit

if keep_headers:
    headers = reader.next()
    ws.append(headers)

for i, row in enumerate(reader):
    if i + 1 > current_limit:
        current_piece += 1
        current_limit = row_limit * current_piece
        ws = wb.create_sheet(index=index, title="Split" + str(current_piece))
        if keep_headers:
            ws.append(headers)
    ws.append(row)

wb.save(current_out_path)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Splits a CSV file into multiple pieces.',
                                     prefix_chars='-+')
    parser.add_argument('-l', '--row_limit', type=int, default=5000,
                        help='The number of rows you want in each output file. (default: 5000)')
    args = parser.parse_args()
    #Check if output path exists else create new output folder
    output_path='Output'
    if not os.path.exists(output_path):
        os.makedirs(output_path)

    with open('Logger.log', 'a+') as logfile:
        logfile.write('Filename --- Number of Rows\n')
        logfile.write('#Unsplit\n')
        #Get list of all csv's in the current folder
        filenames = find_csv_filenames(os.getcwd())
        filenames.sort()
        rem_filenames = []
        for filename in filenames:
            if is_binary(filename):
                logfile.write('{} --- binary -- skipped\n'.format(filename))
                rem_filenames.append(filename)
            else:
                with open(filename, 'rb') as infile:
                    reader_file = csv.reader(infile,delimiter=";",lineterminator="\n")
                    value = len(list(reader_file))
                    logfile.write('{} --- {} \n'.format(filename,value))

        filenames = [item for item in filenames if item not in rem_filenames]
        filenames.sort()
        logfile.write('#Post Split\n')
        for filename in filenames:
            #try:
            with open(filename, 'rb') as infile:
                name = filename.split('.')[0]
                split(filehandler=infile,delimiter=';',row_limit=args.row_limit,output_name_template= name + '.xlsx',output_path='Output')

I have a folder called 'CSV Files' which contains a lot of csv's which need to be split. I am keeping this utility script in the same folder

Getting the following error on running the script:

    Traceback (most recent call last):
  File "csv_split.py", line 96, in <module>
    split(filehandler=infile,delimiter=';',row_limit=args.row_limit,output_name_template= name + '.xlsx',output_path='Output')
  File "csv_split.py", line 57, in split
    ws.append(row)
  File "/home/ramakrishna/.local/lib/python2.7/site-packages/openpyxl/worksheet/worksheet.py", line 790, in append
    cell = Cell(self, row=row_idx, col_idx=col_idx, value=content)
  File "/home/ramakrishna/.local/lib/python2.7/site-packages/openpyxl/cell/cell.py", line 114, in __init__
    self.value = value
  File "/home/ramakrishna/.local/lib/python2.7/site-packages/openpyxl/cell/cell.py", line 294, in value
    self._bind_value(value)
  File "/home/ramakrishna/.local/lib/python2.7/site-packages/openpyxl/cell/cell.py", line 191, in _bind_value
    value = self.check_string(value)
  File "/home/ramakrishna/.local/lib/python2.7/site-packages/openpyxl/cell/cell.py", line 156, in check_string
    raise IllegalCharacterError
openpyxl.utils.exceptions.IllegalCharacterError

Can some one let me know if i have to add another for loop and go each cell in the row and append it to the sheet or can it be done in a single go. Also I seem to have made this logic a lot clumsy can this be optimized further.

Folder structure for your reference

Original Q&A

There are 1 answers

**campovski** · Answer 1 · 2017-09-04T06:43:25+00:00

You must pass just a name of the file as command line argument:

python splitter.py 'Sports & Outdoors 2017-08-26'

Also, I tried running the above script and no matter on what CSS I run it, it doesn't return the first line (which should normally be a header) although keep_headers = True. Setting keep_headers = False also prints out the header line, which is a bit counterintuitive.

This script is meant to read a single CSV. If you want to read every CSV in a directory, you want to make another script that will loop through all the files in that directory.

import splitter as sp
import os

files = [ f for f in os.listdir('/your/directory') if f[-4:] == '.csv' ]
for file in files:
    with open(file, 'r') as f:
        sp.split(f)

TechQA.

Splitting CSV file into multiple sheets in an Excel file based on row limit argument

There are 1 answers

Related Questions in PYTHON

Related Questions in CSV

Related Questions in UTILITY

Popular Questions

Popular Tags

Trending Questions