How to parse a delimited column of a CSV file and make them be part of CSV file by Python

72 views Asked by At

I have a CSV file with the following format/content and would like to see an easier way to further parse the col3 content (delimited) to include them to CSV file by using python.

I'm new to this and the simple looping approach should be working but I would like to know any easier and faster way to implement this.

From:

col1,col2,col3,col4 
1,"David","Job=Sales Manager;Hobby=reading;Sex=Male","31"
2,"Mary","Job=Nurse;Hobby=hiking;Sex=Female","23"

to:

col1,col2,Job,Hobby,Sex,col4 
1,"David","Sales Manager","reading","Male","31"
2,"Mary","Nurse","hiking","Female","23"
2

There are 2 answers

0
Psidom On

You can use pandas library which helps deal with tabular data in a pretty easy way:

import pandas as pd
df = pd.read_csv("xxx.csv")

new_df = pd.concat([df.drop('col3', axis=1), 
                    df.col3.apply(lambda s: pd.Series(dict(tuple(p.split('=')) for p in s.split(";"))))], 
                    axis=1)

enter image description here

To write out as .csv, simply call to_csv(): new_df.to_csv("newXXX.csv")

0
arlyon On

This is a simple class base approach with a parse function and an output function.

import csv

class Person:

    def __init__(self, string):
        self.attributes = {}
        data = string.split(",")
        self.attributes["id"] = data[0]
        self.attributes["name"] = data[1]
        self.attributes["age"] = data[3]

        self.parse_data(data[2])

    def parse_data(self, data):
        for attr in data.split(";"):
            entry = attr.split("=")
            self.attributes[entry[0]] = entry[1]

    def return_data(self):
        return ','.join(self.attributes.values())

input = '''1,"David","Job=Sales Manager;Hobby=reading;Sex=Male","31"
2,"Mary","Job=Nurse;Hobby=hiking;Sex=Female","23"'''

people = []

for line in input.split("\n"):
    person = Person(line)
    people.append(person)

print(','.join(people[0].attributes.keys())) # print the keys

for person in people:
    print(person.return_data()) # print the data

Lightweight (and relatively easy to use), I left reading an writing of csv files out of it. This will return the columns in a consistent format. You will notice however some punctuation that hasn't been taken care of. That can be easily fixed too.

Let me know if this approach works for you.