Pandas read_csv() with multiple delimiters not working

79 views Asked by At

I have a csv file I'm trying to read as a pandas dataframe. I need to skip the first 19 rows of comments. I have headers in the 20th row and data in subsequent rows. Only issue is the header row starts with a '#' which shifts the headers over. The rest of the data are delimited with a space. For some reason doing sep=r'#|\s+' introduces two 'Unnamed' columns to the dataset.

Raw Data Input (row number shown):
01|# comments...
02|# comments...
03|# comments...
.
.
.
19|# comments...
20|# Header1 Header2 Header3
21|Data1 Data2 Data3

Code:

df = pd.read_csv(df_path, skiprows=19, sep=r'#|\s+', engine='python', encoding='utf-8')

Output df:

Unnamed:0 Unnamed:1 Header1
Data1 Data2 Data3

Desired Output df:

Header1 Header2 Header3
Data1 Data2 Data3

How can I address the extra '#' in the header row without having this issue? I've also tried using

sep=r'[#|\s+]'
2

There are 2 answers

1
Atul sanwal On
import pandas as pd

# Define the path to your CSV file
df_path = 'your_file.csv'

# Read the CSV file, skipping the first 19 rows as comments
df = pd.read_csv(df_path, skiprows=19, sep='\s+', engine='python', encoding='utf-8', comment='#')

# Rename the columns by splitting the first row
df.columns = df.columns.str.split().str[-1]

# Print the DataFrame
print(df)
3
e-motta On
# From OP
df = pd.read_csv(
    df_path, skiprows=19, sep=r"#|\s+", engine="python", encoding="utf-8"
)
# Get the column names, minus the unwanted ones
new_cols = df.columns[2:]
# Remove empty columns at the end of DataFrame
df = df.iloc[:, : len(df.columns) - 2]
# Rename the columns
df.columns = new_cols

This will result in:

  Header1 Header2 Header3
0   Data1   Data2   Data3