How can I separate a text file into columns, but only for specific lines?

122 views Asked by At

I have a .psa file that I need manipulated, but to my knowledge the file needs to be in a .txt or .csv to manipulate using pandas. To solve this issue, I am reading the original file and writing the contents into another .txt file to which my code will apply to.

The original .psa file has text data in it, all separated by commas. I am trying to organize this data into columns and only pulling the data I need. Each line has 30+ values separated by commas, but I only need the 3rd value to put into a column.
I will have a zip folder which needs this code to run through and do the same thing to each file within the folder. Each file will have a different store number in the title.

Example:

filename: 1 Area 2 - store 15 group.psa

prod,123,456,abc,def, etc...
pla,124,uhj,jop,etc. 
prod,321,789,ghi,jkl, etc...
...

Expectation: I want to only take out the third item in the lines that start with prod and put it into a .csv file. I want to also keep the title of the original file in another column (would be very nice if it is only the store number included on there, but not a necessity). Ex.

nums store #
456 15
789 15

Here is the code I have so far:

with open('1 Area 2 - store 15 group.psa','r') as firstfile, open('test.txt','a') as secondfile: 
    # read content from first file 
    for line in firstfile: 
         # append content to second file 
         secondfile.write(line)

file = pd.read_csv("test.txt", sep=',', usecols=[0,1,2], header=0, names=['col 1','col 2','col 3'])
file.to_csv("output.csv", index=False) 

This code is able to give me the columns as output, but the rows end up including lines that don't start with prod and I have 3 columns instead of just the num column (get an error when I only do [usecols=2], so the data is still messy and I have no idea how to get the title of the original file in the second column.

3

There are 3 answers

11
Nick On BEST ANSWER

You can do everything you want inside pandas without writing to a temporary file; just read in the first and third columns, filter on the first column value and then drop it; finally add the store number (extracted from the filename) as the second column:

import re

fname = '1 Area 2 - store 15 group.psa'
df = pd.read_csv(fname, usecols=[0, 2], header=None, names=['type', 'num'])
store = re.search(r'store\s+(\d+)', fname).group(1)
df = df[df['type'] == 'prod'].drop(columns='type').assign(store=store)
df.to_csv("output.csv", index=False) 

Output (for your sample data):

num,store
456,15
789,15
0
Sean Dickson On

Here's a way to do it in pandas that ends with a dataframe with your exact specified column headers...

import pandas as pd

# Your store file
store_file = '1 Area 2 - store 15 group.psa'

# You can use pandas to read your .psa as a csv
df = pd.read_csv(store_file, usecols=[0, 2], header=None, names=['type', 'num'])
df = df[df['type'] == 'prod']

# You don't need to keep the "type" column
df.drop(columns='type', inplace=True)

# Assuming the names of your "store files" are systematically named, this should work for storing the store number...
store = store_file.split(' ')[5] 
df['store #'] = store

# Save df as file if desired...
df.to_csv('out.csv')
0
Zach Young On

You can do this without Pandas, if you want.

The following reads the PSA (as a CSV, because it looks like a CSV to me) into a whole list of rows, then moves that list through a succession of list comprehensions to filter the rows by "prod", cut just the 3rd column, then append the store number. If you have any questions about what each operation actually does you can print out the intermediate results:

def process_psa(fname: str):
    m = re.search(r"store-(\d{1,2})", fname)
    assert m, f"{r"store-(\d{1,2})"} did not match {fname}"
    store_no = m.group(1)

    with open(fname, newline="") as f:
        reader = csv.reader(f)
        rows = list(reader)


    filtered = [x for x in rows if x[0] == "prod"]

    cut = [x[2:3] for x in filtered]  # keep a slice of the 3rd field

    final = [x + [store_no] for x in cut]


    with open(f"output-{store_no}.csv", "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["nums", "store #"])
        writer.writerows(final)


for fname in glob.glob("*.psa"):
    process_psa(fname)

I mocked up another PSA file, which I labelled as store 23:

prod,654,111,abc,def,etc...
pla,834,888,uhj,jop,etc...
prod,765,523,ghi,jkl,etc...

Running that script with input-store-15.psa and input-store-23.psa I ended up with:

output-15.csv

nums,store #
456,15
789,15

and

output-23.csv

nums,store #
111,23
523,23