I have this following data frame (df):
mut gene pvalue chrom
1:23456_A>G 0.005 chr1
2:28484_A>G 0.0001 chr2
4:47629_A>G 0.05 chr4
3:88382_A>G 0.00001 chr3
10:88273_A>G 0.005 chr10
[30 rows x 4 columns]
I am trying to create four columns along with their column name labels from the "mut" column of df and assigned it into newly created df_new that looks like this
chr st ref alt
1 23456 A G
2 28484 A G
4 47629 A G
The resulted data frame (df_new) is basically an extraction of column mut from df and then a separation of each part of the string, i.e: split(":") then split("_") and finally split(">") where we end up with 4 parts of the original field 1 23456 A G and then placed into their columns.
Here is my attempt:
df_new["chr"], df_new["st"], df_new["ref"],
df_new["alt"] = df.mut.str.split("[:_>]")
but I end up with an error message as the following:
ValueError: too many values to unpack (expected 4)
a simple print statement reveals the result of this line of code:
df.mut.str.split("[:_>]")
as:
0 [1, 23456, A, G]
1 [2, 28484, A, G]
.
.
.
Is there a way to solve this in pandas where you create a new data frame from the separation of the string fields into 4 columns with their columns labels included?
Lets try
.str.split(expand=True)