randomly shuffle multiple dataframes

176 views Asked by At

I have a corpus of conversations (400) between two people as strings (or more precisely as plain text files) A small example of this might be:

my_textfiles = ['john: hello \nmary: hi there \njohn: nice weather \nmary: yes',
            'nancy: hello \nbill: hi there \nnancy: nice weather \nbill: yes',
            'ringo: hello \npaul: hi there \nringo: nice weather \npaul: yes',
            'michael: hello \nbubbles: hi there \nmichael: nice weather \nbubbles: yes',
            'steve: hello \nsally: hi there \nsteve: nice weather \nsally: yes']

In addition to speaker names, I have also noted each speakers' role in the conversation (as a leader or follower depending on whether they are the first or second speaker). I then have a simple script that converts each conversation into a data-frame by seperating speaker ID from the content:

import pandas as pd
import re
import numpy as np
import random

def convo_tokenize(tf):
    turnTokenize = re.split(r'\n(?=.*:)', tf, flags=re.MULTILINE)
    turnTokenize = [turn.split(':', 1) for turn in turnTokenize]
    dataframe = pd.DataFrame(turnTokenize, columns = ['speaker','turn'])

    return dataframe

df_list = [convo_tokenize(tf) for tf in my_textfiles]

The corresponding dataframe then forms the basis of a much longer piece of analysis. However, I would now like to be able to shuffle speakers so that I create entirely random (and likely nonsense) conversations. For instance, John, who is having a conversation with Mary in the fist string, might be randomly assigned Paul (the second speaker in the third string). Crucially, I would need to maintain the order of speech within each speaker. It is also important that, when randomly assigning new speakers, I preserve a mix of leader/follower, such that I am not creating conversations from two leaders or two followers.

To begin, my thinking was to create a standardized speaker label (where 1 = leader, 2 = follower), and separate each DF into a sub-DF and store in role_specific df lists

def speaker_role(dataframe):
    leader = dataframe['speaker'].iat[0]
    dataframe['sp_role'] = np.where(dataframe['speaker'].eq(leader), 1, 2)

    return dataframe

df_list = [speaker_role(df) for df in df_list]

leader_df = []
follower_df = []

for df in df_list:
    is_leader = df['sp_role'] == 1
    is_follower = df['sp_role'] != 1

    leader_df.append(df[is_leader])
    follower_df.append(df[is_follower])

I have worked out that I can now simply shuffle the data-frame of one of the sub-dfs, in this case the follower_df

follower_rand = random.sample(follower_df, len(follower_df)) 

Having got to this stage I'm not sure where to turn next. I suspect I will need some sort of zip function, but am unsure exactly what. I'm also unsure how I go about merging the turns together such that they form the same dataframe structure I initially had. Assuming Ringo (leader) is randomly assigned to Bubbles (follower) for one of the DFs, I would hope to have something like this...

 speaker   |   turn   |   sp_role
------------------------------------
  ringo       hello          1
 bubbles     hi there        2
  ringo    nice weather      1
 bubbles     yes it is       2
0

There are 0 answers