I have a corpus of conversations (400) between two people as strings (or more precisely as plain text files) A small example of this might be:
my_textfiles = ['john: hello \nmary: hi there \njohn: nice weather \nmary: yes',
'nancy: hello \nbill: hi there \nnancy: nice weather \nbill: yes',
'ringo: hello \npaul: hi there \nringo: nice weather \npaul: yes',
'michael: hello \nbubbles: hi there \nmichael: nice weather \nbubbles: yes',
'steve: hello \nsally: hi there \nsteve: nice weather \nsally: yes']
In addition to speaker names, I have also noted each speakers' role in the conversation (as a leader or follower depending on whether they are the first or second speaker). I then have a simple script that converts each conversation into a data-frame by seperating speaker ID from the content:
import pandas as pd
import re
import numpy as np
import random
def convo_tokenize(tf):
turnTokenize = re.split(r'\n(?=.*:)', tf, flags=re.MULTILINE)
turnTokenize = [turn.split(':', 1) for turn in turnTokenize]
dataframe = pd.DataFrame(turnTokenize, columns = ['speaker','turn'])
return dataframe
df_list = [convo_tokenize(tf) for tf in my_textfiles]
The corresponding dataframe then forms the basis of a much longer piece of analysis. However, I would now like to be able to shuffle speakers so that I create entirely random (and likely nonsense) conversations. For instance, John, who is having a conversation with Mary in the fist string, might be randomly assigned Paul (the second speaker in the third string). Crucially, I would need to maintain the order of speech within each speaker. It is also important that, when randomly assigning new speakers, I preserve a mix of leader/follower, such that I am not creating conversations from two leaders or two followers.
To begin, my thinking was to create a standardized speaker label (where 1 = leader, 2 = follower), and separate each DF into a sub-DF and store in role_specific df lists
def speaker_role(dataframe):
leader = dataframe['speaker'].iat[0]
dataframe['sp_role'] = np.where(dataframe['speaker'].eq(leader), 1, 2)
return dataframe
df_list = [speaker_role(df) for df in df_list]
leader_df = []
follower_df = []
for df in df_list:
is_leader = df['sp_role'] == 1
is_follower = df['sp_role'] != 1
leader_df.append(df[is_leader])
follower_df.append(df[is_follower])
I have worked out that I can now simply shuffle the data-frame of one of the sub-dfs, in this case the follower_df
follower_rand = random.sample(follower_df, len(follower_df))
Having got to this stage I'm not sure where to turn next. I suspect I will need some sort of zip function, but am unsure exactly what. I'm also unsure how I go about merging the turns together such that they form the same dataframe structure I initially had. Assuming Ringo (leader) is randomly assigned to Bubbles (follower) for one of the DFs, I would hope to have something like this...
speaker | turn | sp_role
------------------------------------
ringo hello 1
bubbles hi there 2
ringo nice weather 1
bubbles yes it is 2