Problem during merge: either gets no data output, or simply merges them by doubling the column in data

59 views Asked by At
# Here's the first try:

# Create a custom function for merging the data together:
def getXDataMerged():    
print('Income Statement CSV data is(rows, columns): ', df1.shape)
print('Balance Sheet CSV data is: ', df2.shape)
print('Cash Flow CSV data is: ' , df3.shape)

# Merge the data together
    result = pd.merge(df1, df2, on=['Ticker', 'SimFinId', 'Currency',
                      'Fiscal Year', 'Fiscal Period', 'Report Date', 'Publish Date'], how='inner')    
    result = pd.merge(result, df3, on=['Ticker','SimFinId','Currency',
                    'Fiscal Year','Report Date','Publish Date'])        
    print('Merged X data matrix shape is: ', result.shape)    
    return result

# Use getXDataMerged() to retrieve some data, and then save it to a CSV file named "Annual_Stock_Price_Fundamentals.csv"
X = getXDataMerged()
X.to_csv("Annual_Stock_Price_Fundamentals.csv")

# Output for first try:

Income Statement CSV data is(rows, columns):  (17185, 28)
Balance Sheet CSV data is:  (17185, 30)
Cash Flow CSV data is:  (17185, 28)
Merged X data matrix shape is:  (0, 73)

# Second try (only changed the merging method to 'outer', everything else stays the same:
  
    # Merge the data together
    result = pd.merge(df1, df2, on=['Ticker', 'SimFinId', 'Currency',
                      'Fiscal Year', 'Fiscal Period', 'Report Date', 'Publish Date'], how='outer')    
    result = pd.merge(result, df3, on=['Ticker','SimFinId','Currency',
                    'Fiscal Year','Report Date','Publish Date'])        

# Output for second try:

Income Statement CSV data is(rows, columns):  (17185, 28)
Balance Sheet CSV data is:  (17185, 30)
Cash Flow CSV data is:  (17185, 28)
Merged X data matrix shape is:  (34370, 73)

I tried to merge the data using 'inner' and then got no data.

I tried to merge the data using 'outer' and then doubled the columns, but couldn't sort them by merging common values.

2

There are 2 answers

0
Dun On

My method:

def getXDataMerged(myLocalPath='C:/Users/...'):       
 # apply Pandas read that seperates data using delimiter = ; into different var names.
    incomeStatementData=pd.read_csv(myLocalPath+'us-income-annual.csv',
                                    delimiter=';')
    balanceSheetData=pd.read_csv(myLocalPath+'us-balance-annual.csv',
                                 delimiter=';')
    CashflowData=pd.read_csv(myLocalPath+'us-cashflow-annual.csv',
                             delimiter=';')
    # print information on the shapes of the data
    print('Income Statement CSV data is(rows, columns): ',
          incomeStatementData.shape)
    print('Balance Sheet CSV data is: ',
          balanceSheetData.shape)
    print('Cash Flow CSV data is: ' ,
          CashflowData.shape)
    # Merge the data together... merge the first two data together with the specific column names using on= and assign to 'result'
    result = pd.merge(incomeStatementData, balanceSheetData,\
                on=['Ticker','SimFinId','Currency',
                    'Fiscal Year','Report Date','Publish Date'])
    # update 'result' with merge with the third data on the same column names as before    
    result = pd.merge(result, CashflowData,\
                on=['Ticker','SimFinId','Currency',
                    'Fiscal Year','Report Date','Publish Date'])
    
    print('Merged X data matrix shape is: ', result.shape)
    
    return result

X = getXDataMerged()
X.to_csv("Annual_Stock_Price_Fundamentals.csv")

Output:

Income Statement CSV data is(rows, columns):  (17213, 28)
Balance Sheet CSV data is:  (17213, 30)
Cash Flow CSV data is:  (17213, 28)
Merged X data matrix shape is:  (17213, 74)
5
jeste_artyste On

I can't see your data, so it's hard to tell where excatly the problem is, but when using 'inner' statement of pd.merge() method, new dataframe will have only matching keys from both dataframe objects, you trying to merge. It's always good to use 'inner' if you have same keys in both frames etc., from what I understand, your data is not replicable and you have different values across frames you trying to merge, that's why you have no values after merge. Eg. column 'SimFindID' is not the same, for both dataframes, that's why it's not matching results. Try to use less columns to merge on.

In terms of 'outer' method of merge, story is the same, but it gives you "not matching" rows of data, so that's why you will not sort it in any way, as all rows are different for both frames.

If what you want to achieve is just adding to frames together, use 'cross' option of pd.merge() method or look if you have any matching keys in both dataframes, and pick only those columns that are matching.

If you need more help, please add your data for me to look at it