The Problem
I have two datasets that describe, let's say, the temperature at certain depths and at certain latitudes for a sea. The datasets are from two different models and therefore have differing resolution, with model 1 have a higher resolution for latitude and both models having different levels for the depth dimension. I've turned both datasets into pandas dataframes with depth as the vertical index and latitudes as the column labels. I want to mask out the rows (depths) and columns (latitudes) that are not shared between both dataframes since I'll be taking a difference and don't want to interpolate data. I've found how to mask out certain values within rows and columns, but I want to mask out rows and columns in their entirety.
I've used np.intersect1d on the depths as lists to find which depths are not shared between the models, and I created a boolean list using a conditional statement showing True for every index where the value is unique to that dataframe. However, I'm not sure how to use this as a mask or even if I can. DataFrame.mask says the "array conditional must be the same shape as self", but array conditional is one-dimensional and the dataframe is two-dimensional. I'm not sure how to refer to the indexes of the dataframe only to apply the mask. I feel like I'm on the right track, but I'm not entirely sure since I'm still new to pandas. (I've tried searching for similar questions, but none match my problem quite exactly from what I've seen.)
The Code (simplified working example)
Note- This was written in the Jupyter notebook environment
import numpy as np
import pandas as pd
# Model 1 data
depthmod1 = [5, 10, 15, 20, 30, 50, 60, 80, 100] #depth in meters
latmod1 = [50, 50.5, 51, 51.5, 52, 52.5, 53] #latitude in degrees north
tmpumod1 = np.random.randint(273,303,size=(len(depthmod1),len(latmod1))) #temperature
dfmod1 = pd.DataFrame(tmpumod1,index=depthmod1,columns=latmod1)
print(dfmod1)
50.0 50.5 51.0 51.5 52.0 52.5 53.0
5 299 300 300 293 285 293 273
10 273 288 293 292 290 302 273
15 277 279 284 302 280 294 284
20 291 295 277 276 295 279 274
30 281 284 284 275 295 284 282
50 284 276 291 282 286 295 295
60 298 294 289 294 285 289 288
80 285 284 275 298 287 277 300
100 292 295 294 273 291 276 290
# Model 2 data
depthmod2 = [5, 10, 15, 25, 35, 50, 60, 100]
latmod2 = [50, 51, 52, 53]
tmpumod2 = np.random.randint(273,303,size=(len(depthmod2), len(latmod2)))
dfmod2 = pd.DataFrame(tmpumod2,index=depthmod2,columns=latmod2)
print(dfmod2)
50 51 52 53
5 297 282 275 292
10 298 286 292 282
15 286 285 288 273
25 292 288 279 299
35 301 295 300 288
50 277 301 281 277
60 276 293 295 297
100 275 279 292 287
# Find shared depths
depthxsect = np.intersect1d(depthmod1, depthmod2)
print(depthxsect, depthxsect.shape)
Shared depths: [ 5 10 15 50 60 100] (6,)
# Boolean mask for model 1
depthmask = dfmod1.index.isin(depthxsect) == False
print("Bool showing where mod1 index is NOT in mod2: ", depthmask)
Bool showing where mod1 index is NOT in mod2: [False False False True True False False True False]
# Mask data
dfmod1masked = dfmod1.mask(depthmask1,np.nan)
print(dfmod1masked)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-fedf013c2200> in <module>
----> 1 dfmod1masked = dfmod1.mask(depthmask1,np.nan)
2 print(dfmod1masked)
[...]
ValueError: Array conditional must be same shape as self
The Question
How can I mask the rows by index such that I'm left only with rows/indexes [ 5 10 15 50 60 100] useable in both dataframes? I'll be doing similar masking for the columns (latitudes) so hopefully the solution for the rows will work for columns as well. I also do not want to merge the dataframes. They should remain separate unless a merge is needed for this.
depthxsect
returns annp.array
of the indices that you need. So, you can skip creating the boolean arraydepthmask
and just pass the np.array to your datframe using.loc
. You should use.mask
if you are trying to keep all of the rows but just returnNaN
values on the other indices.After getting
dfmod1
anddepthxsect
, you can simply use:Full reproducible code:
I have included the approach you were trying as well. You have t ospecify
mask
on a column. You were doing it on the entire dataframe: