summarise returning -inf when using na.rm = TRUE

4.3k views Asked by At

I recently built a simple R script to summarize three different data frames. Since updating to the newest version of R and R Studio, I am running into an output I haven't seen before when using the summarize function in dplyr for only one of the data frames (the other two are fine). I also receive a series of warnings that are unfamiliar to me. Please note that prior to updating, I ran the script exactly as written with no issues for any of the data frames.

The data frame with the problem is called VO2 and its is set up as follows:

Name        Sex       VO2
AthleteA    M         50
AthleteA    M         52
AthleteA    M         NA
AthleteB    M         49
AthleteB    M         56
AthleteB    M         47 
AthleteC    M         42
AthleteC    M         NA
AthleteC    M         41 
AthleteD    M         NA
AthleteD    M         NA
AthleteD    M         NA 

The code I run is:

Test.Summary.VO2 = VO2 %>% group_by(Name, Sex) %>% 
summarise(Best.Score = max(VO2, na.rm=TRUE))

This code generates the following summary:

Name       Sex     Best.Score
AthleteA    M        52
AthleteB    M        56
AthleteC    M        42
AthleteD    M        -Inf

The -Inf value is completely new in the output. I cannot figure out why it is appearing now for cases where there were only NAs.

As mentioned above, I have the exact same layout for a second data frame and run the same type of summary. Here everything works fine. When I summarize with na.rm=TRUE, it removes the NA cases without replacing NA cases with an -Inf value.

Where this gets a bit more unusual is that when I view the data frame using:

View(Test.Summary.VO2)

I receive the following series of warning messages:

There were 38 warnings (use warnings() to see them)
warnings()
Warning messages:
1: Unknown or uninitialised column: 'Quad'.
2: Unknown or uninitialised column: 'Quad'.
3: Unknown or uninitialised column: 'Quad'.
4: Unknown or uninitialised column: 'Quad'.

Later on in the script I generate a new variable called "Quad". But the warning above appears even after I clear the environment, and restart R Studio. I have even tried renaming the .csv file and importing using a different dataframe name. It's almost as if the column 'Quad' that is generated later in the script is hanging around somewhere in the environment.

I am really at a loss as to what might be happening here.

I hope one of the R experts on Stack can provide me with an idea on how to remedy this issue.

Thanks for you consideration.

2

There are 2 answers

0
Hong Ooi On BEST ANSWER

See ?max:

The minimum and maximum of a numeric empty set are +Inf and -Inf (in this order!) which ensures transitivity, e.g., min(x1, min(x2)) == min(x1, x2). For numeric x max(x) == -Inf and min(x) == +Inf whenever length(x) == 0 (after removing missing values if requested). However, pmax and pmin return NA if all the parallel elements are NA even for na.rm = TRUE.

You don't have any non-NA values for group D, so max returns the value for an empty set.

0
davsjob On

Late to the party, but a solution would be to return NA instead of Inf when there is no value to maximize. This could be done with the hablar package's s function.

library(dplyr)
library(hablar)

VO2 %>% 
  group_by(Name, Sex) %>% 
  summarise(Best.Score = max(s(VO2)))

which gives you:

  Name     Sex   Best.Score
  <chr>    <chr>      <int>
1 AthleteA M             52
2 AthleteB M             56
3 AthleteC M             42
4 AthleteD M             NA