Set ylim() automatically

2k views Asked by At

Here is some data to work with.

df <- data.frame(x1=c(234,543,342,634,123,453,456,542,765,141,636,3000),x2=c(645,123,246,864,134,975,341,573,145,468,413,636))

If I plot these data, it will produce a simple scatter plot with an obvious outlier:

plot(df$x2,df$x1)

enter image description here

Then I can always write the code below to remove the y-axis outlier(s).

plot(df$x2,df$x1,ylim=c(0,800))

enter image description here

So my question is: Is there a way to exclude obvious outliers in scatterplots automatically? Like ouline=F would do if I were to plot, say, boxplots for an example. To my knowledge, outline=F doesn't work with scatterplots.

This is relevant because I have hundreds of scatterplots and I want to exclude all obvious outlying data points without setting ylim(...) for each individual scatterplot.

2

There are 2 answers

0
JohannesNE On

You could write a function that returns the index of what you define as an obvious outlier. Then use that function to subset your data before plotting.

Here all observations with "a" exceeding 5 * median of "a" are excluded.

df <- data.frame(a = c(1,3,4,2,100), b=c(1,3,2,4,2))

f <- function(x){
    which(x$a > 5*median(x$a))  
}

with(df[-f(df),], plot(b, a))
0
tegancp On

There is no easy yes/no option to do what you are looking for (the question of defining what is an "obvious outlier" for a generic scatterplot is potentially quite problematic).

That said, it should not be too difficult to write a reasonable function to give y-axis limits from a set of data points. If we take "obvious outlier" to mean a point with y value significantly above or below the bulk of the sample (which could be justified assuming a sufficient distribution of x values), then you could use something like:

ybounds <- function(y){  # y is the response variable in the dataframe
    bounds = quantile(df$x1, probs=c(0.05, 0.95), type=3, names=FALSE)
    return(bounds + c(-1,1) * 0.1 * (bounds[2]-bounds[1]) )
}

Then plot each dataframe with plot(df$x, df$y, ylim=ybounds(df$y))