Error in seq.default(1, 1, length.out = nrow(x)) : argument 'length.out' must be of length 1

8.1k views Asked by At

I am trying to make a simple function that finds outliers and marks the corresponding observation as valid.obs=1 if it is not an outlier,or valid.obs=0 if it is indeed an outlier. For example, for the variable "income", the outliers will be identified based on the following formula: if

income>=(99percentile(income)+standard_deviation(income)), then it is an outlier. If income<(99percentile(income)+standard_deviation(income)), then it is not an outlier.

rem= function(x){
  u=quantile(x,probs=0.99,na.rm=TRUE) #calculating the 99th percentile
  s=sapply(x,sd,na.rm=TRUE) #calculating the standard deviation
  uc=u+s
  v=seq(1,1,length.out = nrow(x))
  v[x>=uc]=0
  v[x<uc]=1
  x$valid.obs=v
  return(x) 
}

I go on to apply this function to a single column of a dataframe. The dataframe has 132 variables with 5000 entries. I choose the variable "income"

apply(data["income"],2,rem)

It, then shows the error:

 Error in seq.default(1, 1, length.out = nrow(x)) : 
 argument 'length.out' must be of length 1 

Outside the function "rem", the following code works just fine:

nrow(data["income"])
[1] 5000

I am new to R and there aren't many functions in my armoury yet.The objective of this function is very simple. Please let me know why this error has crept in and if there is an easier way to go about this?

1

There are 1 answers

0
ruaridhw On

Use

v = rep(1, length.out = length(x))

apply iterates through "margins" or rows/columns of a data frame and passes the data frame columns as named vectors to FUN. The vector will have a length but not a row count.

ie. Inside rem you are passing

> nrow(c(1,2,3))
NULL

A few other things not directly related to your error:

For the same reason as above, there is no need to call sd inside sapply. Just call it normally on the vector.

s=sd(x,na.rm=TRUE) #calculating the standard deviation

You can also simplify three lines (and remove your initial problem entirely) by using

v=as.numeric(x<uc)

This will create a logical vector (automatically the same length as x) with TRUE/FALSE values based on <uc. To get your 0s and 1s just coerce the logical values with as.numeric

Finally, if all you need to do is add one column to data based on the values in income you want to return v instead and call the function like so

data$valid.obs <- rem(data$income)

Your function will now return a vector which can essentially be added to data under the new name of valid.obs