I am trying to make a simple function that finds outliers and marks the corresponding observation as valid.obs=1 if it is not an outlier,or valid.obs=0 if it is indeed an outlier. For example, for the variable "income", the outliers will be identified based on the following formula: if
income>=(99percentile(income)+standard_deviation(income)), then it is an outlier. If income<(99percentile(income)+standard_deviation(income)), then it is not an outlier.
rem= function(x){
u=quantile(x,probs=0.99,na.rm=TRUE) #calculating the 99th percentile
s=sapply(x,sd,na.rm=TRUE) #calculating the standard deviation
uc=u+s
v=seq(1,1,length.out = nrow(x))
v[x>=uc]=0
v[x<uc]=1
x$valid.obs=v
return(x)
}
I go on to apply this function to a single column of a dataframe. The dataframe has 132 variables with 5000 entries. I choose the variable "income"
apply(data["income"],2,rem)
It, then shows the error:
Error in seq.default(1, 1, length.out = nrow(x)) :
argument 'length.out' must be of length 1
Outside the function "rem", the following code works just fine:
nrow(data["income"])
[1] 5000
I am new to R and there aren't many functions in my armoury yet.The objective of this function is very simple. Please let me know why this error has crept in and if there is an easier way to go about this?
Use
apply
iterates through "margins" or rows/columns of a data frame and passes the data frame columns as named vectors toFUN
. The vector will have a length but not a row count.ie. Inside rem you are passing
A few other things not directly related to your error:
For the same reason as above, there is no need to call
sd
insidesapply
. Just call it normally on the vector.You can also simplify three lines (and remove your initial problem entirely) by using
This will create a logical vector (automatically the same length as
x
) withTRUE
/FALSE
values based on<uc
. To get your 0s and 1s just coerce the logical values withas.numeric
Finally, if all you need to do is add one column to
data
based on the values inincome
you want to returnv
instead and call the function like soYour function will now return a vector which can essentially be added to
data
under the new name ofvalid.obs