I am trying to make a simple function that finds outliers and marks the corresponding observation as valid.obs=1 if it is not an outlier,or valid.obs=0 if it is indeed an outlier. For example, for the variable "income", the outliers will be identified based on the following formula: if
income>=(99percentile(income)+standard_deviation(income)), then it is an outlier. If income<(99percentile(income)+standard_deviation(income)), then it is not an outlier.
rem= function(x){
u=quantile(x,probs=0.99,na.rm=TRUE) #calculating the 99th percentile
s=sapply(x,sd,na.rm=TRUE) #calculating the standard deviation
uc=u+s
v=seq(1,1,length.out = nrow(x))
v[x>=uc]=0
v[x<uc]=1
x$valid.obs=v
return(x)
}
I go on to apply this function to a single column of a dataframe. The dataframe has 132 variables with 5000 entries. I choose the variable "income"
apply(data["income"],2,rem)
It, then shows the error:
Error in seq.default(1, 1, length.out = nrow(x)) :
argument 'length.out' must be of length 1
Outside the function "rem", the following code works just fine:
nrow(data["income"])
[1] 5000
I am new to R and there aren't many functions in my armoury yet.The objective of this function is very simple. Please let me know why this error has crept in and if there is an easier way to go about this?
Use
applyiterates through "margins" or rows/columns of a data frame and passes the data frame columns as named vectors toFUN. The vector will have a length but not a row count.ie. Inside rem you are passing
A few other things not directly related to your error:
For the same reason as above, there is no need to call
sdinsidesapply. Just call it normally on the vector.You can also simplify three lines (and remove your initial problem entirely) by using
This will create a logical vector (automatically the same length as
x) withTRUE/FALSEvalues based on<uc. To get your 0s and 1s just coerce the logical values withas.numericFinally, if all you need to do is add one column to
databased on the values inincomeyou want to returnvinstead and call the function like soYour function will now return a vector which can essentially be added to
dataunder the new name ofvalid.obs