I am trying to make a function that in the end will run multiple machine learning algorithms on my data set. I have the first little bit of my function below and a small sample of data.
The problem i am running into is with sampling my data into four different data frames and then applying them to the given functions. Here on the first function i am testing the data runs threw the logistic regression model but on the output it uses all the data for that model and not just 1/4 of the data frame df as i am intending. I checked with <<- to see what kind of data is being passed threw and it sends a data set that is 1/4 of the data frame df that i am looking for. Question why douse it pass to my global environment the right way but not my regression function and how would i correct this?
Data:
zeroFac <- c(1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1)
goal <- c(8.412055, 7.528869, 8.699681, 10.478752, 9.210440, 10.308986, 10.126671, 11.002117, 10.308986, 7.090910, 10.819798, 7.824446, 8.612685,
7.601402, 10.126671, 7.313887, 5.993961, 7.313887, 8.517393, 12.611541)
City_Pop <- c( 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613,
11.64613, 11.64613, 11.64613, 11.64613, 11.64613, 11.64613)
df <- data.frame(zeroFac,goal,City_Pop)
Function:
forestModel <- function(eq1, ...){
#making our origenal data frame
train <- data.frame(cbind(...))
################
#splitting into 4 data sets
set.seed(123)
ss <- sample(1:4, size = nrow(train), replace=TRUE, prob = c(0.25,0.25,0.25,0.25))
t1 <- train[ss==1,]
t2 <- train[ss==2,]
t3 <- train[ss==3,]
t4 <- train[ss==4,]
################
m <- glm(eq1, family = binomial(link = 'logit'), data = t1)
summary(m)
}
eq1 <- df$zeroFac ~ df$goal + df$City_Pop
forestModel(eq1, df$zeroFac, df$goal, df$City_Pop)
You have to change the formula and name the columns of the train dataset in the function. The equation changes from
eq1 <- df$zeroFac ~ df$goal + df$City_Pop
toeq1 <- zeroFac ~ goal + City_Pop
. Otherwise it also contains the call to the dataframe and not just to the column names. And after binding the train-data together, you have to name their columns, so the glm function knows which columns you are reffering to in the equation.