Fit Partial Proportional Odds Model (Ordered Logit) for Ordinal Response Variable with Big Dataset

51 views Asked by At

I'm seeking advice regarding fitting a Generalized Ordered Logit model with a large dataset. My aim is to understand the effect of the variable "origin_country_code" on the dependent variable (a score from 0 to 10) when individuals evaluate the same product "id_product". Hence, I want to estimate an ordinal logistic model, allowing for non-proportional odds, as I'm interested in examining how different values of "origin_country_code" affect each level of the "score" variable.

The variable "origin_country_code" is a categorical variable for the country of origin of the individual who made a transaction, and I am interested in knowing how different nationalities change the scores given to the transaction, when I control for the product they are buying.

Here's some information on my dataset:

The dataset consists of four variables and around 50 million observations. The dependent variable ("score") is a score ranging from 0 to 10, stored as a factor. All independent variables are factor variables representing IDs of transactions and the country of origin of the individual making the transaction and the country where the transaction took place. If the same product is being sold, then the transaction ID is the same.

Here's a glimpse of the dataset structure:

head(data)
  score origin_country_code  trans_country_code    id_product
1    10                  1                 75             211424
2    10                  1                  2               4510
3    10                  1                122             458737
4    10                  1                168             554101
5     4                  1                168             554701
6     8                  1                241             927203

Given the size of the dataset, I attempted to run the model on a smaller subsample of the data focusing on the most popular and international products:

#Subsample of most popular and international products
data <- data %>%
  group_by(id_product) %>%
  mutate(n_nationalities = n_distinct(origin_country_code  )) %>%
  ungroup()
#Keep if number of nationalities is greater than 110
data <- data[data$n_nationalities>=110,]

After subsampling, the dataset dimensions are as follows:

> dim(data)
[1] 317694      5
> nlevels(data$score)
[1] 10
> nlevels(data$origin_country_code)
[1] 249
> nlevels(data$trans_country_code)
[1] 192
> nlevels(data$id_product)
[1] 947954

However, when attempting to fit the model using the vglm() function from the VGAM package, I encounter an error message consistently. Below is the main code for fitting the model:

mo_no_propodds <- vglm(score ~ origin_country_code + trans_country_code + id_product,
                       family=cumulative(link="logitlink", 
                                         parallel =F), data=data)

I always get an error message like this:

Error: cannot allocate vector of size 15.8 Gb

If I try with a smaller subsample, then I get these kind of error messages:

> mo_no_propodds <- vglm(score ~ origin_country_code + trans_country_code + id_product,
+                        family=cumulative(link="logitlink", 
+                                          parallel =F), data=data)
Error in tapplymat1(cump, "diff") : 
  NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In Deviance.categorical.data.vgam(mu = mu, y = y, w = w, residuals = residuals,  :
  fitted values close to 0 or 1
2: In eval(slot(family, "deriv")) : some probabilities are very close to 0
3: In Deviance.categorical.data.vgam(mu = mu, y = y, w = w, residuals = residuals,  :
  fitted values close to 0 or 1

I have tried various approaches, including grouping some categories in the "score" variable and using the vgam function instead of vglm, but I still encounter the same issue. Sometimes I get an error message saying that the matrix is not full rank.

I also attempted to fit a partial proportional odds model, but the problem persists:

mo_no_partialprop_odds <- vglm(score ~ origin_country_code + trans_country_code + id_product,
                      family=cumulative(link="logitlink", 
                                parallel = TRUE ~ -1 +
                                  trans_country_code + id_product), data=data)

I suspect that my dataset size is causing this issue. Are there any specific objects in R that are more computationally efficient for handling large datasets? Currently, my data is stored in a data frame. I tried using sparse matrices, but vglm does not seem to accept them as inputs.

Any advice or suggestions would be greatly appreciated. Thank you in advance!

0

There are 0 answers