(please tag 'expectreg' - don't have the rep)
This framework and package seems to exist more or less in the shadows but I'm going to try my luck here.
I'm trying to estimate distribution Y|X non-parametric at values of x I provide. I am using the "Allstate Claims Severity" dataset off kaggle - downloaded maually and extracted into my environment for this MRE. Or can find other data.
library(expectreg); library(dplyr); library(ggplot2);
dat <- read.csv("train.csv") %>% ## from kaggle allstate claim severity
select(id, cont4, loss) %>%
slice_sample(n = 5000) %>%
as_tibble()
m1 <- expectreg.ls(loss ~ rb(cont4, type = "pspline", B_size = 10),
estimate = "restricted", # or can use "bundle"
smooth = "schall",
expectiles = "density",
#LAWSmaxCores = 4,
data = dat)
Now from this bundle of densely packed expectiles I want to estimate the conditional distribution of variable loss at arbitrary values for cont4 I provide. I see two methods: the first, cdf.qp() accepts a length-1 vector for 'x' but doesn't return a very well behaved density... I'm sure this can't be intended since it is nonsensical for a distribution, for this data, and for the estimated expectiles:
## attempt 1 with cdf.qp()
densities <- cdf.qp(m1, x = .3)
## densities$x here is our modeled Y i.e. variable 'loss'
tibble(x = densities$x, y = densities$density) %>%
ggplot(aes(x, y)) +
geom_line()
The other method cdf.bundle requires you use a certain estimation method ("restricted" or "bundle"). Ok... Inspecting the return object it appears to contain a vector density defining one nice smooth density function... but I'm not sure where this density is located along X. The method doesn't have any way to condition the return on a value X and I have no clue what this density is.
## attempt 2, cdf.bundle
## not sure what is x here, or density for that matter
densities <- cdf.bundle(m1)
tibble(x = densities$x, y = densities$density) %>%
ggplot(aes(x, y)) +
geom_line()
Note: densities$x doesn't appear to be the covariate X cont4, units differ. It also isn't the same $x returned by cdf.qp, which is the dependent variable Y (loss).
Hoping someone is familiar enough with this package to answer if I'm missing something or is it just not 'complete' enough to provide what the authors say it implements and the authors show in some of their papers...(see pp 92 of "Expectile smoothing: new perspectives on asymmetric least squares" if you can get it through JSTOR or something.)



Solved this of course not long after posting the question, though after many hours of going through code and testing.
cdf.qp is the correct method, though I had to modify it to handle larger scale response variables/more observations than I guess the authors anticipated, due to some numeric overflow issues during matrix operations.
For smoothness I needed to use the 'lambda' parameter to smooth the density estimates. These are pretty rough in my application (unlike in a paper from the authors).