I'm learning about Gaussian Processes (GP) and have played with a few GP packages on Python. For most packages the default modelling option is, you have a n × 1 vector denoted y
usually representing the GP-distributed variable of your interest, and you have another n × 1 vector x
containing non-repeating values that denote the respective order of values in y
in a real space (i.e. if you're modelling a repeatedly measured data, x
usually denotes the time of measurement).
I have a dataframe of multiple longitudinal data for many different individuals, and for each individual the time of measurement is continuous and different. For example, I operate a refectory for a building of 120 employees; these people visited the refectory from time to time (i.e. they don't always visit the refectory at the same day, and they can visit anytime from 9AM to 5PM). I'm modelling the amount of money they spend on each visit. Obviously data from each individual is clustered separately, which is why my first attempt is to fit a GP model for each individual in this dataset separately, so each individual's data become a separate GP. However, clearly there could be inter-individual clustering (eg. employees from department A spend money differently from those from department B). I hope there could exist a method to fit a single GP, or a class of special GP, to longitudinal data from multiple, instead of one individual, so I could capture covariances at different clustering levels.
So far, I have tried using the multi-output GP by assuming each individual's data as a separate output, which is implemented in many Python packages. I think these methods work well if the data is "time series": i.e. The data were sampled over very regular time interval, or the timespace could be well discretised. However, as described above my data were sampled at very irregular time interval both within and between patients. I tried
- reshaping the whole dataset from long direction to wide direction, so as to create a very sparse training matrix (sparse because NAs were used for missing data at which one patient was measured and others were not), and the model performed extremely poorly;
- forcefully discretise my time scale, but because my data were sampled at a very irregular time interval, there is very little option for me to discretise the scale without creating a lot of NAs, or without disrupting the original temporal information.
Does anyone have any idea on my problem and could provide any suggestion?
Based on your description, you can use sparse Gaussian processes (GPs) for sharing information between models of different individuals by involving a shared structure in the kernel function. Briefly, you can use shared and individual kernels for this problem. The shared kernel: the idea is that you can define a shared kernel that captures the global structure shared across individuals. This kernel should model the similarity between different individuals. The individual kernel: This is to assign individual kernels to each individual. These kernels capture the variances specific to each person. These kernels could be a combination of a basic kernel (like a radial basis function (RBF) kernel) and an individual-specific kernel.
Then you combine these two kernels by summing them. Mathematically, if K_s is the shared kernel and K_i is the individual-specific kernel for the individual number i, the combined kernel K_total would be K_total = K_i + K_s