Getting log-likelihood from probabilistic suffix tree

108 views Asked by At

Here is my code:

library(RCurl)
library(TraMineR)
library(PST)

x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/c2539d06771317c5f4c8d3a2052a73fc485a09c6/challenge_level.csv")
data <- read.csv(text = x)

# Load and transform data
data <- read.table("thread_level.csv", sep = ",", header = F, stringsAsFactors = F)

data.seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = "NA", right = "*")

# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = F)
logLik(S1)

For some reason, it refuses to return a Log-likelihood value? Why is this the case? How can I get a Log-likelihood value?

2

There are 2 answers

3
Gilbert On BEST ANSWER

You have bad values for the missing and right arguments in your seqdef command which then causes an error in pstree.

With

data.seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = NA, right= NA, nr = "*")
# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = TRUE)
logLik(S1)

we get

'log Lik.' -31011.32 (df=47179)

Note that since you have missing values I have set with.missing = TRUE in the pstree command.

===============

To ignore the right missings, set right='DEL' in seqdef.

seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = NA, right= "DEL")
S2 <- pstree(seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = F)
logLik(S2)

I don't know what PST computes as logLik(S2) and why we get here an NA. The likelihood to generate the data with the tree S2 can be obtained by means of the predict function that returns the likelihood of each sequence in the data. The log likelihood of the data should then be

sum(log(predict(S2, seq)))

which gives

 [>] 984 sequence(s) - min/max length: 1/32
 [!] sequences have unequal lengths
 [>] max. context length: L=6
 [>] found 1020 distinct context(s)
 [>] total time: 0.588 secs
[1] -4925.79
1
Alexis Gabadinho On

Indeed, there was a problem when computing likelihood of models fitted to sequences of unequal lengths. This is fixed. The new version of the PST package (0.94) will be available within a few hours on R-Forge, to install:

install.packages("PST", repos="http://R-Forge.R-project.org") 

and later on CRAN.

Note that since your sequences don't contain any missing values but are of unequal lengths, you don't have to set neither with.missing=TRUE when using the pstree function nor any option when using seqdef.

Now when running the following code:

library(RCurl)
library(TraMineR)
library(PST)

x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/c2539d06771317c5f4c8d3a2052a73fc485a09c6/challenge_level.csv")
data <- read.csv(text = x)

data.seq <- seqdef(data[2:nrow(data),2:ncol(data)])

# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6) 

I get:

> S1@logLik
[1] -4925.79