I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code.
# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")
# Read File
# Facebooks posts could be generated by FB Netvizz
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file
fbpost <- read.csv("D:/FB-com.csv", sep=";")
# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)
# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)
Everything works until:
> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
... indexing 2,760 documents
... tokenizing texts, found 77,923 total tokens
... cleaning the tokens, 1584 removed entirely
... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1", :
invalid 'dimnames' given for data frame
How would you interpret the error message? Are there any suggestions to solve the problem?
There was a bug in quanteda version 0.7.2 that caused
dfm()
to fail when using a dictionary when one of the documents contains no features. Your example fails because in the cleaning stage, some of the Facebook post "documents" end up having all of their features removed through the cleaning steps.This is not only fixed in 0.8.0, but also we changed the underlying implementation of dictionaries in
dfm()
, resulting in a significant speed improvement. (The LIWC is still a large and complicated dictionary, and the regular expressions still mean that it is much slower to use than simply indexing tokens. We will work on optimising this further.)It will also work if a document contains zero features after tokenization and cleaning, which is probably what is breaking the older
dfm
you are using with your Facebook texts.