text2vec's vocab_vectorizer ouput is the function itself

Question

text2vec's vocab_vectorizer ouput is the function itself

145 views Asked by maloneypatr At 22 May 2020 at 13:47

I am trying to run through text2vec's example on this page. However, whenever I try to see what the vocab_vectorizer function returned, it's just an output of the function itself. In all my years of R coding, I've never seen this before, but it also feels funky enough to extend beyond just this function. Any pointers?

> library(data.table)
> data("movie_review")
> setDT(movie_review)
> setkey(movie_review, id)
> set.seed(2016L)
> all_ids <- movie_review$id
> train_ids <- sample(all_ids, 4000)
> test_ids <- setdiff(all_ids, train_ids)
> train <- movie_review[J(train_ids)]
> test <- movie_review[J(test_ids)]
> 
> prep_fun <- tolower
> tok_fun <- word_tokenizer
> 
> it_train <- itoken(train$review, 
+                   preprocessor = prep_fun, 
+                   tokenizer = tok_fun, 
+                   ids = train$id, 
+                   progressbar = FALSE)
> vocabulary <- create_vocabulary(it_train)
> 
> vec <- text2vec::vocab_vectorizer(vocabulary = vocabulary)
> vec
function (iterator, grow_dtm, skip_grams_window_context, window_size, 
    weights, binary_cooccurence = FALSE) 
{
    vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term, 
        attr(vocabulary, "ngram")[[1]], attr(vocabulary, "ngram")[[2]], 
        attr(vocabulary, "stopwords"), attr(vocabulary, "sep_ngram"))
    setattr(vocab_corpus_ptr, "ids", character(0))
    setattr(vocab_corpus_ptr, "class", "VocabCorpus")
    corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context, 
        window_size, weights, binary_cooccurence)
}
<bytecode: 0x7f9c2e3f7380>
<environment: 0x7f9c18970970>
>

Original Q&A

There are 1 answers

**Mohanasundaram** · Accepted Answer · 2020-05-22T15:30:36+00:00

The output of vocab_vectorizer is supposed to be a function. I ran the function from the example in the documentation as below:

data("movie_review")
N = 100
vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L))
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, n_chunks = 10)
hash_dtm = create_dtm(it, vectorizer)

it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, n_chunks = 10)
v = create_vocabulary(it, c(1L, 1L) )

vectorizer = vocab_vectorizer(v)

The output of vocab_vectorizer:

> vectorizer
function (iterator, grow_dtm, skip_grams_window_context, window_size, 
    weights, binary_cooccurence = FALSE) 
{
    vocab_corpus_ptr = cpp_vocabulary_corpus_create(vocabulary$term, 
        attr(vocabulary, "ngram")[[1]], attr(vocabulary, 
            "ngram")[[2]], attr(vocabulary, "stopwords"), 
        attr(vocabulary, "sep_ngram"))
    setattr(vocab_corpus_ptr, "ids", character(0))
    setattr(vocab_corpus_ptr, "class", "VocabCorpus")
    corpus_insert(vocab_corpus_ptr, iterator, grow_dtm, skip_grams_window_context, 
        window_size, weights, binary_cooccurence)
}
<bytecode: 0x00000147ada65218>
<environment: 0x00000147b2a6dc38>

In the documentation, it has been mentioned that "It supposed to be used only as argument to create_dtm, create_tcm, create_vocabulary".

Finally, when I ran create_dtm(it, vectorizer), I got the output

> create_dtm(it, vectorizer)
100 x 5356 sparse Matrix of class "dgCMatrix"
   [[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]
   [[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]

1  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
2  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
3  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
4  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . ......
5  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . ......
6  . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
7  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
8  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
9  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . ......
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

 ..............................
 ........suppressing 5304 columns and 81 rows in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................
   [[ suppressing 52 column names ‘0.3’, ‘02’, ‘10,000,000’ ... ]]

92  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
93  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
94  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . ......
96  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . ......
97  . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
98  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
99  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

I hope this answers you.

TechQA.

text2vec's vocab_vectorizer ouput is the function itself

There are 1 answers

Related Questions in R

Related Questions in TEXT2VEC

Popular Questions

Trending Questions