How to get word count of SOLR document?

Question

How to get word count of SOLR document?

1.9k views Asked by Calabria At 18 June 2015 at 08:08

I have the binary content of a pdf file, and I want to upload it to SOLR and index its content:

 ContentStreamUpdateRequest up = new ContentStreamUpdateRequest('/update/extract')
    up.setParam("literal.id", map.id)
    def tmpFile = null
    tmpFile = File.createTempFile(map.id, ".tmp")
    tmpFile.append(binary)
    up.addFile(tmpFile, ".pdf")
    // Do the SOLR stuff here
    def solr = getSolrServer()       
    up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true)
    def response = solr.request(up)
    if (tmpFile) {
        tmpFile.delete()
    }
    return response

When I query SOLR, I can retrieve the SOLR document. How can I get the actual content of the file? Basically I need to find the word count of the document I've uploaded so I was planning to do a size() on the string returned (if that's even possible)....

I'm very new to SOLR so am probably on the wrong track... any assistance greatly appreciated :)

Original Q&A

There are 1 answers

**jay** · Accepted Answer · 2015-06-18T23:59:00+00:00

I am assuming you want to count the number of words in the PDF which you have indexed. Make sure that

The entire extracted contents of PDF are indexed into one field.
Make sure this field has atleast a whitespace tokenizer enabled. So that it splits the sentences into words based on whitespace.

Once you do this you can find the number of words either using facets or Term vector component. The below SO answer might be helpful:

https://stackoverflow.com/a/26933126/689625

TechQA.

How to get word count of SOLR document?

There are 1 answers

Related Questions in SOLR

Related Questions in SOLRJ

Popular Questions

Popular Tags

Trending Questions