Elasticsearch schema for multiple versions of the same text

Question

Elasticsearch schema for multiple versions of the same text

107 views Asked by zmbq At 08 June 2015 at 20:44

I'm working on a system that downloads articles from various news sites and performs various NLP analyses on the texts. I want to store multiple versions and aspects of each article, including

The raw HTML
A cleaned-up text-only version
CoreNLP output of the article.

Since I want to store the text-only version on Elasticsearch, I thought about storing everything else on Elasticsearch, as well. I have no Elasticsearch experience, so I can't tell what's a better way to store these:
1. Have one record per article, with the HTML, text and CoreNLP outputs as properties of that article : {html: '....', text: '....', CoreNLP: '....'}
2. Store each type of information in its own type: /articles/html/1, /articles/text/1, /articles/corenlp/1, etc...
Which one is more common? Is there a third, better option?

Original Q&A

There are 1 answers

**Jettro Coenradie** · Answer 1 · 2015-06-08T22:07:10+00:00

Depends on where you want to do the COreNLP, the html tidy up, etc. If you want to do this in elastic I would use the multi field types:

https://www.elastic.co/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html

If you do it outside of elastic, which would not be common since this is a good task for elastic, you could use the multiple fields approach.

TechQA.

Elasticsearch schema for multiple versions of the same text

There are 1 answers

Related Questions in ELASTICSEARCH

Related Questions in SCHEMA

Popular Questions

Popular Tags

Trending Questions