I'm working on a system that downloads articles from various news sites and performs various NLP analyses on the texts. I want to store multiple versions and aspects of each article, including
- The raw HTML
- A cleaned-up text-only version
CoreNLP output of the article.
Since I want to store the text-only version on Elasticsearch, I thought about storing everything else on Elasticsearch, as well. I have no Elasticsearch experience, so I can't tell what's a better way to store these:
- Have one record per article, with the HTML, text and CoreNLP outputs as properties of that article :
{html: '....', text: '....', CoreNLP: '....'}
- Store each type of information in its own type:
/articles/html/1
,/articles/text/1
,/articles/corenlp/1
, etc...
Which one is more common? Is there a third, better option?
- Have one record per article, with the HTML, text and CoreNLP outputs as properties of that article :
Depends on where you want to do the COreNLP, the html tidy up, etc. If you want to do this in elastic I would use the multi field types:
https://www.elastic.co/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
If you do it outside of elastic, which would not be common since this is a good task for elastic, you could use the multiple fields approach.