Creation of name_tab:
CREATE TABLE name_tab (country string,
new_item ARRAY<STRUCT<ngram:array<string>,
estfrequency:double>>);
Insert statement:
INSERT OVERWRITE TABLE name_tab
SELECT country, ngrams(sentences(var2),3,100) as word_map
FROM bdd
GROUP BY country;
Creation of name_tab_new :
CREATE TABLE name_tab_new (country string, ngram1 string, ngram2 string, ngram3 string, estfrequency double);
Insert statement:
INSERT OVERWRITE TABLE name_tab_new
SELECT country , X.ngram[0], X.ngram[1], X.ngram[2], X.estfrequency
FROM name_tab
LATERAL VIEW explode(new_item) Z as X;
These requests in Hive work. It creates ngrams by country.
The problem: for one country there is a difference between this ngram {aa, bb, cc}
and this ngram {bb, aa, cc}
.
I want a solution which the words orders don't matters. For one country, i want no différence between {aa, bb, cc}
and {bb, aa, cc}
. I want just one of them.
Thank you very much
Results example:
*England, bread, sandwich, juice, 120
England, desk, chair, tool, 54
England, sandwich, bread, juice, 32
Italy, sea, Roma, Coliseo, 47*
Actually, I want that:
*England, bread, sandwich, juice, 152
England, desk, chair, tool, 54
Italy, sea, Roma, Coliseo, 47*
I hope There is an option in the ngrams function for not taking account the order.
In the table bdd, the variable "var2" is a list of several words separated by a blanck.
In the table name_tab, we have:
First line England, {"ngram":["bread","sandwich","juice"],"estfrequency":120.0}, {"ngram":["desk","chair","tool"],"estfrequency":54.0}, {"ngram":["sandwich","bread","juice"],"estfrequency":32.0}
Second line Italy, {"ngram":["sea","Roma","Coliseo"],"estfrequency":47.0}
Demo