Ngrams without words orders in Hive

386 views Asked by At

Creation of name_tab:

CREATE TABLE name_tab (country string,
new_item ARRAY<STRUCT<ngram:array<string>,
estfrequency:double>>);

Insert statement:

INSERT OVERWRITE TABLE name_tab
SELECT country, ngrams(sentences(var2),3,100) as word_map
FROM bdd 
GROUP BY country;

Creation of name_tab_new :

CREATE TABLE name_tab_new (country string, ngram1 string, ngram2 string,  ngram3 string, estfrequency double);

Insert statement:

INSERT OVERWRITE TABLE name_tab_new
SELECT country , X.ngram[0], X.ngram[1], X.ngram[2], X.estfrequency
FROM name_tab
LATERAL VIEW explode(new_item) Z as X;

These requests in Hive work. It creates ngrams by country. The problem: for one country there is a difference between this ngram {aa, bb, cc} and this ngram {bb, aa, cc}.

I want a solution which the words orders don't matters. For one country, i want no différence between {aa, bb, cc} and {bb, aa, cc}. I want just one of them.

Thank you very much

Results example:

*England, bread, sandwich, juice, 120

England, desk, chair, tool, 54

England, sandwich, bread, juice, 32

Italy, sea, Roma, Coliseo, 47*

Actually, I want that:

*England, bread, sandwich, juice, 152

England, desk, chair, tool, 54

Italy, sea, Roma, Coliseo, 47*

I hope There is an option in the ngrams function for not taking account the order.

In the table bdd, the variable "var2" is a list of several words separated by a blanck.

In the table name_tab, we have:

First line England, {"ngram":["bread","sandwich","juice"],"estfrequency":120.0}, {"ngram":["desk","chair","tool"],"estfrequency":54.0}, {"ngram":["sandwich","bread","juice"],"estfrequency":32.0}

Second line Italy, {"ngram":["sea","Roma","Coliseo"],"estfrequency":47.0}

1

There are 1 answers

9
David דודו Markovitz On

Demo

with t as (select 'a  b a c c a b b a a a a c c b c a b c a b' as mycol)

select      sort_array(e.ngram) as ngram
           ,sum(e.estfrequency) as estfrequency

from       (select  explode(ngrams(sentences(mycol),2,1000)) e

            from    t
            ) t

group by    sort_array(e.ngram)
;

+-----------+--------------+
|   ngram   | estfrequency |
+-----------+--------------+
| ["a","a"] | 3.0          |
| ["a","b"] | 6.0          |
| ["a","c"] | 5.0          |
| ["b","b"] | 1.0          |
| ["b","c"] | 3.0          |
| ["c","c"] | 2.0          |
+-----------+--------------+