featuretools: why does dfs() do redundant calculation?

Question

featuretools: why does dfs() do redundant calculation?

83 views Asked by user3595632 At 13 October 2020 at 01:39

Below is the example code from the official docs

import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)

feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_entity="customers",
    agg_primitives=["sum", "mode"],
    trans_primitives=["cum_max", "month", "cum_count"],
    max_depth=2
)

feature_defs

>> 
[<Feature: zip_code>,
 ....
 <Feature: MODE(sessions.device)>,
 <Feature: MODE(transactions.sessions.device)>,
 ...
 ]

After analyzing the calculation of graph_feature(), it looks like MODE(sessions.device) and MODE(transactions.sessions.device) are same even though they are calculated in different way. If I'm right, why does dfs calculate this redundantly?

Original Q&A

There are 1 answers

**Frances Hartwell** · Accepted Answer · 2020-10-13T20:16:37+00:00

Thanks for the question! While they look similar, these are actually different features. MODE(sessions.device) is the mode of devices over all sessions for a customer while MODE(transactions.sessions.device) is the mode of devices over all transactions for a customer.

As a quick example to demonstrate the difference, let's say a customer has 3 sessions:

session_id        device
------------------------
         A        Mobile
         B            PC              
         C            PC

There are also 5 transactions, each associated with one of these sessions:

transaction_id      session_id     sessions.device
--------------------------------------------------
             0               A              Mobile
             1               A              Mobile
             2               A              Mobile
             3               B                  PC
             4               C                  PC

In this case, the MODE(sessions.device) would be PC, but the MODE(transactions.sessions.device) would be Mobile because there's more transactions associated with Session A. In the feature graphs, the key difference is that MODE(transactions.sessions.device) first joins on the transactions entity. Even if you group by sessions, you won't end up with what you started with since each transaction now has it's own value.

TechQA.

featuretools: why does dfs() do redundant calculation?

There are 1 answers

Related Questions in FEATURETOOLS

Popular Questions

Popular Tags

Trending Questions