Below is the example code from the official docs
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="customers",
agg_primitives=["sum", "mode"],
trans_primitives=["cum_max", "month", "cum_count"],
max_depth=2
)
feature_defs
>>
[<Feature: zip_code>,
....
<Feature: MODE(sessions.device)>,
<Feature: MODE(transactions.sessions.device)>,
...
]
After analyzing the calculation of graph_feature()
, it looks like MODE(sessions.device)
and MODE(transactions.sessions.device)
are same even though they are calculated in different way. If I'm right, why does dfs calculate this redundantly?
Thanks for the question! While they look similar, these are actually different features.
MODE(sessions.device)
is the mode of devices over all sessions for a customer whileMODE(transactions.sessions.device)
is the mode of devices over all transactions for a customer.As a quick example to demonstrate the difference, let's say a customer has 3 sessions:
There are also 5 transactions, each associated with one of these sessions:
In this case, the
MODE(sessions.device)
would be PC, but theMODE(transactions.sessions.device)
would be Mobile because there's more transactions associated with Session A. In the feature graphs, the key difference is thatMODE(transactions.sessions.device)
first joins on the transactions entity. Even if you group by sessions, you won't end up with what you started with since each transaction now has it's own value.