I have the following DataFrame (named df2
later):
recipe_id ingredients
0 3332 [11307, 11322, 11632, 11338, 11478, 11438]
1 3333 [11322, 11338, 11632, 11314, 11682, 11478, 108...
2 3334 [11632, 11682, 11338, 11337, 10837, 11435, 113...
3 3335 [11149, 11322, 11532, 11996, 10616, 10837, 113...
4 3336 [11330, 11632, 11422, 11256, 11338, 11314, 114...
5 3812 [959, 92, 3, 554, 12271, 202]
...
I want to create another DataFrame which will have following colums : ['ingredients', "recipe_id1", "recipe_id2", ..., "recipe_idn"]
, where n is total number of recipes in database. I did that with the following snippet:
columns = ['ingredient'] + (list(df2['recipe_id'].unique()))
ingredient_df = pd.DataFrame(columns=columns)
After i create this DataFrame (which i did already), and populate it (problem i'm having), output should look like this:
In [1]:
# Create and populate ingredient_df by some method
columns = ['ingredient'] + (list(df2['recipe_id'].unique()))
ingredient_df = pd.DataFrame(columns=columns)
ingredient_df = populate_df(ingredient_df, df2)
Out [1]:
In [2]:
ingredient_df
Out[2]:
ingredient ... 3332 3333 3334 3335 3336 ...
...
11322 ... 1 1 0 1 0 ...
...
In the example above, value at (11322, 3334)
is 0
because ingredient 11322 is not present in recipe with id 3334.
In other words, I want for every ingredient to have mapping (ingredient, recipe_id) = 1
if ingredient is present in that recipe, and 0
otherwise.
I've managed to do this by iterating over all recipes and trough all ingredient, but this is very slow. How can i do this in more robust and elegant way using Pandas methods (if this is possible at all)?
setup
method 1
method 2