I've a dataset of firms with year from 1960-1989 and month (1-12) that looks like this.
| PERMNO. | Year | Month |
|---|---|---|
| 10057 | 1960 | 1 |
| 10057 | 1960 | 2 |
| 10057 | 1960 | 3 |
| 10057 | 1960 | 4 |
| 10057 | 1960 | 5 |
| 10057 | 1960 | 6 |
| 10057 | 1960 | 7 |
| 10057 | 1960 | ... |
| 10057 | 1961 | 1 |
| 10057 | 1961 | 2 |
| 10057 | 1961 | 3 |
| 10057 | 1961 | 4 |
| 10057 | 1961 | 5 |
| 10057 | 1961 | 6 |
| 10057 | 1961 | 7 |
| 10057 | 1961 | ... |
| 10057 | 1989 | 1 |
| 10057 | 1989 | 2 |
| 10057 | 1989 | 3 |
| 10057 | 1989 | 4 |
| 10057 | 1989 | 5 |
| 10057 | 1989 | 6 |
| 10057 | 1989 | 7 |
| 10057 | 1989 | ... |
| 11857 | 1960 | 1 |
| 11857 | 1960 | 2 |
| 11857 | 1960 | 3 |
| 11857 | 1960 | 4 |
| 11857 | 1960 | 5 |
| 11857 | 1960 | 6 |
| 11857 | 1960 | 7 |
| 11857 | 1960 | ... |
What I'm trying to do is for each Month of July for each firm (PERMNO), I want all the rows before it. For example, For firm 10057, for July 1960, I want all rows beginning from Jan 1960-July 1960 and for next for the same firm 10057, for July 1961, I want all rows beginning from Jan 1960-July 1961 etc etc. Something like this:
| PERMNO. | Year | Month |
|---|---|---|
| 10057 | 1960 | 1 |
| 10057 | 1960 | 2 |
| 10057 | 1960 | 3 |
| 10057 | 1960 | 4 |
| 10057 | 1960 | 5 |
| 10057 | 1960 | 6 |
| 10057 | 1960 | 7 |
| 10057 | 1960 | 1 |
| 10057 | 1960 | 2 |
| 10057 | 1960 | 3 |
| 10057 | 1960 | ... |
| 10057 | 1961 | 4 |
| 10057 | 1961 | 5 |
| 10057 | 1961 | 6 |
| 10057 | 1961 | 7 |
| 10057 | 1960 | 1 |
| 10057 | 1960 | ... |
| 10057 | 1961 | 1 |
| 10057 | 1961 | ... |
| 10057 | 1962 | 6 |
| 10057 | 1962 | 7 |
and so on for each firm and within each firm, all Julys.
I've written a loop inside a loop. The first firm worked but starting from the second, the loop started to return all rows from the beginning 1960 to the end 1980. Not sure where I did wrong for the loop.
This is my code:
permnolist = CRSP2.drop_duplicates('PERMNO').PERMNO.to_list()
LL = []
for indexer in permnolist:
df = CRSP2[CRSP2['PERMNO']==indexer]
july_list = df[df['Month']==7].Month.index.to_list()
for indexer2 in july_list:
df2 = df.iloc[:indexer2]
LL.append(df2)
I made a list of all unique firms. The first loop attempts to get a dataframe of the first firm and the second loop uses the index for the row where Month=July (Yes, I've reset the index). I use iloc to take all rows before July. The first firm worked beautifully but once it looped to the second firm, it started returning all rows for that firm from beginning year 1960 all the way to the end 1980 for many times.
Given you input example data you could do something like this to avoid loops at all:
to output exactly what you describe as output: