I'm trying to download all the minute bars between two dates for stocks symbols using polygon.io. According to their API the API is limited to 50000 results.
From their Github issues, I've found the following comment:
The aggregates endpoint does not have the next_url for pagination. Instead, if there are more than 50,000 messages in the response, you will need to query a smaller time frame of data. I recommend querying 1 months worth of minute bars per query.
So here's what I've done so far:
Return a list of symbols:
from polygon import RESTClient
import os.path
from IPython.display import display
import pandas as pd
key = ''
all_tickers = []
df_list = []
final_df = []
from_ = '2021-05-01'
to = '2022-12-01'
def get_tickers():
with RESTClient(key) as client:
next_url = None
while True:
if next_url is None:
tickers = client.reference_tickers_v3(type="CS")
else:
tickers = client._handle_response("ReferenceTickersV3ApiResponse", next_url, {})
all_tickers.extend(tickers.results)
if hasattr(tickers, 'next_url'):
next_url = tickers.next_url
else:
break
file_name = 'tickers.csv'
if not os.path.exists(file_name):
get_tickers()
all_tickers_copy = pd.DataFrame(all_tickers)
all_tickers_copy.to_csv(file_name, index=False)
else:
all_tickers = pd.read_csv(file_name)
all_tickers = all_tickers['ticker']
Return a list with the start and the end days of the months, between the from_ and to dates:
import pandas as pd
start_date, end_date = from_, to
dtrange = pd.date_range(start=start_date, end=end_date, freq='d')
months = pd.Series(dtrange .month)
starts, ends = months.ne(months.shift(1)), months.ne(months.shift(-1))
df = pd.DataFrame({'month_starting_date': dtrange[starts].strftime('%Y-%m-%d'),
'month_ending_date': dtrange[ends].strftime('%Y-%m-%d')})
# as a list of lists:
months = [df.columns.values.tolist()] + df.values.tolist()
months = pd.DataFrame(months)
I then have a function which loops through my symbols and makes an API request for every month between from_ and to:
def get_daily_agg(from_, to, ticker):
with RESTClient(key) as client:
folder_name = 'intraday_bars_gapped_new'
final_df = pd.DataFrame([])
try:
# skip the header and loop through the rows
for index, row in months[1:].iterrows():
# save the start and end dates as variables
from_ = row[0]
to = row[1]
print(f'{to} and {from_}')
r = client.stocks_equities_aggregates(ticker, 1, "minute", from_, to, unadjusted=False, limit='50000')
print(f'downloading {ticker} from {from_} to {to}')
df = pd.DataFrame(r.results)
df = df[['t','v','o','c','h','l', 'vw']]
df.columns = ['datetime', 'volume','open','close','high', 'low', 'vwap']
df['datetime'] = pd.to_datetime(df['datetime'],unit='ms')
df['time'] = df['datetime'].dt.strftime("%H:%M:%S")
df['date'] = df['datetime'].dt.strftime("%Y-%m-%d")
final_df.append(df)
except:
print(f'nothing found for {ticker} from {from_} to {to}')
pass
if not os.path.exists(folder_name):
os.makedirs(folder_name)
final_df.to_csv('{}/{}.csv'.format(folder_name, ticker), index=False)
else:
final_df.to_csv('{}/{}.csv'.format(folder_name, ticker), index=False)
import glob
from pathlib import Path
folder = "daily_bars_filtered/*.csv"
for fname in glob.glob(folder)[:20]:
ticker = Path(fname).stem
get_daily_agg(from_, to, ticker)
My question is - how do properly paginate the results from polygon.io API?
Update: this has been addressed and as of Polygon's Python client version v1.10.1 you can find examples like the following in the documentation where you can use a for loop to handle paginated responses:
(as found in the new versions of the documentation examples/rest/stocks-aggregates_bars.py)
Here we can use list_aggs() and iterate over the pages of the result and process each (in this example, append them to a list)
Previously...
Prior to the Python client support documented above, I considered two workarounds for accomplishing this. I'll preserve them here, especially since you may find using HTTP requests (instead of the Python client) more performant and/or more quickly adaptable to API updates.
Using HTTP requests instead of the Python client
The Polygonio REST API can be accessed via a standard HTTP request, which you can parse as JSON. If there is data, it will be in the resulting JSON key 'results'. If there is an additional page, it will be in the JSON key 'next_url'. So you can make an HTTP request for the data, and use a while loop to make additional requests for each 'next_url' (if they exist).
For example:
Multiple requests divided by date ranges
As noted by the question, you can separate your requests into date ranges so that each request returns less than the limit of 50,000 results. The major problem with this approach is it can greatly increases the number of requests you need to make compared to using the 'next_url' cursor, especially considering most polygon.io accounts are limited by the number of requests they can make in a given timeframe.
That said, it is doable. It's easiest to implement a solution that makes a request per month or per two months. Something like:
You can probably half the number of requests required by doing two months at a time. Doing a quick lookup on the last two years of data, the largest dataset of minute aggregates for a ticker in a given month was ticker AAPL which had 18,956 records in the range 2022-03-01 to 2022-03-31, so I think you can safely collect two months of data at a time and still be below the 50,000 record limit. So if you use rrule as I did above, consider setting the interval parameter = 2 (2 months) or whatever the equivalent is for the library you are using.
But even two months at a time, this method is horribly inefficient. There are tickers where you could get the entire 2 years of data less than 3 requests via pagination, but even doing two months at a time would require 6 requests for the same.