I have a relatively simple FastAPI app that accepts a query and streams back the response from ChatGPT's API. ChatGPT is streaming back the result and I can see this being printed to console as it comes in.
What's not working is the StreamingResponse
back via FastAPI. The response gets sent all together instead. I'm really at a loss as to why this isn't working.
Here is the FastAPI app code:
import os
import time
import openai
import fastapi
from fastapi import Depends, HTTPException, status, Request
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from fastapi.responses import StreamingResponse
auth_scheme = HTTPBearer()
app = fastapi.FastAPI()
openai.api_key = os.environ["OPENAI_API_KEY"]
def ask_statesman(query: str):
#prompt = router(query)
completion_reason = None
response = ""
while not completion_reason or completion_reason == "length":
openai_stream = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": query}],
temperature=0.0,
stream=True,
)
for line in openai_stream:
completion_reason = line["choices"][0]["finish_reason"]
if "content" in line["choices"][0].delta:
current_response = line["choices"][0].delta.content
print(current_response)
yield current_response
time.sleep(0.25)
@app.post("/")
async def request_handler(auth_key: str, query: str):
if auth_key != "123":
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid authentication credentials",
headers={"WWW-Authenticate": auth_scheme.scheme_name},
)
else:
stream_response = ask_statesman(query)
return StreamingResponse(stream_response, media_type="text/plain")
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000, debug=True, log_level="debug")
And here is the very simple test.py
file to test this:
import requests
query = "How tall is the Eiffel tower?"
url = "http://localhost:8000"
params = {"auth_key": "123", "query": query}
response = requests.post(url, params=params, stream=True)
for chunk in response.iter_lines():
if chunk:
print(chunk.decode("utf-8"))
First, it wouldn't be good practice to use a
POST
request for requesting data from the server. Using aGET
request instead would be more suitable to your case. In addition to that, you shouldn't be sending credentials, such asauth_key
as part of the URL (i.e., using the query string), but you should rather useHeaders
and/orCookies
(usingHTTPS
). Please have a look at this answer for more details and examples on the concepts of headers and cookies, as well as the risks involved when using query parameters instead. Helpful posts around this topic can also be found here and here, as well as here, here and here.Second, if you are executing a blocking operation (i.e., blocking I/O-bound or CPU-bound tasks) inside the
StreamingResponse
's generator function, you should define the generator function withdef
instead ofasync def
, as, otherwise, the blocking operation, as well as thetime.sleep()
function that is used inside your generator, would blcok the event loop. As explained here, if the function for streaming the response body is a normaldef
generator and not anasync def
one, FastAPI will useiterate_in_threadpool()
to run the iterator/generator in a separate thread that is thenawait
ed—seeStreamingResponse
's relevant source code. If you prefer using anasync def
generator, then make sure to execute any blocking operations in an externalThreadPool
(orProcessPool
) andawait
it, as well as useawait asyncio.sleep()
instead oftime.sleep()
, in cased you need to add delay in the execution of an operation. Have a look at this detailed answer for more details and examples.Third, you are using
requests
'iter_lines()
function, which iterates over the response data, one line at a time. If, however, the response data did not include any line break character (i.e.,\n
), you wouldn't see the data on client's console getting printed as they arrive, until the entire response is received by the client and printed as a whole. In that case, you should instead useiter_content()
and specify thechunk_size
as desired (both cases are demonstrated in the example below).Finally, if you would like the
StreamingResponse
to work in every browser (including Chrome as well)—in the sense of being able to see the data as they stream in—you should specify themedia_type
to a different type thantext/plain
(e.g.,application/json
ortext/event-stream
, see here), or disable MIME Sniffing. As explained here, browsers will start bufferingtext/plain
responses for a certain amount (around 1445 bytes, as documented here), in order to check whether or not the content received is actually plain text. To avoid that, you can either set themedia_type
totext/event-stream
(used for server-sent events), or keep usingtext/plain
, but set theX-Content-Type-Options
response header tonosniff
, which would disable MIME Sniffing (both options are demonstrated in the example below).Working Example
app.py
test.py (using Python
requests
)test.py (using
httpx
—see this, as well as this and this for the benefits of usinghttpx
overrequests
)