I would like to know how the context window works in Rag for example: GPT3 2048 tokens GPT4 8192 to 32768 tokens
in the gpt3 example in the documents we will have a window of 2048 forward and 2048 backward? Does this mean it can only recover within that window?
explanation of how the context window works in llm
The input part of the context window determines how much input you can ask - this is typically the request and prompt with any additional context added.
The "Retrieval" part of RAG happens outside of the LLM, so is unaffected by the LLM context window (though if you use an embedding model, it likely has it's own input size limit). After retrieving content (e.g. from your database via a vector similarity search), the relevant document chunks are then added into the prompt that is passed to the LLM (this is the input part of the context window).
For a QA task, it might look like this:
Here the input context is consumed by:
Depending on your library, you can usually control things like the chunk size for your docs/nodes (this is the size of the "Context" blocks) and the number of context nodes you supply. For smaller windows you can use smaller and fewer context nodes, but this means you are providing less information. You likely also have control over the instruction part of the prompt - you could make that bigger or smaller, but it's typically consistent across invocations.
Output context is purely the generated response, and so the full "context window" is the sum of these two - input + output.