Context:
I have a XML file (DEXPI) and I want to use it as a data source to implement Retrieval Augmented Generation (RAG) system using llama-index
to fetch the correct context against any natural language query.
Current Issue:
- I cannot use the XML file like a text document.
llama-index
does not provide any type of splitter for XML data so that XML data can be correctly divided into chunks (nodes).- Even if we write some custom chunker/splitter, a lot of unwanted jargons would be still there in the chunks like XML tags and other metadata related to XML.
What did I try?
To solve this issue I have 2 approaches:
Approach 1:
Convert the XML into SQl tables (or CSVs). Convert these tables into natural language english text. Then pass this text to llama-index
for further processing. Here, while preparing the knowledge graph index, the llama-index
will automatically figure out the vertices (entities) and the edges (relationships) between them.
Approach 2:
Convert the XML into SQL tables (or CSVs). Convert these SQL tables into Graph DB entities & relationships manually. Then query the graph db by using a graph query generated from any LLM.
My Questions:
- I need suggestions on which approach to choose currently & how effective they are.
- Are there any better approaches to deal with XML data when using
llama-index
.