Text splitter output is not JSON serializable

304 views Asked by At

Summary

I'm trying to extract text from PDF using PDFMiner, cut it into chunks, and then embed it with a model from Huggingface. The problem is that the list returned by RecursiveCharacterTextSplitter() is not json serializable by requests.post()

The code fails when querying the Huggingface model and it returns an error message:

TypeError: Object of type Document is not JSON serializable

Note: Full error at the end of this question

I don't know how to convert my data that I receive from my RecursiveCharacterTextSplitter() to JSON serializable object.

Source code:

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import PDFMinerLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

import requests

import pandas as pd

# Extract text from pdf file

output_string = StringIO()
with open('info.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

# Split text into chunks via text_splitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
    is_separator_regex = False,
)

texts = text_splitter.create_documents([output_string.getvalue()])

# model_id: Embedding Model we use on Huggingface
# hf_token: Huggingface token so that you can authorize against huggingface models

model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = "hf..."

# Build request header

api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}

# Issue query for the model to embed our different text chunks

def query(texts):
    response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
    return response.json()

output = query(texts)
 

The error code at runtime is the following:

Traceback (most recent call last):
  File "troubleshoot.py", line 59, in <module>
    output = query(texts)
  File "troubleshoot.py", line 56, in query
    response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
  File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/sessions.py", line 575, in request
    prep = self.prepare_request(req)
  File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/sessions.py", line 486, in prepare_request
    p.prepare(
  File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/models.py", line 371, in prepare
    self.prepare_body(data, files, json)
  File "/Users/work/Library/Python/3.8/lib/python/site-packages/requests/models.py", line 511, in prepare_body
    body = complexjson.dumps(json, allow_nan=False)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/json/__init__.py", line 234, in dumps
    return cls(
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Document is not JSON serializable

The list I'm trying to JSON serialize looks the following:

output of print(type(texts)): <class 'list'>

output of print(texts): [Document(page_content='There is a land in the middle of the Pacific Ocean, it’s called AmazingLand.', metadata={}), Document(page_content='The population of it is about 1.4 million and it’s 72% inhabited by Amazings. The other 28%', metadata={}), Document(page_content='consists of hungarians, germans and mongoloids.', metadata={}), Document(page_content='The country is a monarchy, and it is ruled by the Big Amazing King. The Big Amazing King is', metadata={}), Document(page_content='someone who can rap classical music, and it is the best doing it among the Amazing population', metadata={}), Document(page_content='of the AmazingLand.', metadata={})]

Question

How do I convert the data that I receive from my RecursiveCharacterTextSplitter() to JSON serializable object?

0

There are 0 answers