Python Reading URL: ChunkedEncodingError

58 views Asked by At

I'm using the Python requests library to open URLs. The URL is to a word document. Manually on a browser, I can access the URL, which automatically leads to a download of the document. I am able to successuflly download the document.

However, using requests, I'm getting a ChunkedEncodingError.

My code:

import requests
url = 'https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc'
res = requests.get(url) 
print(res)

The error:

raise ChunkedEncodingError(e) requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(16834 bytes read, 87102 more expected)', IncompleteRead(16834 bytes read, 87102 more expected))

I've also tried using other libraries, like aiohttp and urllib3, but errors also arise.

Retrying the request doesn't work, as I am getting an error each time.

If someone could help, that would be great! Some other posts say it could be a server side problem. But it works fine on a browser, and more technical details are beyond me.

1

There are 1 answers

1
AKX On

This certainly is a server-side problem – it occurs even when using wget, though wget (and your browser) is smart enough to retry from the failing byte:

wget -vvv 'https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc'
--2024-01-24 16:05:40--  https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc
Resolving legalref.judiciary.hk (legalref.judiciary.hk)... 118.143.43.114
Connecting to legalref.judiciary.hk (legalref.judiciary.hk)|118.143.43.114|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 103936 (102K) [application/msword]
Saving to: ‘CACC000213A_2008.doc’

CACC000213A_2008.doc                                   14%[=================>                                                                                                       ]  15,12K  --.-KB/s    in 0s

2024-01-24 16:05:42 (31,4 MB/s) - Read error at byte 15486/103936 (Connection reset by peer). Retrying.

--2024-01-24 16:05:43--  (try: 2)  https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc
Connecting to legalref.judiciary.hk (legalref.judiciary.hk)|118.143.43.114|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 103936 (102K), 88450 (86K) remaining [application/msword]
Saving to: ‘CACC000213A_2008.doc’

CACC000213A_2008.doc                                  100%[++++++++++++++++++======================================================================================================>] 101,50K  25,8KB/s    in 3,3s

2024-01-24 16:05:50 (25,8 KB/s) - ‘CACC000213A_2008.doc’ saved [103936/103936]

You can implement similar logic by using requests.get(..., stream=True), looking at the Content-Length you get, and comparing it to the bytes you've managed to write; if you get an exception and you've read less than you're expecting (via Content-Length), try again with a Range: bytes={start_byte}- style header:

import requests


def download_with_resume(sess: requests.Session, url: str) -> bytes:
    data = b""
    expected_length = None
    for attempt in range(10):
        if len(data) == expected_length:
            break
        if len(data):
            headers = {"Range": f"bytes={len(data)}-"}
            expected_status = 206
        else:
            headers = {}
            expected_status = 200
        print(f"{url}: got {len(data)} / {expected_length} bytes...")
        resp = sess.get(url, stream=True, headers=headers)
        resp.raise_for_status()
        if resp.status_code != expected_status:
            raise ValueError(f"Unexpected status code: {resp.status_code}")
        if expected_length is None:  # Only update this on the first request
            content_length = resp.headers.get("Content-Length")
            if not content_length:
                raise ValueError("Content-Length header not found")
            expected_length = int(content_length)

        try:
            for chunk in resp.iter_content(chunk_size=8192):
                data += chunk
        except requests.exceptions.ChunkedEncodingError:
            pass

    if len(data) != expected_length:
        raise ValueError(f"Expected {expected_length} bytes, got {len(data)}")

    return data


with requests.Session() as sess:
    data = download_with_resume(
        sess,
        url="https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc",
    )
    print("=>", len(data))