What does the Git Smart HTTP(S) protocol fully look like in all its glory?

2k views Asked by At

I'm trying to implement a webserver that simulates a Git remote. Users should be able to clone or pull from my server, edit files, commit, and push (with authentication)—normal things to do with Git. However, on the server side is not a bare Git repository or anything; data is stored in other formats, and only converted when requested.

I've spent a lot of time trying to find out how the Git Smart HTTP protocol works, and here's what I know so far.

From the Git docs on http-protocol, I know that GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.1 should elicit the following (example) response:

HTTP/1.1 200 OK<CRLF>
Content-Type: application/x-git-upload-pack-advertisement<CRLF>
Cache-Control: no-cache<CRLF>
<CRLF>
001e# service=git-upload-pack<LF>
0000<no LF>
004895dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint<NUL>multi_ack<LF>
003fd049f6c27a2244e12041955e262a404c7faba355 refs/heads/master<LF>
003c2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0<LF>
003fa3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}<LF>
0000

From my own experimentation with a repo of mine with very few commits, it seems GitHub is so far entirely within the limits of the protocol as described in the docs:

HTTP/1.1 200 OK<CRLF>
Server: GitHub Babel 2.0<CRLF>
Content-Type: application/x-git-upload-pack-advertisement<CRLF>
Content-Security-Policy: default-src 'none'; sandbox<CRLF>
Transfer-Encoding: chunked<CRLF>
expires: Fri, 01 Jan 1980 00:00:00 GMT<CRLF>
pragma: no-cache<CRLF>
Cache-Control: no-cache, max-age=0, must-revalidate<CRLF>
Vary: Accept-Encoding<CRLF>
X-Frame-Options: DENY<CRLF>
X-GitHub-Request-Id: [redacted]<CRLF>
<CRLF>
001e# service=git-upload-pack<LF>
0000<no LF>0156feee8d0aeff172f5b39e3175175d027f3fd5ecc1 HEAD<NUL>multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed allow-tip-sha1-in-want allow-reachable-sha1-in-want no-done symref=HEAD:refs/heads/master filter object-format=sha1 agent=git/github-g69d6dd5d35d8<LF>
003ffeee8d0aeff172f5b39e3175175d027f3fd5ecc1 refs/heads/master<LF>
0000

However this is where the easy part ends. What if I want to actually get that commit data? The Git docs on the matter gives an example of the POST request to send, and some grammar, and then says "TODO: Document this further". ????????

I tried experimenting by CURLing GitHub in the format I see in the docs.

(cwd)>curl https://github.com/Kenny2github/ConvoSplit.git/git-upload-pack -o - -i -X POST -d @-
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0032have 941ea62275547bcbfb78fd97d29be18d09a78190
0009done
0000
^Z
HTTP/1.1 200 OK
Server: GitHub Babel 2.0
Content-Type: application/x-git-upload-pack-result
Content-Security-Policy: default-src 'none'; sandbox
Transfer-Encoding: chunked
expires: Fri, 01 Jan 1980 00:00:00 GMT
pragma: no-cache
Cache-Control: no-cache, max-age=0, must-revalidate
Vary: Accept-Encoding
X-GitHub-Request-Id: [redacted]
X-Frame-Options: DENY

curl: (18) transfer closed with outstanding read data remaining

What?

I tried using Python:

>>> import requests
>>> requests.post('https://github.com/Kenny2github/ConvoSplit.git/git-upload-pack', data=b'''
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0032have 941ea62275547bcbfb78fd97d29be18d09a78190
0009done
0000
'''.strip())
Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 572, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 331, in _error_catcher
    yield
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 637, in read_chunked
    self._update_chunk_length()
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 576, in _update_chunk_length
    raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\models.py", line 751, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 461, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 665, in read_chunked
    self._original_response.close()
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 349, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    requests.post('https://github.com/Kenny2github/ConvoSplit.git/git-upload-pack', data=b'0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1\n0032have 941ea62275547bcbfb78fd97d29be18d09a78190\n0009done\n0000')
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\api.py", line 119, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 685, in send
    r.content
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\models.py", line 829, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\models.py", line 754, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

The rest of the http-protocol docs don't help - another six TODOs appear. The pack-protocol docs at least give me an idea of what I'm supposed to be receiving, but no indication of how.

The Transfer Protocols docs tells me nothing new, and then says "take a look at the Git source code". I tried that, but it's hardcore C and I'd have to understand basically the entire infrastructure of Git itself. (I may yet attempt to do that, but now is not the time.)

I did manage to glean that git upload-pack is involved, and running git upload-pack --stateless-rpc --advertise-refs .git did give me the /info/refs list like before. However, attempts to get an actual pack out of it failed, and not only did they fail, they failed inconsistently between platforms.

On Windows:

(cwd)>git upload-pack --stateless-rpc .git
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0009done # I hit Enter and nothing else
fatal: protocol error: bad line length character:
000

(cwd)>git upload-pack --stateless-rpc .git
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0000 # likewise
fatal: protocol error: bad line length character:
000

(cwd)>py -c "print('0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1\n0009done\n0000')" | git upload-pack --stateless-rpc .git
fatal: protocol error: bad line length character:
000

Suspecting it was carriage returns causing problems, I tried WSL:

$ git upload-pack --stateless-rpc .git
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0000 # I hit Enter and then ^D after 0000
fatal: The remote end hung up unexpectedly

$ git upload-pack --stateless-rpc .git
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0009done # I hit Enter and did NOT hit ^D
fatal: git upload-pack: protocol error, expected to get sha, not 'done'

$ # using Python to pipe each of the above inputs yielded the same results

What am I doing wrong? How can I get GitHub/git-upload-pack to respect me?

2

There are 2 answers

1
bk2204 On BEST ANSWER

First of all, it isn't possible to explain the entire protocol in a StackOverflow answer; the explanation is too long. However, I'll try to point out a few things to note.

First, when you speak the protocol, you need to be pretty exact; this is not a case where line ending differences and extra bytes will be tolerated. As such, if you're synthesizing data to pass to the remote, it should be done with printf(1) or a programming language. Don't type things at the shell.

Git uses the pkt-line format, which means that every line or chunk of data is prefixed with a four hex-character sequence that represents the length of the data and the prefix. If the sequence is 0000, that's a flush packet and it indicates the end of that chunk of data. If the sequence is 0001, that's a delimiter packet and it's used in protocol v2 to delimit sections of that chunk of data. Otherwise, the hex sequence cannot have a value exceeding 65519.

In your situation where you're sending want and have lines, you're expected to do multiple iterations until the server sends you a pack. In HTTP, that's multiple requests. The server will send you acknowledgements for the have arguments you've specified. The server expects to find a path from each want directive to an object both sides have (or else, that the client has nothing, in which case the repository is empty).

Be aware that this task is actually quite involved. There's now a v2 of the protocol (the old one was v0, and there's a v1, which is the same but with a version header) for fetches. You should also expect to be able to support SHA-256 repositories, which don't currently interoperate with SHA-1 repositories, but are otherwise supported. And Git also provides a large number of extensions which you will practically want to support, like the sideband functionality, which is required if you want to provide output to the user about what your side is doing.

The documentation mostly lives in Documentation/technical in the Git repository. It is incomplete in some places, but you should mostly be able to discern it with some reading and testing.

0
AbyxDev On

Okay, after some more experimentation I happened upon the right combination, if you will, by random chance.

$ git upload-pack --stateless-rpc .git > tmp.pack
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
00000009done # Enter with NO ^D
Counting objects: 16, done.
Compressing objects: 100% (14/14), done.
Total 16 (delta 3), reused 0 (delta 0)
$ hd tmp.bin
00000000  30 30 30 38 4e 41 4b 0a  50 41 43 4b 00 00 00 02  |0008NAK.PACK....|
00000010  00 00 00 10 94 2f 78 9c  a5 92 4f 6f db 30 0c c5  |...../x...Oo.0..|
...
>>> import requests
>>> # omitting the trailing \n results in a 200 OK blank response
>>> r = requests.post('https://github.com/Kenny2github/ConvoSplit.git/git-upload-pack', data=b'0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1\n00000009done\n')
>>> r.text[:20]
'0008NAK\nPACK\x00\x00\x00\x02\x00\x00\x00\x10'

However, this only offers me control over which commits I want. If I try to specify which commits I have (like I should be able to), I only get ACKs for my haves:

>>> print(requests.post('https://github.com/Kenny2github/ConvoSplit.git/git-upload-pack', data=b'''
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
00000032have 941ea62275547bcbfb78fd97d29be18d09a78190
0032have 93dbc9cfb21d23c6eb5313419bfaa8213619c73c
0032have 648508d6359b3e8992ee5a6d9fee6f86110202fd
00000009done
'''.lstrip()).text)
0031ACK 941ea62275547bcbfb78fd97d29be18d09a78190
0031ACK 93dbc9cfb21d23c6eb5313419bfaa8213619c73c
0031ACK 648508d6359b3e8992ee5a6d9fee6f86110202fd

(Same deal if I try with git upload-pack.) How do I properly handle the rest of the whole process? Once more, I'm aiming to simulate a(n essentially) complete git remote.