I'm trying to implement a webserver that simulates a Git remote. Users should be able to clone or pull from my server, edit files, commit, and push (with authentication)—normal things to do with Git. However, on the server side is not a bare Git repository or anything; data is stored in other formats, and only converted when requested.
I've spent a lot of time trying to find out how the Git Smart HTTP protocol works, and here's what I know so far.
From the Git docs on http-protocol, I know that GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.1
should elicit the following (example) response:
HTTP/1.1 200 OK<CRLF>
Content-Type: application/x-git-upload-pack-advertisement<CRLF>
Cache-Control: no-cache<CRLF>
<CRLF>
001e# service=git-upload-pack<LF>
0000<no LF>
004895dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint<NUL>multi_ack<LF>
003fd049f6c27a2244e12041955e262a404c7faba355 refs/heads/master<LF>
003c2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0<LF>
003fa3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}<LF>
0000
From my own experimentation with a repo of mine with very few commits, it seems GitHub is so far entirely within the limits of the protocol as described in the docs:
HTTP/1.1 200 OK<CRLF>
Server: GitHub Babel 2.0<CRLF>
Content-Type: application/x-git-upload-pack-advertisement<CRLF>
Content-Security-Policy: default-src 'none'; sandbox<CRLF>
Transfer-Encoding: chunked<CRLF>
expires: Fri, 01 Jan 1980 00:00:00 GMT<CRLF>
pragma: no-cache<CRLF>
Cache-Control: no-cache, max-age=0, must-revalidate<CRLF>
Vary: Accept-Encoding<CRLF>
X-Frame-Options: DENY<CRLF>
X-GitHub-Request-Id: [redacted]<CRLF>
<CRLF>
001e# service=git-upload-pack<LF>
0000<no LF>0156feee8d0aeff172f5b39e3175175d027f3fd5ecc1 HEAD<NUL>multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed allow-tip-sha1-in-want allow-reachable-sha1-in-want no-done symref=HEAD:refs/heads/master filter object-format=sha1 agent=git/github-g69d6dd5d35d8<LF>
003ffeee8d0aeff172f5b39e3175175d027f3fd5ecc1 refs/heads/master<LF>
0000
However this is where the easy part ends. What if I want to actually get that commit data? The Git docs on the matter gives an example of the POST request to send, and some grammar, and then says "TODO: Document this further". ????????
I tried experimenting by CURLing GitHub in the format I see in the docs.
(cwd)>curl https://github.com/Kenny2github/ConvoSplit.git/git-upload-pack -o - -i -X POST -d @-
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0032have 941ea62275547bcbfb78fd97d29be18d09a78190
0009done
0000
^Z
HTTP/1.1 200 OK
Server: GitHub Babel 2.0
Content-Type: application/x-git-upload-pack-result
Content-Security-Policy: default-src 'none'; sandbox
Transfer-Encoding: chunked
expires: Fri, 01 Jan 1980 00:00:00 GMT
pragma: no-cache
Cache-Control: no-cache, max-age=0, must-revalidate
Vary: Accept-Encoding
X-GitHub-Request-Id: [redacted]
X-Frame-Options: DENY
curl: (18) transfer closed with outstanding read data remaining
What?
I tried using Python:
>>> import requests
>>> requests.post('https://github.com/Kenny2github/ConvoSplit.git/git-upload-pack', data=b'''
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0032have 941ea62275547bcbfb78fd97d29be18d09a78190
0009done
0000
'''.strip())
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 572, in _update_chunk_length
self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 331, in _error_catcher
yield
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 637, in read_chunked
self._update_chunk_length()
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 576, in _update_chunk_length
raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\models.py", line 751, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 461, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 665, in read_chunked
self._original_response.close()
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\response.py", line 349, in _error_catcher
raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
requests.post('https://github.com/Kenny2github/ConvoSplit.git/git-upload-pack', data=b'0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1\n0032have 941ea62275547bcbfb78fd97d29be18d09a78190\n0009done\n0000')
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\api.py", line 119, in post
return request('post', url, data=data, json=json, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 685, in send
r.content
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\models.py", line 829, in content
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\models.py", line 754, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
The rest of the http-protocol docs don't help - another six TODOs appear. The pack-protocol docs at least give me an idea of what I'm supposed to be receiving, but no indication of how.
The Transfer Protocols docs tells me nothing new, and then says "take a look at the Git source code". I tried that, but it's hardcore C and I'd have to understand basically the entire infrastructure of Git itself. (I may yet attempt to do that, but now is not the time.)
I did manage to glean that git upload-pack
is involved, and running git upload-pack --stateless-rpc --advertise-refs .git
did give me the /info/refs list like before. However, attempts to get an actual pack out of it failed, and not only did they fail, they failed inconsistently between platforms.
On Windows:
(cwd)>git upload-pack --stateless-rpc .git
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0009done # I hit Enter and nothing else
fatal: protocol error: bad line length character:
000
(cwd)>git upload-pack --stateless-rpc .git
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0000 # likewise
fatal: protocol error: bad line length character:
000
(cwd)>py -c "print('0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1\n0009done\n0000')" | git upload-pack --stateless-rpc .git
fatal: protocol error: bad line length character:
000
Suspecting it was carriage returns causing problems, I tried WSL:
$ git upload-pack --stateless-rpc .git
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0000 # I hit Enter and then ^D after 0000
fatal: The remote end hung up unexpectedly
$ git upload-pack --stateless-rpc .git
0032want feee8d0aeff172f5b39e3175175d027f3fd5ecc1
0009done # I hit Enter and did NOT hit ^D
fatal: git upload-pack: protocol error, expected to get sha, not 'done'
$ # using Python to pipe each of the above inputs yielded the same results
What am I doing wrong? How can I get GitHub/git-upload-pack to respect me?
First of all, it isn't possible to explain the entire protocol in a StackOverflow answer; the explanation is too long. However, I'll try to point out a few things to note.
First, when you speak the protocol, you need to be pretty exact; this is not a case where line ending differences and extra bytes will be tolerated. As such, if you're synthesizing data to pass to the remote, it should be done with
printf(1)
or a programming language. Don't type things at the shell.Git uses the pkt-line format, which means that every line or chunk of data is prefixed with a four hex-character sequence that represents the length of the data and the prefix. If the sequence is 0000, that's a flush packet and it indicates the end of that chunk of data. If the sequence is 0001, that's a delimiter packet and it's used in protocol v2 to delimit sections of that chunk of data. Otherwise, the hex sequence cannot have a value exceeding 65519.
In your situation where you're sending
want
andhave
lines, you're expected to do multiple iterations until the server sends you a pack. In HTTP, that's multiple requests. The server will send you acknowledgements for thehave
arguments you've specified. The server expects to find a path from eachwant
directive to an object both sides have (or else, that the client has nothing, in which case the repository is empty).Be aware that this task is actually quite involved. There's now a v2 of the protocol (the old one was v0, and there's a v1, which is the same but with a version header) for fetches. You should also expect to be able to support SHA-256 repositories, which don't currently interoperate with SHA-1 repositories, but are otherwise supported. And Git also provides a large number of extensions which you will practically want to support, like the sideband functionality, which is required if you want to provide output to the user about what your side is doing.
The documentation mostly lives in
Documentation/technical
in the Git repository. It is incomplete in some places, but you should mostly be able to discern it with some reading and testing.