convert mercurial to git with not UTF8 characters

215 views Asked by At

I have many mercurial repositories and I try to convert them to git repo. I used fast-export https://github.com/frej/fast-export and everything was good, but some of my mercurial repo have files with russian letters.
It's huge repositories with about 20k commits and many branches

on ubuntu it looks like

docs/
|-- Android
|-- DataContracts
|-- \302\345\355\344\356\360\373
|-- \304\340\362\340\312\356\355\362\360\340\352\362\373
|-- \310\355\361\362\360\363\352\366\350\350
|-- \310\361\365\356\344\355\340\377\ \344\356\352\363\354\345\355\362\340\366\350\377
|-- \317\360\356\362\356\352\356\353\373
`-- \320\345\353\350\347\355\340\377\ \344\356\352\363\354\345\355\362\340\366\350\377

on windows it looks like normal

Get-ChildItem .\docs\


    Каталог: C:\temp\mercurial\Ptk\docs


Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d-----       12.01.2021     10:11                Android
d-----       12.01.2021     10:11                DataContracts
d-----       12.01.2021     10:11                Вендоры
d-----       12.01.2021     10:11                ДатаКонтракты
d-----       12.01.2021     10:11                Инструкции
d-----       12.01.2021     10:11                Исходная документация
d-----       12.01.2021     10:11                Протоколы
d-----       12.01.2021     10:11                Релизная документация
-a----       03.03.2021     12:30              0 Вендоры2

inside docs folder i have many documentations in word, pdf and other format

at first i tried to convert with command

~/mercurial/fast-export/hg-fast-export.sh -r ~/mercurial/Ptk -fe ISO-8859-1 but after converting characters were broken

next i tried to rename all files in my repo https://serverfault.com/questions/319070/mercurial-convert-filename-encoding

import sys
for path in sys.stdin:
    old = path[:-1] # strip newline
    new = old.decode("cp1251").encode("utf-8")
    print 'rename "%s" "%s"' % (old, new)

$ hg manifest --all | python rename.py > rename.txt output is

rename ".gitignore" ".gitignore"
rename ".hgignore" ".hgignore"
rename ".hglf/docs/����������/��������� ������������� ��� �� �������/files/android_root.exe" ".hglf/docs/Инструкции/Первичная инициализация ПТК из коробки/files/android_root.exe"
Traceback (most recent call last):
  File "rename.py", line 4, in <module>
    new = old.decode("cp1251").encode("utf-8")
  File "/usr/lib/python2.7/encodings/cp1251.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 12: character maps to <undefined>

I tried to use other decode cp1252

file -ib docs/*

output

inode/directory; charset=binary
inode/directory; charset=binary
inode/directory; charset=binary
inode/directory; charset=binary
inode/directory; charset=binary
inode/directory; charset=binary
inode/directory; charset=binary
inode/directory; charset=binary

next i tried to convert with tortoisehg https://tortoisehg.bitbucket.io/

hg bookmark -r default master
"C:\Program Files\TortoiseHg\hg.exe" push c:\temp\mercurial\converted-repo

after converting characters were broken

I don't want to delete any documentations from my repositories because not only documentations with russian characters inside repo and i have source files with russian characters, don't ask why :)

Could you give me advice how i can convert it to git repo?

0

There are 0 answers