How can use sed/awk or other tool to assist in search and replace of 12GB subversion dump file

461 views Asked by At

I've got a particular situation where I need to remove the operations of a series of commits in Subversion repository. Entire contents of (/trunk /tags /branches) were tagged and subsequently removed when the mistake was realized. I would simply use svndumpfilter to remove the offending nodes, but someone re-used the bad tag name at a later point so path-based exclusions will cause other problems. I need to manually edit the dump file which is 12GB. I have a series of 15 sequential revisions I need to edit, which appear in the dump in the following format:

Revision-number: 60338
Prop-content-length: 143
Content-length: 143

K 7
svn:log
V 41
Tagging test prior to creating xx branch
K 10
svn:author
V 7
userx
K 8
svn:date
V 27
2009-05-27T15:01:31.812916Z
PROPS-END

Node-path: test/tags/XX_8_0_FINAL
Node-kind: dir
Node-action: add
Node-copyfrom-rev: 60337
Node-copyfrom-path: test

Based on testing I've done, I know I need the above section to change to the following

Revision-number: 60338
Prop-content-length: 112
Content-length: 112

K 7
svn:log
V 38
This is an empty revision for padding.
K 8
svn:date
V 27
2009-05-27T15:01:31.812916Z
PROPS-END

There are 14 more revisions where the same replacement needs to take place. Trying to edit the files manually in VIM is seriously impractical. The dump files are a mixture of binary and ascii text. If anyone has any awk/sed magic that could help me, I'd be really appreciative.

4

There are 4 answers

0
David Corley On BEST ANSWER

I ended up using the following steps:

cat dump.file | grep -C 250 "Revision-number: xxxxx"

This gave me the exact line numbers in the file of the node-operations for the "bad" commits. I then used sed to remove the range of node operations (by line number) for each commit as follows:

sed -e "123,456d" -e "234,456d"

This proved to be pretty fast. For those curious, the reason I need to remove these completely was because our repository scanner (Atlassian Fisheye) was taking days to index the bad commits. I was using exclusion rules that SHOULD have worked around the issue, but it turned out I uncovered a bug with exclusion rules that is due to be fixed in the next release of Fisheye. See: http://jira.atlassian.com/browse/FE-2752

0
Beta On

First a big caveat: sed and awk are designed to work on pure text files. If your files are a mixture of binary and ascii then I'm not confident that the following will work (personally I'd use Perl).

I assume that the "Revision-number: 60338" is what you want to use as your trigger (and heaven help you if it occurs in the binary). Put your revised section ("...This is an empty revision...") in a separate file called, say, newsection. Then:

sed -e '/^Revision-number: 60338$/r newsection' -e '/^Revision-number: 60338$/,/^Node-copyfrom-path: test$/d' bigfilename
0
ldav1s On

How about SvnDumpTool? You might be able to join the initial "good" part with the incrementally dumped edited parts.

0
khmarbaise On

Do those commits contain confidential material or what's the reason to remove them? Why not let them in the repository remove the tags/branches and that's it. EDIT: Oversight that you already removed the tags/branches...