Regular Expression to add Curly Brackets around the "comments" in a pgn (chess)

370 views Asked by At

EDIT: To see a working version of the JS function, see the accepted answer in this followup thread

EDIT: the final regex I used is this:

var pattern = /((?:\s?[\(\)]?\s?[\(\)]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2}(?:\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2})?\s?[()]?\s?[()]?\s?)+)|((?:(?!\s?[\(\)]?\s?[\(\)]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][O-]{0,3}[!?+#=]{0,2}[NBRQ]?[!?+#]{0,2})[^\)\(])+)|((?:\)\s\())/g;

I'm transcribing a bunch of chess games into official PGN format. Normally, "comments" in PGN format go inside curly brackets (e.g. "{ blah blah blah }". My game sources are MS Word Documents where the games are formatted in a "chess book" style without the curly brackets around comments.

Here is my regular expression, which matches all blocks of "chess moves", leaving just the comments unmatched. I think I've done this about 95% correct, based on testing.

var re = /(/(\s?[\(\)]?\s?[\(\)]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8\-][!?+#=O-]{0,2}[NBRQ]*[!?+#]{0,2}(\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8\-][!?+#=O-]{0,2}[NBRQ]*[!?+#]{0,2})?\s?[\(\)]?\s?[\(\)]?\s?)+/gm;

You can see my hideously long regular expression in action, along with a sample pgn here: http://regex101.com/r/dI8sQ8

[N.B. This piece of it

[NBRQK]?[a-h1-8]?x?[a-hO][1-8\-][!?+#=O-]{0,2}[NBRQ]*[!?+#]{0,2} 

which finds a chess move, repeats twice in the larger expression because sometimes the comment will come after just a white move (1. e4 comment), sometimes after just a black move (1... e5 comment), and sometimes after a white and black move together (1. e4 e5 comment)]

So far, I can match all the blocks of moves...which is everything EXCEPT the comments. All that remains is to "replace" these matched blocks of moves with themselves + a " } " before and " { " after (possibly skipping the beginning and end of the string.

I tried both of these replace functions:

str = str.replace(re,{\1});
str = str.replace(re,{$1});

The \1 thing I saw someone do here: regular expression to add brackets before and after a repeated text

and I see documentation of the $1 kind of thing. I guess I don't quite understand the difference, and neither seem to work.

One more wrinkle, which you can see in the linked example, is that sometimes comments come inside variations (which are delimited by "(" and ")" and if one variation ends with a comment and another begins with a comment, which looks like " ) ( ", we want to change that to " } ) ( { " ... but I think that would be easy to do with a second regular expression, after we've gone through and added most of the { with this one.

Thanks in advance for your help. Something tells me that I already did the hard part and I just don't understand the syntax $1 or \1

EDIT: Here is a sample of one of the pgns I'm working with.

Khabarovsk is the capital of Far East of Russia. My 16-year-old opponent was a promising local prodigy. Now he is a very strong FM with a FIDE rating of 2437 and lives... in the USA, too! A small world. 1. e4 c5 2. Nf3 e6 3. c3 Nf6 4. e5 Nd5 5. d4 cxd4 6. cxd4 d6 7. Nc3 Nc6 8. Bd3!? Nxc3 9. bxc3 dxe5 10. dxe5 Qa5 11. O-O Be7 12. Qb3 Nxe5 13. Nxe5 Qxe5 14. Bb5+ Kf8 15. Ba3 Qc7 16. Rad1 g6 17. c4! Bxa3 18. Qxa3+ Kg7 19. Rd6 Rd8 20. c5 Bd7 21. Bc4 Bc6 22. Rfd1 Rd7 23. Qg3 Rad8 Finally with accurate, solid play Black has consolidated yet White still keeps some pressure and has some compensation for the pawn. 24. h4 A typical march in such positions, simply nothing else to do better. 24... h5?! ( 24... h6 would be a more careful response. ) ( But the best defense was 24... Rd6! 25. cd6 Qa5 ) 25. Qe5+ Kh7 26. Bd3 Very natural 26... Kh6? ( Missing 26... Ba4! 27. Qxh5+ Kg7 28. Qe5+ Kg8! and now Black has many own threats. White would have to force a perpetual after 29. h5! Bxd1 30. h6 f6 31. Qxf6 Bh5 32. Qxe6+ Kh7 33. Bxg6+ Bxg6 34. Qxg6+ Kh8 35. Qf6+ Now, after 26...Kh6 everything is ready for preparing a decisive blow. ) 27. Qf6! Kh7 ( There is no 27... Rxd6 28. cxd6 Rxd6? due to 29. Qh8# ) 28. g4! hxg4 29. h5 Rxd6 30. cxd6 Rxd6 31. hxg6+ Kg8 32. g7! This pawn is the vital factor until the end now. With any other move, White loses. 32... Qd8! The only defense against Qh6 and Qh8 checkmating or queening. 33. Qh6 f5 34. Rd2!! The idea is the white rook cannot be taken with a check anymore. The bishop will be easily unpinned with the crushing Bxf5 or Bc4. The Black pin on d file was an illusion! In fact it's Black's rook that is pinned and cannot leave d file. 34... Bd5 ( The best try - to close d file with protecting more e6 pawn. No help is 34... Rd7 35. Bf5 ef5 36. Qh8 Kf7 37. Rd7 ) ( But maybe the best practical chance was 34... g3!? and now 35. Bxf5 doesn't win because of 35... gxf2+ 36. Kh2 f1=N+! 37. Kh3 Bg2+! 38. Rxg2 Rd3+! 39. Bxd3 Qxd3+ with an amazing perpetual 40. Kh4 Qe4+ 41. Rg4 Qh1+ 42. Kg5 Qd5+ 43. Kf6 Qd8+ 44. Kg6 Qd3+ ) ( But after 34... g3!? White wins using another wing tactic: 35. Bc4! Bd5 36. Bxd5 exd5 37. Qh8+ Kf7 38. Rc2 gxf2+ 39. Kf1! and there is no defense against Rc8. Now after 35...Bd5 again everything looks well protected. ) 35. Qh8 Kf7 36. Bb5! The bishop still makes his way breaking through. The coming Be8 is a killer. 36... Qg8 37. Be8+! Qxe8 38. Qe8+ Kxe8 39. g8=Q+ Kd7 40. Qg7+ It was White's 40th move Which means time control was over for me. I was short on time. A piece and three pawns for a queen is not enough. Black resigned. 1-0

1

There are 1 answers

6
AudioBubble On BEST ANSWER

I think you could match the moves/comments in one regex.
Just build up a new string. Below, Capture grp 1 is the moves, grp2 is the comment.

edit - Simple change. To support Dot-All in JS, a . was changed to [\S\s]

    # /((?:\s?[()]?\s?[()]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][!?+#=O-]{0,2}[NBRQ]*[!?+#]{0,2}(?:\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][!?+#=O-]{0,2}[NBRQ]*[!?+#]{0,2})?\s?[()]?\s?[()]?\s?)+)|((?:(?!\s?[()]?\s?[()]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-])[\S\s])+)/

    (                                  # (1 start), Moves
         (?:
              \s? 
              [()]?  \s? [()]? 
              \s? 
              [0-9]{1,3} \.{1,3} 
              \s 
              [NBRQK]? [a-h1-8]? x? [a-hO] [1-8-] [!?+#=O-]{0,2} [NBRQ]* [!?+#]{0,2} 
              (?:
                   \s 
                   [NBRQK]? [a-h1-8]? x? [a-hO] [1-8-] [!?+#=O-]{0,2} [NBRQ]* [!?+#]{0,2} 
              )?
              \s? 
              [()]? \s? [()]? 
              \s? 
         )+
    )                                  # (1 end)
 |  
    (                                  # (2 start), Comments
         (?:
              (?!
                   \s? 
                   [()]?  \s? [()]? 
                   \s? 
                   [0-9]{1,3} \.{1,3} 
                   \s 
                   [NBRQK]? [a-h1-8]? x? [a-hO] [1-8-] 
              )
              [\S\s]                   # Changed from `.` JS doesn't support Dot-All
         )+
    )                                  # (2 end)

Perl test case

$/ = undef;

$str = <DATA>;

$newstr = '';

while ( $str =~ /((?:\s?[()]?\s?[()]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][!?+#=O-]{0,2}[NBRQ]*[!?+#]{0,2}(?:\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-][!?+#=O-]{0,2}[NBRQ]*[!?+#]{0,2})?\s?[()]?\s?[()]?\s?)+)|((?:(?!\s?[()]?\s?[()]?\s?[0-9]{1,3}\.{1,3}\s[NBRQK]?[a-h1-8]?x?[a-hO][1-8-])[\S\s])+)/g )
{
    if (defined $1) {
       $newstr .= $1;
    }
    else {
       $newstr .= '{' . $2 . '}';
    }
    print "'$&'\n";
}

print "---------------\n";
print $newstr;


__DATA__
Khabarovsk is the capital of Far East of Russia. My 16-year-old opponent was a promising local prodigy. Now he is a very strong FM with a FIDE rating of 2437 and lives... in the USA, too! A small world. 1. e4 c5 2. Nf3 e6 3. c3 Nf6 4. e5 Nd5 5. d4 cxd4 6. cxd4 d6 7. Nc3 Nc6 8. Bd3!? Nxc3 9. bxc3 dxe5 10. dxe5 Qa5 11. O-O Be7 12. Qb3 Nxe5 13. Nxe5 Qxe5 14. Bb5+ Kf8 15. Ba3 Qc7 16. Rad1 g6 17. c4! Bxa3 18. Qxa3+ Kg7 19. Rd6 Rd8 20. c5 Bd7 21. Bc4 Bc6 22. Rfd1 Rd7 23. Qg3 Rad8 Finally with accurate, solid play Black has consolidated yet White still keeps some pressure and has some compensation for the pawn. 24. h4 A typical march in such positions, simply nothing else to do better. 24... h5?! ( 24... h6 would be a more careful response. ) ( But the best defense was 24... Rd6! 25. cd6 Qa5 ) 25. Qe5+ Kh7 26. Bd3 Very natural 26... Kh6? ( Missing 26... Ba4! 27. Qxh5+ Kg7 28. Qe5+ Kg8! and now Black has many own threats. White would have to force a perpetual after 29. h5! Bxd1 30. h6 f6 31. Qxf6 Bh5 32. Qxe6+ Kh7 33. Bxg6+ Bxg6 34. Qxg6+ Kh8 35. Qf6+ Now, after 26...Kh6 everything is ready for preparing a decisive blow. ) 27. Qf6! Kh7 ( There is no 27... Rxd6 28. cxd6 Rxd6? due to 29. Qh8# ) 28

Output >>

 'Khabarovsk is the capital of Far East of Russia. My 16-year-old opponent was a promising local prodigy. Now he is a very strong FM with a FIDE rating of 2437 and lives... in the USA, too! A small world.'
 ' 1. e4 c5 2. Nf3 e6 3. c3 Nf6 4. e5 Nd5 5. d4 cxd4 6. cxd4 d6 7. Nc3 Nc6 8. Bd3!? Nxc3 9. bxc3 dxe5 10. dxe5 Qa5 11. O-O Be7 12. Qb3 Nxe5 13. Nxe5 Qxe5 14. Bb5+ Kf8 15. Ba3 Qc7 16. Rad1 g6 17. c4! Bxa3 18. Qxa3+ Kg7 19. Rd6 Rd8 20. c5 Bd7 21. Bc4 Bc6 22. Rfd1 Rd7 23. Qg3 Rad8 '
 'Finally with accurate, solid play Black has consolidated yet White still keeps some pressure and has some compensation for the pawn.'
 ' 24. h4 '
 'A typical march in such positions, simply nothing else to do better.'
 ' 24... h5?! ( 24... h6 '
 'would be a more careful response. ) ( But the best defense was'
 ' 24... Rd6! 25. cd6 Qa5 ) 25. Qe5+ Kh7 26. Bd3 '
 'Very natural'
 ' 26... Kh6? ( '
 'Missing'
 ' 26... Ba4! 27. Qxh5+ Kg7 28. Qe5+ Kg8! '
 'and now Black has many own threats. White would have to force a perpetual after'
 ' 29. h5! Bxd1 30. h6 f6 31. Qxf6 Bh5 32. Qxe6+ Kh7 33. Bxg6+ Bxg6 34. Qxg6+ Kh8 35. Qf6+ '
 'Now, after 26...Kh6 everything is ready for preparing a decisive blow.'
 ' ) 27. Qf6! Kh7 ( '
 'There is no'
 ' 27... Rxd6 28. cxd6 Rxd6? '
 'due to'
 ' 29. Qh8# ) '
 '28'
 ---------------
 {Khabarovsk is the capital of Far East of Russia. My 16-year-old opponent was a promising local prodigy. Now he is a very strong FM with a FIDE rating of 2437 and lives... in the USA, too! A small world.} 1. e4 c5 2. Nf3 e6 3. c3 Nf6 4. e5 Nd5 5. d4 cxd4 6. cxd4 d6 7. Nc3 Nc6 8. Bd3!? Nxc3 9. bxc3 dxe5 10. dxe5 Qa5 11. O-O Be7 12. Qb3 Nxe5 13. Nxe5 Qxe5 14. Bb5+ Kf8 15. Ba3 Qc7 16. Rad1 g6 17. c4! Bxa3 18. Qxa3+ Kg7 19. Rd6 Rd8 20. c5 Bd7 21. Bc4 Bc6 22. Rfd1 Rd7 23. Qg3 Rad8 {Finally with accurate, solid play Black has consolidated yet White still keeps some pressure and has some compensation for the pawn.} 24. h4 {A typical march in such positions, simply nothing else to do better.} 24... h5?! ( 24... h6 {would be a more careful response. ) ( But the best defense was} 24... Rd6! 25. cd6 Qa5 ) 25. Qe5+ Kh7 26. Bd3 {Very natural} 26... Kh6? ( {Missing} 26... Ba4! 27. Qxh5+ Kg7 28. Qe5+ Kg8! {and now Black has many own threats. White would have to force a perpetual after} 29. h5! Bxd1 30. h6 f6 31. Qxf6 Bh5 32. Qxe6+ Kh7 33. Bxg6+ Bxg6 34. Qxg6+ Kh8 35. Qf6+ {Now, after 26...Kh6 everything is ready for preparing a decisive blow.} ) 27. Qf6! Kh7 ( {There is no} 27... Rxd6 28. cxd6 Rxd6? {due to} 29. Qh8# ) {28}