How can I apply a regex only inside an infobox?

117 views Asked by At

I need to remove wikicode image tags, but only inside infoboxes, using AutoWikiBrowser (.NET flavour).

For example, here, I need to keep only the "xyz.jpg" image name and its extension, but not touch the "abc.jpg" nor the contents of the infobox:

{{Infobox xyz
|aaa= Xyz
|bbb= [[Xyz]]
|ccc=[[File:xyz.jpg|thumb|Xyz.]]
|ddd= {{lang|en|xyz}}
}}

[[File:abc.jpg|thumb|Abc.]]

I have the regex to remove image tags: \[\[ *fi(?:le|chier) *: *([^\|]*)[^\]]*\]\] (test here), but it also modifies abc.jpg

I also found a regex that selects only the infobox: (?=\{Infobox)(\{([^{}]|(?1))*\}) (test here), but it isn't in .NET flavour and I cannot adapt it to do what I want.

I am not sure that something like the last regex is possible in .NET as this flavour doesn't seem to accept subroutines, but is there is a way to do it anyway, knowing that I must use the .NET flavour?

2

There are 2 answers

3
logi-kal On BEST ANSWER

For my experience, typically infoboxes have a level of "template recursion" of at most 2. So this should be a practical patch:

(\{\{Infobox(?:[^{}]|\{\{(?:[^{}]|\{\{[^{}]*\}\})*\}\})*)\[\[ *(?:image|fi(?:le|chier)) *: *([^\|\]]*)[^\]]*\]\]

Note that also "Image:" is a valid namespace for images/files.

See a demo here.

Also, beware that such kind of regex does not matches multiple files in the same infobox. As a workaround, you can either use lookbehind for avoiding the regex start matching from {{Infobox (like here), or in AWB's "Advanced settings" you can set "Apply No. of times" to an acceptable number of replacements for every infobox (like 5).

1
InSync On

Wikitext is very hard to parse. It employs some HTML elements too, so all the more difficult. Remember: You can't parse [X]HTML with regex.

The following only attempts to balance the number of braces and brackets, but that's about it. It won't handle all the gory details and specific edge cases. Be careful when making edits.

As a general recommendation, learn yourself a programming language and use it, be it Python, C#, JS or whatever. There are already many wikitext parsers. Most of them are not perfect, but they are still more useful than regex alone.

TLDR

(?=\[\[)
(?<=
  {{[Ii]nfobox
  (?(stack)(?!))
  (?>
    (?<-stack>\k<stack>)
  | (?<=(?<stack> {{ ) [^{]* ) }}
  | (?<=(?<stack>\[\[) [^\]]*) ]]
  | (?<!])] | (?<!\[)\[
  | (?<!})} | (?<!{ ){
  | [^[\]{}]
  )+
)
\[\[
(?i:File|Image):[^[|\]]+\|
(?:[^[|\]]*\|)*
(?>
  [^[\]{}]
| \[(?!\[) | ](?!])
| { (?!{ ) | }(?!})
| \[\[(?= [^\]]* (?<stack2>\]\]))
|  {{ (?= [^}]*  (?<stack2> }} ))
| (?<-stack2>\k<stack2>)
)*
(?(stack2)(?!))
\]\]

Try it on regex101.com.

What the heck was that?

I'm glad you asked.

First, please familiarize yourself with .NET's balancing groups.

This partly derives from this regex (the author's blog post), which matches mixed balanced brackets of all kinds:

(                                    # Match something that is either
    [^(){}\[\]]+                     # not a bracket

    | \( (?=[^)]*  (?<Stack> \) ) )  # or an opening parenthesis
                                     # (at this point, we lookahead to the
                                     # closest closing parenthesis
                                     # and add it to the stack)
    | \[ (?=[^\]]* (?<Stack> \] ) )  # or an opening square bracket (same as above)
    | \{ (?=[^}]*  (?<Stack> \} ) )  # or an opening brace (same as above)

    | \k<Stack> (?<-Stack>)          # or the last character stored in the stack
                                     # (after which we remove it from the stack)
)+?                                  # one or more times.
(?(Stack) (?!))                      # If there is still something left in
                                     # the stack, discard the match.

A slo-mo explanation (non-brackets removed for brevity):

Text     : ([{}[({})]])
Position : ^
Lookahead:         ^
Stack    : )
Text     : ([{}[({})]]}
Position :  ^
Lookahead:          ^
Stack    : )]
Text     : ([{}[({})]]}
Position :   ^
Lookahead:    ^
Stack    : )]}
Text     : ([{}[({})]]}
Position :    ^
Stack[-1]: }
Stack    : )]
Text     : ([{}[({})]]}
Position :     ^
Lookahead:          ^
Stack    : )]]
Text     : ([{}[({})]]}
Position :      ^
Lookahead:         ^
Stack    : )]])
Text     : ([{}[({})]]}
Position :       ^
Lookahead:        ^
Stack    : )]])}
Text     : ([{}[({})]]}
Position :        ^
Stack[-1]: }
Stack    : )]])
Text     : ([{}[({})]]}
Position :         ^
Stack[-1]: )
Stack    : )]]
Text     : ([{}[({})]]}
Position :          ^
Stack[-1]: ]
Stack    : )]
Text     : ([{}[({})]]}
Position :           ^
Stack[-1]: ]
Stack    : )
Text     : ([{}[({})]]}
Position :            ^
Stack[-1]: )             # Fail
Stack    : 

Since we are dealing with doubled brackets and braces, let's modify it just a little bit:

(?>                                # Match either
  [^[\]{}]                         # a non-bracket
| \[(?!\[) | ](?!])                # or a standalone square bracket
| \{(?!\{) | }(?!})                # or a standalone brace
| \[\[(?= [^\]]* (?<stack>\]\]))   # or two consecutive square brackets (add to stack)
|  {{ (?= [^}]*  (?<stack> }} ))   # or two consecutive braces (also add to stack)
| (?<-stack>\k<stack>)             # or the last thing we added to stack.
)*                                 #
(?(stack)(?!))                     # 

We want to only match the links and not other parameters, and the only way to do so is to use a lookbehind. Lookbehinds matches backwards, so we have to reverse everything. Add something to ensure that the match only and we're done:

(?<=
  {{[Ii]nfobox
  (?(stack)(?!))
  (?>
    (?<-stack>\k<stack>)
  | (?<=(?<stack> {{ ) [^{]* ) }}
  | (?<=(?<stack>\[\[) [^\]]*) ]]
  | (?<!])] | (?<!\[)\[
  | (?<!})} | (?<!{ ){
  | [^[\]{}]
  )+
)

But wait! File captions may contain links too:

[[File:xyz.jpg|thumb|Xyz. [[Link]] in caption.]]

Fixing this is simple enough: just slap it with the regex we already had. A bit of seasoning, and et voilĂ :

\[\[
(?i:File|Image):[^[|\]]+\|
(?:[^[|\]]*\|)*
(?>
  [^[\]{}]
| \[(?!\[) | ](?!])
| { (?!{ ) | }(?!})
| \[\[(?= [^\]]* (?<stack2>\]\]))
|  {{ (?= [^}]*  (?<stack2> }} ))
| (?<-stack2>\k<stack2>)
)*
(?(stack2)(?!))
\]\]

Conclusion

This regex matches all three expected results in the following example:

{{Infobox xyz
|aaa= Xyz
|bbb= [[Xyz]]
|ccc=[[File:xyz.jpg|thumb|Xyz.]]
|ddd= {{lang|en|xyz}}
|eee=[[File:xyz.jpg|thumb|Xyz.]]
|eee=[[File:xyz.jpg|thumb|Xyz. [[Link]] in caption.]]
}}

[[File:abc.jpg|thumb|Abc.]]

{{foo|[[File:abc.jpg|thumb|Abc.]]}}

If you need the name of the file, the caption or the configurations that come in between them, just group them up.

However, as stated, this may not match all cases, and I'm not going to fine-tune this mess. Use an actual programming language to deal with more complex cases.