How can I use SPARQL regex to parse Wikitext and extract values from parameters in a Wikimedia Commons template?

105 views Asked by At

This query against the Wikidata SPARQL endpoint returns the Wikitext content of the first 50 files in the Wikimedia Commons category "1930s photographs in Auckland Museum". For each file, I want to extract several pieces of data from that content.

Working with just one file, File:("Ultimate" stall) (AM 79483-1).jpg, as an example, the content looks like this:

== {{int:filedesc}} ==
{{Artwork 
| description = {{en|1=At the equestrian show. A man stands in front of a stall selling radios.}} 
| title = ("Ultimate" stall) 
| artist = {{Creator:Tudor Washington Collins}} 
| date = 1938 
| place of creation = 
| source = {{Images from Auckland Museum|section=library|object=photography|id=79483}}
           [https://api.aucklandmuseum.com/id/media/p/806abf5c0952f972e56bc95fed841c5031bcb9ff Photo] 
| accession number = 79483 (object number) 
| object type = 
| technique = Silver gelatin dry plate 
| dimensions = 
| institution = {{Institution:Auckland War Memorial Museum}} 
| permission = This image has been released as "CCBY" by Auckland Museum. For details refer to the
               [[Commons:Batch_uploading/AucklandMuseumCCBY|Commons project page]]. 
| credit line = 
| notes = 
| other_versions = <gallery> ("Ultimate" stall) (AM 79483-2).jpg </gallery>
}}

== {{int:license-header}} ==
{{CC-BY-4.0|1=Auckland Museum}}
[[Category:Images uploaded by Fæ]] [[Category:1930s photographs in Auckland Museum]]
[[Category:Tudor Washington Collins]] [[Category:Radio in Auckland Museum]]
[[Category:Images from Auckland Museum]]

I'm interested in these 3 values in the source parameter. I've tried to parse this content with regex; this is the the first expression I wrote, which deals with the bulk of the Wikitext:

^(?>.+{{Images from Auckland Museum\|)(.*?)(?>}}.+)$

I used regex101.com to write this, and from what I can tell it says:

  1. Find (and discard) everything up to the string {{Images from Auckland Museum|, including that string. (This was the most obvious delimiter I could think of).
  2. Capture everything that occurs afterward.
  3. Find (and discard) everything from the first occurrence of a pair of right curly brackets (}}) to the end.

This leaves only the portion I'm interested in:

section=library|object=photography|id=79483

So far, so good.

I then created another regex101.com session to work on just that portion, with this expression:

(?>.*?\=)(.*)(?>.*?\|)(?>.*?\=)(.*)(?>.*?\|)(?>.*?\=)(.*)

From what I can tell, this expression says:

  1. Find (and discard) everything up to, and including, the first =.
  2. Capture everything after that, up to, but not including, the first | …and repeats three times, one for each capture group, giving me the three data points I want.

It seems to work: Regex101.com evaluation of the syntax "(?>.?=)(.)(?>.?|)(?>.?=)(.)(?>.?|)(?>.?=)(.)" against the string "section=library|object=photography|id=79483"

My questions are these:

  1. How can I combine these regular expressions? Simply slotting the second into the first in place of its (.*?) does not appear to work.
  2. Given that regex allows recursion, is there a better (i.e., more efficient) way to write the second expression? (Would the SPARQL endpoint/language allow this?)
  3. Is there any way in the first expression to simply say, after obtaining the first capture group, something like, "I've got what I want; stop"—and would there be any efficiency gain in doing so?

Thanks in advance.

0

There are 0 answers