This query against the Wikidata SPARQL endpoint returns the Wikitext content of the first 50 files in the Wikimedia Commons category "1930s photographs in Auckland Museum". For each file, I want to extract several pieces of data from that content.

Working with just one file, File:("Ultimate" stall) (AM 79483-1).jpg, as an example, the content looks like this:

== {{int:filedesc}} ==
{{Artwork 
| description = {{en|1=At the equestrian show. A man stands in front of a stall selling radios.}} 
| title = ("Ultimate" stall) 
| artist = {{Creator:Tudor Washington Collins}} 
| date = 1938 
| place of creation = 
| source = {{Images from Auckland Museum|section=library|object=photography|id=79483}}
           [https://api.aucklandmuseum.com/id/media/p/806abf5c0952f972e56bc95fed841c5031bcb9ff Photo] 
| accession number = 79483 (object number) 
| object type = 
| technique = Silver gelatin dry plate 
| dimensions = 
| institution = {{Institution:Auckland War Memorial Museum}} 
| permission = This image has been released as "CCBY" by Auckland Museum. For details refer to the
               [[Commons:Batch_uploading/AucklandMuseumCCBY|Commons project page]]. 
| credit line = 
| notes = 
| other_versions = <gallery> ("Ultimate" stall) (AM 79483-2).jpg </gallery>
}}

== {{int:license-header}} ==
{{CC-BY-4.0|1=Auckland Museum}}
[[Category:Images uploaded by Fæ]] [[Category:1930s photographs in Auckland Museum]]
[[Category:Tudor Washington Collins]] [[Category:Radio in Auckland Museum]]
[[Category:Images from Auckland Museum]]

I'm interested in these 3 values in the source parameter. I've tried to parse this content with regex; this is the the first expression I wrote, which deals with the bulk of the Wikitext:

^(?>.+{{Images from Auckland Museum\|)(.*?)(?>}}.+)$

I used regex101.com to write this, and from what I can tell it says:

  1. Find (and discard) everything up to the string {{Images from Auckland Museum|, including that string. (This was the most obvious delimiter I could think of).
  2. Capture everything that occurs afterward.
  3. Find (and discard) everything from the first occurrence of a pair of right curly brackets (}}) to the end.

This leaves only the portion I'm interested in:

section=library|object=photography|id=79483

So far, so good.

I then created another regex101.com session to work on just that portion, with this expression:

(?>.*?\=)(.*)(?>.*?\|)(?>.*?\=)(.*)(?>.*?\|)(?>.*?\=)(.*)

From what I can tell, this expression says:

  1. Find (and discard) everything up to, and including, the first =.
  2. Capture everything after that, up to, but not including, the first | …and repeats three times, one for each capture group, giving me the three data points I want.

It seems to work: Regex101.com evaluation of the syntax "(?>.?=)(.)(?>.?|)(?>.?=)(.)(?>.?|)(?>.?=)(.)" against the string "section=library|object=photography|id=79483"

My questions are these:

  1. How can I combine these regular expressions? Simply slotting the second into the first in place of its (.*?) does not appear to work.
  2. Given that regex allows recursion, is there a better (i.e., more efficient) way to write the second expression? (Would the SPARQL endpoint/language allow this?)
  3. Is there any way in the first expression to simply say, after obtaining the first capture group, something like, "I've got what I want; stop"—and would there be any efficiency gain in doing so?

Thanks in advance.

0

There are 0 answers