This query against the Wikidata SPARQL endpoint returns the Wikitext content of the first 50 files in the Wikimedia Commons category "1930s photographs in Auckland Museum". For each file, I want to extract several pieces of data from that content.
Working with just one file, File:("Ultimate" stall) (AM 79483-1).jpg, as an example, the content looks like this:
== {{int:filedesc}} ==
{{Artwork
| description = {{en|1=At the equestrian show. A man stands in front of a stall selling radios.}}
| title = ("Ultimate" stall)
| artist = {{Creator:Tudor Washington Collins}}
| date = 1938
| place of creation =
| source = {{Images from Auckland Museum|section=library|object=photography|id=79483}}
[https://api.aucklandmuseum.com/id/media/p/806abf5c0952f972e56bc95fed841c5031bcb9ff Photo]
| accession number = 79483 (object number)
| object type =
| technique = Silver gelatin dry plate
| dimensions =
| institution = {{Institution:Auckland War Memorial Museum}}
| permission = This image has been released as "CCBY" by Auckland Museum. For details refer to the
[[Commons:Batch_uploading/AucklandMuseumCCBY|Commons project page]].
| credit line =
| notes =
| other_versions = <gallery> ("Ultimate" stall) (AM 79483-2).jpg </gallery>
}}
== {{int:license-header}} ==
{{CC-BY-4.0|1=Auckland Museum}}
[[Category:Images uploaded by Fæ]] [[Category:1930s photographs in Auckland Museum]]
[[Category:Tudor Washington Collins]] [[Category:Radio in Auckland Museum]]
[[Category:Images from Auckland Museum]]
I'm interested in these 3 values in the source
parameter. I've tried to parse this content with regex; this is the the first expression I wrote, which deals with the bulk of the Wikitext:
^(?>.+{{Images from Auckland Museum\|)(.*?)(?>}}.+)$
I used regex101.com to write this, and from what I can tell it says:
- Find (and discard) everything up to the string
{{Images from Auckland Museum|
, including that string. (This was the most obvious delimiter I could think of). - Capture everything that occurs afterward.
- Find (and discard) everything from the first occurrence of a pair of right curly brackets (
}}
) to the end.
This leaves only the portion I'm interested in:
section=library|object=photography|id=79483
So far, so good.
I then created another regex101.com session to work on just that portion, with this expression:
(?>.*?\=)(.*)(?>.*?\|)(?>.*?\=)(.*)(?>.*?\|)(?>.*?\=)(.*)
From what I can tell, this expression says:
- Find (and discard) everything up to, and including, the first
=
. - Capture everything after that, up to, but not including, the first
|
…and repeats three times, one for each capture group, giving me the three data points I want.
My questions are these:
- How can I combine these regular expressions? Simply slotting the second into the first in place of its
(.*?)
does not appear to work. - Given that regex allows recursion, is there a better (i.e., more efficient) way to write the second expression? (Would the SPARQL endpoint/language allow this?)
- Is there any way in the first expression to simply say, after obtaining the first capture group, something like, "I've got what I want; stop"—and would there be any efficiency gain in doing so?
Thanks in advance.