RegEx: is it possible to make multiple substitutions within a larger capture group?

51 views Asked by At

I need to transform survey results from a comma-separated variable file (originally Excel) where each of the few hundred records contains an arbitrary number of quotes about a location (preceded by "Story xx: " and followed by a new line) to a JSON file that I can use as the data source for a website that will map the responses. A typical record looks like this (I had to remove some URLs or the post was flagged as spam):

1,Alethia Tanner Park,227 Harry Thomas Way NE,Washington,DC,20002,3,"Story 1: First quote of someone commenting on this location.
Story 2: Second quote from a resident that likes this place.
Story 3: Third quote about this park in the city.







",38.91143761,-77.00165721

I want to transform the above to this where each quote is surrounded by HTML left/right double quotes and followed by a horizontal line:

{
    "latitudeLongitude": [38.91143761, -77.00165721], 
    "displayName": "Alethia Tanner Park", 
    "quote": "&ldquo;First quote of someone commenting on this location.&rdquo;<hr />&ldquo;Second quote from a resident that likes this place.&rdquo;<hr />&ldquo;Third quote about this park in the city.&rdquo;<hr />", 
    "numberOfCitations": 3
},

I am using regex101.com (which uses PCRE2 processor by default) and my input regular expression to extract the components of each record is:

^(?'groupId'[0-9]{1,3}),(?'displayName'[^,]+),(?'street'\?\?\?|[0-9 a-zA-Z]*),(?'city'\?\?\?|[ a-zA-Z]*),(?'state'\?\?\?|[A-Z]{2}),(?'zipCode'\?\?\?|[0-9]{5}),(?'numberOfCitations'[0-9]{1,2}),"(?:Story [0-9]{1,2}: (?'quote'[ 0-9a-zA-Z\(\)\-\+,.'&!?]*)\n)+\n*",(?'latitude'38\.[0-9]{4,8}),(?'longitude'-7[6|7]\.[0-9]{4,8})$

Note the part trying to capture all the individual quotes inside the string with a plus sign outside the group to indicate there is at least one:

"(?:Story [0-9]{1,2}: (?'quote'[ 0-9a-zA-Z\(\)\-\+,.'&!?]*)\n)+\n*"

My substitution is

{\n\t"latitudeLongitude": [${latitude}, ${longitude}], \n\t"displayName": "${displayName}", \n\t"quote": "(&ldquo;${quote}&rdquo;<hr />)+", \n\t"numberOfCitations": ${numberOfCitations}\n},

but I capture only the last of the three quotes so the output is

{
    "latitudeLongitude": [38.91143761, -77.00165721], 
    "displayName": "Alethia Tanner Park", 
    "quote": "&ldquo;Third quote about this park in the city.&rdquo;<hr />", 
    "numberOfCitations": 3
},

I hoped that adding parentheses and a plus sign (&ldquo;${quote}&rdquo;<hr />)+ to indicate multiple instances would work but clearly it does not as they just appear in the output as you cn see here https://regex101.com/r/pYaN92/1.

Is there a way to capture all the quotes and process them? I can achieve what I need by processing the quotes separately but I am curious to know if what I want to achieve is possible in a single pass.This post (Regex: Is it possible to do a substitution within a capture group?) suggests not but that is somewhat different as he is trying to match just on quote marks which exist elsewhere; I have a match on a "quote" within the overall capture but only one is processed.

0

There are 0 answers