Here is my best attempt (so far) to solve this issue. I'm new to regular expressions and this problem is pretty substantial, but I'll give it a try. RegEx's clearly take some time to master.
This seems to satisfy the delimiter/comma requirements. To me it seems redundant though because of the repeated /s*
. There is likely a better way.
/\s*[,|\s*]\s*/
I found this on SOF and am trying to tear it apart and apply it to my problem (not easy). This seems to satisfy most of the "quoting" requirements, but I'm still working on how to solve the delimiter issues in the requirements below.
/"(?:\\\\.|[^\\\\"])*"|\S+/
The requirements I'm trying to meet:
- Will be used by the PHP preg_match_all() (or similar) function to break a string into an array of strings. Source language is PHP.
- Words in the input string are delimited by (0 or more whitespace)(optional comma)(0 or more whitespace) or just (1 or more whitespace).
- The input string can also have quoted substrings which become a single element in the output array.
- Quoted substrings in the input string must retain their double quotes when placed in the output array (because we must be able to identify them later as being originally quoted in the input string).
- Leading and trailing whitespace (that is, whitespace between the double-quote character and the string itself) in quoted substrings must be removed when placed into the output array. Example: "<space>hello<space>world<space><tab>" becomes "hello<space>world"
- Whitespace within quoted phrases in the input string must be reduced to a single space when placed into its output array element. Example: "hello<space><tab><space><space>world" becomes "hello<space>world"
- Quoted substrings in the input string that are zero-length or contain only whitespace are not placed into the output array (The output array must not contain any zero-length elements).
- Each element of the output array must be trimmed (left and right) for whitespace.
This example demonstrates all requirements above:
Input String:
"" one " two three " four , five " six seven " " "
Returns this array (double quotes actually exist in the strings shown below):
{one,"two three",four,five,"six seven"}
EDIT 9/13/2013
I have been studying regular expressions hard for a couple days and finally settled on this proposed solution. It may not be the best, but it's what I have at this time.
I will use this regex to split the search string into an array using PHP's preg_match_all() function:
/(?:"([^"]*)"|([^\s",]+))/
The leading/trailing "/" is required by the php function preg_match_all().
Now that the array is created, we retrieve it from the function call like this:
$x = preg_match_all(REGEX);
$Array = $x[0];
We have to do this because the function returns a compound array and element 0 contains the actual output of the regex. The other returned elements contain values captured by the regex, which we don't need.
Now, I will iterate the resulting array and process each element to meet the requirements (above), which will be much easier than meeting all the requirements in a single step using single regex.
I finally have developed a solution for this problem which involved a few PHP statements utilizing regular expressions. Below is the final function.
This function is part of a class which is why it begins with "public".