RegEx in PHP to extract components of nquad

133 views Asked by At

I'm looking around for a RegEx that can help me parse an nquad file. An nquad file is a straight text file where each line represents a quad (s, p, o, c):

<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext> .
<http://mysubject> <http://mypredicate2> <http://myobject2> <http://mycontext> .
<http://mysubject> <http://mypredicate2> <http://myobject2> <http://mycontext> .

The objects can also be literals (instead of uris), in which case they are enclosed with double quotes:

<http://mysubject> <http://mypredicate> "My object" <http://mycontext> .

I'm looking for a regex that given one line of this file, which will give me back a php array in the following format:

[0] => "http://mysubject"
[1] => "http://mypredicate"
[2] => "http://myobject"
[3] => "http://mycontext"

...or in the case where the double quotes are used for the object:

[0] => "http://mysubject"
[1] => "http://mypredicate"
[2] => "My Object"
[3] => "http://mycontext"

One final thing - in an ideal world, the regex will cater for the scenario there may be 1 or more spaces between the various components, e.g.

<http://mysubject>     <http://mypredicate>  "My object"       <http://mycontext> .
3

There are 3 answers

0
nickb On BEST ANSWER

I'm going to add another answer as an additional solution using only a regex and explode:

$line = "<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext>";
$line2 = '<http://mysubject> <http://mypredicate> "My object" <http://mycontext>';

$delimeter = '---'; // Can't use space
$result = preg_replace('/<([^>]*)>\s+<([^>]*)>\s+(?:["<]){1}([^">]*)(?:[">]){1}\s+<([^>]*)>/i', '$1' . $delimeter . '$2' . $delimeter . '$3' . $delimeter . '$4', $line);
$array = explode( $delimeter, $result);
0
Aziz Shaikh On

This regular expression would help:

/(\S+?)\s+(\S+?)\s+(\S+?)\s+(\S+?)\s+\./

(s, p, o, c) values will be in $1, $2, $3, $4 variables.

2
nickb On

It seems this can be accomplished as follows (I do not know your character restrictions so it may not work specifically for your needs, but worked for your test cases):

$line = "<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext>";
$line2 = '<http://mysubject> <http://mypredicate> "My object" <http://mycontext>';

// Remove unnecessary whitespace between entries (change $line to $line2 for testing)
$delimeter = '---';
$result = preg_replace('/([">]){1}\s+(["<]){1}/i', '$1' . $delimeter . '$2', $line);

// Explode on our delimeter
$array = explode( $delimeter, $result);
foreach( $array as &$a)
{
    // Replace the characters we don't want with nothing
    $a = str_replace( array( '<', '.', '>', '"'), '', $a);
}

var_dump( $array);