Extract JSON from HTML using PHP

4.3k views Asked by At

I'm reading source code of an online shop website, and on each product page I need to find a JSON string which shows product SKUs and their quantity.

Here are 2 samples:

'{"sku-SV023435_B_M":7,"sku-SV023435_BL_M":10,"sku-SV023435_PU_M":11}'

The sample above shows 3 SKUs.

'{"sku-11430_B_S":"20","sku-11430_B_M":"17","sku-11430_B_L":"30","sku-11430_B_XS":"13","sku-11430_BL_S":"7","sku-11430_BL_M":"17","sku-11430_BL_L":"4","sku-11430_BL_XS":"16","sku-11430_O_S":"8","sku-11430_O_M":"6","sku-11430_O_L":"22","sku-11430_O_XS":"20","sku-11430_LBL_S":"27","sku-11430_LBL_M":"25","sku-11430_LBL_L":"22","sku-11430_LBL_XS":"10","sku-11430_Y_S":"24","sku-11430_Y_M":36,"sku-11430_Y_L":"20","sku-11430_Y_XS":"6","sku-11430_RR_S":"4","sku-11430_RR_M":"35","sku-11430_RR_L":"47","sku-11430_RR_XS":"6"}',

The sample above shows many more SKUs.

The number of SKUs in the JSON string can range from one to infinity.

Now, I need a regex pattern to extract this JSON string from each page. At that point, I can easily use json_encode().

Update: Here I found another problem, sorry that my question was not complete, there is another similar json string which is starting with sku- , Please have a look at source code of below link you will understand, the only difference is the value for that one is alphanumeric and for our required one is numeric. Also please note our final goal is to extract SKUs with their quantity, maybe you have a most straightforward solution.

Source

@chris85

Second update:

Here is another strange issue which is a bit off topic.

while I'm opening the URL content using below code there is no json string in the source!

$html = file_get_contents("http://www.dresslink.com/womens-candy-color-basic-coat-slim-suit-jacket-blazer-p-8131.html");

But when I'm opening the url with my browser the json is there! really confused about this :(

3

There are 3 answers

1
Phil On

Trying to extract specific data from json directly with regexp is normally always a bad idea due to the way json is encoded. The best way is to regexp the whole json data, then decode using the php function json_decode.

The issue with the missing data is due to a missing required cookie. See my comments in the code below.

<?php

function getHtmlFromDresslinkUrl($url)
{
    $ch = curl_init();
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);

    //You must send the currency cookie to the website for it to return the json you want to scrape
    curl_setopt($ch, CURLOPT_HTTPHEADER, array(
        'Cookie: currencies_code=USD;',
    ));

    $output=curl_exec($ch);

    curl_close($ch);
    return $output;
}

$html = getHtmlFromDresslinkUrl("http://www.dresslink.com/womens-candy-color-basic-coat-slim-suit-jacket-blazer-p-8131.html");

//Get the specific arguments for this js function call only
$items = preg_match("/DL\.items\_list\.initItemAttr\((.+)\)\;/", $html, $matches);
if (count($matches) > 0) {
    $arguments = $matches[1];

    //Split by argument seperator.  
    //I know, this isn't great but it seems to work.
    $args_array = explode(", ", $arguments);

    //You need the 5th argument
    $fourth_arg = $args_array[4];

    //Strip quotes
    $fourth_arg = trim($fourth_arg, "'");

    //json_decode
    $qty_data = json_decode($fourth_arg, true);

    //Then you can work with the php array
    foreach ($qty_data as $name => $qtty) {
        echo "Found " . $qtty . " of " . $name . "<br />";
    }
}

?>

Special thanks to @chris85 for making me read the question again. Sorry but I couldn't undo my downvote.

2
Reeno On

A simple /'(\{"[^\}]+\})'/ will match all these JSON strings. Demo: https://regex101.com/r/wD5bO4/2

The first element of the returned array will contain the JSON string for json_decode:

preg_match_all ("/'(\{\"[^\}]+\})'/", $html, $matches);

$html is the HTML to be parsed, the JSON will be in $matches[0][1], $matches[1][1], $matches[2][1] etc.

2
grill On

You will want to use preg_match_all() to perform the regex matching operation (documentation here).

The following should do it for you. It will match each substring beginning with "sku" and ending with ",".

preg_match_all("/sku\-.+?:[0-9]*/", $input)

Working example here.

Alternatively, if you want to extract the entire string, you can use:

preg_match_all("/{.sku\-.*}/, $input")

This will grab everything between the opening and closing brackets.

Working example here.

Please note that $input denotes the input string.