scrape more than 1000 product detail using curl php from shopee then store to database

1k views Asked by At

I have a project to make shopee product scraping. Scraping for some products is successful, but if there are thousands of products, only hundreds of products are successful, the rest fail and the error is "forbidden". I've tried using three php methods for scraping, namely curl_init, curl_multi_init, and curl class.

  1. php curl_init() This method returns an array
function scrapcurl($data){
   $result = [];
   foreach ($data as $key => $value) {
      $url = $value;
      $ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
      $handle = curl_init();
                
      // Set the url
      curl_setopt($handle, CURLOPT_URL, $url);
      curl_setopt($handle, CURLOPT_USERAGENT, $ua);
      curl_setopt($handle, CURLOPT_HEADER, 0);
      curl_setopt($handle, CURLOPT_RETURNTRANSFER, 1);
      $output = curl_exec($handle);
      curl_close($handle);
      array_push($result, $output);
   }
   return $result;
}
  1. php curl_multi_init() This method returns an array of json in string ex: {"error":null,"error_msg":null,"data":{"itemid":14513803134,"shopid":40261202,"userid":0,...} then i convert to array associative with another function
function multiRequest($data, $options = array()) {
    // array of curl handles
    $curly = array();
    // data to be returned
    $result = array();

    // multi handle
    $mh = curl_multi_init();

    // loop through $data and create curl handles
    // then add them to the multi-handle
    foreach ($data as $id => $d) {
        $curly[$id] = curl_init();

        $url = (is_array($d) && !empty($d['url'])) ? $d['url'] : $d;
        curl_setopt($curly[$id], CURLOPT_URL,            $url);
        curl_setopt($curly[$id], CURLOPT_HEADER,         0);
        curl_setopt($curly[$id], CURLOPT_RETURNTRANSFER, 1);

        // post?
        if (is_array($d)) {
            if (!empty($d['post'])) {
            curl_setopt($curly[$id], CURLOPT_POST,       1);
            curl_setopt($curly[$id], CURLOPT_POSTFIELDS, $d['post']);
            }
        }

        // extra options?
        if (!empty($options)) {
            curl_setopt_array($curly[$id], $options);
        }

        curl_multi_add_handle($mh, $curly[$id]);
    }

    // execute the handles
    $running = null;
    do {
        curl_multi_exec($mh, $running);
    } while($running > 0);


    // get content and remove handles
    foreach($curly as $id => $c) {
        $result[$id] = curl_multi_getcontent($c);
        curl_multi_remove_handle($mh, $c);
    }

    // all done
    curl_multi_close($mh);

    return $result;
}
  1. Curl class This method returns an array
use Curl;

function scrap($data)
{
    $resultawal=[];
    $result=[];
    $image=[];
    foreach ($data as $key => $value) {
        # code...
        $curl = new Curl();
        $curl->get($value);
        if ($curl->error) {
            # code...
            echo 'Error: ' . $curl->errorCode . ': ' . $curl->errorMessage . "\n";
        }
        else {
            # code...
            $js = $curl->response;
            foreach ($js->data->images as $key => $value) {
                $image["img$key"] = $value;
            };
            $gambar1 = json_encode($image);
            $harga = substr($js->data->price_max, 0, -5);
            $stok = $js->data->stock;
            $nama = str_replace("'", "", $js->data->name);
            $catid = $js->data->catid;
            $deskripsi = str_replace("'", "", $js->data->description);
            if ($js->data->video_info_list != '') {
                $video = $js->data->video_info_list;
                $video1 = json_encode($video);
            } else {
                $video1 = null;
            }
            $linkss = "https://shopee.co.id/" . str_replace(" ", "-", $nama) . "-i." . $js->data->shopid . "." . $js->data->itemid;
            $berat = 0; // berat
            $min = 1; // minimum_pemesanan
            $etalase = NULL; // etalase
            $preorder = 1; //preorder
            $kondisi = "Baru";
            $sku = NULL;
            $status = "Aktif";
            $asuransi = "optional";
            $item_id = $js->data->itemid;

            $resultawal = array(
                'item_id'=>$item_id,
                'linkss'=>$linkss,
                'nama'=>$nama,
                'deskripsi'=>$deskripsi,
                'catid'=>$catid,
                'berat'=>$berat,
                'min'=>$min,
                'etalase'=>$etalase,
                'preorder'=>$preorder,
                'kondisi'=>$kondisi,
                'gambar1'=>$gambar1,
                'video1'=>$video1,
                'sku'=>$sku,
                'status'=>$status,
                'stok'=>$stok,
                'harga'=>$harga,
                'asuransi'=>$asuransi,
            );
            array_push($result, $resultawal);
        }
    }
    return $result;
}

My Question From the three methods above, when the link is thousands, why does a 403 forbidden error appear with methods 1 and 2, and error: 403: HTTP/2 403 with method 3??

Additional info: Input of the program is thousand of link of products. For example:

5Pcs-pt4115-4115-sot-89-IC-Power-IC-LED-i.41253123.1355347598.sp_atk=09264df0-bb8d-4ca5-8970-719bbb2149dd

and then i take the shopid=41253123 and itemid=1355347598. Then i put to this link:

$link = "https://shopee.co.id/api/v4/item/get?itemid=" . $item_id . "&shopid=" . $shop_id;

and then use three methods above to scrape the product data.

0

There are 0 answers