Reading the first 100 lines

Question

Reading the first 100 lines

373 views Asked by Dongle At 31 December 2013 at 09:05

Please have a look at the following code:

wcmapper.php (mapper for hadoop streaming job)

#!/usr/bin/php
<?php
//sample mapper for hadoop streaming job
$word2count = array();

// input comes from STDIN (standard input)
while (($line = fgets(STDIN)) !== false) {
   // remove leading and trailing whitespace and lowercase
   $line = strtolower(trim($line));
   // split the line into words while removing any empty string
   $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
   // increase counters
   foreach ($words as $word) {
       $word2count[$word] += 1;
   }
}

// write the results to STDOUT (standard output)

foreach ($word2count as $word => $count) {
   // tab-delimited
   echo "$word\t$count\n";
}

?>

wcreducer.php (reducer script for sample hadoop job)

#!/usr/bin/php
<?php
//reducer script for sample hadoop job
$word2count = array();

// input comes from STDIN
while (($line = fgets(STDIN)) !== false) {
    // remove leading and trailing whitespace
    $line = trim($line);
    // parse the input we got from mapper.php
    list($word, $count) = explode("\t", $line);
    // convert count (currently a string) to int
    $count = intval($count);
    // sum counts
    if ($count > 0) $word2count[$word] += $count;
}

ksort($word2count);  // sort the words alphabetically

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
    echo "$word\t$count\n";
}

?>

This code is for Wordcount streaming job using PHP on commoncrawl dataset.

In here, these code read the entire input. This is not what I need, I need to read the first 100 lines and write them into a text file. I am a beginner in Hadoop, CommonCrawl and PHP. So, how can I do this?

Please help.

Original Q&A

There are 2 answers

Jesse Clark On 31 December 2013 at 16:32

I'm not sure how you define "lines" but if you wanted words you could do something like this:

for ($count=0; $count<=100; $count++){
      echo $word2count[$count]\t$count\n";
}

**cabad** · Accepted Answer · 2013-12-31T16:30:10+00:00

Use a counter in the first loop, and stop the loop when the counter reaches 100. Then, have a dummy loop that just reads until the end of the input, and then continue with your code (write the results to STDOUT). The writing of the results could also go before the dummy loop to read until the end of the STDIN input. Sample code follows:

...
// input comes from STDIN (standard input)
for ($i=1; $i<=100; $i++){
   // read the line from STDIN; you
   // can add a check to exit if done ($line == false)
   $line = fgets(STDIN); 
   // remove leading and trailing whitespace and lowercase
   $line = strtolower(trim($line));
   // split the line into words while removing any empty string
   $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
   // increase counters
   foreach ($words as $word) {
       $word2count[$word] += 1;
   }
}

// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
   // tab-delimited
   echo "$word\t$count\n";
}

// Dummy loop (to consume all the mapper input; it may work
// without this loop but I am not sure if this will confuse the
// Hadoop framework; you can try it without this loop and see)
while (($line = fgets(STDIN)) !== false) {
}

TechQA.

Reading the first 100 lines

There are 2 answers

Related Questions in PHP

Related Questions in WEB-SERVICES

Related Questions in HADOOP

Related Questions in WEB-CRAWLER

Related Questions in COMMON-CRAWL

Popular Questions

Popular Tags

Trending Questions