Please have a look at the following code:
wcmapper.php (mapper for hadoop streaming job)
#!/usr/bin/php
<?php
//sample mapper for hadoop streaming job
$word2count = array();
// input comes from STDIN (standard input)
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace and lowercase
$line = strtolower(trim($line));
// split the line into words while removing any empty string
$words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
// increase counters
foreach ($words as $word) {
$word2count[$word] += 1;
}
}
// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
// tab-delimited
echo "$word\t$count\n";
}
?>
wcreducer.php (reducer script for sample hadoop job)
#!/usr/bin/php
<?php
//reducer script for sample hadoop job
$word2count = array();
// input comes from STDIN
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace
$line = trim($line);
// parse the input we got from mapper.php
list($word, $count) = explode("\t", $line);
// convert count (currently a string) to int
$count = intval($count);
// sum counts
if ($count > 0) $word2count[$word] += $count;
}
ksort($word2count); // sort the words alphabetically
// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
echo "$word\t$count\n";
}
?>
This code is for Wordcount streaming job using PHP on commoncrawl dataset.
In here, these code read the entire input. This is not what I need, I need to read the first 100 lines and write them into a text file. I am a beginner in Hadoop, CommonCrawl and PHP. So, how can I do this?
Please help.
Use a counter in the first loop, and stop the loop when the counter reaches 100. Then, have a dummy loop that just reads until the end of the input, and then continue with your code (write the results to STDOUT). The writing of the results could also go before the dummy loop to read until the end of the STDIN input. Sample code follows: