Where do 'normal' println go in a scala jar, under Spark

2.7k views Asked by At

I'm running a simple jar through spark, everything is working fine, but as a crude way to debug, I often find println pretty helpful, unless I really have to attach a debugger

However, output from println statements are nowhere to be found when run under Spark.

The main class in the jar begins like this:

import ...
object SimpleApp {
  def main(args: Array[String]) {
    println("Starting up!")
    ...

Why does something as simple as this not show in the driver process.

If it matters, I've tested this running spark locally, as well as under Mesos

update

as Proper way to provide spark application a parameter/arg with spaces in spark-submit I've dumbed down the question scenario, I was actually submitting (with spark-submit) the command through SSH. The actual value parameter was a query from the BigDataBenchmark, namely:

"SELECT pageURL, pageRank FROM rankings WHERE pageRank > 1000"

Now that wasn't properly escaped on the remote ssh command:

ssh host spark-submit ... "$query"

Became, on the host:

spark-submit ... SELECT pageURL, pageRank FROM rankings WHERE pageRank > 1000

So there you have it, all my stdout was going to a file, whereas "normal" spark output was still appearing as it is stderr, which I only now realise.

2

There are 2 answers

1
ChristopherB On BEST ANSWER

This would appear in the stdout of the driver. As an example see SparkPi. I know on Yarn this appears locally in the stdout when in client mode or in the application master stdout log when in cluster mode. Local mode should appear just on the normal stdout (though likely mixed in with lots of logging noise).

0
childofsoong On

I can't say for sure about Spark, but based on what Spark is, I would assume that it starts up child processes, and the standard output of those processes is not sent back to the main process for you to see. You can get around this in a number of ways, such as opening a file to write messages to, or a network connection over your localhost to another process that displays messages it receives. If you're just trying to learn the basics, this may be sufficient. However, if you're going to do a larger project, I'd strongly recommend doing some research into what the Spark community has already developed for that purpose, as it will benefit you in the long run to have a more robust setup for debugging.