Spark Streaming Connection Refused

81 views Asked by At

I am trying to run the Spark Streaming sample program slightly modified as follows :

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
    sc = SparkContext(appName="PythonStreamingNetworkWordCount")
    ssc = StreamingContext(sc, 1)                                  # 2nd parameter is duration in seconds

    lines = ssc.socketTextStream('localhost', 5101)
    counts = lines.flatMap(lambda line: line.split(" "))\
                  .map(lambda word: (word, 1))\
                  .reduceByKey(lambda a, b: a + b)
    
    counts.pprint()
    #counts.toDF().write.csv("outputDir",mode="append")
    #counts.saveAsTextFiles('output','pm')
    
    ssc.start()
    ssc.awaitTermination()

I am running this in Google Colab using the following technique.First I create ngrok tunnel so that I can expose port 1234 to the internet.

public_url = ngrok.connect(1234,"tcp")

Then I use socat to create a forwarding mechanism from port 1234 to port 5101. I run the socat command in an xterm simulation created by colab-xterm

socat TCP4-LISTEN:1234 TCP4:localhost:5101

Then I execute the spark-streaming job by submitting it in a second xterm as follows

spark-submit wcstream.py

Program runs by keeps saying it is not getting any data, which is fine because I have not pumped any data as yet.

Last step I push data from to the ngrok public address

<NgrokTunnel: "tcp://0.tcp.ngrok.io:17078" -> "localhost:1234">

by running the following command from a Colab cell

!echo 'Hello throuugh TCP tunnel 4' > /dev/tcp/0.tcp.ngrok.io/17078

at this point socat throws and error

/content# socat TCP4-LISTEN:1234 TCP4:localhost:5101
2023/05/30 05:25:49 socat[8041] E connect(5, AF=2 127.0.0.1:5101, 16): Connection refused

My question is, obviously, what am I doing wrong and how can I fix this. I am aware that the netcat approach can be used and have used the same successfully with this same program. However, I am doing all this to see if I can send data to a Spark Streaming program in Colab FROM AN EXTERNAL source. Hence all this effort. Will be grateful for any guidance.

Here is the Colab Notebook, in case anyone wants to try this out.

0

There are 0 answers