I have some data that needs to be classified in spark streaming. The classification key-values are loaded at the beginning of the program in a HashMap. Hence each incoming data packet needs to be compared against these keys and tagged accordingly.
I realize that spark has variables called broadcast variables and accumalators to distribute objects. The examples in the tutorials are using simple variables like etc.
How can I share my HashMap on all spark workers using a HashMap. Alternatively, is there a better way to do this?
I am coding my spark streaming application in Java.
In spark you can broadcast any serializable object the same way. This is the best way because you are shipping data only once to the worker and then you can use it in any of the tasks.
Scala:
Java: