My team is deploying a new cluster on Amazon EC2 instances. After a bit of research, we decided to go with Apache Mesos as cluster manager and Spark for computation.
The first question we asked ourself is what would be the best way to collect logs from all the machines, for each different framework. Till now, we developed some custom bash/python scripts which collect logs from predefined locations, zip them and send the compressed file to S3. This kind of rotation is activated by a cron job, which runs every hour.
I have been searching for the "best" (or standard) way to do this. I found Apache Flume, which is a data collector also for logs, but I don't understand how could it be integrated in a Mesos cluster to collect logs (and for Spark).
I found this "similar" question, but the solutions are not Open Source or no more supported.
Is there a better way to rotate logs or a standard way I'm missing?
Thank you very much
There is no perfect answer to this. If you are using Spark and are interested in using Flume, you will have to either write a custom Flume -> Spark interface as one doesn't exist as far as I know. However, what you can do is this:
Spark Streaming is supposedly not up to prime time production grade yet but this is one potential solution.