I am relatively new to Apache Flink and I am trying to create a simple project that produces a file to an AWS S3 bucket. Based on the documentation it looks like I am required to have Hadoop installed in order to do this.
How do I set up my local environment to allow me to test this capability? I have installed Apache Flink as well as Hadoop locally. I have added the necessary changes to the core-site.xml configuration for Hadoop and also added my HADOOP_CONF path to my flink.yaml configuration. When I try and submit my job locally through the Flink UI I always get an error
2016-12-29 16:03:49,861 INFO org.apache.flink.util.NetUtils - Unable to allocate on port 6123, due to error: Address already in use
2016-12-29 16:03:49,862 ERROR org.apache.flink.runtime.jobmanager.JobManager - Failed to run JobManager.
java.lang.RuntimeException: Unable to do further retries starting the actor system
at org.apache.flink.runtime.jobmanager.JobManager$.retryOnBindException(JobManager.scala:2203)
at org.apache.flink.runtime.jobmanager.JobManager$.runJobManager(JobManager.scala:2143)
at org.apache.flink.runtime.jobmanager.JobManager$.main(JobManager.scala:2040)
at org.apache.flink.runtime.jobmanager.JobManager.main(JobManager.scala)
I am assuming that I am missing something with how my environment is set up. Is it possible to do this locally? Any help would be appreciated.
While you need Hadoop libraries you do not have to have Hadoop installed to run locally and write to S3. I just happened to try this out with writing a Parquet output based on Avro schema and generated SpecificRecord to S3. I am running a version of the following code locally through SBT and Intellij Idea. Needed parts:
1) Have the following file specifying the needed Hadoop properties (Note: defining the AWS access key/secret key is not recommended. Better to run on an EC2 instance that has proper IAM role to read/write to your S3 bucket. But needed for local for testing)
2) Imports: import com.uebercomputing.eventrecord.EventOnlyRecord
3) Flink code uses HadoopOutputFormat with above configuration:
4) Build dependencies and versions used:
Edit for using writeAsText to S3:
1) Create a Hadoop configuration directory (will reference this as hadoop-conf-dir) with a file core-site.xml in it.
For example:
2) Create a directory (will reference this as flink-conf-dir) with a file flink-conf.yaml in it.
For example:
3) Edit your IntelliJ Run configuration used to run your S3 Flink job - Run - Edit configurations - and add the following environment variable:
4) Run the code with that environment variable set: