Airflow doesn't recognise my S3 Connection setting

2.9k views Asked by At

I am using Airflow with Kubernetes executor and testing out locally (using minikube), While I was able to get it up and running, I cant seem to store my logs in S3. I have tried all solutions that are described and I am still getting the following error,

*** Log file does not exist: /usr/local/airflow/logs/example_python_operator/print_the_context/2020-03-30T16:02:41.521194+00:00/1.log
*** Fetching from: http://examplepythonoperatorprintthecontext-5b01d602e9d2482193d933e7d2:8793/log/example_python_operator/print_the_context/2020-03-30T16:02:41.521194+00:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='examplepythonoperatorprintthecontext-5b01d602e9d2482193d933e7d2', port=8793): Max retries exceeded with url: /log/example_python_operator/print_the_context/2020-03-30T16:02:41.521194+00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fd00688a650>: Failed to establish a new connection: [Errno -2] Name or service not known'))

I implemented a custom Logging class as mentioned in this answer and still no luck.

My airflow.yaml looks like this

airflow:
  image:
     repository: airflow-docker-local
     tag: 1

  executor: Kubernetes

  service:
    type: LoadBalancer

  config:
    AIRFLOW__CORE__EXECUTOR: KubernetesExecutor
    AIRFLOW__CORE__TASK_LOG_READER: s3.task
    AIRFLOW__CORE__LOAD_EXAMPLES: True
    AIRFLOW__CORE__FERNET_KEY: ${MASKED_FERNET_KEY}
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://postgres:airflow@airflow-postgresql:5432/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://postgres:airflow@airflow-postgresql:5432/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:airflow@airflow-redis-master:6379/0

    # S3 Logging
    AIRFLOW__CORE__REMOTE_LOGGING: True
    AIRFLOW__CORE__REMOTE_LOG_CONN_ID: s3://${AWS_ACCESS_KEY_ID}:${AWS_ACCESS_SECRET_KEY}@S3
    AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER: s3://${BUCKET_NAME}/logs
    AIRFLOW__CORE__S3_LOG_FOLDER: s3://${BUCKET_NAME}/logs
    AIRFLOW__CORE__LOGGING_LEVEL: INFO
    AIRFLOW__CORE__LOGGING_CONFIG_CLASS: log_config.LOGGING_CONFIG
    AIRFLOW__CORE__ENCRYPT_S3_LOGS: False
    # End of S3 Logging

    AIRFLOW__WEBSERVER__EXPOSE_CONFIG: True
    AIRFLOW__WEBSERVER__LOG_FETCH_TIMEOUT_SEC: 30
    AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY: airflow-docker-local
    AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG: 1
    AIRFLOW__KUBERNETES__WORKER_CONTAINER_IMAGE_PULL_POLICY: Never
    AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME: airflow
    AIRFLOW__KUBERNETES__DAGS_VOLUME_CLAIM: airflow
    AIRFLOW__KUBERNETES__NAMESPACE: airflow
    AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: True
    AIRFLOW__KUBERNETES__KUBE_CLIENT_REQUEST_ARGS: '{\"_request_timeout\":[60,60]}'

persistence:
  enabled: true
  existingClaim: ''
  accessMode: 'ReadWriteMany'
  size: 5Gi

logsPersistence:
  enabled: false

workers:
  enabled: true

postgresql:
  enabled: true

redis:
  enabled: true

I have tried setting up the Connection via UI and creating connection via airflow.yaml and nothing seems to work, I have been trying this for 3 days now with no luck, any help would be much appreciated.

I have attached the screenshot for reference,

enter image description here enter image description here

1

There are 1 answers

2
Jacob Ward On BEST ANSWER

I am pretty certain this issue is because the s3 logging configuration has not been set on the worker pods. The worker pods don't get given configuration set using environment variables such as AIRFLOW__CORE__REMOTE_LOGGING: True. If you wish to set this variable in the worker pod then you must copy the variable and append AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABLES__ to the copied environment variable name: AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABLES__AIRFLOW__CORE__REMOTE_LOGGING: True.

In this case you would need to duplicate all of your variables specifying config for s3 logging and append AIRFLOW__KUBERNETES_ENVIRONMENT_VARIABLES__ to the copies.