Since upgrading to the latest composer (composer-2.3.5-airflow-2.5.3 ), our all our long-running GKE pods seem to fail.
Example error:
[2023-08-01, 14:20:23 CEST] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 2nd time calling it.
[2023-08-01, 14:20:27 CEST] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 3rd time calling it.
[2023-08-01, 14:20:31 CEST] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 4th time calling it.
[2023-08-01, 14:20:35 CEST] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 5th time calling it.
[2023-08-01, 14:20:39 CEST] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 6th time calling it.
[2023-08-01, 14:20:43 CEST] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 7th time calling it.
[2023-08-01, 14:20:47 CEST] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 8th time calling it.
[2023-08-01, 14:20:51 CEST] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 9th time calling it.
[2023-08-01, 14:20:55 CEST] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 10th time calling it.
[2023-08-01, 14:20:58 CEST] {pod.py:934} ERROR - (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': '<REDACTED>', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 01 Aug 2023 12:20:58 GMT', 'Content-Length': '129'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
It seems there was an issue with the credentials in the google provider, which is being fixed: https://github.com/apache/airflow/issues/31648
Is there a way to get this fix on Composer?
Any other recommendations to fix this? I'm currently trying to disable logging, but I'm afraid this will just postpone the error.
To solve this issue, you can use the release candidate for version
10.5.0
of theapache-airflow-providers-google
python package. It can be found here.According to the Cloud Composer docs, it is allowed to override the shipped packages.
The override can be accomplished by either manually adding a Pypi package in the Cloud Composer environment's settings, or by adding the package to the terraform resource. The updates takes about 15-30 minutes.
I tested this and can confirm it works. Tasks can again run longer than 1h.
At the time of writing, it is under test. There's also other people experiencing similar issues (github issue, other SO post).The latest version of Cloud Composer is
composer-2.3.5-airflow-2.5.3
ships withapache-airflow-providers-google==10.3.0
.