So I have been tasked with making an ETL Pipeline.
My Code works with Docker Compose, and I have been able to make tables and inject the table with all of the data so far. Now I have to make a cron workflow that will schedule this task. I have two volumes right now that get mounted. One for a secret configuration file that holds the secrets for the code to run and another is a payload file that is used for mapping attributes. When I specify the two volumes without empty_dir{}
my containers immediately error out and, but the describe output on the cwf
shows that it did indeed run and started again. However, there is a read-error on the csv file that gets staged for data insertion.
Here is my cronworkflow example:
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
generateName: dataeng-github-metrics-
namespace: dataops
spec:
schedule: "*/1 * * * *" # run every 1 minute
concurrencyPolicy: "Replace"
startingDeadlineSeconds: 0
workflowSpec:
#volumes:
#- name: my-secret-vol
#secret:
#secretName: my-secret
# - CONNECTION_STRING=${SNOWFLAKE_USER}:${SNOWFLAKE_PASSWORD}@${SNOWFLAKE_ACCOUNT}/MIRANTIS/DATAENG
# - DATABASE_BACKEND=snowflake
# - DATAENG_CONFIG_PATH=/.dataeng/config.yaml
# - PAYLOAD_TEAMS_CONFIG_PATH=/payloads/teams.yaml
# - TEAMS_SPEC=all_users
# - SCHEMA=MIRANTIS
# - DATABASE=DATAENG
# - TABLE=GITHUB_CONTRIBUTIONS_STAGE
entrypoint: run-sync
templates:
- name: run-sync
container:
imagePullPolicy: Always
image: msr.ci.mirantis.com/dataeng/dataeng_github_metrics:latest
imagePullSecrets:
- name: msrregcred
namespace: dataops
args: ['--log-level', 'debug']
env:
- name: CONNECTION_STRING
valueFrom:
secretKeyRef:
name: connection-string
key: CONNECTION_STRING
- name: DATAENG_CONFIG_PATH
value: /.dataeng/config.yaml
- name: DATABASE_BACKEND
value: snowflake
- name: PAYLOAD_TEAMS_CONFIG_PATH
value: /payloads/teams.yaml
- name: TEAMS_SPEC
value: all_users
- name: SCHEMA
value: MIRANTIS
- name: DATABASE
value: DATAENG
- name: TABLE
value: GITHUB_CONTRIBUTIONS_STAGE
volumeMounts:
- mountPath: /.dataeng
name: config
- mountPath: /payloads
name: teamspayload
volumes:
- name: config
emptyDir: {}
secret:
secretName: config
optional: false
- name: teamspayload
emptyDir: {}
configMap:
name: teamspayload
when specified no Pods get spun up and I do not see any events from the describe
output of the workflow in question it's just empty. When I don't specify I get two containers main
and wait
that get spun up.
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
generateName: dataeng-github-metrics-
namespace: dataops
spec:
schedule: "*/1 * * * *" # run every 1 minute
concurrencyPolicy: "Replace"
startingDeadlineSeconds: 0
workflowSpec:
#volumes:
#- name: my-secret-vol
#secret:
#secretName: my-secret
# - CONNECTION_STRING=${SNOWFLAKE_USER}:${SNOWFLAKE_PASSWORD}@${SNOWFLAKE_ACCOUNT}/MIRANTIS/DATAENG
# - DATABASE_BACKEND=snowflake
# - DATAENG_CONFIG_PATH=/.dataeng/config.yaml
# - PAYLOAD_TEAMS_CONFIG_PATH=/payloads/teams.yaml
# - TEAMS_SPEC=all_users
# - SCHEMA=MIRANTIS
# - DATABASE=DATAENG
# - TABLE=GITHUB_CONTRIBUTIONS_STAGE
entrypoint: run-sync
templates:
- name: run-sync
container:
imagePullPolicy: Always
image: msr.ci.mirantis.com/dataeng/dataeng_github_metrics:latest
imagePullSecrets:
- name: msrregcred
namespace: dataops
args: ['--log-level', 'debug']
env:
- name: CONNECTION_STRING
valueFrom:
secretKeyRef:
name: connection-string
key: CONNECTION_STRING
- name: DATAENG_CONFIG_PATH
value: /.dataeng/config.yaml
- name: DATABASE_BACKEND
value: snowflake
- name: PAYLOAD_TEAMS_CONFIG_PATH
value: /payloads/teams.yaml
- name: TEAMS_SPEC
value: all_users
- name: SCHEMA
value: MIRANTIS
- name: DATABASE
value: DATAENG
- name: TABLE
value: GITHUB_CONTRIBUTIONS_STAGE
volumeMounts:
- mountPath: /.dataeng
name: config
- mountPath: /payloads
name: teamspayload
volumes:
- name: config
secret:
secretName: config
optional: false
- name: teamspayload
configMap:
name: teamspayload
This will produce two pods: main
and wait
$ kubectl get pods -n dataops
NAME READY STATUS RESTARTS AGE
dataeng-github-metrics-2q9b2-1649718900 0/2 Error 0 2m46s
On the wait container I see this from the logs:
$ kubectl logs -n dataops dataeng-github-metrics-l2nvh-1649728560 -c wait
time="2022-04-12T01:56:20.292Z" level=info msg="listed containers" containers="map[main:{d19b209fb5de511d38bda02aa9bcf8f58fe34b60e25128225cd976944717dbb9 Exited {0 63785325361 <nil>}} wait:{c6a5abdcffb229235f68d1db3acc249be81781e8bb326d1ef3b867835e164704 Up {0 63785325360 <nil>}}]"
time="2022-04-12T01:56:20.323Z" level=info msg="listed containers" containers="map[main:{d19b209fb5de511d38bda02aa9bcf8f58fe34b60e25128225cd976944717dbb9 Exited {0 63785325361 <nil>}} wait:{c6a5abdcffb229235f68d1db3acc249be81781e8bb326d1ef3b867835e164704 Up {0 63785325360 <nil>}}]"
time="2022-04-12T01:56:20.323Z" level=info msg="Killing sidecars []"
time="2022-04-12T01:56:20.323Z" level=info msg="Alloc=5137 TotalAlloc=10211 Sys=73809 NumGC=3 Goroutines=7"
On the main
container I see this from the logs:
$ kubectl logs -n dataops dataeng-github-metrics-l2nvh-1649728560 -c main
2022/04/12 01:56:20 failed to insert data: open /payloads/dataeng_github_metrics.csv: read-only file system
Why aren't the pods coming up if I specify empty_dir{}
for the mounted payloads and secrets for the container runtimes?