My mesos version was recently upgraded from 0.28 to 1.2.1.
Jobs are being scheduled using Chronos. My docker job is being invoked properly, but still getting TASK_FAILED error event it completes with exit status ZERO.
This is reproducible always. I tried with multiple mesos versions, but still no luck. Wondering if I am missing anything.
OS details :-
Kernel - 3.8.13-98.7.1.el7uek
OS - OL 7.3
Sanpshot - 7-2017.6.4
Since, it is not the latest Kernal, I have added the following in my Chronos Job environment variable as i am not able to install the latest docker version.
{
"name":"DOCKER_API_VERSION",
"value":"1.22"
}
Launching 'mesos-docker-executor' with flags
'--container="mesos-81cb9c2a-d18b-4127-872b-2a5676dfb314-S0.97dc2c67-5d69-4a8c-b4e1-ba15807697cf"
--docker="docker"
--docker_socket="/var/run/docker.sock"
--help="false"
--initialize_driver_logging="true"
--launcher_dir="/usr/libexec/mesos" --logbufsecs="0"
--logging_level="INFO"
--mapped_directory="/mnt/mesos/sandbox"
--quiet="false"
--sandbox_directory="/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/docker/links/97dc2c67-5d69-4a8c-b4e1-ba15807697cf"
--stop_timeout="0ns"'
Mesos slave logs :-
I0906 14:05:00.958442 9 slave.cpp:1625] Got assigned task 'ct:1504706700007:0:Job_Task_Test:' for framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:00.958544 9 slave.cpp:6386] Checkpointing FrameworkInfo to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/framework.info'
I0906 14:05:00.958868 9 slave.cpp:6397] Checkpointing framework pid '[email protected]:43144' to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/framework.pid'
I0906 14:05:00.959430 9 slave.cpp:1785] Launching task 'ct:1504706700007:0:Job_Task_Test:' for framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:00.966035 9 paths.cpp:547] Trying to chown '/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf' to user 'root'
I0906 14:05:00.966223 9 slave.cpp:6903] Checkpointing ExecutorInfo to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/executor.info'
I0906 14:05:00.966648 9 slave.cpp:6479] Launching executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 with resources cpus(*)(allocated: *):0.1; mem(*)(allocated: *):32 in work directory '/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf'
I0906 14:05:00.967015 9 slave.cpp:6926] Checkpointing TaskInfo to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf/tasks/ct:1504706700007:0:Job_Task_Test:/task.info'
I0906 14:05:00.967300 9 slave.cpp:2118] Queued task 'ct:1504706700007:0:Job_Task_Test:' for executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:00.967337 9 slave.cpp:884] Successfully attached file '/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf'
I0906 14:05:00.967501 12 docker.cpp:1165] Starting container '97dc2c67-5d69-4a8c-b4e1-ba15807697cf' for task 'ct:1504706700007:0:Job_Task_Test:' (and executor 'ct:1504706700007:0:Job_Task_Test:') of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:00.968152 9 fetcher.cpp:353] Starting to fetch URIs for container: 97dc2c67-5d69-4a8c-b4e1-ba15807697cf, directory: /mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/docker/links/97dc2c67-5d69-4a8c-b4e1-ba15807697cf
I0906 14:05:00.969274 16 docker.cpp:1348] Running docker -H unix:///var/run/docker.sock pull docker.xyz.com/test/job-action:latest
I0906 14:05:01.165390 9 docker.cpp:1274] Running docker -H unix:///var/run/docker.sock inspect docker.xyz.com/test/job-action:latest
I0906 14:05:01.267083 13 docker.cpp:465] Docker pull docker.xyz.com/test/job-action:latest completed
I0906 14:05:01.267663 13 docker.cpp:1514] Launching 'mesos-docker-executor' with flags '--container="mesos-81cb9c2a-d18b-4127-872b-2a5676dfb314-S0.97dc2c67-5d69-4a8c-b4e1-ba15807697cf" --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" --initialize_driver_logging="true" --launcher_dir="/usr/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" --quiet="false" --sandbox_directory="/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/docker/links/97dc2c67-5d69-4a8c-b4e1-ba15807697cf" --stop_timeout="0ns"'
I0906 14:05:01.269182 13 docker.cpp:803] Checkpointing pid 147 to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf/pids/forked.pid'
I0906 14:05:01.308346 15 slave.cpp:3385] Got registration for executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 from executor(1)@20.426.45.305:48379
I0906 14:05:01.308658 15 slave.cpp:3471] Checkpointing executor pid 'executor(1)@20.426.45.305:48379' to '/mesos-data/slave-1/meta/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf/pids/libprocess.pid'
I0906 14:05:01.309453 15 docker.cpp:1608] Ignoring updating container 97dc2c67-5d69-4a8c-b4e1-ba15807697cf because resources passed to update are identical to existing resources
I0906 14:05:01.309587 15 slave.cpp:2331] Sending queued task 'ct:1504706700007:0:Job_Task_Test:' to executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 at executor(1)@20.426.45.305:48379
I0906 14:05:02.914752 11 slave.cpp:3816] Handling status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 from executor(1)@20.426.45.305:48379
I0906 14:05:02.915037 11 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:02.915063 11 status_update_manager.cpp:500] Creating StatusUpdate stream for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:02.915396 11 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:02.915513 11 status_update_manager.cpp:377] Forwarding update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to the agent
I0906 14:05:02.915587 11 slave.cpp:4256] Forwarding the update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to [email protected]:5050
I0906 14:05:02.915688 11 slave.cpp:4150] Status update manager successfully handled status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:02.915700 11 slave.cpp:4166] Sending acknowledgement for status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to executor(1)@20.426.45.305:48379
I0906 14:05:03.628334 10 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:03.628398 10 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_RUNNING (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:03.628463 10 slave.cpp:3105] Status update manager successfully handled status update acknowledgement (UUID: 72a301e2-f3d3-4e24-8e43-fc5ee44b3730) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:04.256636 14 slave.cpp:5731] Querying resource estimator for oversubscribable resources
I0906 14:05:04.256691 14 slave.cpp:5745] Received oversubscribable resources {} from the resource estimator
I0906 14:05:04.944047 10 slave.cpp:4346] Received ping from slave-observer(1)@20.426.45.305:5050
I0906 14:05:11.049429 12 slave.cpp:3816] Handling status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 from executor(1)@20.426.45.305:48379
I0906 14:05:11.049636 12 docker.cpp:987] Running docker -H unix:///var/run/docker.sock inspect mesos-81cb9c2a-d18b-4127-872b-2a5676dfb314-S0.97dc2c67-5d69-4a8c-b4e1-ba15807697cf
I0906 14:05:11.136464 13 status_update_manager.cpp:323] Received status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.136502 13 status_update_manager.cpp:832] Checkpointing UPDATE for status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.136572 13 status_update_manager.cpp:377] Forwarding update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to the agent
I0906 14:05:11.136642 12 slave.cpp:4256] Forwarding the update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to [email protected]:5050
I0906 14:05:11.136739 12 slave.cpp:4150] Status update manager successfully handled status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.136752 12 slave.cpp:4166] Sending acknowledgement for status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 to executor(1)@20.426.45.305:48379
I0906 14:05:11.152230 13 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.152271 13 status_update_manager.cpp:832] Checkpointing ACK for status update TASK_FAILED (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.152307 13 status_update_manager.cpp:531] Cleaning up status update stream for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.152454 13 slave.cpp:3105] Status update manager successfully handled status update acknowledgement (UUID: 83eceb05-6a94-4660-bdb5-3cbc2b166b1b) for task ct:1504706700007:0:Job_Task_Test: of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:11.152475 13 slave.cpp:6882] Completing task ct:1504706700007:0:Job_Task_Test:
I0906 14:05:12.059929 13 slave.cpp:4388] Got exited event for executor(1)@20.426.45.305:48379
I0906 14:05:12.140215 14 docker.cpp:2397] Executor for container 97dc2c67-5d69-4a8c-b4e1-ba15807697cf has exited
I0906 14:05:12.140244 14 docker.cpp:2091] Destroying container 97dc2c67-5d69-4a8c-b4e1-ba15807697cf
I0906 14:05:12.140283 14 docker.cpp:2218] Running docker stop on container 97dc2c67-5d69-4a8c-b4e1-ba15807697cf
I0906 14:05:12.140717 11 slave.cpp:4768] Executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 exited with status 0
I0906 14:05:12.140748 11 slave.cpp:4868] Cleaning up executor 'ct:1504706700007:0:Job_Task_Test:' of framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000 at executor(1)@20.426.45.305:48379
I0906 14:05:12.141062 11 slave.cpp:4956] Cleaning up framework 5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000
I0906 14:05:12.141129 11 gc.cpp:55] Scheduling '/mesos-data/slave-1/slaves/81cb9c2a-d18b-4127-872b-2a5676dfb314-S0/frameworks/5175f6c9-0617-4145-ab46-3b7e64dc67ea-0000/executors/ct:1504706700007:0:Job_Task_Test:/runs/97dc2c67-5d69-4a8c-b4e1-ba15807697cf' for gc 6.99999836858667days in the future
Executor stderr :-
I0906 13:25:02.165338 3399 exec.cpp:162] Version: 1.2.1
I0906 13:25:02.169983 3405 exec.cpp:237] Executor registered on agent f20ab78e-acd3-407a-b1b6-47d67a947eff-S1
Mesos version 1.3.1 :-
I0906 16:15:02.975457 580 exec.cpp:162] Version: 1.3.1
I0906 16:15:02.979501 586 exec.cpp:237] Executor registered on agent 920fa575-b534-48a8-8305-61f6097f2f49-S1
E0906 16:15:13.523074 589 process.cpp:956] Failed to accept socket: future discarded
I am scheduling a docker job thru Chronos. The docker job runs fine and I can see the following in the Mesos stdout logs. Partial logs :-
Executor stdout :-
Registered docker executor on <hostname>
Starting task ct:1504704300002:0:Job_Task_Test:
.....
.....
.....
: success: true
: Stopping HTTP executor
: Started Application in 6.734 seconds (JVM running for 7.642)
s.c.a.AnnotationConfigApplicationContext : Closing org.springframework.context.annotation.AnnotationConfigApplicationContext@bebdb06: startup date [Wed Sep 06 17:25:07 UTC 2017]; root of context hierarchy
2017-09-06 17:25:14.050 INFO 1 — [ Thread-1] o.s.c.support.DefaultLifecycleProcessor : Stopping beans in phase 0
2017-09-06 17:25:14.051 INFO 1 — [ Thread-1] o.s.j.e.a.AnnotationMBeanExporter : Unregistering JMX-exposed beans on shutdown
2017-09-06 17:25:14.051 DEBUG 1 — [ Thread-1] h.i.c.PoolingHttpClientConnectionManager : Connection manager is shutting down
2017-09-06 17:25:14.051 DEBUG 1 — [ Thread-1] h.i.c.DefaultManagedHttpClientConnection : http-outgoing-0: Close connection
2017-09-06 17:25:14.052 DEBUG 1 — [ Thread-1] h.i.c.PoolingHttpClientConnectionManager : Connection manager shut down
The docker job is a java process which invokes the remote service and finally exits by calling System.exit(0); We print the following in the docker Job.
Expected status: 200, Actual status: 200, success: true
We see no issues with the task execution.
The status code as "Exited (0) 2 minutes ago" by executing the following command.
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED
STATUS PORTS NAMES
789ec59e32442 docker-test:latest "/bin/sh -c 'java -ja" 2 minutes ago Exited (0) 2 minutes ago mesos-05fb536b-asdff-3d634c4ed860-S1.be003d0e-7701-4020-84be-234643565244
Thanks,