Cask CDAP services started, but not running during installation

558 views Asked by At

After going through the docs for installing CDAP on MapR system (v6.0) and starting the cdap services, am finding that some CDAP services not running after startup (https://docs.cask.co/cdap/current/en/admin-manual/installation/mapr.html#starting-cdap-services) despite the services' startup loop not showing any errors. The output after starting the services and checking their status is shown below:

[root@mapr007 conf]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i start ; done
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:03:01 HST 2018 Starting CDAP Auth Server service on mapr007.org.local


/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:03:04 HST 2018 Starting CDAP Kafka Server service on mapr007.org.local


/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:03:07 HST 2018 Starting CDAP Master service on mapr007.org.local


Warning: Unable to determine $DRILL_HOME
Wed Nov 21 16:03:48 HST 2018 Ensuring required HBase coprocessors are on HDFS
Wed Nov 21 16:04:00 HST 2018 Running CDAP Master startup checks -- this may take a few minutes
/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:04:15 HST 2018 Starting CDAP Router service on mapr007.org.local


/usr/bin/id: cannot find name for group ID 504
Wed Nov 21 16:04:17 HST 2018 Starting CDAP UI service on mapr007.org.local



[root@mapr007 conf]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i status ; done
/usr/bin/id: cannot find name for group ID 504
PID file /var/cdap/run/auth-server-cdap.pid exists, but process 12126 does not appear to be running
/usr/bin/id: cannot find name for group ID 504
CDAP Kafka Server running as PID 12653
/usr/bin/id: cannot find name for group ID 504
PID file /var/cdap/run/master-cdap.pid exists, but process 15789 does not appear to be running
/usr/bin/id: cannot find name for group ID 504
CDAP Router running as PID 16184
/usr/bin/id: cannot find name for group ID 504
CDAP UI running as PID 16308

Note that there while there is an "Unable to determine $DRILL_HOME" error, I don't think that this should be a big problem since have added and set the explore.enabled value in the cdap-site.xml to be false. Looking at the cdap-site.xml, the web UI port does appear to be set to the default 11011 and yet can't see it (if only to check if the UI would tell me more about any errors) despite the fact that it reports as running.

Checking some info about the PIDs, seeing

# looking at the process that report to not be running
[root@mapr007 conf.dist]# ps -p 12126
  PID TTY          TIME CMD
[root@mapr007 conf.dist]# ps -p 15789
  PID TTY          TIME CMD

# looking at the rest of the processes
[root@mapr007 conf.dist]# ps -p 12653
  PID TTY          TIME CMD
12653 ?        00:08:12 java
[root@mapr007 conf.dist]# ps -p 16184
  PID TTY          TIME CMD
16184 ?        00:03:02 java
[root@mapr007 conf.dist]# ps -p 16308
  PID TTY          TIME CMD
16308 ?        00:00:01 node

Also checked if the default security.auth.server.bind.port was being used by some other service

root@mapr007 conf.dist]# netstat -anp | grep 10009

but nothing detected.

Not sure where to start debugging from here, so any suggestions or information would be appreciated.


UPDATE

Restarting the services to try to get more logging data, now seeing some errors (better than just it just not complaining and then not working, I guess)

[root@mapr007 conf.dist]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i stop ; done
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:29 HST 2018 Stopping CDAP Auth Server ...
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:29 HST 2018 Stopping CDAP Kafka Server ....

/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:30 HST 2018 Stopping CDAP Master ...
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:31 HST 2018 Stopping CDAP Router ....

/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:32 HST 2018 Stopping CDAP UI ....

[root@mapr007 conf.dist]# for i in `ls /etc/init.d/ | grep cdap` ; do sudo service $i start ; done
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:41 HST 2018 Starting CDAP Auth Server service on mapr007.org.local

/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:44 HST 2018 Starting CDAP Kafka Server service on mapr007.org.local

/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:06:47 HST 2018 Starting CDAP Master service on mapr007.org.local

Warning: Unable to determine $DRILL_HOME
Mon Nov 26 11:07:17 HST 2018 Ensuring required HBase coprocessors are on HDFS
Mon Nov 26 11:08:57 HST 2018 Running CDAP Master startup checks -- this may take a few minutes
[ERROR] Master startup checks failed. Please check /var/log/cdap/master-cdap-mapr007.org.local.log to address issues.
/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:10:08 HST 2018 Starting CDAP Router service on mapr007.org.local

/usr/bin/id: cannot find name for group ID 504
Mon Nov 26 11:10:11 HST 2018 Starting CDAP UI service on mapr007.org.local

Checking the content of the /var/log/cdap/master-cdap-mapr007.org.local.log file, at the bottom can see

...
...
...
2018-11-26 11:10:06,996 - ERROR [main:c.c.c.m.s.MasterStartupTool@109] - YarnCheck failed with RuntimeException: Unable to get status of YARN nodemanagers. Please check that YARN is running and that the correct Hadoop configuration (core-site.xml, yarn-site.xml) and libraries are included in the CDAP master classpath.
java.lang.RuntimeException: Unable to get status of YARN nodemanagers. Please check that YARN is running and that the correct Hadoop configuration (core-site.xml, yarn-site.xml) and libraries are included in the CDAP master classpath.
    at co.cask.cdap.master.startup.YarnCheck.run(YarnCheck.java:79) ~[co.cask.cdap.cdap-master-5.1.0.jar:na]
    at co.cask.cdap.common.startup.CheckRunner.runChecks(CheckRunner.java:51) ~[co.cask.cdap.cdap-common-5.1.0.jar:na]
    at co.cask.cdap.master.startup.MasterStartupTool.canStartMaster(MasterStartupTool.java:106) [co.cask.cdap.cdap-master-5.1.0.jar:na]
    at co.cask.cdap.master.startup.MasterStartupTool.main(MasterStartupTool.java:96) [co.cask.cdap.cdap-master-5.1.0.jar:na]
Caused by: java.util.concurrent.TimeoutException: null
    at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[na:1.8.0_181]
    at co.cask.cdap.master.startup.YarnCheck.run(YarnCheck.java:76) ~[co.cask.cdap.cdap-master-5.1.0.jar:na]
    ... 3 common frames omitted
2018-11-26 11:10:07,006 - ERROR [main:c.c.c.m.s.MasterStartupTool@113] -   Root cause: TimeoutException: 
2018-11-26 11:10:07,006 - ERROR [main:c.c.c.m.s.MasterStartupTool@116] - Errors detected while starting up master. Please check the logs, address all errors, then try again.

Following the "CDAP services on Distributed CDAP aren't starting up due to an exception. What should I do?" FAQ in the docs did not seem to help (https://docs.cask.co/cdap/current/en/faqs/cdap.html#cdap-services-on-distributed-cdap-aren-t-starting-up-due-to-an-exception-what-should-i-do).

Will continue debugging, but would appreciate any opinion on these new errors.

1

There are 1 answers

0
lampShadesDrifter On BEST ANSWER

Restarting Resource Manager and Node Manager services on the cluster seems to have resolved this error. This was done mostly on a guess by another dev based only on the fact that the error was related to CDAP being unable to connect to YARN despite the cluster's RM and NM services running fine.

Furthermore, the CDAP installation docs for enabling kerberose (https://docs.cask.co/cdap/current/en/admin-manual/installation/mapr.html#enabling-kerberos) specify using a special keyword _HOST, eg.

<property>
  <name>cdap.master.kerberos.keytab</name>
  <value>/etc/security/keytabs/cdap.service.keytab</value>
</property>

<property>
  <name>cdap.master.kerberos.principal</name>
  <value><cdap-principal>/[email protected]</value>
</property>

where the _HOST is not just some doc placeholder, but is some special keyword that is supposed to automatically be filled in (eg. see https://mapr.com/docs/60/Hive/Config-HiveMetastoreForKerberos.html and https://mapr.com/docs/60/SecurityGuide/Config-YARN-Kerberos.html).

Apparently, for MapR client nodes (ie. non control- or data-nodes (nodes simply running the MapR client package to interact with the cluster)), this does not work and the kerberos principle server host name must be explicitly given (pretty sure the docs exist, but can't find at this time). This was discovered when further examining the logs and seeing that the CDAP services where trying to connect to [email protected] instead of say [email protected].