Accumulo scan/write not running in standalone Java main program in AWS EC2 master using Cloudera CDH 5.8.2

79 views Asked by At

We are trying to run simple write/sacn from Accumulo (client jar 1.5.0) in standalone Java main program (Maven shade executable) as below in AWS EC2 master (described below) using Putty

    public class AccumuloQueryApp {

      private static final Logger logger = LoggerFactory.getLogger(AccumuloQueryApp.class);

      public static final String INSTANCE = "accumulo"; // miniInstance
      public static final String ZOOKEEPERS = "ip-x-x-x-100:2181"; //localhost:28076

      private static Connector conn;

      static {
        // Accumulo
        Instance instance = new ZooKeeperInstance(INSTANCE, ZOOKEEPERS);
        try {
          conn = instance.getConnector("root", new PasswordToken("xxx"));
        } catch (Exception e) {
          logger.error("Connection", e);
        }
      }

      public static void main(String[] args) throws TableNotFoundException, AccumuloException, AccumuloSecurityException, TableExistsException {
        System.out.println("connection with : " + conn.whoami());

        BatchWriter writer = conn.createBatchWriter("test", ofBatchWriter());

        for (int i = 0; i < 10; i++) {
          Mutation m1 = new Mutation(String.valueOf(i));
          m1.put("personal_info", "first_name", String.valueOf(i));
          m1.put("personal_info", "last_name", String.valueOf(i));
          m1.put("personal_info", "phone", "983065281" + i % 2);
          m1.put("personal_info", "email", String.valueOf(i));
          m1.put("personal_info", "date_of_birth", String.valueOf(i));
          m1.put("department_info", "id", String.valueOf(i));
          m1.put("department_info", "short_name", String.valueOf(i));
          m1.put("department_info", "full_name", String.valueOf(i));
          m1.put("organization_info", "id", String.valueOf(i));
          m1.put("organization_info", "short_name", String.valueOf(i));
          m1.put("organization_info", "full_name", String.valueOf(i));

          writer.addMutation(m1);
        }
        writer.close();

        System.out.println("Writing complete ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`");

        Scanner scanner = conn.createScanner("test", new Authorizations());
        System.out.println("Step 1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`");
        scanner.setRange(new Range("3", "7"));
        System.out.println("Step 2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`");
        scanner.forEach(e -> System.out.println("Key: " + e.getKey() + ", Value: " + e.getValue()));
        System.out.println("Step 3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`");
        scanner.close();
      }

      public static BatchWriterConfig ofBatchWriter() {
        //Batch Writer Properties
        final int MAX_LATENCY  = 1;
        final int MAX_MEMORY = 10000000;
        final int MAX_WRITE_THREADS = 10;
        final int TIMEOUT = 10;

        BatchWriterConfig config = new BatchWriterConfig();   
        config.setMaxLatency(MAX_LATENCY, TimeUnit.MINUTES);
        config.setMaxMemory(MAX_MEMORY);
        config.setMaxWriteThreads(MAX_WRITE_THREADS);
        config.setTimeout(TIMEOUT, TimeUnit.MINUTES);

        return config;
      }
    }

Connection is established correctly but creating BatchWriter it getting error and it's trying in loop with same error

[impl.ThriftScanner] DEBUG: Error getting transport to ip-x-x-x-100:10011 : NotServingTabletException(extent:TKeyExtent(table:21 30, endRow:21 30 3C, prevEndRow:null))

When we run the same code (writing to Accumulo and reading from Accumulo) inside Spark job and submit to the YANK cluster it's running perfectly. We are struggling to figure out that but getting no clue. Please see the environment as described below

Cloudera CDH 5.8.2 on AWS environemnts (4 EC2 instances as one master and 3 child).

Consider the private IPs are like

  1. Mater: x.x.x.100
  2. Child1: x.x.x.101
  3. Child2: x.x.x.102
  4. Child3: x.x.x.103

We havethe follwing installation in CDH

Cluster (CDH 5.8.2)

  1. Accumulo 1.6 (Tracer not installed, Garbage Collector in Child2, Master in Master, Monitor in child3, Tablet Server in Master)
  2. HBase
  3. HDFS (master as name node, all 3 child as datanode)
  4. Kafka
  5. Spark
  6. YARN (MR2 Included)
  7. ZooKeeper
1

There are 1 answers

1
elserj On

Hrm, that's very curious that it runs with the Spark-on-YARN, but as a regular Java application. Usually, it's the other way around :)

I would verify that the JARs on the classpath of the standalone java app match the JARs used by the Spark-on-YARN job as well as the Accumulo server classpath.

If that doesn't help, try to increase the log4j level to DEBUG or TRACE and see if anything jumps out at you. If you have a hard time understanding what the logging is saying, feel free to send an email to [email protected] and you'll definitely have more eyes on the problem.