PIG MAP_ONLY Job failing with Datanode Going OOM Error

21 views Asked by At

I Have application Audit stored in HDFS for day 09 to 16. All this data have total 23 Fields. Two of them is Event_ID and Status. Event_ID can be 1 to 10. And Status can be REQUEST, SUCCESS, FAILURE.

[root@IDAPHadoop base2]# hdfs dfs -du -h /prod/audit/year=2023/month=08
412.4 M  /prod/audit/year=2023/month=08/day=09
140.2 M  /prod/audit/year=2023/month=08/day=10
550.5 M  /prod/audit/year=2023/month=08/day=11
26.3 M   /prod/audit/year=2023/month=08/day=12
33.0 M   /prod/audit/year=2023/month=08/day=13
50.5 K   /prod/audit/year=2023/month=08/day=14
156.4 K  /prod/audit/year=2023/month=08/day=15
38.0 K   /prod/audit/year=2023/month=08/day=16

I Am reading day=09 data using PIG and MapReduce On Yarn. On this Alias A applying filter of EventID = 2 and storing it to another alias B. On this alias B applying filter of Status = "SUCCESS" OR Status = "FAILURE".

A = LOAD '/prod/audit/year=2023/month=08/day=09/ymd=20230809' USING PigStorage('|') AS
(
   COL0     : chararray,
   Event_ID : int,
   Status   : chararray,
   COL3     : chararray,
   .
   .
   .
   COL23    : charrarray,
);

B = FOREACH (FILTER A BY Event_ID == 1) GENERATE COL0, Event_ID, Status, COL3, ..., COL23;

C = FOREACH (FILTER B BY Status == 'SUCCESS' OR Status == 'FAILURE' ) GENERATE GENERATE COL0, Event_ID, Status, COL3, ..., COL23;

STORE C INTO 'hdfs://namenode:9000/auditresult/' USING org.apache.pig.piggybank.storage.MultiStorage('hdfs://namenode:9000/auditresult/','3','none', ',');

JVM Container MIN MEM is set to 6GB and MAX MEM is set to 12 GB for MAP REDUCE On YARN

HDFS Configuration : Block Size is 64 MB with replication Factor 1.

When i try to store C in HDFS. Job and Tasks Fail. Yarn UI shows 7 MAPPERS ONLY 0 Reducers Tasks. Task Fails and ultimately JOB Fails. YARN UI Show Map Task failed with error JAVA Not able to create native thread and also datanode with heap of 1gb shows OOM Error. And Some times it goes for restart when due to max resource consumption- may be os kills it.

Consider Below Statistics If It can help...

Total No Of Records: 1426123

+---------+------+
| Event_ID| count|
+---------+------+
|        1|301790|
|        3|116310|
|        9| 19166|
|        4| 58690|
|        8| 93204|
|        7| 22643|
|       10| 18111|
|        2|796209|
+---------+------+


# For EventID == 2
+---------+------+
|   STATUS| count|
+---------+------+
|  SUCCESS|377655|
|  REQUEST|395812|
|  FAILURE| 22742|
+---------+------+

Above Stats are calculated using SPARK.

If I try to run COUNT operation on alias C then it is working fine. As well going further try to store alias E which is result of join of C and D. That case E is being stored. Its strange!! As per my knowledge if we store alias E then also MAP_ONLY operations will be same only so why in this case it is not giving error.

E = C JOIN D;
STORE E; # Working Properly

When i Try to store C at that time only. This issue occurs and strangely MAP Reduce Fails but DATANode JVM also faces OOM Error.

Stack Trace From YARN UI:

Error: java.io.IOException: java.lang.OutOfMemoryError: unable to create new native thread at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.set(DataStreamer.java:299) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:826) Caused by: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:719) at org.apache.hadoop.hdfs.DataStreamer.initDataStreaming(DataStreamer.java:623) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:711)

Stack Trace From Datanode Logs ...

==> hadoop/hadoop--datanode-IDAPHadoop-217.log <==
2023-08-19 19:53:03,904 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. Will retry in 30 seconds.
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:719)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:155)
        at java.lang.Thread.run(Thread.java:750)
2023-08-19 19:53:03,952 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: IDAPHadoop-222.hadoop.com:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.7.0.65:50894 dst: /10.7.0.77:50010
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:211)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:211)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:519)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:959)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:867)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
        at java.lang.Thread.run(Thread.java:750)

I can also See this logs from journlectl.

/startup_datanode.sh: fork: retry: Resource temporarily unavailable
/startup_datanode.sh: fork: retry: No child processes
/startup_datanode.sh: fork: retry: Resource temporarily unavailable
/startup_datanode.sh: fork: retry: No child processes
/startup_datanode.sh: fork: retry: No child processes
/startup_datanode.sh: fork: retry: Resource temporarily unavailable
/startup_datanode.sh: fork: retry: No child processes
/startup_datanode.sh: fork: retry: No child processes
/startup_datanode.sh: fork: retry: No child processes
/startup_datanode.sh: fork: retry: Resource temporarily unavailable

From online research i found out that it can be low value of no of max process user can launch and no of file descriptor. But its not the case in fact its much more then enough.

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1030714
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1048576
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Request Community to Help me possible causes.

0

There are 0 answers