I've tried to run my own Pregel method for a relatively small graph (250k vertices, 1.5M edges). The algorithm which I use may (high chances are) be non-convergent meaning in most cases maxIterations setting is actually acting as hard stop finishing all calculations.

I'm using AWS EMR with apache spark and m5.2xlarge instances for all nodes in a setup with EMR-managed scaling. Initially, cluster is set to run 1 master and 4 worker nodes with expansion up to 8 max.

For the same setup of cluster, I was increasing the number of maxIterations gradually from 100 to 500 with step of 100 [100, 200, 300, 400, 500]. I was under the assumption that setup enough for 100 iterations will be also enough for any other number just because not used memory will be freeing up.

However, when I ran a set of jobs with maxIterations increasing from 100 to 500 I found that all jobs with maxIterations > 100 were terminated due to step error. I've checked logs of Spark to find issues and this is what I got:

log start

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt1/yarn/usercache/hadoop/filecache/10/__spark_libs__364046395941885636.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for TERM
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for HUP
21/02/13 21:23:24 INFO SignalUtils: Registered signal handler for INT
21/02/13 21:23:24 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/13 21:23:24 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/13 21:23:24 INFO SecurityManager: Changing view acls groups to: 
21/02/13 21:23:24 INFO SecurityManager: Changing modify acls groups to: 
21/02/13 21:23:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users  with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:24 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:24 INFO ApplicationMaster: Preparing Local resources
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:25 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1613251201422_0001_000001
21/02/13 21:23:25 INFO ApplicationMaster: Starting the user application in a separate Thread
21/02/13 21:23:25 INFO ApplicationMaster: Waiting for spark context initialization...
21/02/13 21:23:25 INFO SparkContext: Running Spark version 2.4.7-amzn-0
21/02/13 21:23:25 INFO SparkContext: Submitted application: Read JDBC Datasites2
21/02/13 21:23:25 INFO SecurityManager: Changing view acls to: yarn,hadoop
21/02/13 21:23:25 INFO SecurityManager: Changing modify acls to: yarn,hadoop
21/02/13 21:23:25 INFO SecurityManager: Changing view acls groups to: 
21/02/13 21:23:25 INFO SecurityManager: Changing modify acls groups to: 
21/02/13 21:23:25 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users  with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:25 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:25 INFO Utils: Successfully started service 'sparkDriver' on port 41117.
21/02/13 21:23:25 INFO SparkEnv: Registering MapOutputTracker
21/02/13 21:23:25 INFO SparkEnv: Registering BlockManagerMaster
21/02/13 21:23:25 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/02/13 21:23:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/02/13 21:23:25 INFO DiskBlockManager: Created local directory at /mnt/yarn/usercache/hadoop/appcache/application_1613251201422_0001/blockmgr-bc544c91-1a59-41f3-890f-faaa392bea09
21/02/13 21:23:25 INFO DiskBlockManager: Created local directory at /mnt1/yarn/usercache/hadoop/appcache/application_1613251201422_0001/blockmgr-14e3f36f-6d3f-4ffe-a28c-fa3f81f0c5c9
21/02/13 21:23:26 INFO MemoryStore: MemoryStore started with capacity 1008.9 MB
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:26 INFO SparkEnv: Registering OutputCommitCoordinator
21/02/13 21:23:26 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill.
21/02/13 21:23:26 INFO Utils: Successfully started service 'SparkUI' on port 43659.
21/02/13 21:23:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://ip-172-31-21-88.ec2.internal:43659
21/02/13 21:23:26 INFO YarnClusterScheduler: Created YarnClusterScheduler
21/02/13 21:23:26 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1613251201422_0001 and attemptId Some(appattempt_1613251201422_0001_000001)
21/02/13 21:23:26 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/02/13 21:23:26 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34665.
21/02/13 21:23:26 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
21/02/13 21:23:26 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: fs.s3.buffer.dir;  Ignoring.
21/02/13 21:23:26 WARN Configuration: __spark_hadoop_conf__.xml:an attempt to override final parameter: yarn.nodemanager.local-dirs;  Ignoring.
21/02/13 21:23:27 INFO RMProxy: Connecting to ResourceManager at ip-172-31-29-
  command:
    LD_LIBRARY_PATH=\"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH\" \ 
      {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx4743m \ 
      '-verbose:gc' \ 
      '-XX:+PrintGCDetails' \ 
      '-XX:+PrintGCDateStamps' \ 
      '-XX:OnOutOfMemoryError=kill -9 %p' \ 
      '-XX:+UseParallelGC' \ 
      '-XX:InitiatingHeapOccupancyPercent=70' \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.history.ui.port=18080' \ 
      '-Dspark.ui.port=0' \ 
      '-Dspark.driver.port=41117' \ 
      -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
      org.apache.spark.executor.CoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://[email protected]:41117 \ 
      --executor-id \ 
      <executorId> \ 
      --hostname \ 
      <hostname> \ 
      --cores \ 
      2 \ 
      --app-id \ 
      application_1613251201422_0001 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1><LOG_DIR>/stdout \ 
      2><LOG_DIR>/stderr

  resources:
    __app__.jar -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/force-pregel.jar" } size: 27378 timestamp: 1613251399566 type: FILE visibility: PRIVATE
    __spark_libs__ -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/__spark_libs__364046395941885636.zip" } size: 239655683 timestamp: 1613251397751 type: ARCHIVE visibility: PRIVATE
    __spark_conf__ -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/__spark_conf__.zip" } size: 274365 timestamp: 1613251399776 type: ARCHIVE visibility: PRIVATE
    hive-site.xml -> resource { scheme: "hdfs" host: "ip-172-31-29-153.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1613251201422_0001/hive-site.xml" } size: 2137 timestamp: 1613251399631 type: FILE visibility: PRIVATE 

===============================================================================
    21/02/13 21:23:27 INFO Configuration: resource-types.xml not found
    21/02/13 21:23:27 INFO ResourceUtils: Unable to find 'resource-types.xml'.
    21/02/13 21:23:27 INFO ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
    21/02/13 21:23:27 INFO ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
    21/02/13 21:23:27 INFO Utils: Using initial executors = 100, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
    21/02/13 21:23:27 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://[email protected]:41117)
    21/02/13 21:23:27 INFO YarnAllocator: Will request up to 100 executor container(s), each with <memory:5632, max memory:2147483647, vCores:2, max vCores:2147483647>
    21/02/13 21:23:27 INFO YarnAllocator: Submitted 100 unlocalized container requests.
    21/02/13 21:23:27 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
   org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json.
    21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution.
    21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json.
    21/02/13 21:23:27 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql.
    21/02/13 21:23:27 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000002 on host ip-172-31-21-88.ec2.internal for executor with ID 1 with resources <memory:5632, max memory:12288, vCores:1, max vCores:8>
    21/02/13 21:23:27 INFO YarnAllocator: Launching executor with 4742m of heap (plus 890m overhead) and 2 cores
    21/02/13 21:23:27 INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
    21/02/13 21:23:28 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000004 on host ip-172-31-25-102.ec2.internal for executor with ID 2 with resources <memory:11264, vCores:2>
    21/02/13 21:23:28 INFO YarnAllocator: Launching executor with 9485m of heap (plus 1779m overhead) and 4 cores
    21/02/13 21:23:28 INFO YarnAllocator: Allocated container container_1613251201422_0001_01_000006 on host ip-172-31-28-143.ec2.internal for executor with ID 3 with resources <memory:11264, vCores:2>
    21/02/13 21:23:28 INFO YarnAllocator: Launching executor with 9485m of heap (plus 1779m overhead) and 4 cores
    21/02/13 21:23:28 INFO YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
  30 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.31.21.88:53634) with ID 1
    21/02/13 21:23:30 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 1)
    21/02/13 21:23:30 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-21-88.ec2.internal:45667 with 2.3 GB RAM, BlockManagerId(1, ip-172-31-21-88.ec2.internal, 45667, None)

then approximately 2Mbytes of same output and then it finishes:

21/02/13 21:28:25 INFO TaskSetManager: Finished task 199.0 in stage 37207.0 (TID 93528) in 8 ms on ip-172-31-25-102.ec2.internal (executor 2) (158/200)

21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_31 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 252.3 KB, free: 2.1 GB)
21/02/13 21:28:25 ERROR ApplicationMaster: Exception from Reporter thread.
org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
    at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
    at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
    at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
    at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
    at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
    at com.sun.proxy.$Proxy23.allocate(Unknown Source)
    at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:300)
    at org.apache.spark.deploy.yarn.YarnAllocator.allocateResources(YarnAllocator.scala:279)
    at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$allocationThreadImpl(ApplicationMaster.scala:541)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:607)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
    at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
    at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)

    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549)
    at org.apache.hadoop.ipc.Client.call(Client.java:1495)
    at org.apache.hadoop.ipc.Client.call(Client.java:1394)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
    at com.sun.proxy.$Proxy22.allocate(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
    ... 13 more
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_30 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 244.8 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 40.0 in stage 37207.0 (TID 93533, ip-172-31-21-88.ec2.internal, executor 1, partition 40, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 31.0 in stage 37207.0 (TID 93532) in 16 ms on ip-172-31-21-88.ec2.internal (executor 1) (162/200)
21/02/13 21:28:25 INFO ApplicationMaster: Final app status: FAILED, exitCode: 12, (reason: Application attempt appattempt_1613251201422_0001_000001 doesn't exist in ApplicationMasterService cache.
    at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:353)
    at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:507)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1034)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:931)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2854)
)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 41.0 in stage 37207.0 (TID 93534, ip-172-31-21-88.ec2.internal, executor 1, partition 41, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 30.0 in stage 37207.0 (TID 93531) in 22 ms on ip-172-31-21-88.ec2.internal (executor 1) (163/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_40 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 234.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 48.0 in stage 37207.0 (TID 93535, ip-172-31-21-88.ec2.internal, executor 1, partition 48, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 40.0 in stage 37207.0 (TID 93533) in 17 ms on ip-172-31-21-88.ec2.internal (executor 1) (164/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_41 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 233.4 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 51.0 in stage 37207.0 (TID 93536, ip-172-31-21-88.ec2.internal, executor 1, partition 51, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 41.0 in stage 37207.0 (TID 93534) in 15 ms on ip-172-31-21-88.ec2.internal (executor 1) (165/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_48 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 235.1 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 57.0 in stage 37207.0 (TID 93537, ip-172-31-21-88.ec2.internal, executor 1, partition 57, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 48.0 in stage 37207.0 (TID 93535) in 11 ms on ip-172-31-21-88.ec2.internal (executor 1) (166/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_57 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 232.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_51 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 244.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 61.0 in stage 37207.0 (TID 93538, ip-172-31-21-88.ec2.internal, executor 1, partition 61, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 57.0 in stage 37207.0 (TID 93537) in 10 ms on ip-172-31-21-88.ec2.internal (executor 1) (167/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 63.0 in stage 37207.0 (TID 93539, ip-172-31-21-88.ec2.internal, executor 1, partition 63, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 51.0 in stage 37207.0 (TID 93536) in 17 ms on ip-172-31-21-88.ec2.internal (executor 1) (168/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_61 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 228.6 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 67.0 in stage 37207.0 (TID 93540, ip-172-31-21-88.ec2.internal, executor 1, partition 67, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 61.0 in stage 37207.0 (TID 93538) in 10 ms on ip-172-31-21-88.ec2.internal (executor 1) (169/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_63 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 238.3 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 71.0 in stage 37207.0 (TID 93541, ip-172-31-21-88.ec2.internal, executor 1, partition 71, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 63.0 in stage 37207.0 (TID 93539) in 14 ms on ip-172-31-21-88.ec2.internal (executor 1) (170/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_67 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 247.2 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_71 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 243.6 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 77.0 in stage 37207.0 (TID 93542, ip-172-31-21-88.ec2.internal, executor 1, partition 77, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 67.0 in stage 37207.0 (TID 93540) in 18 ms on ip-172-31-21-88.ec2.internal (executor 1) (171/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 79.0 in stage 37207.0 (TID 93543, ip-172-31-21-88.ec2.internal, executor 1, partition 79, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 71.0 in stage 37207.0 (TID 93541) in 12 ms on ip-172-31-21-88.ec2.internal (executor 1) (172/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_79 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 253.6 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_77 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 222.5 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 86.0 in stage 37207.0 (TID 93544, ip-172-31-21-88.ec2.internal, executor 1, partition 86, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 79.0 in stage 37207.0 (TID 93543) in 12 ms on ip-172-31-21-88.ec2.internal (executor 1) (173/200)
21/02/13 21:28:25 INFO TaskSetManager: Starting task 87.0 in stage 37207.0 (TID 93545, ip-172-31-21-88.ec2.internal, executor 1, partition 87, PROCESS_LOCAL, 19161 bytes)
21/02/13 21:28:25 INFO TaskSetManager: Finished task 77.0 in stage 37207.0 (TID 93542) in 14 ms on ip-172-31-21-88.ec2.internal (executor 1) (174/200)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_86 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 254.5 KB, free: 2.1 GB)
21/02/13 21:28:25 INFO BlockManagerInfo: Added rdd_2738_87 in memory on ip-172-31-21-88.ec2.internal:45667 (size: 267.1 KB, free: 2.1 GB)
  • Am I correct that Pregel doesn't finish 200 or more iterations due to OutOfMemory error on some of the cluster nodes?
  • If so, how does Pregel work that 100 iterations are not causing it and 200 or 300 are causing? My understand before this issue was that Pregel as many other iterative approaches only 'store' previous and current iteration values and results and iteration by iteration values are changing, but their quantity is not increasing, meaning it is still graph with 250k vertices and 1.5m edges and only messages valid for current iteration are adding up to the heap.
  • Throughout the log I was not able to find any information on low memory and as seen, there are Gigabytes of it available on each node before it terminates
0

There are 0 answers