I am getting a java.lang.OutOfMemoryError when pulling data from a sparklyr table. I am running the code on the university computer cluster, so it should hv plenty of spare memory to pull one variable from my 1.48Gb database (or when I collect the entire database, by using the command collect()). And I have already tried many different spark configurations, as described in https://github.com/rstudio/sparklyr/issues/379 and Running out of heap space in sparklyr, but have plenty of memory, but the problem still persists.
Also, when I type java -version
on the terminal while connected to the cluster, I get
java version "1.7.0_141"
OpenJDK Runtime Environment (rhel-2.6.10.1.el6_9-x86_64 u141-b02)
OpenJDK 64-Bit Server VM (build 24.141-b02, mixed mode)
so I don't think the problem is with Java, as suggested in How do I configure driver memory when running Spark in local mode via Sparklyr?
Below is the output file:
R version 3.4.1 (2017-06-30) -- "Single Candle"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
[Previously saved workspace restored]
> Sys.info()['nodename']
nodename
"econ14"
>
> #memory.limit(size=10000)
>
> #options(java.parameters = "-Xmx8048m")
>
>
> rm(list = ls()) #clear database
> library("sparklyr",lib.loc="/econ_s/saraiva/R_libs")
> library(dplyr,lib.loc="/econ_s/saraiva/R_libs")
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
> library("config",lib.loc="/econ_s/saraiva/R_libs")#
Attaching package: ‘config’
The following objects are masked from ‘package:base’:
get, merge
> library("rappdirs",lib.loc="/econ_s/saraiva/R_libs")#
> library("withr",lib.loc="/econ_s/saraiva/R_libs")#
> library("bindrcpp",lib.loc="/econ_s/saraiva/R_libs")#
>
>
> #Sys.setenv("SPARK_MEM" = "20g")
> config <- spark_config()
> #config[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=-XX:MaxHeapSize=24G"
> config$`sparklyr.shell.driver-memory` <- "10G"
> config$`sparklyr.shell.executor-memory` <- "10G"
> config$`spark.driver.maxResultSize` <- "10g"
> config$`spark.yarn.executor.memoryOverhead` <- "16g"
>
>
>
> sc<-spark_connect(master = "local",config = config)
* Using Spark: 2.1.0
>
>
>
>
> test=spark_read_json(sc = sc, name = "videos", path = "file/path.json")
>
> #=====Select a subset of variables:=====
> a<-select(test, asin, helpful,overall)#works
>
> #=====Filter Variables:=================
>a<- filter(test, asin=='B000H0X79O')#works
> #Using a function not defined in dplyr: (causes computer to run out of memory)
> tr<-select(test, reviewText)
> tr<-pull(tr)
Error: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
at scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:364)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:357)
at scala.collection.mutable.ArrayOps$ofRef.addString(ArrayOps.scala:186)
at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:323)
at scala.collection.mutable.ArrayOps$ofRef.mkString(ArrayOps.scala:186)
at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:325)
at scala.collection.mutable.ArrayOps$ofRef.mkString(ArrayOps.scala:186)
at sparklyr.Utils$.collectImplString(utils.scala:136)
at sparklyr.Utils$.collectImpl(utils.scala:174)
at sparklyr.Utils$$anonfun$collect$1.apply(utils.scala:198)
at sparklyr.Utils$$anonfun$collect$1.apply(utils.scala:198)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at sparklyr.Utils$.collect(utils.scala:198)
at sparklyr.Utils.collect(utils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sparklyr.Invoke$.invoke(invoke.scala:102)
at sparklyr.StreamHandler$.handleMethodCall(stream.scala:97)
Execution halted