Release CIMScala-2.11-2.0.1-1.8.1 · derrickoswald/CIMSpark

Fix warning and error messages when creating redges.RData.

Note

Existing R scripts work, but issue warning messages like so:

Warning message:
'sparkR.init' is deprecated.
Use 'sparkR.session' instead.
See help("Deprecated")

Warning message:
'sparkRSQL.init' is deprecated.
Use 'sparkR.session' instead.
See help("Deprecated")

Warning message:
'sql(sqlContext...)' is deprecated.
Use 'sql(sqlQuery)' instead.
See help("Deprecated")

It is possible to eliminate these messages using the script below, but testing this code against large data sets indicates severe memory issues.

So, at this time, we recommend using the same R script as was used with version 1.6.0 - ignoring warning messages - and not using the code below.

R code changes for Spark 2.0 (avoids warning messages):

# record the load time
begin = proc.time ()

# set up the Spark system
Sys.setenv (YARN_CONF_DIR="/spark/spark-2.0.2-bin-hadoop2.7/conf")
Sys.setenv (SPARK_HOME="spark/spark-2.0.2-bin-hadoop2.7")
library (SparkR, lib.loc = c (file.path (Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session ("spark://sandbox:7077", "Sample", sparkJars = c ("CIMScala-2.11-2.0.1-1.8.1.jar"), sparkEnvir = list (spark.driver.memory="1g", spark.executor.memory="4g", spark.serializer="org.apache.spark.serializer.KryoSerializer"))

# record the start time
pre = proc.time ()

# read the data file and process topologically and make the edge RDD
elements = sql ("create temporary view elements using ch.ninecode.cim options (path 'hdfs://sandbox:8020/data/NIS_CIM_Export_sias_current_20161220_V9.rdf', StorageLevel 'MEMORY_AND_DISK_SER', ch.ninecode.cim.make_edges 'true', ch.ninecode.cim.do_topo 'false', ch.ninecode.cim.do_topo_islands 'false')")
head (sql ("select * from elements")) # triggers evaluation

# record the time spent creating the redges data frame
post = proc.time ()

# read the edges RDD as an R data frame
edges = sql ("select * from edges")
redges = SparkR::collect (edges, stringsAsFactors=FALSE)

# save the redges data frame
save ("redges", file="./NIS_CIM_Export_sias_current_20161220_V9")

finish = proc.time ()

# show timing
print (paste ("setup", as.numeric (pre[3] - begin[3])))
print (paste ("read", as.numeric (post[3] - pre[3])))
print (paste ("redges", as.numeric (finish[3] - post[3])))

# example to read an RDD directly
terminals = sql ("select * from Terminal")
rterminals = SparkR::collect (terminals, stringsAsFactors=FALSE)

# example to read a three-way join of RDD
switches = sql ("select s.sup.sup.sup.sup.mRID mRID, s.sup.sup.sup.sup.aliasName aliasName, s.sup.sup.sup.sup.name name, s.sup.sup.sup.sup.description description, open, normalOpen no, l.CoordinateSystem cs, p.xPosition, p.yPosition from Switch s, Location l, PositionPoint p where s.sup.sup.sup.Location = l.sup.mRID and s.sup.sup.sup.Location = p.Location and p.sequenceNumber = 0")
rswitches = SparkR::collect (switches, stringsAsFactors=FALSE)

Timings on NIS AWS cluster for the sequence of operations on 8017082910 byte RDF file is:
setup 3.089 seconds
read 27.636 seconds
redges 1296.595 seconds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CIMScala-2.11-2.0.1-1.8.1

Note