Skip to content

Installing SparkR on the Cloudera Quickstart VM

November 7, 2016

Purpose

SparkR is an extension to Apache Spark which allows you to run Spark jobs with the R programming language. In the case of both Cloudera and MapR, SparkR is not supported and would need to be installed separately. This blog post describes how you can install SparkR on the Cloudera Quickstart VM.

For more information on how to get started with the Cloudera Quickstart VM you can view this blog post:

https://softwaresanders.wordpress.com/2016/10/24/getting-started-with-the-cloudera-quickstart-vm/

Steps

Install R and depending packages

  1. Install R and dependencies
    yum install R R-devel libcurl-devel openssl-devel
    
  2. Test
    R -e "print(1+1)"
    
Notes:

Gets installed: /usr/lib64/R

Install SparkR

  1. Run R Console
    R
    
  2. Install the Depending R Packages
    install.packages("devtools")
    install.packages("roxygen2")
    install.packages("testthat")
    
  3. Close out of the R shell
    quit()
    
  4. Get the Spark Version
    spark-submit --version
    
    1. Example output: 1.6.0
    2. Use the output of this command as the {SPARK_VERSION} placeholder
  5. Run R Console
    R
    
  6. Install the SparkR Packages
    devtools::install_github('apache/spark@v{SPARK_VERSION}', subdir='R/pkg')
    install.packages('sparklyr')
    
  7. Close out of the R shell
    quit()
    
  8. Install the SparkR OS Dependencies
    cd /tmp/
    wget https://github.com/apache/spark/archive/v{SPARK_VERSION}.zip
    unzip v{SPARK_VERSION}.zip
    cd spark-{SPARK_VERSION}
    cp -r R /usr/lib/spark/
    cd bin
    cp sparkR /usr/lib/spark/bin/
    
  9. Run Dev Install
    cd /usr/lib/spark/R/
    sh install-dev.sh
    
  10. Create a new file “/user/bin/sparkR” and set the contents to be:
    #!/bin/bash
    # Autodetect JAVA_HOME if not defined
    . /usr/lib/bigtop-utils/bigtop-detect-javahome
    exec /usr/lib/spark/bin/sparkR "$@"
    
  11. Finish install
    sudo chmod 755 /usr/bin/sparkR
    
  12. Your done!

Test SparkR

  • Test from R Console
    1. Open the R Console
      R
      
    2. Execute the following commands:
      library(SparkR)
      library(sparklyr)
      Sys.setenv(SPARK_HOME='/usr/lib/spark')
      Sys.setenv(SPARK_HOME_VERSION='1.6.0')
      sc = spark_connect(master = "yarn-client")
      
    3. If everything runs without errors, you know it’s working!
  • Test from SparkR Console
    1. Open the SparkR Console
      sparkR
      
    2. Verify the Spark Context is available with the following command:
      sc
      
    3. If the sc variable is listed then you know it’s working!
  • Sample code you can run to test more
    rdd = SparkR:::parallelize(sc, 1:5)
    SparkR:::collect(rdd)
    
Advertisements
Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: