Skip to content

Installing Livy on the Cloudera Quickstart VM

Purpose

Livy is an open source component to Apache Spark that allows you to submit REST calls to your Apache Spark Cluster. You can view the source code here:

https://github.com/cloudera/livy

In this post I will be going over the steps you would need to follow to get it installed on the Cloudera Quickstart VM. The steps were derived from the above source code link, however, this post provides more information on how to test it in a more simple manner.

Initial Setup

Enable port forwarding for port 8998 from the localhost to the VM.

Install Steps

  1. Login as Root
  2. Download the Livy source code
    cd /opt
    wget https://github.com/cloudera/livy/archive/v0.2.0.zip
    unzip v0.2.0.zip
    cd livy-0.2.0
    
  3. Get the version of spark that is currently installed on your cluster
    1. Run the following command
      spark-submit --version
      
    2. Example: 1.6.0
    3. Use this value in downstream commands as {SPARK_VERSION}
  4.  Build the Livy source code with Maven
    /usr/local/apache-maven/apache-maven-3.0.4/bin/mvn -DskipTests=true -Dspark.version={SPARK_VERSION} clean package
    
  5. Your done!

Steps to Control Livy

Get Status

ps -eaf | grep livy

It will  be listed like the following:

root      9379     1 14 18:28 pts/0    00:00:01 java -cp /opt/livy-0.2.0/server/target/jars/*:/opt/livy-0.2.0/conf:/etc/hadoop/conf: com.cloudera.livy.server.LivyServer

Start

Note: Run as Root

cd /opt/livy-0.2.0/
export SPARK_HOME=/usr/lib/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf
./bin/livy-server start

 

Once started, the Livy Server can be called with the following host and port:

http://localhost:8998

Stop

Note: Run as Root

cd /opt/livy-0.2.0/
./bin/livy-server stop

Testing Livy

  1. Create a new Livy Session
    1. Curl Command
      curl -H "Content-Type: application/json" -X POST -d '{"kind":"spark"}' -i http://localhost:8998/sessions
      
    2. Output
      HTTP/1.1 201 Created
      Date: Wed, 02 Nov 2016 22:38:13 GMT
      Content-Type: application/json; charset=UTF-8
      Location: /sessions/1
      Content-Length: 81
      Server: Jetty(9.2.16.v20160414)
      
      {"id":1,"owner":null,"proxyUser":null,"state":"starting","kind":"spark","log":[]}
      
  2. View Current Livy Sessions
    1. Curl Command
      curl -H "Content-Type: application/json" -i http://localhost:8998/sessions
      
    2. Output
      HTTP/1.1 200 OK
      Date: Tue, 08 Nov 2016 02:30:34 GMT
      Content-Type: application/json; charset=UTF-8
      Content-Length: 111
      Server: Jetty(9.2.16.v20160414)
      
      {"from":0,"total":1,"sessions":[{"id":0,"owner":null,"proxyUser":null,"state":"idle","kind":"spark","log":[]}]}
      
  3. Get Livy Session Info
    1. Curl Command
      curl -H "Content-Type: application/json" -i http://localhost:8998/sessions/0
      
    2. Output
      HTTP/1.1 200 OK
      Date: Tue, 08 Nov 2016 02:31:04 GMT
      Content-Type: application/json; charset=UTF-8
      Content-Length: 77
      Server: Jetty(9.2.16.v20160414)
      
      {"id":0,"owner":null,"proxyUser":null,"state":"idle","kind":"spark","log":[]}
      
  4. Submit job to Livy
    1. Curl Command
      curl -H "Content-Type: application/json" -X POST -d '{"code":"println(sc.parallelize(1 to 5).collect())"}' -i http://localhost:8998/sessions/0/statements
      
    2. Output
      HTTP/1.1 201 Created
      Date: Tue, 08 Nov 2016 02:31:29 GMT
      Content-Type: application/json; charset=UTF-8
      Location: /sessions/0/statements/0
      Content-Length: 40
      Server: Jetty(9.2.16.v20160414)
      
      {"id":0,"state":"running","output":null}
      
  5. Get Job Status and Output
    1. Curl Command
      curl -H "Content-Type: application/json" -i http://localhost:8998/sessions/0/statements/0
      
    2. Output
      HTTP/1.1 200 OK
      Date: Tue, 08 Nov 2016 02:32:15 GMT
      Content-Type: application/json; charset=UTF-8
      Content-Length: 109
      Server: Jetty(9.2.16.v20160414)
      
      {"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"[I@6270e14a"}}}
      
  6. Delete Session
    1. Curl Command
      curl -H "Content-Type: application/json" -X DELETE -d -i http://localhost:8998/sessions/0
      
    2. Output
      {"msg":"deleted"}
      

 

Advertisements

Installing SparkR on the Cloudera Quickstart VM

Purpose

SparkR is an extension to Apache Spark which allows you to run Spark jobs with the R programming language. In the case of both Cloudera and MapR, SparkR is not supported and would need to be installed separately. This blog post describes how you can install SparkR on the Cloudera Quickstart VM.

For more information on how to get started with the Cloudera Quickstart VM you can view this blog post:

https://softwaresanders.wordpress.com/2016/10/24/getting-started-with-the-cloudera-quickstart-vm/

Steps

Install R and depending packages

  1. Install R and dependencies
    yum install R R-devel libcurl-devel openssl-devel
    
  2. Test
    R -e "print(1+1)"
    
Notes:

Gets installed: /usr/lib64/R

Install SparkR

  1. Run R Console
    R
    
  2. Install the Depending R Packages
    install.packages("devtools")
    install.packages("roxygen2")
    install.packages("testthat")
    
  3. Close out of the R shell
    quit()
    
  4. Get the Spark Version
    spark-submit --version
    
    1. Example output: 1.6.0
    2. Use the output of this command as the {SPARK_VERSION} placeholder
  5. Run R Console
    R
    
  6. Install the SparkR Packages
    devtools::install_github('apache/spark@v{SPARK_VERSION}', subdir='R/pkg')
    install.packages('sparklyr')
    
  7. Close out of the R shell
    quit()
    
  8. Install the SparkR OS Dependencies
    cd /tmp/
    wget https://github.com/apache/spark/archive/v{SPARK_VERSION}.zip
    unzip v{SPARK_VERSION}.zip
    cd spark-{SPARK_VERSION}
    cp -r R /usr/lib/spark/
    cd bin
    cp sparkR /usr/lib/spark/bin/
    
  9. Run Dev Install
    cd /usr/lib/spark/R/
    sh install-dev.sh
    
  10. Create a new file “/user/bin/sparkR” and set the contents to be:
    #!/bin/bash
    # Autodetect JAVA_HOME if not defined
    . /usr/lib/bigtop-utils/bigtop-detect-javahome
    exec /usr/lib/spark/bin/sparkR "$@"
    
  11. Finish install
    sudo chmod 755 /usr/bin/sparkR
    
  12. Your done!

Test SparkR

  • Test from R Console
    1. Open the R Console
      R
      
    2. Execute the following commands:
      library(SparkR)
      library(sparklyr)
      Sys.setenv(SPARK_HOME='/usr/lib/spark')
      Sys.setenv(SPARK_HOME_VERSION='1.6.0')
      sc = spark_connect(master = "yarn-client")
      
    3. If everything runs without errors, you know it’s working!
  • Test from SparkR Console
    1. Open the SparkR Console
      sparkR
      
    2. Verify the Spark Context is available with the following command:
      sc
      
    3. If the sc variable is listed then you know it’s working!
  • Sample code you can run to test more
    rdd = SparkR:::parallelize(sc, 1:5)
    SparkR:::collect(rdd)
    

Upgrading from Java7 to Java8 on the Cloudera Quickstart VM

Purpose

The purpose of this post is to describe how to set Java8 as the version of Java to use in the Cloudera Quickstart VM and as the version of java to use in Hadoop. The reason you might want to do this is so that you can run Spark jobs using the Java8 libraries and features (like lambda operations, etc).

For more information you can follow the Guide to the Cloudera Quickstart VM post:

https://softwaresanders.wordpress.com/2016/10/24/getting-started-with-the-cloudera-quickstart-vm/

High Level Steps

  1. Shutdown services
  2. Upgrade to Java8
  3. Update configurations to use Java8
  4. Restart services

In depth Steps

  1. SSH to the Cloudera Quickstart VM
  2. Stop Hadoop Services
    • If you’re using Cloudera Manager:
      1. Login to the Cloudera Manager
      2. Stop to Cloudera Management Services
      3. Stop the Cluster
      4. SSH into the machine
      5. Login as root
        sudo su
        
      6. Stop the Cloudera SCM Services from the command line
        service cloudera-scm-agent stop
        service cloudera-scm-server stop
        
    • If you’re not using Cloudera Manager:
      1. SSH into the machine
      2. Login as root
        sudo su
        
      3. Execute the stop service commands
        service hadoop-hdfs-datanode stop
        service hadoop-hdfs-journalnode stop
        service hadoop-hdfs-namenode stop
        service hadoop-hdfs-secondarynamenode stop
        service hadoop-httpfs stop
        service hadoop-mapreduce-historyserver stop
        service hadoop-yarn-nodemanager stop
        service hadoop-yarn-proxyserver stop
        service hadoop-yarn-resourcemanager stop
        service hbase-master stop
        service hbase-regionserver stop
        service hbase-rest stop
        service hbase-solr-indexer stop
        service hbase-thrift stop
        service hive-metastore stop
        service hive-server2 stop
        service impala-catalog stop
        service impala-server stop
        service impala-state-store stop
        service oozie stop
        service solr-server stop
        service spark-history-server stop
        service sqoop2-server stop
        service sqoop-metastore stop
        service zookeeper-server stop
        
  3. Install and Configure JDK 1.8
    1. SSH into the machine
    2. Login as root
    3. Change directory to where you want to place the JDK resources
      1. We’ll assume its under /usr/java/
        cd /usr/java/
        
    4. Download the JDK
      wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u60-b27/jdk-8u60-linux-x64.tar.gz
      
    5. Unzip the JDK
      gunzip jdk-8u60-linux-x64.tar.gz
      tar -xvf jdk-8u60-linux-x64.tar jdk1.8.0_60/
      
    6. Edit the /etc/bashrc file to set java8 as the primary java
      1. Set the bottom of the file to resemble the following:
        export JAVA_HOME=/usr/java/jdk1.8.0_60
        export JRE_HOME=${JAVA_HOME}
        export JDK_HOME=${JAVA_HOME}
        export ANT_HOME=/usr/local/apache-ant/apache-ant-1.9.2
        export M2_HOME=/usr/local/apache-maven/apache-maven-3.0.4
        export PATH=/usr/local/firefox:/sbin:$JAVA_HOME/bin:$ANT_HOME/bin:$M2_HOME/bin:$PATH
        
    7. Apply the changes to the /etc/bashrc
      source ~/.bashrc
      
    8. Validate the correct version of
      1. Check Java Version
        java -version
        
      2. Expected Output:
        java version "1.8.0_60"
        
  4. Set Configs
    1. Edit the following files
      nano /etc/default/cloudera-scm-server
      nano /etc/default/hadoop
      nano /etc/default/hadoop-0.20-mapreduce
      nano /etc/default/hadoop-hdfs-datanode
      nano /etc/default/hadoop-hdfs-journalnode
      nano /etc/default/hadoop-hdfs-namenode
      nano /etc/default/hadoop-hdfs-secondarynamenode
      nano /etc/default/hadoop-yarn-nodemanager
      nano /etc/default/spark
      nano /etc/default/impala
      nano /etc/default/zookeeper
      nano /etc/default/solr
      
    2. Add the following line to the bottom of each
      export JAVA_HOME=/usr/java/jdk1.8.0_60
      
  5. Restart the Hadoop Services
    • If you’re using Cloudera Manager:
      1. SSH into the machine
      2. Login as root
      3. Start the Cloudera SCM Services from the command line
        service cloudera-scm-agent start
        service cloudera-scm-server start
        
      4. Login to Cloudera Manager
      5. Start the Cluster
      6. Start the Cloudera Management Services
    • If you’re not using Cloudera Manager:
      • A fast way to do this is to just restart the Virtual Machine. The services are setup to startup when the Virtual Machine starts.
  6. Validate that all the services are running with Java8
    1. Print the Hadoop processes
      ps -eaf | grep hadoop
      
    2. You should see all the lines resemble the following:
      • hdfs      5288     1 27 21:53 ?        00:00:04 /usr/java/jdk1.8.0_60/bin/…

Getting Started with the Cloudera Quickstart VM

Purpose

The purpose of this post is to provide instructions on how to get started with the Cloudera Quickstart VM and what are some of the main things to know about the VM. This includes where to find certain configuration files, how to setup certain things that will make your life easier and more.

About the Cloudera Quickstart VM

Overview

The Cloudera Quickstart VM is a Virtual Machine that comes with a pseudo distributed version of Hadoop preinstalled on it along with the main services that are offered by Cloudera. This includes the Cloudera Manager and Impala as the most notable.

Some Requirements

  • Make sure your computer is setup to allow virtualization. This can be set in your bios on startup.
  • To use the Cloudera Manager, you will need to allocate 10GB to your VM and 2 Virtual CPU Cores.
    • The Cloudera Manager comes disabled by default, and all the Hadoop daemons are started up on startup and run just fine without it. so you don’t absolutely need the Cloudera Manager.

Getting Started

Downloads

General Downloads

http://www.cloudera.com/downloads.html

Latest Quickstart VM

http://www.cloudera.com/downloads/quickstart_vms.html

Official Documentation

https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cloudera_quickstart_vm.html

Importing into VirtualBox

  1. Download the Quickstart VM with the above links
  2. Open VirtualBox
  3. Click on File -> Import Appliance
  4. Select the Quickstart VM you just download
  5. Click Continue
  6. Optional: Double click on the name, and change it to whatever you want.
  7. Click Import
  8. Wait for the machine to import and when it is done, it will be list in the window to startup

Recommended VirtualBox Configurations

  1. Right click on the VirtualMachine and click Settings
  2. Setup the VM to allow you to copy and paste from that machine to your local and vice-versa
    1. Click on General -> Advanced
    2. Set Shared Clipboard to Bidirectional
  3. Setup port forwarding from port 2222 to port 22 to allow SSH to the machine
    1. Click on Network -> Advanced -> Port Forwarding
    2. Add a new entry
      1. Name: 2222
      2. Host Port: 2222
      3. Guest Port: 22

Accessing the VM

SSH’ing to the Machine

Default SSH Credentials: cloudera/cloudera

Host to connect to: localhost

Because of the Recommended VirtualBox Configuration above, we’re forwarding connections from port 2222 to 22. So you would want to use port 2222 to connect.

Linux/Mac
  1. Open a command line terminal
  2. Use the ssh command to login
    ssh -p 2222 cloudera@localhost
    
  3. Enter the password
Windows
  1. Open putty
  2. Set localhost as the Host Name
  3. Set 2222 as the port
  4. Connection Type: SSH
  5. Click open
  6. Enter the password

Setup password-less SSH (Optional)

  1. Generate a public and private key locally
  2. Login to the machine with the instructions above
  3. create the ~/.ssh directory
    mkdir ~/.ssh
    
  4. Create the file ~/.ssh/authorized_keys
    1. Open file
      nano ~/.ssh/authorized_keys
      
    2. Add your public key to the authorized_keys file
    3. Save the authorized_keys file
  5. Change permissions of .ssh
    chmod 700 ~/.ssh
    
  6. Change permissions of the ~/.ssh/authorized_keys
    chmod 600 ~/.ssh/authorized_keys
    
  7. Change permissions of: chmod 740 /home/cloudera/
    chmod 740 ~/
    
  8. Now if you try SSH’ing to the machine, you shouldn’t have to provide the password

Copying Files to the VM

SCP
  1. Open a command line terminal
  2. Use the following command:
    scp -P 2222 {PATH_TO_FILE_ON_LOCAL} cloudera@localhost:{DESTINATION_PATH_ON_VM}
    
FileZilla or anther FTP App
  1. Open your desired FTP Application
  2. Create a new connection
    1. Host: localhost
    2. Username: cloudera
    3. Password: cloudera
    4. Port: 22
  3. Connect

Optional Setup Tasks

Configure Apache Spark to Connect to Hive

If you’re intending to use Apache Spark, you will also probably want to connect to Hive using SparkSQL so you can interact with that relational store. To do this you need to include the hive-site.xml file in the spark configurations so Spark knows how to interact with Hive. If you don’t do this, the app will still run, but you wont be able to view the same tables you have in Hive and you wont be able to store data in tables.

  1. SSH into the Machine
  2. Login as root
  3. Create a symlink to Link the hive-site.xml in the spark conf directory
    ln -s /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/hive-site.xml
    

Configure Apache Spark History Server to allow you to view previously ran Spark jobs

If you’re intending to use Apache Spark, you may end up trying to view past runs via the Apache Spark History Server. There is a small issue right off the bat with the Quickstart VM where you can’t view past runs, because of a permissions issue with the applicationHistory directory in HDFS (/user/spark/applicationHistory). The spark user, is not able to read the contents of the directory. You can follow these steps to fix this:

  1. SSH into the Machine
  2. Login as hdfs user
    1. Run “$ sudo su” to login as root, then “$ su hdfs”
  3. Change the permissions of the applicationHistory directory under the spark home directory in hdfs
    hadoop fs -mkdir -p /user/spark/applicationHistory
    hadoop fs -chown spark:spark /user/spark/applicationHistory
    hadoop fs -chgrp spark /user/spark/applicationHistory/*
    
  4. Now when you visit the Apache Spark History server you will see any past jobs that have ran

Using Services

Using Beeline to connect to Hive

Beeline is a new command line shell that is supported by HiveServer2. It is recommended to use this over the normal hive shell since it supports better security and functionality.

Credentials

cloudera/cloudera

Starting Shell with beeline Command
beeline

This will start the beeline shell.

Note: If you were to run a command such as “show tables” to list the hive tables in the currently selected database at this time you will get the following error:
No current connection

This is because you haven’t technically connected to the HiveServer2 to be able to run hive commands.

To connect you can run the following command. This will prompt you for credentials.

beeline> !connect jdbc:hive2://localhost:10000

To avoid having to enter credentials each time, you can include the username and password in the connect statement like so:

 
beeline> !connect jdbc:hive2://localhost:10000 cloudera cloudera
Starting Shell with beeline Command and arguments

Instead of having to use the connect command upon starting the beeline shell, you can automatically connect to the HiveServer2 using command line arguments.

beeline -u jdbc:hive2://localhost:10000/default -n cloudera -p cloudera
Shutting down the Shell
beeline> !quit

URLs and Credentials

Cloudera Manager

URL: http://quickstart.cloudera:7180/cmf/home

Credentials: cloudera/cloudera

Hue

URL: http://quickstart.cloudera:8888/accounts/login/

Credentials: cloudera/cloudera

Resource Manager

URL: http://quickstart.cloudera:8088/cluster

Credentials: None

Job History

URL: http://quickstart.cloudera:19888/jobhistory

Credentials: None

HBase Master UI

URL: http://quickstart.cloudera:60010/master-status

Credentials: None

Oozie UI

URL: http://quickstart.cloudera:11000/oozie/

Credentials: None

Apache Solr

URL: http://quickstart.cloudera:8983/solr/#/

Credentials: None

Apache Spark History

URL: http://quickstart.cloudera:18088/

Credentials: None

MySQL

Host: localhost

Credentials: root/cloudera

Example Connection

$ mysql -u root -p

cloudera

Beeline

Host: localhost

Port: 10000

Credentials: cloudera/cloudera

Example Connection

$ beeline -u jdbc:hive2://localhost:10000/default -n cloudera -p cloudera

Useful File System Paths

Configuration Files:

/usr/lib/

 

Command Line Arguments for Java Programs

Introduction

When building Java projects you may the need to pass in arguments into a program so you can control certain parts of the execution. To do this there’s a simple way and a more complete way that allows you to more dynamically specify arguments in any order. This post will include examples on how you can support passing in arguments into a Java program.

Simple Example

The most simple method of accepting arguments into your application is to use the default args (String array) that you have to provide in the main function in the Main class.

If you want to use this method its as simple as assuming that each desired argument is at a specific position within the array.

Code

com.example.args_example.java.Main.java

package com.example.args_example.java;
public class Main {
    public static void main(String[] args) {
        System.out.println(java.util.Arrays.toString(args));
    }
}

Running

$ java -cp args-example.jar com.example.args_example.java.Main arg1 arg2 arg3

[arg1, arg2, arg3]

Disadvantages

The simple case is great because it doesn’t require a lot of code and you get exactly what you need from what you’re given right off the bat. However, there are limitations to this approach. The arguments that you provide need to be in a specific order and you can’t optionally drop any arguments without severely complicating your arguments parser.

Complete Example

If you come across the case where you would like to allow the user to provide optional arguments, provide the arguments in any particular order, and validate the arguments then this approach is more your speed.

Here, we’re building out a separate class to store the arguments and provide specialized functions for its validation and retrieval.

com.example.args_example.java.Main.java

package com.example.args_example.java;
import org.kohsuke.args4j.CmdLineException;
import java.util.List;
public class Main {
    public static void main(String[] args) throws CmdLineException {
        List argsList =java.util.Arrays.asList(args);
        if(argsList.contains("-help") || argsList.contains("--help")) {
            MainArgs.printUsage();
            System.exit(0);
        }
        MainArgs mainArgs = new MainArgs(args);
        System.out.println(mainArgs.toString());
    }
}

com.example.args_example.java.MainArgs.java

package com.example.args_example.java;

import org.kohsuke.args4j.CmdLineException;
import org.kohsuke.args4j.CmdLineParser;
import org.kohsuke.args4j.Option;

public class MainArgs {

    @Option(name="-strArg", usage="Example of a String Argument. (Required)", required = true)
    private String strArg;

    @Option(name="-intArg", usage="Example of an Integer argument.")
    private int intArg;

    @Option(name="-bolArg", usage="Example of a Boolean argument.")
    private boolean bolArg;

    public MainArgs() {}

    public MainArgs(String... args) throws CmdLineException {
        CmdLineParser parser = getCmdLineParser();
        try {
            parser.parseArgument(args);
        } catch (CmdLineException e) {
            System.err.println(e.getMessage());
            printUsage();
            throw e;
        }
    }

    public CmdLineParser getCmdLineParser() {
        return new CmdLineParser(this);
    }

    public static void printUsage() {
        CmdLineParser parser = new MainArgs().getCmdLineParser();
        parser.printUsage(System.err);
    }

    public String getStrArg() {
        return strArg;
    }

    public void setStrArg(String strArg) {
        this.strArg = strArg;
    }

    public int getIntArg() {
        return intArg;
    }

    public void setIntArg(int intArg) {
        this.intArg = intArg;
    }

    public boolean isBolArg() {
        return bolArg;
    }

    public void setBolArg(boolean bolArg) {
        this.bolArg = bolArg;
    }

    @Override
    public String toString() {
        return "MainArgs{" +
                "strArg='" + strArg + '\'' +
                ", intArg=" + intArg +
                ", bolArg=" + bolArg +
                '}';
    }
}

Running

$ java -cp args-example.jar com.example.args_example.java.Main -help

-bolArg : Example of a Boolean argument. (default: false)
-intArg N : Example of an Integer argument. (default: 0)
-strArg VAL : Example of a String Argument. (Required)

$ java -cp args-example.jar com.example.args_example.java.Main

Option “-strArg” is required
-bolArg : Example of a Boolean argument. (default: false)
-intArg N : Example of an Integer argument. (default: 0)
-strArg VAL : Example of a String Argument. (Required)

$ java -cp args-example.jar com.example.args_example.java.Main -strArg test

MainArgs{strArg=’test’, intArg=0, bolArg=false}

$ scala -classpath args-example.jar com.example.args_example.scala.Main -strArg test -intArg 100

MainArgs{strArg=’test’, intArg=100, bolArg=false}

$ java -cp args-example.jar com.example.args_example.java.Main -strArg test -intArg 100 -bolArg

MainArgs{strArg=’test’, intArg=100, bolArg=true}

Equivalent in other Languages

Scala

https://softwaresanders.wordpress.com/2016/10/11/command-line-arguments-for-scala-programs/

Python

To Be Created

Open Data Sets

Introduction

If you’re like me, whenever you’re working with a new piece of software or product that works with data, you like to actually get your hands on it and work with it. Sadly this sometimes requires you have data. You can create your own data, but this takes needless time out of your discovery work. Not only that, but if you want to practice data analytics or Machine Learning then you certainly do need some pre-existing dataset. This is what lead me to search for some free data sets out there. Here is a list of some of the best places I’ve found to get open data sets.

Datasets

US Government Data

Description

The US Government provides some data about business, agriculture, education, energy and other types of data. Some are specific to a city and some are more general.

URL

https://data.gov

Kaggle Datasets

Description

Kaggle is a Data Challenge competition website where people can compete to create the best predictive model on a variety of datasets. Along with this they have provided a repository of their datasets.

URL

https://www.kaggle.com/datasets

Sean Lahman – Baseball Database

Description

A free relational database of individual and team statistics that covers the game back to 1871 up to the early 2000’s.

URL

http://www.seanlahman.com/baseball-archive/statistics/

Command Line Arguments for Scala Programs

Introduction

When building Scala projects you may the need to pass in arguments into a program so you can control certain parts of the execution. To do this there’s a simple way and a more complete way that allows you to more dynamically specify arguments in any order. This post will include examples on how you can support passing in arguments into a Scala program.

Simple Example

The most simple method of accepting arguments into your application is to use the default args (String array) that you have to provide in the main function in the Main class.

If you want to use this method its as simple as assuming that each desired argument is at a specific position within the array.

Code

com.example.args_example.scala.Main.scala

package com.example.args_example.scala
object Main {
    def main(args: Array[String]): Unit = {
        println(args.mkString(", "))
    }
}

Running

$ scala -classpath args-example.jar com.example.args_example.scala.Main arg1 arg2 arg3

arg1, arg2, arg3

Disadvantages

The simple case is great because it doesn’t require a lot of code and you get exactly what you need from what you’re given right off the bat. However, there are limitations to this approach. The arguments that you provide need to be in a specific order and you can’t optionally drop any arguments without severely complicating your arguments parser.

Complete Example

If you come across the case where you would like to allow the user to provide optional arguments, provide the arguments in any particular order, and validate the arguments then this approach is more your speed.

Here, we’re building out a separate class to store the arguments and provide specialized functions for its validation and retrieval.

com.example.args_example.scala.Main.scala

package com.example.args_example.scala
object Main {
    def main(args: Array[String]): Unit = {
        if (args.contains("-help") || args.contains("--help")) {
            println(MainArgs.argsUsage)
            System.exit(0)
        }
        val mainArgs = MainArgs.parseJobArgs(args.toList)
        if (mainArgs == null) {
            println(MainArgs.argsUsage)
            System.exit(-1)
        }
        mainArgs.validate()
        println(mainArgs)
    }
}

com.example.args_example.scala.MainArgs.scala

package com.example.args_example.scala
import java.security.InvalidParameterException
object MainArgs {
    val argsUsage = s"MainArgs Usage: \n" +
    s"\t[-strArg string (description=Example of a String Argument. (Required))]\n" +
    s"\t[-intArg integer (description=Example of a Integer Argument.)]\n" +
    s"\t[-bolArg (description=Example of a Boolean Argument.)]\n" +
    s"\n"
    case class JobArgs(strArg: String = null, intArg: Int = 0, bolArg: Boolean = false) {
        override def toString(): String = {
            s"MainJobArgs(\n" +
            s"\tstrArg=$strArg, \n" +
            s"\tintArg=$intArg, \n" +
            s"\tbolArg=$bolArg \n" +
            s")"
        }
        def validate(): Unit = {
            val invalidMessageList = new java.util.ArrayList[String]()
            //code to ensure that strArg is required
            if(strArg == null) {
                 invalidMessageList.add("-strArg needs to be provided")
            }
            if (invalidMessageList.size() > 0) {
                 throw new InvalidParameterException("Invalid Arguments: " + invalidMessageList + "\n" + argsUsage)
            }
        }
    }
    def parseJobArgs(args: List[String], jobArgs: JobArgs = JobArgs()): JobArgs = {
        args.toList match {
            case Nil => jobArgs
            case "-strArg" :: value :: otherArgs => parseJobArgs(otherArgs, jobArgs.copy(strArg = value))
            case "-intArg" :: value :: otherArgs => parseJobArgs(otherArgs, jobArgs.copy(intArg = value.toInt))
            case "-bolArg" :: otherArgs => parseJobArgs(otherArgs, jobArgs.copy(bolArg = true))
            case option :: tail => println("Unknown option " + option); return null;
        }
    }
}

Running

$ scala -classpath args-example.jar com.example.args_example.scala.Main -help

MainArgs Usage:
[-strArg string (description=Example of a String Argument. (Required))]
[-intArg integer (description=Example of a Integer Argument.)]
[-bolArg (description=Example of a Boolean Argument.)]

$ scala -classpath args-example.jar com.example.args_example.scala.Main

Exception in thread “main” java.security.InvalidParameterException: Invalid Arguments: [-strArg needs to be provided]
MainArgs Usage:
[-strArg string (description=Example of a String Argument. (Required))]
[-intArg integer (description=Example of a Integer Argument.)]
[-bolArg (description=Example of a Boolean Argument.)]

$ scala -classpath args-example.jar com.example.args_example.scala.Main -strArg test

MainJobArgs(
strArg=test,
intArg=0,
bolArg=false
)

$ scala -classpath args-example.jar com.example.args_example.scala.Main -strArg test -intArg 100

MainJobArgs(
strArg=test,
intArg=100,
bolArg=false
)

$ scala -classpath args-example.jar com.example.args_example.scala.Main -strArg test -intArg 100 -bolArg

MainJobArgs(
strArg=test,
intArg=100,
bolArg=true
)

Equivalent in other Languages

Java

To Be Created

Python

To Be Created