Skip to content

R Programming – Useful Functions

Description

The R Programming Language  is a very useful language for anyone who does a lot of Data Science related activities. These activities can be reading in and analyzing data, to using either plotting functions or even doing data manipulation to get it into the form you want.

The purpose of this blog entry is to simply present some of the most commonly used and most useful functions that can be done in R.

 

Useful Functions

General

Comments
#Comments can be created with the '#' character
Set working Directory
setwd("/path/to/dir")
Summarize Data
#Summarize Entire Data Set
summary(data)

#Summarize Single Field in Data Set
summary(data$field)
Print to Console
print("Print String")
Install package
install.packages("library_name")
Import package
library(package_to_import_no_quotes)
Clear Console
Ctrl + L

Reading in Data

CSV
data = read.csv("file.csv")

Writing out Data

CSV
write.csv(data, file = "fileName.csv")

String Operations

String Contains
grepl("regex", string)
String Replace
gsub('regex', 'value_to_replace_with', string)
String toLowerCase
tolower(string)
String Concatenate
paste("value1", "value2")
"value1 value2"

paste("value1", "value2", sep="-")
"value1-value2"

Convert Data Types

To String
output = as.character(value)
To Integer
output = as.numeric(value)

Plotting

General Plot
plot(data$x, data$y)
Histogram
hist(data$x)

Data Manipulation

Filter
filteredData = data[data$test_field == "wanted_value", ]
Create New Field in Data Frame based on existing fields
data$newField = data$field1 + data$field2
Group By
groupedByData = aggregate(FIELD_TO_HAVE_FUN_EXECUTE_ON ~ GROUP_BY_FIELD, data=dataToGroupBy, FUN=sum)
Merge Data
mergedData = merge(xData, yData, by.x="xField", by.y="yField")
Combine 2 Lists
list1 = c(1,2,3)
list2 = c(4,5,6)
newList = c(list1, list2)
Combine by Column
cbind(1, 1:7)

[,1] [,2]
[1,]   1   1
[2,]   1   2
[3,]   1   3
[4,]   1   4
[5,]   1   5
[6,]   1   6
[7,]   1   7

Combine by row
rbind(1, 1:7)

[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]   1   1   1   1   1   1   1
[2,]   1   2   3   4   5   6   7

Rename Column
names(data)[which(names(data) %in% c("old_field_name"))] = "new_field_name"

Guide to Apache HBase

Description

Apache HBase is Column Family database used to provide structure to data stored in Apache Hadoop HDFS. It is based on Google Big Table.

 

Start Shell

$ hbase shell

 

Useful Commands

Help

hbase> help

Show All Tables

hbase> list

Describe Table

hbase> describe '{table_name}'

Create HBase Table

hbase> create '{table_name}', '{column_family_name}'

List ALL contents of Table

hbase> scan '{table_name}'

Manually Scan Table with Row Count Filter

hbase> scan '{table_name}', {FILTER => org.apache.hadoop.hbase.filter.PageFilter.new({number_of_rows_count})}

Count Number of Rows in Table

hbase> count '{table_name}'

Insert Item into Table

hbase> put '{table_name}', '{row_key}', '{column_family}:{column}', '{value}'

Truncate Table

hbase> truncate '{table_name}'

Delete Table

hbase> disable '{table_name}'
hbase> drop '{table_name}'

View Snapshots

hbase> list_snapshots

Take a Snapshot of a Table

hbase> snapshot '{table_name}', '{snapshot_name}'

Restore a Snapshot

hbase> disable '{table_name}'
hbase> restore_snapshot '{snapshot_name}'
hbase> enable '{table_name}'

Stop Command Line Shell

hbase> quit

OR

hbase> exit

OR

Ctrl + Z

Guide to Apache Hive

Description

Apache Hive is data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

This technology is primarily used on top of Apache Hadoop.

 

Start Command Line Shell

$ hive

 

Useful Commands

Show Databases/Schemas

hive> show databases;

Use Database/Schema

hive> use {database_name};

Show Tables in Database/Schemea

hive> show tables;

Show Table Partitions

hive> show partitions {table_name};

 

Describe Table

hive> desc {table_name}

Run Hive Query

hive> select * from {schema}.{table_name} where hive_entry_timestamp>"{starting_timestamp}" and hive_entry_timestamp<="ending_timestamp" limit 100;

Stop Command Line Shell

hive> quit;

OR

hive> exit;

OR

Ctrl + Z

Guide to Apache Oozie

Description

Apache Oozie is a Workflow scheduler and manager for a Apache Hadoop cluster.

 

Example Use Cases

  • You want to extract log data to get usage statistics on how people are using a Web App.
  • You want to call a service and load that data into a Data Store.
  • You want to transform data in one table into another form and load that data into a new table.

 

Structure of an Oozie Job

An Oozie Job has 3 structures (at least at the writing of this blog entry). A Bundle (which contains one to many coordinators), a Coordinator (which contains one to many workflows) and a Workflow (which contains the actual code which performs some action).

The base level is the Workflow. As mentioned above, this is element contains the actual commands and scripts for doing some operation. This can be anything from running HDFS commands to running HIVE queries to running PIG queries. You can also run other Workflows from within a Workflow to accomplish whatever you need to do. You should also think of a Workflow as a one time run type of a job. In other words, you cant schedule it to run every hour or so. this type of functionality is done at the Coordinator level.

Going up a level we get to the Coordinator. As mentioned above, this element contains one to many Workflows. A Coordinator is useful for scheduling Workflows to run at certain times. For example, you can configure a Coordinator to run a workflow every hour if you have data that needs to be refreshed or recalculated during that time.

At the final level there is the Bundle. As mentioned above, a Bundle contains one to many Coordinators. It is useful for grouping Coordinators that are related in some way. For example, you may have one Coordinator that pulls data into Hadoop from somewhere. You might then have another Coordinator that transforms it in some way and finally you have another Coordinator to push that data to some outside data source.

 

Oozie Documentation

https://oozie.apache.org/docs/3.1.3-incubating/DG_CommandLineTool.html

 

Useful Commands (from command line)

Help

$ oozie help

Validate Workflow (Ensure that there aren’t any errors in a Workflow file)

$ oozie validate /path/to/workflow.xml

Start Job (will return a job_id)

$ oozie job -config /path/to/job.properties -run

Job Status

$ oozie job -info {job_id}

Logs

$ oozie job -log {job_id}

Suspend Job

$ oozie job -suspend {job_id}

Restart Suspended Job

$ oozie job -resume {job_id} 

Kill Job

$ oozie job -kill {job_id}

 

 Oozie Error Codes Explained

While running Oozie jobs, I’ve noticed that certain error codes come up in the logs that don’t give all that much explanation as to what the problem is. Here are what i’ve found some of these error codes mean after debugging.

Error Possible Cause
Error 9 There is a SQL error. Like trying to create a table that already exists
Error 10 There are undefined param variables in your workflow, SQL run time exception like a field doesn’t exist in a temporary table
Error 11 There are comments (/**/) in the executing sql file, Parameter included in SQL was not passed into through workflow

 

Useful Links

https://oozie.apache.org/docs/3.1.3-incubating/DG_CommandLineTool.html

Linux RPM Installer

What is RPM

RPM (Red Hat Package Manager) is a tool used to help manage the programs installed on your machine and provide an interface to install more. It is used to install .rpm files.

 Verify Installation

RPM is ready to use if the output of the about line shows a path to an executable file. If the output is blank then it still needs to be installed.


$ which rpm

How to Install RPM Installer

Ubuntu/Debian


$ sudo apt-get install rpm

Red Hat/CentOS/Fedora


$ sudo yum install rpm.x86_64

If above package isn’t compatible with your computer then you can search for one that might work using the command:


$ sudo yum search rpm

 

How To Use

List installed packages


$ rpm -qa

Install Package


$ sudo rpm -Uhv nginx-release-rhel-5-0.el5.ngx.noarch.rpm

Uninstall package


$ sudo rpm -ev {name of package}

Install and Setup MongoDB

How to Install

General installation documentation:

http://docs.mongodb.org/manual/installation/

Linux

Ubuntu/Debian

$ sudo apt-get install mongodb-server

Red Hat/CentOS/Fedora

$ sudo yum install mongo-10gen mongo-10gen-server

Windows

  1. Download from: http://www.mongodb.org/downloads
  2. Unzip and place somewhere
    • I recommend placing it at the root of the C drive:  C:\
  3. Add to environment variables
    • For help on how to do this see: https://softwaresanders.wordpress.com/2013/12/19/windows-environment-variables/

Mac

$ brew install mongodb

Set up

How To Use (from command line)

Start MongoDB

To start up MongoDB

$ mongod

To start up MongoDB with a specific db path specified

$ mongod --dbpath /path/to/file

To start up MongoDB on a specific port

$ mongod --port 12345

Stop MongoDB

In command line you started the MongoDB in, execute:

CTRL + C

Next you have to kill the background process that was started:

$ kill <mongod-process-id>

Start MongoDB Commend Line

After you have started the MongoDB, run the the following command in another window:

$ mongo

Show Databases

First run “Start MongoDB Command Line”

> show dbs
> show databases

SwitchTo/Create Database

First run “Start MongoDB Command Line”

> use <database-name>

Delete Database

First run “Start MongoDB Command Line”

> use <database-name>
> db.dropDatabase()

Show Collections

First run “Start MongoDB Command Line” and switch to a specific DB

> show collections

Drop a Collection

First run “Start MongoDB Command Line” and switch to a specific DB

> db.<collection>.drop()

Insert Document into a Collection

First run “Start MongoDB Command Line” and switch to a specific DB

> db.<collection>.save( <some-json> )

Note: if the collection you specified doesn’t exist it will be created automatically

Query a Collection

First run “Start MongoDB Command Line” and switch to a specific DB

> db.<collection>.find()

To have the response look nice:

> db.<collection>.find().pretty()

Remove Document from a Collection

First run “Start MongoDB Command Line” and switch to a specific DB
To remove all documents from a collection:

> db.<collection>.remove()

To remove all documents from a collection:

> db.<collection>.remove( <some-json> )

Example:
> db.<collection>.remove( { "_id" : ObjectId("123456789101112131415171") })

Helpful Links:

http://docs.mongodb.org/manual/

http://docs.mongodb.org/v2.2/tutorial/getting-started-with-the-mongo-shell/

Finding out who is logged in to a linux machine

Use the “who” commnad


[username@1111111~]$ who

{logged-in username 1} pts/6 {date logged in} (user 1 host)

{logged-in username 2} pts/6 {date logged in} (user 2 host)