Sunday, March 20, 2016

Getting started with Spark in python

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).

Tuesday, March 8, 2016

Apache Spark: Cluster manager and jobs


Cluster manager and jobs vs Databricks

This video gives a short overview of how Spark runs on clusters to make it easier to understand the components involved. How to manager, schedule and scale Spark nodes vs Databricks

Monday, March 7, 2016

Create a Scala project on Spark using Scala IDE

Scala is one of the most exciting languages for programming Big Data. It is a multi-paradigm language that fully supports functional, object-oriented, imperative and concurrent programming. It is a strong language, it means a convenient form of self-documenting code. 
Apache Spark is written in Scala, and any library that purports to work on distributed run times should at the very least be able to interface with Spark. 

Sunday, March 6, 2016

Scala setup on Windows - Spark Scala

Scala is primary language of Apache Spark, so using Scala is good to learn Apache Spark. 
The Scala language can be installed on any UNIX-like or Windows system. To install Scala on Windows vs Spark, we need to do something as below.


Before installing Scala on your computer, you must install Java. 

Step 1: setup Java on your Windows machine.

Saturday, March 5, 2016

Stream data processing by using Apache Spark



The idea. Many real world data obtained sequentially over time, whether messages from social media users, while a series from wearable sensors. In these conditions, do not wait until all the data to be acquired to carry out our analysis, we can use the streaming algorithms to identify patterns over time, and make more targeted forecasts and decisions.The decision in this case is to create a machine learning model on static data, and then use the learned knowledge of the model to make predictions on the incoming data stream. In this paper we develop a family of streaming machine learning algorithms within Spark MLlib: streaming k-means clustering.

Followers