Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).
Sunday, March 20, 2016
Tuesday, March 8, 2016
Apache Spark: Cluster manager and jobs
Cluster manager and jobs vs Databricks
This video gives a short overview of how Spark runs on clusters to make it easier to understand the components involved. How to manager, schedule and scale Spark nodes vs Databricks
Nhãn:
apache spark,
apache spark scale,
big data,
cloud scale,
cluster,
cluster and jobs,
cluster management,
cluster manager and jobs,
databricks,
manager,
scale,
scaling spark,
shedule
Monday, March 7, 2016
Create a Scala project on Spark using Scala IDE
Scala is one of the most exciting languages for programming Big Data. It is a multi-paradigm language that fully supports functional, object-oriented, imperative and concurrent programming. It is a strong language, it means a convenient form of self-documenting code.
Apache Spark is written in Scala, and any library that purports to work on distributed run times should at the very least be able to interface with Spark.
Sunday, March 6, 2016
Scala setup on Windows - Spark Scala
Scala is primary language of Apache Spark, so using Scala is good to learn Apache Spark.
The Scala language can be installed on any UNIX-like or Windows system. To install Scala on Windows vs Spark, we need to do something as below.
Before installing Scala on your computer, you must install Java.
Step 1: setup Java on your Windows machine.
Saturday, March 5, 2016
Stream data processing by using Apache Spark
The idea. Many real world data obtained sequentially over time, whether messages from social media users, while a series from wearable sensors. In these conditions, do not wait until all the data to be acquired to carry out our analysis, we can use the streaming algorithms to identify patterns over time, and make more targeted forecasts and decisions.The decision in this case is to create a machine learning model on static data, and then use the learned knowledge of the model to make predictions on the incoming data stream. In this paper we develop a family of streaming machine learning algorithms within Spark MLlib: streaming k-means clustering.
Subscribe to:
Posts (Atom)