Saturday, March 5, 2016

Stream data processing by using Apache Spark

The idea. Many real world data obtained sequentially over time, whether messages from social media users, while a series from wearable sensors. In these conditions, do not wait until all the data to be acquired to carry out our analysis, we can use the streaming algorithms to identify patterns over time, and make more targeted forecasts and decisions.The decision in this case is to create a machine learning model on static data, and then use the learned knowledge of the model to make predictions on the incoming data stream. In this paper we develop a family of streaming machine learning algorithms within Spark MLlib: streaming k-means clustering.

Analysis platform Apache Spark. This framework applications with open source software, which is designed for distributed processing of large data using the principle of parallel computing cluster. Spark Streaming - extension of Apache Spark as a scalable processor fail-over data stream in real time. Spark Streaming is able to take data for a fixed period of Kafka, Flume, ZeroMQ, Kinesis, TCP socket, etc. Processed data can be written to the file system, database, real time graphics. You can apply machine learning algorithms and to work with graphs to process the data stream. MLlib - machine learning library which implements a number of general machine learning and statistical algorithms to simplify piping large scale machine learning. A key advantage Spark is that its library of machine learning (MLlib) and its library for streaming data processing (Spark Streaming) built on the same architecture Spark core for distributed intelligence. This makes it easier to add extensions that use and combine the components in new ways.

K-means algorithm. The objective K-means is a separation of data points in the set of K clusters. The algorithm can be described as follows: choice of initial centroid (centroid) - points that are centers of the clusters;cycle start: count the centroid (centroids), to calculate the distance between the old and new centers of clusters, if this distance is greater than the constructor parameter, then move to the beginning of the cycle; group the input data by the distance to the nearest cluster center and return the result.

K-means on the Spark Streaming and MLlib. Streaming K-means provides methods for setting the K-means flow analysis, training models streaming and use the model to make predictions streaming. K-means clustering on the streaming data to support mini-batch and forgetful algorithms. The basic assumption that all the streaming data points belong to one of several clusters, and we want to know the identity of these clusters (model «K-means») as new data become available. Given this assumption, all the data must have the same dimension. For a mini-package of algorithms package, we update the main clusters of identity for each data packet and save the current count of the number of data points in the cluster, so that all data points are treated equally. The number of data points for each package can be arbitrary. Forgetfulness - for each new batch of data is weighted to contribute, so that more recent data is weighted more heavily. Weighing packet (i.e. for the time window) rather than the data points, the number of data points for each packet to be approximately constant.

Choice of technology. to solve the problem the following technologies were selected Apache Spark (Spark Streaming, MLlib ) ; Scala and Python languages for the implementation of data processing and visualization of results.

Conclusion. In this article, we were examined streaming processing on framework Apache Spark using Spark Streaming & Spark MLlib for solutions model K-Means clustering. When testing the example used Scala Scala language and the Scala IDE, framework Apache Spark-1.6.0 showed that the performance of Apache Spark in Stream processing.

References:

1. Introducing streaming k-means in Spark- https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

2. Patricio Cordova, analysis of real time stream processing systems considering latency

1 comment:

Alfred AvinaJanuary 28, 2020 at 11:55 AM
The article is so appealing. You should read this article before choosing the Automated big data engineering you want to learn.
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)