Wednesday, August 17, 2016

Deep Learning for Everyone

Summary:  The most important developments in Deep Learning and AI in the last year may not be technical at all, but rather a major change in business model.  In the space of about six months all the majors have made their Deep Learning IP open source, hoping to gain on the competition from the power of the broader developer base and wide adoption.
 To say that the last year has been big for Deep Learning is an understatement.  There have been some spectacular technical innovations like Microsoft winning the ImageNet competition with a neural net comprised of 152 layers (where 6 or 7 layers is more the norm).  But the big action especially in the last six months has been in the business model for Deep Learning.

Sunday, March 20, 2016

Getting started with Spark in python

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).

Tuesday, March 8, 2016

Apache Spark: Cluster manager and jobs


Cluster manager and jobs vs Databricks

This video gives a short overview of how Spark runs on clusters to make it easier to understand the components involved. How to manager, schedule and scale Spark nodes vs Databricks

Monday, March 7, 2016

Create a Scala project on Spark using Scala IDE

Scala is one of the most exciting languages for programming Big Data. It is a multi-paradigm language that fully supports functional, object-oriented, imperative and concurrent programming. It is a strong language, it means a convenient form of self-documenting code. 
Apache Spark is written in Scala, and any library that purports to work on distributed run times should at the very least be able to interface with Spark. 

Sunday, March 6, 2016

Scala setup on Windows - Spark Scala

Scala is primary language of Apache Spark, so using Scala is good to learn Apache Spark. 
The Scala language can be installed on any UNIX-like or Windows system. To install Scala on Windows vs Spark, we need to do something as below.


Before installing Scala on your computer, you must install Java. 

Step 1: setup Java on your Windows machine.

Saturday, March 5, 2016

Stream data processing by using Apache Spark



The idea. Many real world data obtained sequentially over time, whether messages from social media users, while a series from wearable sensors. In these conditions, do not wait until all the data to be acquired to carry out our analysis, we can use the streaming algorithms to identify patterns over time, and make more targeted forecasts and decisions.The decision in this case is to create a machine learning model on static data, and then use the learned knowledge of the model to make predictions on the incoming data stream. In this paper we develop a family of streaming machine learning algorithms within Spark MLlib: streaming k-means clustering.

Saturday, February 27, 2016

Installation Spark Cluster on Windows

How to install Spark Cluster on Windows?

Installation spark cluster on windows isn't the same on Unix. When you using Unix terminal you start a standalone master

./sbin/start-master.sh
But on Windows it's more complex. How to do it right, without mistake and don't spend too much time.
Do it step by step and carefully.

Friday, February 26, 2016

Spark MLlib: From quick start to Scikit-Learn

Presentation by Joseph Bradley from Databricks.
He was speaking about Spark’s distributed Machine Learning Library - MLlib.
We will start off with a quick primer on machine learning, Spark MLlib, and a quick overview of some Spark machine learning use cases. We will continue with multiple Spark MLlib quick start demos. Afterwards, the talk will transition toward the integration of common data science tools like Python pandas, scikit-learn, and R with MLlib

Watch the movie about Webinar: Spark MLlib: From quick start to Scikit-Learn

Attachments:






Interactive Analysis with the Spark Shell

Basics

Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Start it by running the following in the Spark directory:

Scala

./bin/spark-shell
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:
scala> textFile.count() // Number of items in this RDD
res0: Long = 126

scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file.
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
We can chain together transformations and actions:
scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15

Video Tutorial


Thursday, February 25, 2016

Hadoop - Eclipse configuration for Hadoop/MapR on Ubuntu

You installed Hadoop on computer and want to write code with Eclipse. But you don't know how to configure Eclipse for Hadoop? and How to import jar file to Eclipse?
You could read very much articles from internet but you spent too much time these problems. And when you make a mistake, you continue search and search too much. Of course it resolved or didn't.

Now i write this short article for you. Read and you'll find necessary information.

I don't write about " How install Hadoop on ***?" "How to setup Eclipse on ***?" and so on. From internet you can search or you known about it.


Now, configuration Eclipse for Hadoop

Requirements:  

  1.  Installed Java 
  2.  Installed Hadoop
  3.  Have Eclipse IDE
  4.  Download Hadoop plugin for Eclipse: here 

Configuration:

1. Combine the source code and build jar file with below command:
$ ant jar -Dversion=2.6.0 -Declipse.home=/opt/eclipse/ -Dhadoop.home=/usr/lib/hadoop/hadoop-2.6.0/
( my version 2.6.0, if your ver. is 2.4.0 change it by your version -2.4.0). 
2. Copy plugin jar file from hadoop-eclipse-plugin/build/contrib/eclipse-plugin/hadoop-eclipse-plugin-2.6.0.jar to /eclipse/plugin(directory).

3. After restart Eclipse, the MapReduce perspective will be available.



Video tutorial



Comment & share it if you find it's helpful.

Thank you for reading!

Wednesday, February 24, 2016

Setup Apache Spark on Windows.

It’s great to try Spark on Windows. Here is how you can setup Spark Standalone mode on Windows. This article shows you how to setup without mistakes. 

To install Spark on a windows based environment the following prerequisites should be fulfilled first.



Requirement:


1. Java 6+
2. Scala 2.10.x
3. sbt (Simple build Tools)
4. GIT
5. Spark 1.*.*
6. Intelij IDEA or Scala IDE for programming


Installations:


Step1: Download and install Java: 
Now Java 8 works on Windows when you build Spark it takes a mistake. It's notice below.


Step2: Download and install Scala

Scala 2.10.x works better. Also note: please make sure there is no space in your installation directory(e.g. you may install it in C:\scala, but you cannot install it in C:\Program Files\scala)


Step3: Download and install Git

You have to make sure git is installed and “git” can be executed in commend line(it is an option when you install git)

Step4: Download and install SBT

The latest version of sbt is compatible with JAVA 8 and scala 2.10.x

Step5: Download and extract Spark 

Extract file Spark in anywhere, but it is better in: c:\spark-1.6.0 

Step 6: Build Spark 

  •  press WIN + R and run “cmd”
  •  go to your spark boot directory: cd c:\spark-1.6.0
  •  input “sbt assembly” and run

Videos tutorial:



Note:

The installation will take about 10-20 minutes
Then, you may run spark-shell, and spark has been already built if you see the following picture:



Troubleshoot:
1. java/javac/git/sbt is not recognized as an internal or external command
You didn't put all the software path into environment variables path(they are not executive in commend line )
2. [error]: not a valid commend: package/assemble
please make sure you installed the software version as I indicated before
3. [error]: java.lang.OutOfMemoryError: Java heap space
A: It is because sbt does not get enough memory allocated. Please go to your sbt/conf directory, and find the file “sbtconfig”.
Revise -Xmx512M to -Xmx2048M; -XX:MaxPermSize=256m to -XX:MaxPermSize=1024m; ReservedCodeCacheSize=512m


Sunday, February 21, 2016

Phân tích gian lận với Apache Spark

Nhiều trong số các ứng dụng trực tuyến ngày nay đã phát triển vượt quá truyền thống và cơ bản ACID (atomic, consistent, isolated, durable) của kỷ nguyên quan hệ giao dịch và đã mở rộng để nó có thể được sử dụng trên một hệ thống phân phối rộng rãi, và có nhiều hơn một tương tác khi sự giao dịch có thể bao gồm phân tích thời gian thực (real time) thời gian lân cận (near time) và cả sự kiện trong quá khứ.


Sau khi hoàn thành, sau đó các giao dịch được sử dụng để kích hoạt các sự kiện khác và đưa ra quyết định có ảnh hưởng thật đến giao dịch tiếp theo mà người dùng tạo ra hoặc các hoạt động nội bộ như là các quá trình ra quyết định trong kinh doanh thông minh.





Ví dụ về các ứng dụng ngày càng trở thành phân tích giao dịch bao gồm hệ thống phát hiện gian lận lĩnh vực mà yêu cầu mua đến và phân tích nhiều chi tiết riêng biệt liên quan đến các yêu cầu như vị trí mua, tần suất, số lượng và nhiều hơn nữa. Các trường hợp khác sử dụng thích hợp đối với phân tích giao dịch đó là công cụ khuyến cáo trực tuyến mà liên tục tiêu thụ và phân tích hoạt động người dùng, sau đó nhanh chóng quay lại đưa ra gợi ý về những mặt hàng được đề xuất khác để mua,  bổ sung thêm những tin tức mới, câu chuyện mới để đọc, vân vân.



Nhóm phân tích như Gartner Group phân loại mở rộng những giao dịch pháp lý như phân tích xử lý giao thức lại hay HTAP. Ngoài ra, Gartner cho rằng các phân tích cần thiết trong rất nhiều các ứng dụng sẽ được đa dạng "nhịp độ", có nghĩ là tốc độ mà tại đó các phân tích được thực hiện đôi khi sẽ cần phải là real/near (thực/gần) thời gian thực trong khi tình huống khác sẽ được xử lý tốt nhất bởi phân tích điều này mất nhiều thời gian để thực hiện.



DataStax Enterprise cung cấp xây dựng hợp nhất với Spark để cung cấp sự cần thiết cho phân tích giao dịch. Trong hội nghị thượng đỉnh Cassandra, Pat McDonough, Giám đốc giải pháp khách hàng tại Databricks, đã đưa ra trao đổi xung quanh SDK của tất cả các nền tảng dữ liệu lớn và đi sâu vào Spark cùng sự  hợp lại của nó với Cassandra. Xem video về bài thuyết trình tại đây.



Bạn có thể sử dụng DataStax Enterprise để trải nghiệm phân tích giao dịch, hiểu rõ hơn về cách hoạt động, giao thức và các thức phát hiện gian lận cũng như hiệu suất làm việc của công cụ ứng dụng Spark trong phân tích dữ liệu lớn thời gian thực.


Saturday, February 20, 2016

Phân tích gian lận thế hệ mới: Machine Learning trên Hadoop

Gian lận đại diện cho sự mất mát lớn nhất đối với các ngân hàng, chiếm trên 1.744 tỷ đô thua lỗ hàng năm. Các ngành công nghiệp ngân hàng dành hàng triệu đô mỗi năm cho các công nghệ nhằm giảm gian lận và giữ chân khách hàng, nhưng lại phải chi tiêu không ít trong việc bảo vệ các ngân hàng. Hãy tập trung vào lý do tại sao các phương pháp phát hiện gian lận hiện nay không làm việc tốt như mong muốn và cách học máy giúp đỡ thế nào trên nguồn dữ liệu lớn.


Hầu hết các phương pháp tiếp cận hiện tại để phát hiện gian lận phần lớn là dạng tĩnh và dựa trên các mẫu chữ ký và bắt nguồn từ một tập hợp con của các giao dịch trước đó. Các ngân hàng thường sử dụng mô hình toán học phức tạp được tạo ra từ gian lận lịch sử đã biết để xác định liệu một giao dịch xảy ra trong thời gian thực là gian lận hay không. Rất ít hoặc nếu có, phải được chú ý nhằm phát hiện gian lận trong lần đầu tiên, trong khi không có chữ ký đã biết. Hơn nữa, chữ ký thu được cũng không đủ toàn diện như nó được tạo ra từ một tập hợp con của dữ liệu. Kết quả là, các ngân hàng luôn chơi đuổi bắt và gian lận lần đầu tiên thường vượt qua và không bị phát hiện.

Friday, February 19, 2016

50 dự án IOT ( Internet of Things) cho năm 2016

IoT và khoa học dữ liệu được đan xen, như dữ liệu cảm biến từ wearables, vận chuyển, hệ thống chăm sóc sức khỏe, sản xuất và kỹ thuật, cần phải được thu gom, tinh chế tổng hợp và xử lý bởi hệ thống khoa học dữ liệu để đưa ra giá trị và hiểu biết.

Chúng ta có một danh sách các bài viết thú vị vào cuối năm ngoái về chủ đề này, và một số dữ liệu IoT đặt trước đó vào năm 2015. Nếu các bạn muốn biết thêm thông tin có thể vào IoTCentral.io. Dưới đây là danh sách dự đoán của các chuyên gia hàng đầu về IoT trong năm 2016.

Wednesday, February 17, 2016

50 IoT (Internet of Things) Predictions for 2016

IoT and data science are intertwined, as sensor data from wearables, transportation or healthcare systems, manufacturing and engineering, needs to be collected, refined, aggregated and processed by automated data science systems to deliver insights and value.
We posted a list of interesting articles late last year on this topic, and some IoT data sets earlier in 2015. More can be found here and on IoTCentral.io. Below is a list of 2016 predictions by top IoT experts.
To view full table, click here.
  • Nathaniel Borenstein, inventor of the MIME email protocol and chief scientist at Mimecast - “The maturation of the IoT will cause entirely new business models to emerge, just as the Internet did. We will see people turning to connected devices to sell things, including items that are currently "too small" to sell, thus creating a renewed interest in micropayments and alternate currencies. Street performers, for example, might find they are more successful if a passerby had the convenience of waving a key fob at their "donate here" sign. The IoT will complicate all aspects of security and privacy, causing even more organizations to outsource those functions to professional providers of security and privacy services.”
  • Adam Wray, CEO, Basho - "The deluge of Internet of Things data represents an opportunity, but also a burden for organizations that must find ways to generate actionable information from (mostly) unstructured data. Organizations will be seeking database solutions that are optimized for the different types of IoT data and multi-model approaches that make managing the mix of data types less operationally complex.”
  • Geoff Zawolkow, CEO, Lab Sensor Solutions - “Sensors are changing the face of medicine. Mobile sensors are used to automatically diagnosis disease and suggest treatment, bringing us closer to having a Star Trek type Tricorder. Also mobile sensors will ensure the quality of our drugs, diagnostic samples and other biologically sensitive materials through remote monitoring, tracking and condition correction.”
  • Zach Supalla, CEO, Particle - “2016 isn't the Year of IoT (yet)- It's A Bump in the Road. The industry has been claiming it’s the year of IoT for the last ​five years - let’s stop calling it the year of the IoT and let's start to call it the year of experimentation. 2016 will be the year that we recognize the need for investment, but we’re still deeply in the experimental phase. 2016 will be the bump in the road year - but at the end of it, we’ll have a much better idea of how experiments should be run, and how organizations can “play nicely” within their own walls to make IoT a reality for the business.”
  • Borys Pratsiuk, Ph.D, Head of R&D Engineering, Ciklum - "The IoT in medicine in 2016 will be reflected in deeper consumption of the biomedical features for non-invasive human body diagnostics. Key medical IoT words for next year are the following: image processing, ultrasound, blood analysis, gesture detection, integration with smart devices. Bluetooth and WiFi will be the most used protocols in the integration with mobile.
  • Laurent Philonenko, CTO, Avaya - “Surge in connected devices will flood the network – the increasing volume of data and need for bandwidth for a growing number of IoT connected devices such as healthcare devices, security systems and appliances will drive traditional networks to the breaking point. Mesh topologies and Fabric-based technologies will quickly become adopted as cost-effective solutions that can accommodate the need for constant changes in network traffic.”
To read all the 50 predictions, click here.


Time Series IoT applications in Railroads



Time Series IoT applications in Railroads
Authors: Vinay Mehendiratta, PhD, Director of Research and Analytics at Eka Software
and Ajit Jaokar, Data Science for IoT course  

This blog post is part of a series of blogs exploring Time Series data and IoT.
The content and approach are part of the Data Science for Internet of Things practitioners course.  
Please contact info@futuretext.com for more details.
Only for this month, we have a special part-payment pricing for the course (which begins in November).
We plan to develop these ideas more – including an IoT toolkit in the R programming language for IoT datasets. You can sign up for more posts from us HERE
Introduction 
Over the last fifteen years, Railroads in the US, Europe and other countries have been using  RFID devices on their locomotives and railcars.  Typically, this Information is stored in traditional (i.e. mostly relational) databases. Information from the RFID scanner provides information about the railcar number and locomotive number. This railcar number is then mapped to existing railcar and train schedule. Timestamp information on scanned data also provides us the sequence of cars on that train. Information from data obtained by scanning RFID on locomotive provides us the number of locomotives and the total horsepower assigned to the train. It also informs whether locomotive is coupled in front of the train or rear of the train.
The scanned data  requires cleansing. Often, readings  from a railcar RFID are  missing at certain scanner. In this case, the missing value is estimated by looking at the scanner reading  before and after the problematic scanner to estimate the time of arrival.
Major Railroads have also defined their territory using links where a link is the directional connection between two nodes.  Railroads have put RFID scanners at major links. 
 An RFID gives information on railcar sequence in train, locomotive consist, and track in real-time.  Railroads store this real-time and historical data for analysis.
Figure 1: Use cases of Rail Time Series Data

 
Figure 1 above shows use cases of time series data in railroad industry. We believe that all of these use cases are applicable for freight railroads. These use cases can also be used for passenger railroads with some changes.  They involve the use of Analytics and RFID
Uses of Real-Time Time Series Data
Here are some ways that time series data is/can be used in railroads in real-time.
  1. Dispatching: Scanner data is being used for dispatching decisions for many years now.  Scanner data is used to display the latest location of trains. Dispatchers use this information, track type, train type, time table information to determine the priority that should be assigned to various trains.
  2. Information for Passengers: Passengers can use train arrival and departure estimates for planning their journey.

Uses of Historical Time Series Data:
Here are some ways that historical time series data is/can be used in railroads.
  • Schedule Adherence Identify trains that are consistently delayed: We can identify trains that are on Schedule, delayed or earlier. . We can identify trains that consistently occupy tracks more than the schedule permit. These are the trains that should be considered for a schedule change. These are the trains that are candidate for root cause analysis.
  • Better Planning: We would be able to determine if planned ‘sectional running time’ are accurate or need to be checked. Sectional run times are generally determined based on experience and are estimates at network level but don’t consider local infrastructure (signal, track type). Sectional running time is used in development of train schedule and maintenance schedule at network and local level
  • Infrastructure Improvemen - Track Utilization: We can identify the section of track where trains have the highest occupancy. This would lead us to identify tracks that are being operated near track capacity or above track capacity. Assumption here is that Utilization above track capacity would result in delays. We can identify the set of trains, tracks, time of day, day of the week when occupancy is high and low. This would provide us insights in train movement and perhaps provide suggestions on train schedule change. We might be able to determine if trains are held up at station/yards or on mainline.  An in-depth and careful analysis can help us determine if attention needs to be paid to yard operations or mainline operations.
  • Simulation Studies: RFID scan data provides us actual time of arrival and departure for every car (hence train). Modelers do create hypothetical trains to feed to simulation studies. This information (actual train arrival/departure time at every scanner, train consist, locomotive consist) is used in infrastructure expansion projects.
  • Maintenance Planning : Historical Occupancy of tracks would enable us to identify time windows when maintenance should be scheduled in future. Railroads use inspection cars to inspect and record track condition regularly. Some railroads are facing the challenge of getting the right geo coordinates for segment of track. Careful insights of this geo and time series data measure track health and deterioration. Satellite imagery data is becoming available frequently. A combination of these two sources can do well to inspect tracks, schedule maintenance, predict track failures, and move maintenance gangs.
  • Statistical Analysis of Railroad Behavior
  1. We can map train behavior with train definition (train type, schedule, train speed, train length) and track definition (signal type, track class, grade, curve, authority type) and identify patterns.
  2. Passenger trains do affect the operations of freight trains. Scanner data can be used to determine the delay imposed on freight trains
  3. Time series information of railcars can be used to identify misrouted cars or lost cars.
  4. Locomotive consist information and time series data based performance can be used together to determine the best locomotive consist such as make, horsepower (historically) for every track segment
  5. Locomotive is a costly asset for any railroad. Time series data can easily be used to determine locomotive utilization.
  •  Demand Forecasting : Demand for railroad empty cars is known as an indicator of a country’s economy. While demand of railroad cars vary with car type and macro-economic factors, it is worth making efforts getting insights on historical perspective. Number of cars by car type can be estimated and forecasted for every major origin-destination pair. Number of train starts and train ends at every origin and destination can be used to forecast the number of trains for a future month. Number of trains forecasted would help a railroad determine the number of crew, locomotives. It would also help railroad determine the load that tracks would go through.  Number of forecasted trains can be used in infrastructure studies.

  • Safety: Safety is  the most important feature of railroad culture. Track maintenance, track wear and tear ( track utilization) are all related to safety. Time series data of railcars, signal type, track type, train type, accident type, train schedule can all be analyzed together to identify potential relationship (if any) between various relevant factors.

  • Train Performance Calculations: What is the unopposed running speed on a track with a given grade, curve, locomotive consist, car type, wind direction and speed?  These factors were  determined by Davis [1] in 1926. Could time series data help us calibrate the co-efficient of Davis’s equation for railcars with new design?
  • Planning and Optimization: All findings above can be used to develop smarter optimization models for train schedule, maintenance planning, locomotive planning, crew scheduling, and railcar assignment.

Conclusion:
In this article,  we have highlighted some use cases of time series data for Railroads. There are many more factors that could be considered especially in the use of Technology for implementing these Time series algorithms. In subsequent sections, we will show how some of these use cases could be implemented based on the R programming language.
To know more about the Data Science for Internet of Things practitioners course.  Please contactinfo@futuretext.com for more details. You can sign up for more posts from us HERE
Reference:
  1. Davis, W.J, Jr.: The tractive resistance of electric locomotives and cars, General Electric Rewiew, vol. 29, October 1926. 

Followers