Introduction to Machine Learning with Apache Spark & Apache Zeppelin
Apache Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java.
Spark can read from diverse data sources and scale to thousands of nodes.
There will be a lecture on Spark’s Machine Learning (ML) module where we will cover both an older MLlib library as well as a newer Spark ML library for pipelining ML jobs. The lecture will be followed by a demo in Apache Zeppelin with Machine Learning examples.
Zeppelin provides a notebook style environment for data exploration, analytics and more — it’s a modern Data Science notebook.
If you would like to run ML examples shown during the meetup yourself, then setup one of the following:
- Hortonworks Sandbox (preconfigured HDP 2.4) on a Virtual Machine (VM) where you have full control of the environment. No data center, no cloud service, and no internet connection needed! http://hortonworks.com/products/sandbox/#install
- Hortonworks Sandbox (preconfigured HDP 2.4) in the cloud (on Microsoft Azure). It’s FREE for the the first month, and there’s no need to download the VM! http://hortonworks.com/hadoop-tutorial/deploying-hortonworks-sandbox-on-microsoft-azure/
- Hortonworks Cloud on Amazon AWS for more control & pre-configured multi-node cluster deployments (slightly more advanced). Includes latest bits, such as Spark 2.0: http://hortonworks.github.io/hdp-aws/