Scalable Data Processing with Apache Spark

Rating:
1 vote, average: 5.00 out of 51 vote, average: 5.00 out of 51 vote, average: 5.00 out of 51 vote, average: 5.00 out of 51 vote, average: 5.00 out of 5
Loading...
Please Log in or register to rate

Scalable Data Processing with Apache Spark

BD-360

Scalable data processing with Apache Spark introduces you to the popular, open-source processing framework that took over the Big Data landscape. From basic concepts all the way to configuration and operations, you will learn how to model data processing algorithms using Spark’s APIs, how to monitor, analyze and optimize Spark’s performance, how to deploy and build Spark applications, and how to use Spark’s various APIs (RDD, SQL, DataFrame and Dataset).

Audience

Target Audience:
This course is intended for individuals responsible for designing and implementing solutions using Apache Spark, namely Solutions Architects and SysOps Administrators, Data Scientists and Data Engineers interested in learning about Apache Spark.

Prerequisites:
We recommend that attendees of this course have the following prerequisites:
— Proficiency in one of the following languages: Java8 (including Lambdas), Scala, Python
— Basic familiarity with Java’s JVM: classes, jars, and memory management
— Basic familiarity with big data technologies, including Apache Hadoop, MapReduce, HDFS
— Basic understanding of data warehousing, relational database systems, and database design

Topics

Module 1 – Introduction to Apache Spark

— Overview of Apache Spark
— Basic concepts: distributed processing and Map/Reduce
— Word Count example explained
— Apache Spark application components: Driver, Master, Executor
— Apache Spark deployment modes and local environment setup
— Serialization and Shuffling
— Spark internal processing model: Jobs, Stages and Tasks
— Apache Spark libraries: core, SQL, MLLib, Graph and Streaming
— Supported Languages: API comparison

Module 2 – Programming and Optimizing Apache Spark Jobs

— RDD API: transformations vs. actions
— Modeling computations using RDD API
— Caching, Broadcasting and Checkpointing
— Common performance pitalls: groupBy, collect, join
— Accumulators as alternatives to actions
— Wordcount example revisited – using SQL, Dataframe, and Dataset APIs
— Supported Data Formats: CSV, JSON, Parquet
— API Comparison – which one should I use?

Module 3 – Apache Spark Operations: deployment, monitoring and intgerations

— Managing Spark Sessions: best practices
— Apache Spark class loading: common mistakes and best practices
— Monitoring Spark application: Web UI, Metric sinks
— Resiliency: retried tasks and stages, and where resiliency fails
— Logging and viewing history
— Data storage integration: HDFS, S3, Cassandra and JDBC

© Copyright - Skilit - Site by Dweb