Big Data Analytics

Big Data Analytics

This course provides a basic introduction to big data and related quantitative research methods. The aim of the course is to familiarize students with big data analysis as a tool for processing substantial research questions. The course begins with a basic introduction to Big Data and discusses what analyzing this data entails, as well as the technical, conceptual and ethical challenges associated with it. The strengths and limits of big data research are discussed in detail using practical examples. Then the students deal with case study exercises in which small groups of students develop and present a big data concept for a specific case from practice. This includes hands-on exercises to familiarize students with the format of big data. In addition, initial practical experience in handling and analyzing large, complex data structures is imparted.

Course Details

 If you ask anyone about big data, they only know: “You are talking about a huge collection of data that cannot be used for calculations unless it is provided and operated in an unconventional way.” Big data is not just about storage and extracting data, but much more. Big data itself encompasses so many technologies that it is difficult to remember which one to start with. Not really! Some of the technologies that make up big data are Hadoop, MapReduce, Apache, Pig, Hive, Flume, Sqoop, Zookeeper, Oozie, and Spark.

 

Companies are urgently looking for qualified big data analysts. As data is collected and stored faster than ever before, the urgency of such professionals continues to grow. Before getting into big data, we encourage you to fully understand this topic which is the full big data curriculum. So the next time you're starting a course, be sure to read all of the major big data topics. Allsoft Solutions covers every module up to the streaming of data.

Course Information

1. Provide an overview of an exciting and growing field of big data analysis.

2. Introduction to the tools required to manage and analyze big data, such as Hadoop, NoSql MapReduce.

3. Teaching the basic techniques and principles to achieve big data analytics with scalability and streaming capability.

4. To provide students with skills that will help them solve complex real world problems in decision support.

 

Introduction of  BigData

Limitations of the existing solutions for Big Data problems.

How Hadoop solves the Big Data problem?

IBM’s 4 V’s

Types of Data

Installation Of Cloudera and VMWare

Setup of the single node hadoop cluster

Describing the functions and features of HDP

Listing the IBM value-add components

Explaining what IBM Watson Studio is

Giving a brief description to the purpose of each of the value-add components

Exploring the lab environment

Describing and compare the open source programming languages, Pig and Hive

Listing the characteristics of programming languages typically used by Data Scientists: R and Python

Understanding the challenges posed by distributed applications and how ZooKeeper is designed to handle them.

Explaining the role of ZooKeeper within the Apache Hadoop infrastructure and the realm of Big Data management.

Exploring generic use cases and some real-world scenarios for ZooKeeper.

Defining the ZooKeeper services that are used to manage distributed systems.

Exploring and using the ZooKeeper CLI to interact with ZooKeeper services.

Understanding how Apache Slider works in conjunction with YARN to deploy distributed applications and to monitor them.

 

HDFS Architecture

Hadoop Ecosystem

Linux based Commands(How to work with local file system commands)

Hadoop Commands

 Sqoop

Sqoop intro

How to display tables from Rdbms Mysql for Sqoop?

How to display Databases from Rdbms mysql for Sqoop?

How to import all tables from a specific database from RDBMS Mysql to HDFS(Hadoop)?

How to import data from RDBMS MYSQL FROM HDFS(HADOOP)?

How to export data from HDFS TO RDBMS MYSQL?

How to import part of the table from RDBMS MYSQL TO HDFS?

 HIVE

Hive concepts

Hive Data types

Hive Background

About Hive

Hive Architecture and Components

Metastore in Hive

Limitations of Hive

Comparison with Traditional Databases

 

PIG

What is Pig?

Pig Run Modes

Pig Latin Concepts

Pig Data Types

Pig Example

Group Operator

COGROUP Operator

Joins

COGROUP

 

HBASE

What is HBase

HBase Model

HBase Read

HBase Write

HBase MemStore

RDBMS vs HBase

HBase Commands

HBase Example

 

Map Reduce

Input Splits in MapReduce

Combiner & Partitioner

What are all the file input formats in hadoop (Mapreduce)?

What type of Key value Pair will be generated? Our file format is key value text input format?

Can we set Required no of mappers and Reducers?

Difference between Old and New api in Mapreduce

What is the importance of Record Reader in Hadoop?

Map Reduce with Word Count Example

 

Advanced course

Big SQL

Overview of Big SQL

Understanding how Big SQL fits in the Hadoop architecture

Start and stop Big SQL using Ambari and command line

Connecting to Big SQL using command line

Connecting to Big SQL using IBM Data Server Manager

Configuring images

Starting Hadoop components

Start up the Big SQL and DSM services

Connecting to Big SQL using JSqsh

Executing basic Big SQL statements

 

IBM Watson Studio

Explain what IBM Watson Studio is.

Identify industry use cases. 

List Watson Studio offerings. 

Create Watson Studio projects. 

Describe Watson Studio and the Apache Spark environment.

Describe Watson Studio and Cloud Object Storage. 

Prepare and analyze data. 

Use Jupyter Notebooks.Describe Apache Spark environment options. •

List Watson Studio default Apache Spark environment definitions. 

Create machine learning (ML) models with an Apache Spark runtime.

cloud storage and its features. 

Define various types of cloud storage (object storage, block storage, and file storage).

 

Scala and spark

What is Scala?

Why Scala for Spark?

Scala in other frameworks

Introduction to Scala REPL

Basic Scala operations

Variable Types in Scala

Control Structures in Scala

Foreach loop

Functions

Procedures

Collections in Scala- Array, Array Buffer,

Map, Tuples, Lists, and more.

What is Spark? :

Spark Ecosystem

Modes of Spark

Spark installation demo

Overview of Spark on a cluster

Spark Standalone cluster

Spark Web UI

Some configurations.

Components of Spark Unified stack

Spark Streaming

MLlib

Core

Spark SQL

RDD - The core concept of Spark

RDDs,

Transformations in RDD,

Actions in RDD,

Loading data in RDD,

Saving data through RDD,

Key-Value Pair RDD,

MapReduce and Pair RDD Operations

Scala and Python shell

Word count example

Shared Variables with examples

Submitting jobs in cluster

Hands on examples

Future Opportunity

Big data is not only an important part of the future, it can be the future itself. The way companies, organizations and the IT professionals who support them approach their jobs will continue to be shaped by developments in the way we store, move and understand data.

Tutor Information

- Databases (Oracle)
- ETL tools(Informatica,ODI)
- Analytical Reporting tools (OBIEE/TABLEAU)
- Big Data ecosystem (HDFS,KAFKA,SCOOP,SPARK,SCALA,PYTHON,HIVE,IMPALA,OOZIE,HBASE)

Industry Driven projects

POC using Hadoop Technologies for Example : 

1. Zero Copy Shared Memory Framework in KVM for Host Guest Data Sharing

2. Sampling Based Network Traffic Measurement Algorithm for Big Network Data

3. Novel Privacy Preserving and Efficient Protocols for Human Activity Recognition Based on Sensor

4. Credit Card Fraud Detection Project 

5, Twitter dataset analysis