Introduction

The aim for every machine learning algorithm is to

build a classification or regression model from the dataset. When trying to

come up with a model, a parametric model is possible by adjusting the

parameters according to the observed data. In most real life scenarios such

models are not possible as we do not have any expert guess to begin. Hence

non-parametric techniques are applied to derive a model directly from the data.

Supervised learning is done when we derive models for which the target variable

is already known. Gradient Boosting is also one such techniques for supervised

learning which has gained lot of fame over the past decade.

Ensemble Methods

Ensemble methods are used to aggregate many models,

each which have been for the same dataset. Ensemble then comes up with a single

model out of the many which has better accuracy, estimates and reliable

decision. Multiple learners are trained by ensemble methods to find solution to

a problem. This is a reason ensemble methods are called

committee-based-learning or learning multiple classifier system. An ensemble method

has many learners called as base learners which are base learning algorithm on the

training data. Base learning algorithm could be decision tree, neural network

or any other learning algorithm. Ensemble has an ability of generalization

which is better than a base learner. Ensemble boosts weak learners which would

be slightly better or strong learners. Strong learners are those which are able

to make accurate predictions. Two most commonly used ensemble methods are “Bagging” and “Boosting”. Bagging helps to make the samples random by bootstrapping.

Boosting works on re-weight the samples and gives extra weight to wrongly

predicted predictors.

Boosting

The main idea behind boosting is to build several

models and then add it to ensemble one after the other. Family of algorithm

converting weak learners to strong learners is called Boosting. Weak learner

might be just a bit better than a random guess, while a strong learner is the

prediction which would give us ideal performance. Boosting is based on

observations to find rough rules of thumb which would be easier than directly

finding a single highly accurate prediction rule. Boosting refers to the

repeated process of feeding a different subset which is different distribution

of weights over the training examples. In boosting the assigned weights change

over time. When we increase the weight, the probability of that sample being

picked increases. Hence when we proceed from one round to another we assign

larger weight to those which were not picked so that their probability of being

picked increases. Weights are reassigned so that each observation is given an

equal chance for being picked leading to fair representation of the data in the

model formed. Boosting technique face a problem which was to decide whether it

outperformed every other method or it led to severe overfitting. The challenge

was to find the exact iterations to stop where it would not overfit or underfit

the model.

Gradient Boosting

Gradient

boosting can be called as an enhanced form of boosting. A gradient descent

based formulation of boosting method was derived by Friedman. These

customizations made were then known as gradient boosting machines. It uses the

main idea of an ensemble to build new base learners to be maximally correlated

to the loss function’s negative gradient.

Dataset:

where explanatory input variables and y is the

response variable.

To

construct functional dependence with estimate to minimize the loss function.

We restrict the function to

parametric function :

If

there are M steps then parameter estimates can be summed to the following:

Steepest

Gradient Descent is the most commonly used parameter estimation procedure. Decrease

Empirical Loss Function:

Following

is the estimate at t-th iteration and optimization rule:

Gradient Boost

Algorithm

Using

the specific loss function and custom base learner its difficult to get to a parameter estimate.

Thus we choose a new function which would be most parallel to the negative

gradient for the observed data.

Then

we look for new function highly correlated with negative gradient:

Summary:

Inputs:

§ Dataset

§ number of iterations

§ loss function chosen

§ base learner model

chosen

Algorithm:

Step

1: Initialising estimate function with a constant.

Step

2: for t=1 to M

Step

3: Calculate negative gradient

Step

4: fit new base learner function

Step

5: Calculate best gradient descent

step size:

Step

6: Update function estimate

Step

7: end the loop in step 2

Loss

Function

A

loss function quantifies how far the prediction is from the correct desired

value. The aim is achieve highest accuracy in prediction so loss function is

the object we want to minimize.

Loss

functions are segregated according to the type of response variable:

1)

Loss

Functions for Continuous Response variable

a)

Gaussian

loss function: The loss function is finding the mean parameter when the data is

assumed to follow Gaussian distribution. Then the log of the Gaussian

likelihood is taken.

b)

Laplace

loss function: When the loss function is taken for Laplace Distribution of

response variable.

c)

Huber

loss function: The loss function used in robust regression is called as Huber loss.

The characteristic of Huber loss is that it is less sensitive to outliers in

the data and the squared error loss.

d)

Quantile

loss function: The loss function used for quantile regression is referred to as

quantile loss function. The characteristic of quantile loss is that it is robust

against outliers in case of response measurements.

2)

Loss

Functions for Categorical Response variable

a)

Binomial

loss function: When the response variable has values 0, 1 the binomial loss

function comes into picture. It can be estimated by takes the negative log

likelihood and finding minimum of it.

b)

Adaboost

loss function: Exponential loss is referred to as Adaboost loss.

3) Other response variable families

a) Loss function for survival Models

b) Loss function for counts data

c) Custom loss functions

Base-Learners

It is a component of the ensemble which are combined

together to be known as base-learners. Base

learners also referred as weak learners would focus on correctly classifying the

highest weighted sample. The most commonly used base-learners are categorised

as linear models, smooth models and decision trees.

1)

Additive Base Learners: At each iteration many additive

base-learners build random variables and are fitted at the same time. Based on

the residual sum of squares the best out of these models is chosen. Additive

Model’s usage is encouraged when we require to fit a sparse model. When using

additive base-learner it leads to easy variable selection procedure.

2)

Decision Tree Base Learners: Decision tree approach

is used to partition the space of the input variables into rectangular spaces

in form of a tree. When a decision tree has only one split it is a special case

known as tree stump. Decision tree base learners are widely used in real life

scenarios.

Regularization

The importance of a model depends on its

generalization capabilities. In case the learning algorithm is not applied

properly it may lead to overfitting the data. If overfitting is done it

degrades the ability of the model for further generalization. This is the

reason why regularization techniques came into existence. Increasing the number

of iterations would reduce the error on the training set but if done in excess

leads to overfitting. Optimal value of number of iteration is determined by

monitoring the error on a separate validation dataset.

·

Subsampling:

Subsampling

is the simplest form of regularization. Subsampling not only improve the

generalization ability of a model but also reduces the computations considerably.

Subsampling introduces randomness in the model. Subsampling requires a bag

fraction to be defined. Bag fraction is a positives number less than 1.Whatever

bag fraction is defined that much percentage of the sample is used at each

iteration. Subsampling works well for large datasets as they are dealt well

when subsampled. But a concern still needs to be taken care of, the sample size

should not become too low as it would result in poorly fit model due to lack of

degrees of freedom. Higher efficiency and accuracy is obtained with larger

number of base-learners with lower bag rate when compared to higher bag rate

with less number of base-learners.

·

Shrinkage:

Regression

coefficient is shrunk to zero in case of ridge regression. This is the most

common regularization using shrinkage. When the regression coefficients become zero,

the potential of affecting stability reduces. Shrinkage reduces the size of

incremental steps. The aim of this technique is to take several small steps instead

of taking few large steps. The only thing which needs to be seen is the no

iteration is wrong as it will affect all subsequent iterations. Regularization

through shrinkage is applied as the final step in the gradient boosting

algorithm which is as follows:

The

parameter lamda needs to be lower for better generalization.

Shrinking makes decision tree

handle large datasets better with extremely small step size.

·

Stochastic

Gradient Boosting: After Friedman introduced gradient boosting,

a minor modification was made by Breiman with his method of ‘bagging’. This was

done to introduce an element of randomness in the algorithm. It was suggested

that a base learner should be fitted on randomly selected subsample of the

training set which is drawn without replacement. Then later in 1999 Breiman

suggested a bagging-boosting procedure which was known as adaptive bagging.

Adaptive bagging looked for least square fitting. Base learner was replaced by

bagged base learner and residuals would be replaced by out-of-bag residuals at

each iteration of boosting.

·

Number

of observations in leaves: Regularization also is done in

form of limiting the minimum number of observation in the terminal nodes of the

tree. Split at any node is stopped or not further continued if the number of

node would become less than this defined number. The variance in prediction at

leaves is reduced if limit of minimum

observation is set.

·

Penalize

Complexity of the Tree: Penalizing model complexity of the

learned model is another regularization technique for a gradient boosted tree. Proportional

number of leaves in the learned tree is called the model complexity of the

tree. Removing the branches which fail to reduce the loss after the pruning is

known as optimization of loss and also of model complexity.

Gradient

Boosting in Python

Following are the packages and methods used:

Classification

Package: from sklearn.ensemble import

GadientBoostingClassifier

Method:

GradientBoostingClassifier(n_estimator=50,learning_rate=1,max_depth=1,random_state=0).fit(X_train,y_train)

Regression

Package: from sklearn.ensemble import

GadientBoostingRegressor

Method: GradientBoostingRegressor(n_estimator=50,learning_rate=1,max_depth=1,random_state=0,loss=’ls’).fit(X_train,y_train)

Applications

of Gradient Boosting

Feature

Selection: Feature selection is important in

machine learning as it results in classifiers which occupy less memory. Data

becomes faster to train and test. It leads to better generalization and also

reduces feature selection costs. Feature selection cost with an

l1-norm ,gradient boosted regression trees are used. Gradient Boosting

feature selection uses gradient boosting framework where trees are built using

greedy CART Algorithm. Following the change

in the impurity function, features are selected sparsely. Splitting on

new set of features is incurred by a cost . Gradient Boosting In

case of GBFS is the complexity of its time and memory, d is the number of features

dimentionality and n is the number of

data points.GBFS has high speed in real life.

Feature Selection(GBFS)

helps to discover nonlinear interaction between features. GBFS also

makes feature selection and classification into one single optimization. GBFS

has pre specified cost structures.

Algorithm:

Step 1:Linear classification and regularization combined

by Lasso is as follows

Step 2: regularization

introduces sparsity and also regularizes against overfitting, we introduce

capped norm which is

Step 3: Feature selection and regularization is adjusted as

Step 4:Linear Classifier

Step 5: is a sparse linear vector. Where h represents all possible regression trees .Classifier is as follows:

High

Dimensional Sparse Output: As in case of multi label

classification ,there is a L-dimensional

0/1 vector as the output space. L is

a number which could go to million is several applications. Gradient Boosted

Decision Trees(GBDT) runs out of memory in case of vanilla GBDT. Therefore a

new variant of GBDT is used called GBDT-SPARSE using .Sparsity is used to

conduct GBDT training,splitting the nodes,sparse residual computation and sub

linear time prediction. An algorithm is applied to improve model prediction and

its size using GBDT-SPARSE. GBDT is preferred because it takes up very small

memory size.

GBDT for binary classification we have X as the training data, Y

as their labels with an aim that a classification function is chosen to

minimize the summation of the loss function. The estimation function is taken

to be in additive form. Every function which sums up to give the estimation

function belongs to the parameterized set of learners. As the name suggests

Gradient Boosted Decision Trees use decision trees as the base-learners.

Algorithm:

Step 1:Firstly we have the General loss function having high

dimensional output

Step 2: The regularization function in the first step equation

is R(F),which is

represents the tree structure, which means M

leaves of m tree.For jth leaf node in

m tree.

Step 3:The loss function should be differentiable and follow the

following properties:

a)Loss function is summation of loss function for each q.

b)the derivative of the loss function is equal to zero.

Step 4: is the L-dimensional

gradient for the i-th sample,

Step 5: Objective function is minimized for each tree for which we want to find the

cut value

Sentiment

Analysis: It is taken from a research paper for Greek

Sentiment Analysis. Sentiment analysis plays a pivotal role in text classification.

Modern Greek language was posing complication in sentiment analysis and pre-processing

for Greek are not easily available. Hence pre-processing was done by Google API

translation. Then ensemble classification algorithm of Gradient Boosting was

used to handle large number of features. Basic feature reduction technique like

Principal Component Analysis was tried but Gradient Boosting gave the best performance.

It was also seen that gradient boosting could also handle imbalanced dataset.

Gradient boosting is said to perform well in case of high dimensional data

because it does variable selection. It also assigns variable amount of degrees

of freedom. Class imbalance issue is looked after using bootstrapping method.

Ranking

in Search Engine: Gradient Boosting is used for learning

to rank the results in a search engine. The purpose of the ranking model is to

produce a permutations of result showing up the relevant and not showing up the

irrelevant. AltaVista was first search engine to use gradient boosting-trained

ranking function. Later it was also used by Yahoo.

Example

Smart

City Mobility Application: Development in communication

technologies has impacted smart cities a lot. Location information of people

and their origin-destination data can be obtained by mobile phone data. A model

was implemented to provide personalised route suggestion to the people using a

mobile application. To design this model gradient boosted trees were used to

predict the target or destination value based on the selection of the origin.

Gradient Boosted Trees are best known for predictive analytics. Depending on

the characteristic of the target variable, classification or regression is

decided .Transportation mode is a categorical variable in this case and GBT

classifier would be used. Random explanatory variables and random response

variables are taken and having samples of known pairs, our goal is to predict. Then

loss function is minimized. Gradient boosting is used to increase the stability

of the model.

Advantages

·

Major Advantage of Gradient Boosting is

that it handles high dimensional spaces and a very large number of training

data.

·

It is robust to outliers and scalable.

·

Gradient Boosting is a very flexible

nonlinear regression procedure and it helps to increase the accuracy of

decision trees tremendously.

·

Gradient Boosted Regression Trees can

handle heterogeneous data and automatically detects feature interactions.

·

Gradient boosting support several types

of loss functions, hence has a wide scope.

·

In Gradient Boosting Feature selection

the algorithm is able to quickly classify very accurate classifier which helps

to select high quality features. The iterative nature of GBFS would permit us

to bias the probability of sampling of a feature by splitting value from previous

iteration helping to avoid unimportant feature selection.

·

In High Dimensional Sparse Output using

Gradient Boosted Decision Tree reduce model size and also the prediction time.

Disadvantages

·

Gradient Boosting could fail to perform

in case of insufficient data. It is entirely data driven and the classifier

derived is generated from the training data itself. In some application if the

data is very less and restricted then human knowledge would be needed to

suffice the data and derive to a decision.

·

Gradient Boosting is a non-explicit

model.

·

If there are overly complex

base-classifiers which are too weak then gradient boosting might not be of much

use.

·

Gradient boosting might be also

susceptible to noise.

·

At times no tree is displayed while

using gradient boosting. This might be because several trees are incorporated

together into the model.

·

When unconstrained, individual tree can

become prone to overfitting which needs to be controlled.

·

Gradient Boosting require careful tuning.

·

Gradient Boosted Regression Trees cannot

extrapolate, due to it sequential nature it is hard to parallelize.

Conclusion

Gradient Boosting Machines are widely applied in real life

data. GBMs give excellent performance in terms of generalization of a model and its accuracy. GBM

also gives us detailed insight of the model. Having the ability to predict efficiently

and outperform a number of other methods, GBM has an edge over others. The

essential stages of designing a model using gradient boosting is as described

above. The deep knowledge which gradient boosting allows investigation and

analysis of the effects of the model. Gradient boosting also has promising

robotics application for pattern recognition, which would be the ultimate use of machine learning.

·