Introduction problem. This is a reason ensemble methods



The aim for every machine learning algorithm is to
build a classification or regression model from the dataset. When trying to
come up with a model, a parametric model is possible by adjusting the
parameters according to the observed data. In most real life scenarios such
models are not possible as we do not have any expert guess to begin. Hence
non-parametric techniques are applied to derive a model directly from the data.
Supervised learning is done when we derive models for which the target variable
is already known. Gradient Boosting is also one such techniques for supervised
learning which has gained lot of fame over the past decade.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

Ensemble Methods


Ensemble methods are used to aggregate many models,
each which have been for the same dataset. Ensemble then comes up with a single
model out of the many which has better accuracy, estimates and reliable
decision. Multiple learners are trained by ensemble methods to find solution to
a problem. This is a reason ensemble methods are called
committee-based-learning or learning multiple classifier system. An ensemble method
has many learners called as base learners which are base learning algorithm on the
training data. Base learning algorithm could be decision tree, neural network
or any other learning algorithm. Ensemble has an ability of generalization
which is better than a base learner. Ensemble boosts weak learners which would
be slightly better or strong learners. Strong learners are those which are able
to make accurate predictions. Two most commonly used ensemble methods are “Bagging” and “Boosting”. Bagging helps to make the samples random by bootstrapping.
Boosting works on re-weight the samples and gives extra weight to wrongly
predicted predictors.



The main idea behind boosting is to build several
models and then add it to ensemble one after the other. Family of algorithm
converting weak learners to strong learners is called Boosting. Weak learner
might be just a bit better than a random guess, while a strong learner is the
prediction which would give us ideal performance. Boosting is based on
observations to find rough rules of thumb which would be easier than directly
finding a single highly accurate prediction rule. Boosting refers to the
repeated process of feeding a different subset which is different distribution
of weights over the training examples. In boosting the assigned weights change
over time. When we increase the weight, the probability of that sample being
picked increases. Hence when we proceed from one round to another we assign
larger weight to those which were not picked so that their probability of being
picked increases. Weights are reassigned so that each observation is given an
equal chance for being picked leading to fair representation of the data in the
model formed. Boosting technique face a problem which was to decide whether it
outperformed every other method or it led to severe overfitting. The challenge
was to find the exact iterations to stop where it would not overfit or underfit
the model.

Gradient Boosting

boosting can be called as an enhanced form of boosting. A gradient descent
based formulation of boosting method was derived by Friedman. These
customizations made were then known as gradient boosting machines. It uses the
main idea of an ensemble to build new base learners to be maximally correlated
to the loss function’s negative gradient.


where   explanatory input variables and y is the
response variable.

construct functional dependence  with estimate to minimize the loss function.


We restrict the function to
parametric function :

there are M steps then parameter estimates can be summed to the following:

Gradient Descent is the most commonly used parameter estimation procedure. Decrease
Empirical Loss Function:

is the estimate at t-th iteration and optimization rule:


Gradient Boost


the specific loss function and custom base learner  its difficult to get to a parameter estimate.
Thus we choose a new function which would be most parallel to the negative
gradient for the observed data.


we look for new function highly correlated with negative gradient:




§  Dataset

§  number of iterations

§  loss function chosen

§  base learner model




1: Initialising estimate function with a constant.

2: for t=1 to M

3:       Calculate negative gradient

4:       fit new base learner function

5:       Calculate best gradient descent
step size:


6:       Update function estimate

7:  end the loop in step 2







loss function quantifies how far the prediction is from the correct desired
value. The aim is achieve highest accuracy in prediction so loss function is
the object we want to minimize.

functions are segregated according to the type of response variable:


Functions for Continuous Response variable

loss function: The loss function is finding the mean parameter when the data is
assumed to follow Gaussian distribution. Then the log of the Gaussian
likelihood is taken.

loss function: When the loss function is taken for Laplace Distribution of
response variable.

loss function: The loss function used in robust regression is called as Huber loss.
The characteristic of Huber loss is that it is less sensitive to outliers in
the data and the squared error loss.

loss function: The loss function used for quantile regression is referred to as
quantile loss function. The characteristic of quantile loss is that it is robust
against outliers in case of response measurements.

Functions for Categorical Response variable

loss function: When the response variable has values 0, 1 the binomial loss
function comes into picture. It can be estimated by takes the negative log
likelihood and finding minimum of it.

loss function: Exponential loss is referred to as Adaboost loss.

      3) Other response variable families 

           a) Loss function for survival Models

           b) Loss function for counts data

           c) Custom loss functions







It is a component of the ensemble which are combined
together to be  known as base-learners. Base
learners also referred as weak learners would focus on correctly classifying the
highest weighted sample. The most commonly used base-learners are categorised
as linear models, smooth models and decision trees.

Additive Base Learners: At each iteration many additive
base-learners build random variables and are fitted at the same time. Based on
the residual sum of squares the best out of these models is chosen. Additive
Model’s usage is encouraged when we require to fit a sparse model. When using
additive base-learner it leads to easy variable selection procedure.

Decision Tree Base Learners: Decision tree approach
is used to partition the space of the input variables into rectangular spaces
in form of a tree. When a decision tree has only one split it is a special case
known as tree stump. Decision tree base learners are widely used in real life




The importance of a model depends on its
generalization capabilities. In case the learning algorithm is not applied
properly it may lead to overfitting the data. If overfitting is done it
degrades the ability of the model for further generalization. This is the
reason why regularization techniques came into existence. Increasing the number
of iterations would reduce the error on the training set but if done in excess
leads to overfitting. Optimal value of number of iteration is determined by
monitoring the error on a separate validation dataset.

is the simplest form of regularization. Subsampling not only improve the
generalization ability of a model but also reduces the computations considerably.
Subsampling introduces randomness in the model. Subsampling requires a bag
fraction to be defined. Bag fraction is a positives number less than 1.Whatever
bag fraction is defined that much percentage of the sample is used at each
iteration. Subsampling works well for large datasets as they are dealt well
when subsampled. But a concern still needs to be taken care of, the sample size
should not become too low as it would result in poorly fit model due to lack of
degrees of freedom. Higher efficiency and accuracy is obtained with larger
number of base-learners with lower bag rate when compared to higher bag rate
with less number of base-learners.


coefficient is shrunk to zero in case of ridge regression. This is the most
common regularization using shrinkage. When the regression coefficients become zero,
the potential of affecting stability reduces. Shrinkage reduces the size of
incremental steps. The aim of this technique is to take several small steps instead
of taking few large steps. The only thing which needs to be seen is the no
iteration is wrong as it will affect all subsequent iterations. Regularization
through shrinkage is applied as the final step in the gradient boosting
algorithm which is as follows:


parameter lamda needs to be lower for better generalization.

Shrinking makes decision tree
handle large datasets better with extremely small step size.

Gradient Boosting: After Friedman introduced gradient boosting,
a minor modification was made by Breiman with his method of ‘bagging’. This was
done to introduce an element of randomness in the algorithm. It was suggested
that a base learner should be fitted on randomly selected subsample of the
training set which is drawn without replacement. Then later in 1999 Breiman
suggested a bagging-boosting procedure which was known as adaptive bagging.
Adaptive bagging looked for least square fitting. Base learner was replaced by
bagged base learner and residuals would be replaced by out-of-bag residuals at
each iteration of boosting.

of observations in leaves: Regularization also is done in
form of limiting the minimum number of observation in the terminal nodes of the
tree. Split at any node is stopped or not further continued if the number of
node would become less than this defined number. The variance in prediction at
leaves  is reduced if limit of minimum
observation is set.

Complexity of the Tree: Penalizing model complexity of the
learned model is another regularization technique for a gradient boosted tree. Proportional
number of leaves in the learned tree is called the model complexity of the
tree. Removing the branches which fail to reduce the loss after the pruning is
known as optimization of loss and also of model complexity.

Boosting in Python

Following are the packages and methods used:


Package:      from sklearn.ensemble import



Package:      from sklearn.ensemble import

Method:    GradientBoostingRegressor(n_estimator=50,learning_rate=1,max_depth=1,random_state=0,loss=’ls’).fit(X_train,y_train) 

of Gradient Boosting

Selection: Feature selection is important in
machine learning as it results in classifiers which occupy less memory. Data
becomes faster to train and test. It leads to better generalization and also
reduces feature selection costs. Feature selection cost with  an 
l1-norm ,gradient boosted regression trees are used. Gradient Boosting
feature selection uses gradient boosting framework where trees are built using
greedy CART Algorithm. Following the change 
in the impurity function, features are selected sparsely. Splitting on
new set of features is incurred by a cost . Gradient Boosting In
case of GBFS  is the complexity of its time and memory, d is the number of features
dimentionality and n is the number of
data points.GBFS has high speed in real life.

Feature Selection(GBFS) 
helps to discover nonlinear interaction between features. GBFS also
makes feature selection and classification into one single optimization. GBFS
has pre specified  cost structures.


Step 1:Linear classification and regularization combined
by Lasso is as follows

Step 2:  regularization
introduces sparsity and also regularizes against overfitting, we introduce
capped  norm which is

Step 3: Feature selection and regularization is adjusted as

Step 4:Linear Classifier

Step 5:  is a sparse linear vector. Where h represents all possible regression trees .Classifier is as follows:

Dimensional Sparse Output: As in case of multi label
classification ,there is a L-dimensional
0/1 vector as the output space. L is
a number which could go to million is several applications. Gradient Boosted
Decision Trees(GBDT) runs out of memory in case of vanilla GBDT. Therefore a
new variant of GBDT is used called GBDT-SPARSE using .Sparsity is used to
conduct GBDT training,splitting the nodes,sparse residual computation and sub
linear time prediction. An algorithm is applied to improve model prediction and
its size using GBDT-SPARSE. GBDT is preferred because it takes up very small
memory size.

GBDT for binary classification we have X as the training data, Y
as their labels with an aim that a classification function is chosen to
minimize the summation of the loss function. The estimation function is taken
to be in additive form. Every function which sums up to give the estimation
function belongs to the parameterized set of learners. As the name suggests
Gradient Boosted Decision Trees use decision trees as the base-learners.


Step 1:Firstly we have the General loss function having high
dimensional output

Step 2: The regularization function in the first step equation
is R(F),which is

 represents the tree structure, which means M
leaves of m tree.For jth leaf node in
m tree.

Step 3:The loss function should be differentiable and follow the
following properties:

a)Loss function is summation of loss function for each q.

b)the derivative of the loss function is equal to zero.

Step 4: is the L-dimensional
gradient for the i-th sample,

Step 5: Objective function is minimized  for each tree for which we want to find the
cut value



Analysis: It is taken from a research paper for Greek
Sentiment Analysis. Sentiment analysis plays a pivotal role in text classification.
Modern Greek language was posing complication in sentiment analysis and pre-processing
for Greek are not easily available. Hence pre-processing was done by Google API
translation. Then ensemble classification algorithm of Gradient Boosting was
used to handle large number of features. Basic feature reduction technique like
Principal Component Analysis was tried but Gradient Boosting gave the best performance.
It was also seen that gradient boosting could also handle imbalanced dataset.
Gradient boosting is said to perform well in case of high dimensional data
because it does variable selection. It also assigns variable amount of degrees
of freedom. Class imbalance issue is looked after using bootstrapping method.

in Search Engine: Gradient Boosting is used for learning
to rank the results in a search engine. The purpose of the ranking model is to
produce a permutations of result showing up the relevant and not showing up the
irrelevant. AltaVista was first search engine to use gradient boosting-trained
ranking function. Later it was also used by Yahoo.


City Mobility Application: Development in communication
technologies has impacted smart cities a lot. Location information of people
and their origin-destination data can be obtained by mobile phone data. A model
was implemented to provide personalised route suggestion to the people using a
mobile application. To design this model gradient boosted trees were used to
predict the target or destination value based on the selection of the origin.
Gradient Boosted Trees are best known for predictive analytics. Depending on
the characteristic of the target variable, classification or regression is
decided .Transportation mode is a categorical variable in this case and GBT
classifier would be used. Random explanatory variables and random response
variables are taken and having samples of known pairs, our goal is to predict. Then
loss function is minimized. Gradient boosting is used to increase the stability
of the model.


Major Advantage of Gradient Boosting is
that it handles high dimensional spaces and a very large number of training

It is robust to outliers and scalable.

Gradient Boosting is a very flexible
nonlinear regression procedure and it helps to increase the accuracy of
decision trees tremendously.

Gradient Boosted Regression Trees can
handle heterogeneous data and automatically detects feature interactions.

Gradient boosting support several types
of loss functions, hence has a wide scope.

In Gradient Boosting Feature selection
the algorithm is able to quickly classify very accurate classifier which helps
to select high quality features. The iterative nature of GBFS would permit us
to bias the probability of sampling of a feature by splitting value from previous
iteration helping to avoid unimportant feature selection.

In High Dimensional Sparse Output using
Gradient Boosted Decision Tree reduce model size and also the prediction time.



Gradient Boosting could fail to perform
in case of insufficient data. It is entirely data driven and the classifier
derived is generated from the training data itself. In some application if the
data is very less and restricted then human knowledge would be needed to
suffice the data and derive to a decision.

Gradient Boosting is a non-explicit

If there are overly complex
base-classifiers which are too weak then gradient boosting might not be of much

Gradient boosting might be also
susceptible to noise.

At times no tree is displayed while
using gradient boosting. This might be because several trees are incorporated
together into the model.

When unconstrained, individual tree can
become prone to overfitting which needs to be controlled.

Gradient Boosting require careful tuning.

Gradient Boosted Regression Trees cannot
extrapolate, due to it sequential nature it is hard to parallelize.



Gradient Boosting  Machines are widely applied in real life
data. GBMs give excellent performance in terms of  generalization of a model and its accuracy. GBM
also gives us detailed insight of the model. Having the ability to predict efficiently
and outperform a number of other methods, GBM has an edge over others. The
essential stages of designing a model using gradient boosting is as described
above. The deep knowledge which gradient boosting allows investigation and
analysis of the effects of the model. Gradient boosting also has promising
robotics application for pattern recognition, which would  be the ultimate use of machine learning.