XGBoost and Boosted Decisions Trees in Python and sklearn

XGBoost is an extremely well-tested and popular supervised machine learning algorithm.

Fundamental Idea Behind XGBoost

The first thing we need to talk about is decision tree ensembles. The XGBoost is designed with the principles of ensemble moddling in mind. The idea is to create multiple “weak learner” classifiers and use the outputs from these to make predictions.

Let’s start out with a very simple example. Say we want to predict whether someone likes computer games or not. This decision may depend on a lot of different features such as the age, gender, and so on. In order to make the prediction, we will score each person based on the outcome from the tree(s). This is done by assigning a value y to each output leaf and letting y>0 result in a positive prediction (e.g. the person likes computer games). A single tree could look something like this:

xgboost famility prediction example decision tree

Using just a one tree, we see that all the children are likely to like computer games while the adults won’t. But let’s try to improve this by adding more trees! In general, the results improve significantly when one uses multiple trees. Our model now looks like this:
xgboost multiple decision trees

By combining the results from Tree 1 and Tree 2, we can already make more sophisticated predictions by summing up the output from each tree. The most likely person to like computer games will therefore be:

Mathematically, the above model can be described as follows:

 y = \sum_{k=1}^K f_k (x_i)

Now we are ready to move on to boosting trees and XGBoost.

The XGBoost Algorithm and How it Works

Content on the way!

Summary

What do you think about XGBoost after this tutorial? XGBoost is a really important algorithm and is the go-to algorithm for many in both education and industry due to its proven results and robustness. If you enjoyed this guide, you may also like some of these.