Over the past few years, Machine Learning has taken a leading role in the discovery of data-driven solutions. Of these solutions, classification is by far one of the most commonly used areas of Machine Learning which is widely applied in fraud detection, image classification, ad click-through rate prediction, identification of medical conditions and a number of other areas. There is a range of different classification algorithms, but over the years single-model approach is being replaced by ensemble methods which combine a number of different algorithms and provide more accurate results than separate models. If you have ever tried to apply an ensemble method on a big data set you should have definitely run into a very common problem – the computation takes hours, sometimes even days or weeks, unless you have a powerful machine.
At the Higgs Boson Data Science competition, everyone’s attention was caught by XGBoost – a new classification algorithm which outperformed all other Machine Learning algorithms used in this competition and brought the 1st place to its developers. By its nature, XGBoost is similar to GBM, because it’s a tree-based approach, but its flexibility, scalability, and exceptional accuracy is superior to GBM and other classification methods.
Here are some of the main reasons why you should consider using XGBoost for your next classification problem:
- Out of core computation. Ability to use parallel and distributed computation is a key when working with Big Data. It considerably reduces computation time and intensity of the processes which your machine has to handle. It means that the algorithm runs faster and uses fewer computer resources.
- Built-in Cross-Validation. Makes it easy to tune the model and check for overfitting without any additional iterations or tools like GridSearch.
- Flexibility. XGBoost provides a number of parameters which you can adjust based on the problem you are working on, for example, different objectives or evaluation functions.
- Ability to handle sparse data. It makes it easier and faster to work with huge amounts of data.
- Ability to handle missing data and imbalanced classes.
Recently, XGBoost became a winning solution for most of the Kaggle Data Science competitions. Predicting Red Hat Business Value was the very first Kaggle competition I have participated myself a few months ago. With no doubt, my initial choice for this competition was XGBoost which brought me to the top 20% of best performing solutions. XGBoost outperformed all other classification algorithms I applied such as GBM, Random Forests, SVM and for that reason became my usual preference for dealing with classification problems.
XGBoost is implemented in Python and a wide range of other programming languages, it’s also compatible with Hadoop and will soon be able to be used in Spark. But if you are a Windows user who is willing to use this algorithm in Python, you are very likely to bump into installation problems, because the usual package installation methods like pip install or execution of setup.py file doesn’t work for this package. Below, there is a detailed guide on how to install XGBoost on Windows machine without getting into XGBoost installation nightmare.
XGBoost installation on Windows machine
XGBoost installation may look like a complicated process, but actually, it’s quite straightforward once you take care of a couple of crucial steps:
- Make sure you have Git installed. You can check if you already have it by typing ‘git’ in your Command Prompt. If it prints the list of Git commands, it means that you already have it on your machine. Otherwise, simply download and install it from here.
- Install Mingw64 (at least 4.9.1 version is required). Mingw64 contains make and gcccompilers which are necessary to make a build for XGBoost. The newest Mingw64 build can be downloaded from here (if you would like to use another version, you can find the list of all the builds by following this link). All you have to do is download the version you want, unzip, open the unzipped folder, copy Mingw64 subfolder and place it in your preferred location (C: drive is my suggestion).
- Add Mingw64 to the PATH Environment Variable.
- Open Mingw64 folder which you just copied and go to the bin subfolder. Here you will see a bunch of different compilers which comes with Mingw64. Check if you can find gcc and mingw32-make files. If you do, right-click on any of the files in the folder, select ‘Properties’ and copy the link which is written next to ‘Location:’
2. Open your System Properties (you can do this by typing ‘environment variables’ in your start menu search box or Cortana if you are using Windows 10. Click ‘Edit the system environment variables’ on the list that matched your search). On the System Properties window select ‘Advanced’ tab and click on ‘Environment Variables’. In the ‘User variables for’ section double click on PATH.
3. Once Edit environment variable window opens up, select ‘New’.
Paste the link which you copied before and close all windows by clicking ‘OK’.
4. (For Windows 7 users) If you use Windows 7, you can add PATH environment variable following the same procedure. The only difference is that after opening Environment Variables window you should double click ‘Path’ in System variables section and paste the copied link at the end of Variable value (don’t forget to add a semicolon before pasting it):
- Check if compilers are accessible. Open your Command Prompt and type ‘mingw32-make’. This should bring you the message ‘mingw32-make: *** No targets specified and no makefile found. Stop.’
Once everything is ready we can start the installation:
- Open Git Bash (this terminal comes with Git installation). You can also use any other bash terminal, for example, Cygwin.
- Navigate to the directory where you want XGBoost to be downloaded. For example:
cd C://Users//XGBoost_User//Anaconda3//Lib
- Download the XGBoost package:
git clone --recursive https://github.com/dmlc/xgboost
- Once the download is done, navigate to the downloaded package:
cd xgboost
- Initialise local configuration file and fetch downloaded data:
git submodule init
git submodule update
- Make the build (this may take a couple of minutes to run)
cp make/mingw64.mk config.mk
mingw32-make -j4
- Once the build is done, change directory to the python-package folder:
cd python-package
- Install XGBoost:
python setup.py install
- Congratulations! Installation is done. Launch Python in Command Prompt or any Python IDE you are using for writing Python code and start using XGBoost:
import xgboost
If you would like to learn more about XGBoost package, you can read about it on official XGBoost documentation page: https://xgboost.readthedocs.io/en/latest/
Machine Learning is an incredibly powerful and a very exciting part of the Data Science. Strong and flexible tools is what we need to uncover the unknown side of data and there is no doubt that XGBoost is one of them. I hope that this blog post helped you to prepare for taking your Machine Learning applications to the next level.