XGBoostLM

Suraj Parmar
3 min readOct 26, 2023

--

Somewhere between N-grams and Neural Probabilistic Language Model

Just give me the code

Why?

> https://x.com/parmarsuraj99/status/1716839925859881212?s=20

the reason, the inspiration, the Grand Master, the xgboost, on X 🙏

Language modelling is […]

the task of predicting the next word given a sequence of previous words. In this case, we try to predict next character, especially the next character in the corpus containing only “xgboost0”. Since the Large Language models are able to learn the language, which is a human construct (encoding) of the underlying real world, the question is, Can XGBoost learn something similar but much much simpler using simple models?

From Words

During my recent presentation on Language models (mainly encoders) at Numerai Council of Elders meet-up in Toronto, I began with the basic idea behind predicting next word is to calculate probabilities of N words appearing in a sequence, then use the probability of last word appearing given N-1 words.

slide from Numerai Toronto meet-up. Link in the references

To Neural models

For Neural models, we need to transform words to numbers, one-hot-encoding to begin with.

slide from Numerai Toronto meet-up. Link in the references

XGBoostLM

This fun experiment tries to predict next character in a corpus given a sequence of previous characters. Here the objective was to overfit on the word, “xgboost” and within few labels as possible.

Easiest scheme to transform characters into numbers was to assign index to each unique alphabet and then encode the index to bytes.

XGBoost can predict on multiple targets, so this can be a binary classifier with 8 targets. Here, I kept only 5 bits to reduce the number of classifiers needed as dictionary size is 26 (alphabets) + 1 (eos) token. Thus, with 5 bits, we can get 32 combinations, out of which we used 27.

encoding characters to bytes
encoding steps

You can check out the code to see the specific details.

Results

It works? of course it’s overfit.

Since the model was overfit on the corpus containing “xgboost0” repeatedly, it is expected to memorize that.

results

This is not a neural approach, so the multiple target predictors of XGBoost are independent of each other. So the predictor for 0th bit doesn’t have context of the predictor at 5th bit. This is where Neural networks with fully connected layers shine. So, This is something that can be thought of as something between N-grams and the Neural Probabilistic Model.

Overall, this was just a fun experiment. Thanks to Bojan Tunguz for the inspiration. How about training your own statistical language model on your corpus with the code provided. I Hope it helps.

XGBoost is (almost) all you need.

References

--

--