Evaluating Financial Machine Learning Models on Numerai

Don’t just submit and wait, evaluate!

Suraj Parmar
6 min readSep 4, 2020
Glowing numerai

Update — DEC 19, 2020: The notebook has been updated according to the new target “Nomi”. TARGET_NAME is now only “target” instead of “target_kazutsugi” .

Just Give me the code

Note: This isn't a 'Run all' and submit notebook. I have tried to make this flexible so feel free to experiment and customize according to your style and workflow.

This post on Model Diagnostics. It also has links to community-written posts on the metrics.

Also, check out A guide to “The hardest data science tournament on the planet” if you want to get started with submitting your predictions for the tournament.


Now, having already submitted your predictions, you might be wondering about improving your models to get better results in the tournament. For a typical machine learning problem you start with Exploratory Data Analysis, then you might want to build a validation pipeline along with a baseline model and then you optimize your model(s).

The first step would be to clean and normalize the data but Numerai gives already cleaned and normalized data. So here, we’ll focus on validation pipeline using various metrics.

Validation Pipeline

Usually one would set aside a part of training data (a.k.a ‘validation data’) to evaluate our trained model. Numerai already provides “Validation data” (val1 and val2 subsets) on which we can evaluate the experiments.

We can use appropriate metric(s) to check how the training is progressing and compare between different models’ performances. Same is true for Numerai data, we have a lot of metrics to evaluate our models. Before we dive into models and evaluation, very important step is to explore data.


The data numerai provides is divided into two files:

  • Training data
  • Tournament data

The catch is, it is obfuscated and we don’t know which row refers to which stock. However, there is a column era which represents a certain time period. We can surely use this information. training data has 310 different features.

Numerai training data sample

The Numerai tournament data contains three different types of data: validation, test and live. It has eras and 310 features, and you’ll notice the data is already cleaned and normalized! Our goal is to perform well on this data. We’ll use entire training data to train and validation split to evaluate our models.

Note: There are two splits in validation data. val1 and val2 from different eras. It is suggested that we can train on them too.
Numerai Tournament data sample with data types

In the notebook, we’ll first use these to evaluate two models (Linear Regression vs. Neural Network) and then add more metrics to compare between a simple Neural Network and CatBoost Regressor.

Objective or scoring function

Numerai scoring function

Your predictions are scored based on the Spearman Correlation.

If you have submitted predictions to Numerai before, you may have noticed your predictions being evaluated against these metrics.

This is what the new model diagnostics looks like. This will suggest you some improvements that you can make to your predictions.

It uses stats of historical submissions which performed well and suggests any improvement in color code.

Some Basic Metrics

Some metrics that will be used first

This is an example of what the predictions might look like. They must be in the range (0, 1).

Sample predictions assigned to DataFrame

Mean and Std. Dev. of Predictions: The very first metrics I use to evaluate the distribution of new models. These are just some qualitative stats I use while experimenting with new models.

Correlation with example predictions: The Numerai dataset comes with predictions form integration_test model, called example_predictions_target_kazutsugi.csv which has shown very good performance.

Per-era Metrics

As eras represent a certain period of time, there might be some correlation between features of same era. Thus, our aim should be to perform well across these sub-sets representing different time-spans of different sizes.

As there is no specific loss function defined, the optimization strategy is very flexible. I used CatBoost Regressor in the previous notebook. However, it is not necessary only to use Regression. You can(should) experiment with different losses and modeling techniques. Just make sure your prediction are in range (0, 1). NOTE: So, your predictions are scored based on their correlation with targets. Higher the correlation, better your scores.

Mean Correlation across eras: Mean of Spearman Correlation between targets and predictions grouped into eras. This is very important metric and you should focus on getting higher mean correlation across eras on unseen data.

Per-era validation correlation

Std. Dev. and Sharpe of correlations:

Sharpe = mean / Std.Dev.

Your models should not only have higher mean correlation, but also a lower standard deviation. That suggest a more consistent and well performing model. You should focus on higher Sharpe ratio along with higher mean correlation on unseen data as it is very easy to overfit.

Let’s add more metrics

Here are the metrics you can use to evaluate your predictions. I have covered very basic and important ones already, let’s explore some more. Here’s the first one that can get you started easily — don’t forget to check out the notebook.

All the metrics you can use

Feature Exposure: The standard deviation of the predictions’ correlation with each feature. Good models usually have low feature exposure.

Feature exposure calculation function

Drawdown: It is the maximum estimated loss. You want that to be as close to zero as possible. This is more of a metric related to risk of loss.

Drawdown calculation

Which one to optimize for?

I think this is one of the reasons why Numerai is “The hardest data science tournament on the planet”! You can stake on the Goodness(Correlation) of your predictions as well as Uniqueness(MMC)+Goodness.

My suggestion would be to initially focus on per-era correlation mean, Sharpe ratio and feature exposure And try to beat (or at least get comparable to) example_predictions on these metrics. After that, you can experiment with other metrics — or perhaps created a custom one to share. But do keep in mind that your predictions will be scored based on Spearman correlation.

What’s next? 💭

  1. Read about MMC
  2. Accompanying Colab
  3. Connect on RocketChat or Forum
  4. Read this topic on forum about metrics— “More Metrics for ya”
  5. Read and join the weekly Office Hours (I was interviewed OHwA S02E10 )


My two models at their highest on the leaderboard 😃

On August 26th, 2020

Thanks to Natasha-Jade for feedback and Michael Phillips for suggestions on metrics. Also, Jon Taylor for some intuitive explanation.