From r/WSB to Numerai Signals

Suraj Parmar
3 min readJan 28, 2021

--

Using r/WallStreetBets data for Numerai Signals submission.

I stumbled upon Arjun Rohlfing-Das ‘s excellent post on Sentiment Analysis for Trading with Reddit Text Data that uses r/wallstreetbets data for sentiment analysis which seems to be holding predictive power.

Edit: https://twitter.com/parmarsuraj99/status/1358718493261201409?s=20 :)

This project was discussed in the Lex Fridman’s podcast with Richard Craib (ep. 159).

Just give me the code

This notebook is built upon the work of Arjun Rohlfing-Das’s notebook. It predicts for entire market. I have modified it to work with all listed symbols.

This is a ‘Run All’ notebook. Once you have setup the PRAW credentials, all you have to do is, just click Run all from colab and it will grenerate a .csv that you can submit to the tournament.

Since, sentiments are quantified using ML models, what about using this as a feature for Numerai Signal’s tournament. This can be combined with a strategy you are currently using or submit these scores directly as I did.

Workflow

  1. Symbols in the Signal universe
  2. Collect Reddit data using PRAW
  3. Symbol filtering
  4. To the moon? (Sentiment analysis)
  5. Rolling average of daily scores for top 400 extreme symbols
  6. Submit

Symbols in the Signals universe

Latest tickers(Symbols) in the universe can be downloaded using NumerAPI with Bloomberg to Yahoo mapping.

Tickers mapping and universe

Collecting Reddit data using PRAW

You’ll need to setup credentials for PRAW. Check this article on Scrapping Reddit data. “Daily Discussion” data is scrapped with all comments. You can filter the comments by the number of up votes it has as not every comment will be useful.

List of comments
Scrapping using PRAW

Symbol filtering

If you explore r/WSB and look at the symbols, you’ll find some ambiguity in the way stocks are mentioned. Terminology is different so I decided to split the symbols by space. i.e, TSLA US -> TSLA and only consider new symbols with length ≥2 .

Another criteria for filtering symbols is stopwords. You might want to use appropriate stopwords that reddit users use which are also in the symbol list. This gives false impression of stock being discussed.

The symbols with only numbers are also removed because they may create ambiguity.

To the moon?

  1. Score all comments for a day based on sentiment

I have used VADER sentiment analysis model from NLTK. A better model can be used. Or, You can take historical Signals targets and historical comments data and train a simple classification model on that. So, It is trained on reddit data optimized for Signal’s targets.

2. Log all tickers mentioned in those comments

3. Assign the daily sentiment to tickers involved

Sentiment analysis for all stocks

Finding extreme stocks and scoring

Not all of 5k stocks will be discussed there. Here, the stocks having most positive and most negative sentiments across all the days are selected.

A rolling window of 14 days is applied and the scores for last day and will be used for submission.

Scoring

Submission

Since the submission need tickers in the Bloomberg format, we need to re-map the filtered tickers back to original Bloomberg tickers. This may cause a hash collision so I have used the first occurrence of Bloomberg ticker to be used.

Submission

As usual, this is a Run all Colab notebook once you setup PRAW. This will create accept-worthy submissions.

What’s next

  1. Try different model
  2. Better cleaning of comments (use upvotes)
  3. Combine the daily sentiment scores with your current Numerai strategy.
  4. Predict for weeks in the validation data to see diagnostics.

Below is a plot of Fourier transform of market sentiment vs. SPY ticker. This shows, there is some predictive power in the r/WSB.

Market sentiment vs. S&P500

--

--