Jane Street Market Prediction 🎯
Jane Street Market Prediction Kaggle Competition
Got a score of 9443.499 (249th place out of 3616 competitors) using MLP.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout, Concatenate, Lambda, GaussianNoise, Activation
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers.experimental.preprocessing import Normalization
import tensorflow as tf
import matplotlib.pyplot as plt
from tqdm import tqdm
import seaborn as sns
from random import choices
!pip install datatable > /dev/null
import datatable as dt
from sklearn import impute
import gc
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
SEED = 42
tf.random.set_seed(SEED)
np.random.seed(SEED)
As discussed before in my EDA notebook, we have couple of options to handle null values.
- Drop all nans
- Impute with median or mean
- Feedforward/backward
- KNN imputer
- Be creative!
In this notebook, I used KNN imputer with 5 nearest neighbors to fill the nans. This takes a long time to run so I suggest downloading the imputed data files from here by louise2001. Note that he also uploaded soft and iterative imputes.
In this notebook, we are just going to load the imputed data instead of running the feature engineering here. Since it is very time consuming and takes a lot of RAM.
imputed_df = dt.fread('../input/data-wrangling/imputed.csv')
imputed_df = imputed_df.to_pandas()
train = dt.fread('../input/data-wrangling/one_on_top.csv')
train = train.to_pandas()
df = pd.concat([train, imputed_df], axis=1, ignore_index=False)
del train, imputed_df
gc.collect()
We first do two feature engineering right off the bat.
- We are going to drop any rows with 'weight' column equal to 0. This tells us that overall gain from such trade is 0. This would be like telling machine to just guess if learned correctly.
- To explain why we are dropping all dates before day 85 can be shown visually below. Before the day 85, we can clearly see that the trend has changed quite drastically.
df = df.query('date > 85').reset_index(drop = True)
df = df[df['weight'] != 0]
Note that we only have 130 features compared to over 2 million datas. We easily make more features and avoid curse of dimensionality.
# Add action column (this is our target)
df['action'] = ((df['resp'].values) > 0).astype(int)
# feature names
features = [c for c in df.columns if "feature" in c]
# resp names
resp_cols = ['resp_1', 'resp_2', 'resp_3', 'resp', 'resp_4']
df = df.loc[:, df.columns.str.contains('feature|resp', regex=True)]
Let us do log transform and add them as new columns to the dataframe. Since performing on all features will give me out of memory error, let's do this on group_0 which has tag_0 from features.csv. For more information, check out my EDA notebook.
tag_0_group = [9, 10, 19, 20, 29, 30, 56, 73, 79, 85, 91, 97, 103, 109, 115, 122, 123]
for col in tag_0_group:
df[str('log_'+str(col))] = (df[str('feature_'+str(col))]-df[str('feature_'+str(col))].min()+1).transform(np.log)
Other ideas for feature engineering:
- aggregating categorical columns by 'tags' on features.csv
- count above mean, mean abs change, abs energy
- log transform, kurt transform and other transforms
- get creative!
Reasons not to do more feature engineering:
- We have no idea what the features represent so it might be meaningless and dangerous
- The dataset is really big so adding couple more columns will make me run out of memory
- Much slower computation
We are going to use approximately 20000 data as test set. Our target value is action which we already have defined as any weight times resp above 0.(positive trades)
# Train test split
from sklearn.model_selection import train_test_split
X = df.loc[:, df.columns.str.contains('feature|log')]
y = np.stack([(df[c] > 0).astype('int') for c in resp_cols]).T
del df
gc.collect()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=42, shuffle=True)
del X, y
gc.collect()
Implementation #2
Algoritms & Technique
For technique, we already applied a lot of our knowledge from our EDA into our dataset. (Feature engineering, imputing nulls, dropping < 85 days, etc). For algorithm, we are going to use machine learning.
Now we have our data ready for training. There are hundreds of classifier model we can choose from and explore. However, after studying the Kaggle notebooks other participants have submitted, all high scored model seem to use Neural Network. I am going to try using random forest classifier and MLP to experiment here. Random Forest are always good for early because it is easy to just build and evaluate. Neural network is good at learning complicated models with the right parameter tuning.
Metrics
Since this is a multiclass-classifying problem (5 types of 'resp' -> gave us 5 pos vs neg target variables), for performance metrics we are going to use AUC(area under curve) as well as pure accuracy score for overall performance. With this metrics, we can see how our model is performing on unseen data and prevent overfitting easily to see any area for improvement accordingly. Sklearn and Seaborn provides great graphing tools for these metrics as well.
Complications
Note that the worst complication I had to face going through rest of this notebook was the size of the data. Depending on your computer's RAM size and GPU computation speed this experience will vary. In my case, I ran into out of memory a hundreds of times. To avoid this, try using cloud training. If not make sure to save your computed data frequently and clean RAM with gc.collect and del function to free up space as much as possible.
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=32, n_jobs=-1, verbose=2)
rnd_clf.fit(X_train, y_train)
test_pred = rnd_clf.predict(X_test)
test_pred = np.rint(test_pred)
test_acc = np.sum(test_pred == y_test)/(y_test.shape[0]*5)
print("test accuracy: " + str(test_acc))
So we got about 52.4% accuracy with random forest.
From the confusion matrix, we can tell that the model is having harder time predicting 0's correctly. It is actually doing a good job of classifying 1's though! So with this model, we can expect to get lots of good trades but also fail to not go for bad trades.
Result 1 implementation
This was our first pass solution. Although we were able to get a positive score of 52.4%, when submitted to Jane Street for Evaluation, it returned a score of 0. Meaning we have lost more profit than we gained. (The competition didn't return negative scores and only calculated positive gains). This suggests that although we were able to get more 'correct' trades, the scale of the trades we failed to predict correctly have out-weighted our correct predictions.
Classic multiple layer perceptron with AUC(Area Under Curve) metrics. After looking at many notebooks on Kaggle, MLP seem to perform the best with short run time. Let us build one ourselves.
def create_mlp(
num_columns, num_labels, hidden_units, dropout_rates, label_smoothing, learning_rate
):
inp = tf.keras.layers.Input(shape=(num_columns,))
x = tf.keras.layers.BatchNormalization()(inp)
x = tf.keras.layers.Dropout(dropout_rates[0])(x)
for i in range(len(hidden_units)):
x = tf.keras.layers.Dense(hidden_units[i])(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Activation(tf.keras.activations.swish)(x)
x = tf.keras.layers.Dropout(dropout_rates[i + 1])(x)
x = tf.keras.layers.Dense(num_labels)(x)
out = tf.keras.layers.Activation("sigmoid")(x)
model = tf.keras.models.Model(inputs=inp, outputs=out)
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
loss=tf.keras.losses.BinaryCrossentropy(label_smoothing=label_smoothing),
metrics=tf.keras.metrics.AUC(name="AUC"),
)
return model
batch_size = 4096
hidden_units = [150, 150, 150]
dropout_rates = [0.20, 0.20, 0.20, 0.20]
label_smoothing = 1e-2
learning_rate = 3e-3
#with tpu_strategy.scope():
clf = create_mlp(
X_train.shape[1], 5, hidden_units, dropout_rates, label_smoothing, learning_rate
)
clf.fit(X_train, y_train, epochs=100, batch_size=batch_size)
models = []
models.append(clf)
test_pred = clf.predict(X_test)
test_pred = np.rint(test_pred)
test_acc = np.sum(test_pred == y_test)/(y_test.shape[0]*5)
print("test accuracy: " + str(test_acc))
This is actually good! Although one could say that the machine is doing slightly better than me if I was to go to Jane Street and randomly decide to 'action' on trades.
It is important to note that even though we are getting only around ~55% accuracy only, this is actually considered good for trading markets. To explain this, since Jane Market has billions of money, as long as they have a positive return rate, it doesn't matter how much they lose because in the end they will gain more. It is like going to a casino knowing you have more chance of winning than losing. The more time you spend here, the more you will gain out of it!
from sklearn.model_selection import RandomizedSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.callbacks import EarlyStopping
batch_size = 5000
hidden_units = [(150, 150, 150), (100,100,100), (200,200,200)]
dropout_rates = [(0.25, 0.25, 0.25, 0.25), (0.3,0.3,0.3,0.3)]
epochs = 100
num_columns = len(features)
num_labels = 5
#num_columns, num_labels, hidden_units, dropout_rates, label_smoothing, learning_rate
mlp_CV = KerasClassifier(build_fn=create_mlp, epochs=epochs, batch_size=batch_size, verbose=1)
param_distributions = {'hidden_units':hidden_units, 'learning_rate':[1e-3, 1e-4],
'label_smoothing':[1e-2, 1e-1], 'dropout_rates':dropout_rates,
'num_columns': [len(features)], 'num_labels': [5]}
random_cv = RandomizedSearchCV(estimator=mlp_CV,
param_distributions=param_distributions, n_iter=5,
n_jobs=-1, cv=3, random_state=42)
random_cv.fit(X_train, y_train, callbacks=[EarlyStopping(patience=10)])#, epochs=200, batch_size=5000)
models = []
models.append(random_cv)
RandomSearch and GridSearch easily runs out of memory..
So from trial and error, I've learned that with learning rate at 1e-3, model overfits quickly around at 10 with batch_size around 5000. However, the model wasn't able to learn much with less than 100 epochs. One solution is to add more layers and perceptrons which is what I did and the result 2 is the result of manual hyper param tuning. Before the model was definetly at around 200 epochs with same learning rate with 5000 batches giving me an accuracy of only 51%. After manual hyperparameter, (running few different param combination by myself) I was able to increase about 3.5% accuracy!
For my final review and conclusion, check out my blog post
Other things to try/explore:
- Weighted training. We know that sometimes we will encounter 'monster' deals. It is crucial for the Kaggle competition to get these ones correct since these will probably outweight most other trades. So we could make model that focuses more on these heavy trades. (high weight X resp data)
- Split data and train multiple models. Idea is that we could split the data into two by feature_0 and maybe one model that optimizes the '1's data and another model that optimizes the '-1's data.
- Make much more features and explore more data (requires time and big data machines)
- One interesting thing I learned is that apparently, in financial, it is sometimes good to heavily overfit the model. Something to do with volatile. I've experimented with this and indeed my utility score for the competition went really high when super overfitted with epoches over 200.
th = 0.5
f = np.median
models = models[-3:]
def feature_engineering(df):
tag_0_group = [9, 10, 19, 20, 29, 30, 56, 73, 79, 85, 91, 97, 103, 109, 115, 122, 123]
for col in tag_0_group:
df['log_'+str(col)] = (df['feature_'+str(col)]-df['feature_'+str(col)].min()+1).transform(np.log)
return df
import janestreet
#env = janestreet.make_env()
for (test_df, pred_df) in tqdm(env.iter_test()):
if test_df['weight'].item() > 0:
x_tt = test_df.loc[:, features]
x_tt = feature_engineering(x_tt).values
if np.isnan(x_tt[:, 1:].sum()):
x_tt[:, 1:] = np.nan_to_num(x_tt[:, 1:]) + np.isnan(x_tt[:, 1:]) * f_mean
pred = np.mean([model(x_tt, training = False).numpy() for model in models],axis=0)
pred = f(pred)
pred_df.action = np.where(pred >= th, 1, 0).astype(int)
else:
pred_df.action = 0
env.predict(pred_df)