Having fun with Scikit-Learn

Reading time ~12 minutes

Scikit-Learn is a great library to start machine learning with, because it combines a powerful API, solid documentation, and a large variety of methods with lots of different options and sensible defaults. For example, if we have a classification problem of predicting whether a sentence is about New-York, London or both, we can create a pipeline including tokenization with case folding and stop-word removal, bigram extraction, tf-idf weighting and support for multiple labels, train and apply it in merely 8 lines of code (well, excluding imports and input specification).

vec = TfidfVectorizer(ngram_range=(1, 2), stop_words='english', max_features=15)
clf = OneVsRestClassifier(LogisticRegressionCV())
pipeline = make_pipeline(vec, clf)
mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(y_train)
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
print(zip(X_test, mlb.inverse_transform(predicted)))

Furthermore, we can easily add cross-validation and parameter tuning, etc., in just a few more lines. Check out this great introductory video series for all the cool features you can use right from the beginning. But lets go back a bit! Assume we have the following training data:

[("new york is a hell of a town", ["New York"]),
 ("new york was originally dutch", ["New York"]),
 ("the big apple is great", ["New York"]),
 ("new york is also called the big apple", ["New York"]),
 ("nyc is nice", ["New York"]),
 ("people abbreviate new york city as nyc", ["New York"]),
 ("the capital of great britain is london", ["London"]),
 ("london is in the uk", ["London"]),
 ("london is in england", ["London"]),
 ("london is in great britain", ["London"]),
 ("it rains a lot in london", ["London"]),
 ("london hosts the british museum", ["London"]),
 ("new york is great and so is london", ["London", "New York"]),
 ("i like london better than new york", ["London", "New York"])]

Our model can easily predict the following:

[("nice day in nyc", ["New York"]),
 ("welcome to london", ["London']),
 ("hello simon welcome to new york. enjoy it here and london", ["London", "New York"])]

But how? Well, we know that the model’s decision function computes a dot product between the input features and the trained weights and we can actually see the computed numbers:

[[ -7.96273351   1.04803743]
 [ 22.19686347  -1.39109585]
 [  5.48828931   1.28660432]]

So, a positive number indicates positive classification ([London, New-York]), and we can convert this to some sort of “probability” estimation:

print(np.vectorize(lambda x: 1 / (1 + exp(-x)))(pipeline.decision_function(X_test)))
[[  3.48078806e-04   7.40397853e-01]
 [  1.00000000e+00   1.99232868e-01]
 [  9.95882115e-01   7.83571880e-01]]

However, it would be nice if we could know exactly which words have contributed to the final score. And guess what, there is an awesome library to explain you this like are five years old:

import eli5
eli5.show_prediction(clf, X_test[0], vec=vec, target_names=mlb.classes_)

y=London (probability 0.996, score 5.488)top features

Weight Feature
+8.631 Highlighted in text (sum)

hello simon welcome to new york. enjoy it here and london too

y=New York (probability 0.784, score 1.287)top features

Weight Feature
+1.084 Highlighted in text (sum)

hello simon welcome to new york. enjoy it here and london too

Note that the numbers here are exactly what we have seen above, but in addition we get a visual explanation of which words have trigged positive and negative signals. Moreover, with a bit of trickery to work around this issue, we can also dump the features computed for our model as a neatly looking heat map:

setattr(clf.estimator, 'classes_', clf.classes_)
setattr(clf.estimator, 'coef_', clf.coef_)
setattr(clf.estimator, 'intercept_', clf.intercept_)
eli5.show_weights(clf.estimator, vec=vec, target_names=mlb.classes_)
y=London top features y=New York top features
Weight Feature
+25.340 london
+4.706 great
+2.079 lot london
+2.079 lot
+1.266 great britain
+1.266 britain
+0.896 museum
-1.682 new
-1.682 new york
-1.682 york
-2.723 nice
-3.143 <BIAS>
-3.875 apple
-3.875 big
-3.875 big apple
-4.219 nyc
Weight Feature
+1.154 york
+1.154 new york
+1.154 new
+0.661 nyc
+0.546 nice
+0.526 big apple
+0.526 big
+0.526 apple
+0.202 <BIAS>
+0.048 great
-0.500 lot
-0.500 lot london
-0.632 museum
-0.755 britain
-0.755 great britain
-1.594 london

So far, working with scikit-learn and eli5 is as fun as Transformers were when I was five! And working with a relatively large dataset is just as easy as the toy example shown here in. Man, this is great, but what about using it in a production system you say? Well, lets talk about that next time ;)

Software Engineering at Cxense

A year ago I wrote [a post](http://s-j.github.io/ten-things-i-wish-i-knew-before-starting-on-a-phd/) about the things I have learned duri...… Continue reading

A few thoughts about Spark

Published on November 21, 2016

HyperLogLog - StreamLib vs Java-HLL

Published on August 21, 2016