Scikit-Learn is a great library to start machine learning with, because it combines a powerful API, solid documentation, and a large variety of methods with lots of different options and sensible defaults. For example, if we have a classification problem of predicting whether a sentence is about New-York, London or both, we can create a pipeline including tokenization with case folding and stop-word removal, bigram extraction, tf-idf weighting and support for multiple labels, train and apply it in merely 8 lines of code (well, excluding imports and input specification).
Furthermore, we can easily add cross-validation and parameter tuning, etc., in just a few more lines. Check out this great introductory video series for all the cool features you can use right from the beginning. But lets go back a bit! Assume we have the following training data:
[("new york is a hell of a town", ["New York"]), ("new york was originally dutch", ["New York"]), ("the big apple is great", ["New York"]), ("new york is also called the big apple", ["New York"]), ("nyc is nice", ["New York"]), ("people abbreviate new york city as nyc", ["New York"]), ("the capital of great britain is london", ["London"]), ("london is in the uk", ["London"]), ("london is in england", ["London"]), ("london is in great britain", ["London"]), ("it rains a lot in london", ["London"]), ("london hosts the british museum", ["London"]), ("new york is great and so is london", ["London", "New York"]), ("i like london better than new york", ["London", "New York"])]
Our model can easily predict the following:
[("nice day in nyc", ["New York"]), ("welcome to london", ["London']), ("hello simon welcome to new york. enjoy it here and london", ["London", "New York"])]
But how? Well, we know that the model’s decision function computes a dot product between the input features and the trained weights and we can actually see the computed numbers:
[[ -7.96273351 1.04803743] [ 22.19686347 -1.39109585] [ 5.48828931 1.28660432]]
So, a positive number indicates positive classification ([London, New-York]), and we can convert this to some sort of “probability” estimation:
[[ 3.48078806e-04 7.40397853e-01] [ 1.00000000e+00 1.99232868e-01] [ 9.95882115e-01 7.83571880e-01]]
However, it would be nice if we could know exactly which words have contributed to the final score. And guess what, there is an awesome library to explain you this like are five years old:
Note that the numbers here are exactly what we have seen above, but in addition we get a visual explanation of which words have trigged positive and negative signals. Moreover, with a bit of trickery to work around this issue, we can also dump the features computed for our model as a neatly looking heat map:
So far, working with scikit-learn and eli5 is as fun as Transformers were when I was five! And working with a relatively large dataset is just as easy as the toy example shown here in. Man, this is great, but what about using it in a production system you say? Well, lets talk about that next time ;)