Overview of scikit-learn   

Python has a very advanced machine learning toolkit called scikit-learn, or sklearn for short. sklearn is useful for nearly all of the machine learning and artificial intelligence projects we do today, with one drawback. In essence, it does not support, nor does it plan to support, reinforcement learning.

So for nearly 90% of machine learning tasks, we can employ the open source BSD licensed sklearn library. It uses NumPy, pandas and many other Python’s famous data science toolkits, internally (some of which were covered in the previous blog link above). Python interpreter is not very fast, like the Javascript/node.js engine. However, for machine learning, it is still the language of choice due to its excellent support for advanced mathematical functions, as well as the scikit-learn toolkit, the most complete one-stop-shop for ML using Python. In this article, we are scratching the surface of sklearn, as it is a massive topic that requires a thorough treatment that cannot be covered within the scope of an article.

Two people reviewing code

Photo by Alvaro Reyes on Unsplash

Machine Learning Techniques

To get familiar with sklearn, we must first get to know some basics of machine learning theory. 

Though ML is a very complex topic that requires enormous learning and experience, we will briefly touch upon some generic techniques here and show some sample source code just to get a feel for ML.

Usually, with machine learning models, we must build a model based on some sample data we have available. Based on the model, we approximate specific parameters and try to arrive at what is often referred to in mathematical terms as regression. Instead of teaching each and every aspect of input data to classify and identify patterns in it, with regression, we use mathematical derivations, approximations, and extrapolations to arrive at an understanding of the characteristics that go towards classifying something.

For instance, let us say we wish to differentiate a cat from a dog from a picture or make out if a man or woman is speaking on the phone. Here the input data may vary; either it is a matrix of RGB values or grayscale like an image, or it is an audio waveform which represents as a time-series of audio energy levels of various frequencies.

Using such input data, we have to teach the computer how to classify different input combinations, in our example, say male or female voice or cat or dog picture. We obviously cannot feed the computer with all dog and cat pictures. Instead, we train the computer software in such a way that it can identify specific telltale signs and approximate the rest.

We divide the machine learning techniques employed hereby into three:

  • Supervised learning
  • Unsupervised learning
  • Reinforced learning

As I mentioned above, scikit-learn can handle supervised and unsupervised learning, and it has a plethora of sample data and mathematical primitives at our disposal. 

Supervised learning trains the software to use a feedback score to assess how it fares; without feedback, the software cannot do its job and tweak its input parameters to identify and classify input data. So to do it, we split our sample data into what is called: 

  • Training data
  • Test data

These are some common supervised learning algorithms used for classifying data:

  • Linear regression
  • Logistic regression
  • Decision trees
  • Support-vector machine
  • Naive Bayes classifier
  • Random forest
  • K nearest neighbor

For now, I am just listing the above cryptic names, some I am sure you have already heard of, and I will try to explain them further in the next blog. Irrespective of which of the above we use, ultimately, we are using it to train our software. After training, we test the performance on test data to optimize our prediction algorithm to guess correctly. In other words, we already know the answer, and we teach the software to arrive at the correct answer.

Unsupervised learning is mainly an open system, something that uses no feedback to evaluate its effectiveness and instead works by association or clustering. Without feedback mechanism, we use certain empirical methods to learn, but the software cannot learn quickly with changes in input data like supervised learning can. Unsupervised learning uses normal programming techniques and uses a computer’s raw processing power to achieve ML without much feedback data or advanced mathematics. A real-life example is when you want to cluster and analyze groups of similar visitors to your website.

Reinforced learning uses a different approach, instead of using a whole bunch of hidden logic that helps the software learn and categorize input data, it uses a reward system to teach the software how it should behave. Usually, reinforced learning is not as common as supervised learning and is applied in scenarios where there are primarily heuristic and empirical thumb rules. Reinforced learning is supposed to be a lot more modern and adaptive approach applicable to situations like playing games, avoiding obstacles, and self-driving cars. Often, the issue is to achieve practical running time, we must use highly advanced GPUs and so on. Moreover, this field is not yet mature, so, for the most part, we end up doing supervised learning, a topic we will go into more detail in the future.

Prefer freedom or prefer supervision?

Supervised learning needs a lot of hand-holding: filtering, analysis of results, multiple models, corrections, dimensions, and so on, to do true learning. Whereas with unsupervised learning, it works without any training data, albeit without much exploration of the effectiveness of its approach to solving the problem at hand. One might think that unsupervised learning is the more “self-guided” and smart way. However, unsupervised learning is like a new piece of furniture; it does not get better with age or data. It is only as effective as it is today. In that respect, unsupervised learning falls short.

For AI and machine learning, we want to “teach” the computer how to think and act like humans. That’s why we mostly care about supervised learning and making machines better and better. In the supervised learning world, things are a lot more dynamic and pliable. We teach the computer how to think, work, and learn (like babies or animals learn), and that is a significant bonus because using a plethora of data and math formulas, we could someday attain an AI world that is somewhat close to the biotic world.

In general, the approach taken in supervised learning algorithms is filtering, reducing noisy input data, ignoring tracks that lead us down the garden path(normalization), and so on. We can also always add more testing data and vary the test data samples to tweak our code. Our code also can be adapted to use any of the techniques that are popularly employed, and over time by sheer hard work, trying out various approaches, we can hopefully succeed. Supervised learning gets better with age. Of course, this depends on how generic and appropriate our model, strategies in code, and other processes are.

Unsupervised learning is only using certain assumptions, which, when wrong, falls by the wayside. In employing supervised learning, we protect ourselves from that. But we may not even have that choice if there is no data on known outcomes to train and test.

Python source code samples and simple scripts

Let’s look at this simple Python script employing the scikit-learn toolkit to get a basic feel for what it takes to write some machine learning code using Python. Don’t worry about understanding everything being done in the code below, instead treat it as a short piece of code you can try out. Machine learning, in general, is an art since we have to keep trying various approximations and figure out which approach best works for the problem at hand. The examples presented here can only get you started on a long journey. To become an ML expert, you must get lots of experience applying mathematical principles of ML in real-life scenarios.

from sklearn.svm import LinearSVC                                              

from sklearn.datasets import load_iris

from sklearn.feature_selection import SelectFromModel

X, y = load_iris(return_X_y=True)

X.shape

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)

model = SelectFromModel(lsvc, prefit=True)

X_new = model.transform(X)

print(X_new.shape)

A simple explanation of the above code sample is that we load the freely available Iris dataset values to create a model or a classifier algorithm. This is the Iris (flower species) data set with petal lengths and other plant characteristics, and not to be confused with Query.AI IRIS, although the name similarity is what made me choose this! With this data, we do what is called curve fitting and transform our values to do other advanced machine learning arithmetic. We need to squeeze values or reduce their range to focus on the main values that have to be predicted. SVC stands for support vector classifier from the support vector machine model often used in supervised learning. Usually, we apply a model, then transform and fit to get values. We also often reduce the dimensions and varying values over a long-range and shrink them to focus on only specific columns in a big matrix of values.

If you run the above script, it prints the flower petal shape property (150,3), giving a quick idea of how many instances (or rows, 150 here) and how many attributes (or columns, 3 here) the data contains post model transformation.

In case we are dealing with time series, then we are bothered only about applying the right curve function to separate our values such that they classify correctly. Simple examples of time-series are an audio file, a change of climate, or a time-lapse of your home security camera. Even a football match is one.

Pandas is the python toolkit used for time series manipulations as it helps us split various values into data frames such that we can apply a rolling window mean or standard deviation against values. Such ideas are very commonly used in machine learning as long as the program does not take too long to compute.

There are plenty of code samples for sklearn, but to follow them, you need a background in the various components and steps we use in machine learning. Such things would make this blog too long, but we will continue to cover in later blogs.

Just having mathematics is not enough; we must have the ability to get answers in polynomial time. In high-frequency trading bots or the cryptocurrency market, we must compute the next stock price and values within nanoseconds. Since we are talking matrix manipulation and a lot of training and test samples, we require a great deal of data and computational power, which is made possible with today’s cloud computing and growth in storage at reduced prices.

But most applications are not so demanding, and we can get by with a few minutes or even hours as long as we get the right answers. And with the growth of reinforced learning, the need for significant data changes as reinforced learning does not use training at all.

Summary

We managed to scratch the surface of different machine learning techniques and managed to see some supervised learning Python code too. In the next blog, we will expand on the different supervised learning algorithms which we listed above. While we do that, we will continue using Python’s scikit-learn as our library of choice to try things out.

 Part 3: Working Under Supervision is coming soon. In the meantime, follow our linkedin page for more great content!