Machine Learning Models and When to Use Them

With the recent explosion of artificial intelligence into popular culture, machine learning has become center to discourse about technology. As more and more people begin to feel the impact of AI, understanding the basics of machine learning(ML) is only going to become more important.
In this article, I will be going through the basics of machine learning, as well as outlining some of the most popular models used for analysis and predictions.
While it may be tempting to think of machine learning and artificial intelligence as the same thing, they are actually different. Machine learning is considered a subset of the artificial intelligence field. It tends to be less user-oriented, and used more for predicting the outcome of something. For example, you could train a machine learning model on a dataset of salaries based on a variety of metrics, such as the gender of the employee, age, years of experience, and more, in order to predict the salary. On the other hand, AI is much better at answering open ended questions, like “How do I write an effective article?”(😃). Overall, machine learning and AI are similar, but not the same. However, learning one can help you with another, which is why many members of the space tend to call themselves “AI/ML” Researchers.
Now that we have cleared up the difference between AI and ML, we can dive into the different types of machine learning.
Supervised Learning
Supervised learning involves the use of labeled and corresponding data, which makes it easier for the model to learn and improve over time. The data tends to be easier to read and train a model on. As implied by the name, the engineer is heavily involved in this machine learning process.
The process for a supervised learning model tends to be as follows:
- Set up training and testing sets, with divisions to your liking
- Write a model(the types of which we will get more into later) and choose the data points on which to train it
- See the performance of the model on the testing set, and the accuracy.
- Use it to predict results that haven’t happened yet!
Here’s an example, written in R:
set.seed(73123) #A random seed;I just put the date
split <- initial_split(your_dataset, prop = 0.75) #Uses 75% of the data to train your model, and the rest to test
training <- training(split)
testing <- testing(split)
lm_fit <- linear_reg() |> #Linear regression is the model used here, we'll delve into this further
set_engine("lm") |>
set_mode("regression") |>
fit(variable_you_want_to_predict ~ correlated_variable1 + correlated_variable2 +..., data = training)
lm_fit$fit
summary(lm_fit$fit)
results <- testing
results$lm_pred <- predict(lm_fit, testing)$.pred
yardstick::mae(results, variable_you_want_to_predict, lm_pred) #The error rates of the model
yardstick::rmse(results, variable_you_want_to_predict, lm_pred) #The error rates of the model
As we can see, the entire machine learning process is done by the programmer, with training splits being created, labeled data being used, and the predictions being tested for accuracy. Supervised learning tends to be better for more uniform datasets, and always gets better over time.
There are two main ways in which supervised learning can be used:
- Regression: This is the model shown above. Regression is the use of machine learning to predict a numeric value. These can be used to predict things like quarterly revenue for a company.
- Classification: Classification is a machine learning model that classifies the data into groups. You can think of it as sorting apples from oranges. Classification is often used to classify spam from normal mail in your email inbox.
Unsupervised Learning
Unsupervised learning involves the use of clustering on unlabeled datasets. The main difference here is that the data fed into the model is not labeled, and thus requires far less human involvement. However, the tradeoff is that these models tend to be less accurate.
Unsupervised learning tends to be used in different ways than supervised learning, such as:
- Clustering: Grouping unlabeled data based on their similarities
- Association: Finding associations and correlations in the data
- Dimensionality Reduction: Reduces the number of features in a dataset to a manageable amount(used when there’s a huge dataset)
In short, you should use supervised learning when you have labeled and corresponding data and want to predict something. You should use unsupervised learning when you have large, unlabeled data that you want to group based on similarities or find associations in.
Overall, the field of machine learning is huge and still growing, despite being relatively new in the world of computing. It is essential that anyone in the AI/ML field keep learning, as they are growing and developing every day.
I hope this primer has been helpful to anyone looking to dive into machine learning. My name is Arnav Mahendra, and I am a high schooler interested in all things data science and finance. You can keep up with me here, at my LinkedIn, or my website.