This tutorial uses the Lightwood integration that requires the mindsdb/mindsdb:lightwood
Docker image. Learn more here .
Introduction
In this tutorial, we’ll create and train a machine learning model, or as we call
it, an AI Table
or a predictor
. By querying the model, we’ll predict the
probability of churn for new customers of a telecoms company.
Install MindsDB locally via Docker or Docker Desktop .
Let’s get started.
Data Setup
Connecting the Data
There are a couple of ways you can get the data to follow through with this
tutorial.
Connecting as a database Connecting as a file You can connect to a demo database that we’ve prepared for you. It contains the data used throughout this tutorial (the example_db.demo_data.customer_churn
table).
CREATE DATABASE example_db
WITH ENGINE = "postgres" ,
PARAMETERS = {
"user" : "demo_user" ,
"password" : "demo_password" ,
"host" : "samples.mindsdb.com" ,
"port" : "5432" ,
"database" : "demo"
};
Now you can run queries directly on the demo database. Let’s preview the data that we’ll use to train our predictor.
SELECT *
FROM example_db . demo_data .customer_churn
LIMIT 10 ;
Pay Attention to the Queries
From now on, we’ll use the
files.churn
file as a table. Make sure you replace it with
example_db.demo_data.customer_churn
if you connect the data as a database.
Understanding the Data
We use the customer churn dataset, where each row is one customer, to predict
whether the customer is going to stop using the company products.
Below is the sample data stored in the files.churn
table.
+ ----------+------+-------------+-------+----------+------+------------+----------------+---------------+--------------+------------+----------------+-----------+-----------+---------------+--------------+----------------+-------------------------+--------------+------------+-----+
|customerID|gender|SeniorCitizen| Partner |Dependents|tenure|PhoneService|MultipleLines |InternetService|OnlineSecurity|OnlineBackup|DeviceProtection|TechSupport|StreamingTV|StreamingMovies| Contract |PaperlessBilling|PaymentMethod |MonthlyCharges|TotalCharges|Churn|
+ ----------+------+-------------+-------+----------+------+------------+----------------+---------------+--------------+------------+----------------+-----------+-----------+---------------+--------------+----------------+-------------------------+--------------+------------+-----+
| 7590 - VHVEG|Female| 0 |Yes | No | 1 | No | No phone service |DSL | No |Yes | No | No | No | No | Month - to - month |Yes |Electronic check | 29 . 85 | 29 . 85 | No |
| 5575 - GNVDE|Male | 0 | No | No | 34 |Yes | No |DSL |Yes | No |Yes | No | No | No |One year | No |Mailed check | 56 . 95 | 1889 . 5 | No |
| 3668 - QPYBK|Male | 0 | No | No | 2 |Yes | No |DSL |Yes |Yes | No | No | No | No | Month - to - month |Yes |Mailed check | 53 . 85 | 108 . 15 |Yes |
| 7795 - CFOCW|Male | 0 | No | No | 45 | No | No phone service |DSL |Yes | No |Yes |Yes | No | No |One year | No |Bank transfer ( automatic )| 42 . 3 | 1840 . 75 | No |
| 9237 - HQITU|Female| 0 | No | No | 2 |Yes | No |Fiber optic | No | No | No | No | No | No | Month - to - month |Yes |Electronic check | 70 . 7 | 151 . 65 |Yes |
+ ----------+------+-------------+-------+----------+------+------------+----------------+---------------+--------------+------------+----------------+-----------+-----------+---------------+--------------+----------------+-------------------------+--------------+------------+-----+
Where:
Column Description Data Type Usage CustomerId
The identification number of a customer. character varying
Feature Gender
The gender of a customer. character varying
Feature SeniorCitizen
It indicates whether the customer is a senior citizen (1
) or not (0
). integer
Feature Partner
It indicates whether the customer has a partner (Yes
) or not (No
). character varying
Feature Dependents
It indicates whether the customer has dependents (Yes
) or not (No
). character varying
Feature Tenure
Number of months the customer has been staying with the company. integer
Feature PhoneService
It indicates whether the customer has a phone service (Yes
) or not (No
). character varying
Feature MultipleLines
It indicates whether the customer has multiple lines (Yes
) or not (No
, No phone service
). character varying
Feature InternetService
Customer’s internet service provider (DSL
, Fiber optic
, No
). character varying
Feature OnlineSecurity
It indicates whether the customer has online security (Yes
) or not (No
, No internet service
). character varying
Feature OnlineBackup
It indicates whether the customer has online backup (Yes
) or not (No
, No internet service
). character varying
Feature DeviceProtection
It indicates whether the customer has device protection (Yes
) or not (No
, No internet service
). character varying
Feature TechSupport
It indicates whether the customer has tech support (Yes
) or not (No
, No internet service
). character varying
Feature StreamingTv
It indicates whether the customer has streaming TV (Yes
) or not (No
, No internet service
). character varying
Feature StreamingMovies
It indicates whether the customer has streaming movies (Yes
) or not (No
, No internet service
). character varying
Feature Contract
The contract term of the customer (Month-to-month
, One year
, Two year
). character varying
Feature PaperlessBilling
It indicates whether the customer has paperless billing (Yes
) or not (No
). character varying
Feature PaymentMethod
Customer’s payment method (Electronic check
, Mailed check
, Bank transfer (automatic)
, Credit card (automatic)
). character varying
Feature MonthlyCharges
The monthly charge amount. money
Feature TotalCharges
The total amount charged to the customer. money
Feature Churn
It indicates whether the customer churned (Yes
) or not (No
). character varying
Label
Labels and Features
A label is a column whose values will be predicted (the y variable in simple
linear regression).
A feature is a column used to train the model (the x variable in simple
linear regression).
Training a Predictor
Let’s create and train the machine learning model. For that, we use the
CREATE MODEL
statement and specify the
input columns used to train FROM
(features) and what we want to
PREDICT
(labels).
CREATE MODEL mindsdb . customer_churn_predictor
FROM files
( SELECT * FROM churn)
PREDICT Churn;
We use all of the columns as features, except for the Churn
column, whose
values will be predicted.
Status of a Predictor
A predictor may take a couple of minutes for the training to complete. You can
monitor the status of the predictor by using this SQL command:
DESCRIBE customer_churn_predictor;
If we run it right after creating a predictor, we get this output:
+ ------------+
| status |
+ ------------+
| generating |
+ ------------+
A bit later, this is the output:
+ ----------+
| status |
+ ----------+
| training |
+ ----------+
And at last, this should be the output:
+ ----------+
| status |
+ ----------+
| complete |
+ ----------+
Now, if the status of our predictor says complete
, we can start making
predictions!
Making Predictions
Making a Single Prediction
You can make predictions by querying the predictor as if it were a table. The
SELECT
statement lets you make predictions for the label
based on the chosen features.
SELECT Churn, Churn_confidence, Churn_explain
FROM mindsdb . customer_churn_predictor
WHERE SeniorCitizen = 0
AND Partner = 'Yes'
AND Dependents = 'No'
AND tenure = 1
AND PhoneService = 'No'
AND MultipleLines = 'No phone service'
AND InternetService = 'DSL' ;
On execution, we get:
+ -------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Churn | Churn_confidence | Churn_explain |
+ -------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Yes | 0 . 7752808988764045 | { "predicted_value" : "Yes" , "confidence" : 0 . 7752808988764045 , "anomaly" : null , "truth" : null , "probability_class_No" : 0 . 4756 , "probability_class_Yes" : 0 . 5244 } |
+ -------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
To get more accurate predictions, we should provide as much data as possible in
the WHERE
clause. Let’s run another query.
SELECT Churn, Churn_confidence, Churn_explain
FROM mindsdb . customer_churn_predictor
WHERE SeniorCitizen = 0
AND Partner = 'Yes'
AND Dependents = 'No'
AND tenure = 1
AND PhoneService = 'No'
AND MultipleLines = 'No phone service'
AND InternetService = 'DSL'
AND Contract = 'Month-to-month'
AND MonthlyCharges = 29 . 85
AND TotalCharges = 29 . 85
AND OnlineBackup = 'Yes'
AND OnlineSecurity = 'No'
AND DeviceProtection = 'No'
AND TechSupport = 'No'
AND StreamingTV = 'No'
AND StreamingMovies = 'No'
AND PaperlessBilling = 'Yes'
AND PaymentMethod = 'Electronic check' ;
On execution, we get:
+ -------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Churn | Churn_confidence | Churn_explain |
+ -------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Yes | 0 . 8202247191011236 | { "predicted_value" : "Yes" , "confidence" : 0 . 8202247191011236 , "anomaly" : null , "truth" : null , "probability_class_No" : 0 . 4098 , "probability_class_Yes" : 0 . 5902 } |
+ -------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
MindsDB predicted the probability of this customer churning with confidence of
around 82%. The previous query predicted it with confidence of around 79%. So
providing more data improved the confidence level of predictions.
Making Batch Predictions
Also, you can make bulk predictions by joining a data table with your predictor
using JOIN
.
SELECT t . customerID , t . Contract , t . MonthlyCharges , m . Churn
FROM files . churn AS t
JOIN mindsdb . customer_churn_predictor AS m
LIMIT 100 ;
On execution, we get:
+ ----------------+-------------------+------------------+---------+
| customerID | Contract | MonthlyCharges | Churn |
+ ----------------+-------------------+------------------+---------+
| 7590 - VHVEG | Month - to - month | 29 . 85 | Yes |
| 5575 - GNVDE | One year | 56 . 95 | No |
| 3668 - QPYBK | Month - to - month | 53 . 85 | Yes |
| 7795 - CFOCW | One year | 42 . 3 | No |
| 9237 - HQITU | Month - to - month | 70 . 7 | Yes |
+ ----------------+-------------------+------------------+---------+
What’s Next?
Have fun while trying it out yourself!
If this tutorial was helpful, please give us a GitHub star
here .