Calorie prediction using DecisionTreeRegressor Model

Scenario
  You are running a company which need to predict calories needed by atheletes based on duration and intensity(pulse, maxpulse) rates of their workouts
  You need to develop a ML model which takes (Duration,Pulse,Maxpulse,Calories) as input and predicts Calories burned during athlete workout.

Inputs
  You would be provided with data.csv file which contains historical data of atheletes workout sessions

# data.csv
Duration,Pulse,Maxpulse,Calories
    60,  110,   130,    409.1
    60,  117,   145,    479.0
    60,  103,   135,    340.0
    45,  109,   175,    282.4
            
Requirement
  You need to write a Machine Learning model which takes (Duration, Pulse, MaxPulse) as input and predict Calories burned by athelete.
  This way they can take appropriate calorie intake during/before their workouts

Model doing prediction

Code Description

# data.csv
Duration,Pulse,Maxpulse,Calories
   60,    110,   130,   409.1
   60,    117,   145,   479.0
   60,    103,   135,   340.0
   45,    109,   175,   282.4
   45,    117,   148,   406.0
   60,    102,   127,   300.5
   60,    110,      ,   374.0
   45,    104,   134,
                

import pandas as pd                                     //1
from sklearn.tree import DecisionTreeRegressor          //2
from sklearn.metrics import mean_absolute_error         //2

# This function takes the raw data and cleans it
def clean_the_data(file):
	df = pd.read_csv(file)                              //3
	print("Raw data\n", df.to_string())
	clean_df = df.dropna(axis=0)                        //4 Clean the data, remove empty rows
	print("Cleaned data:\n", clean_df.to_string())
	return clean_df

# split the input data into traning and prediction
# From this training data: ['Duration','Pulse','Maxpulse'], Calories would be predicted
def split_data(data):
	X_df_training_data = data[['Duration','Pulse','Maxpulse']]              //5
	print("Traning Data(Removed Calories):\n", X_df_training_data.to_string())
	y_pred_calories_trainingdata = data['Calories']                         //6
	return X_df_training_data, y_pred_calories_trainingdata
	
def get_model(X_train, y_prediction):
	model = DecisionTreeRegressor(random_state=1)                           //7
	model.fit(X_train, y_prediction)                                        //8
	return model

def model_predict(model, X):
	return model.predict(X)                                                 //9

# prediction_from_data1: Prediction from Training data
# prediction_from_data2: Prediction from Real world data
def get_mean_absolute_error(prediction_from_data1, prediction_from_data2):
	return mean_absolute_error(prediction_from_data1, prediction_from_data2)
	
print("---------- Traning Begin ----------")
clean_df = clean_the_data("data.csv")
X_df_training_data, y_pred_calories_trainingdata = split_data(clean_df)
model = get_model(X_df_training_data, y_pred_calories_trainingdata)
calories_prediction_from_traningData = model_predict(model, X_df_training_data) //10
print("Model Predicted: Calories:\n", calories_prediction_from_traningData)
print("---------- Traning End ----------")

print("\n---------- Real World Start ----------")
real_data = pd.DataFrame 
    'Duration':[55, 45, 60, 80, 55, 45],
    'Pulse':[120, 125, 117, 150, 120, 125],
    'Maxpulse':[150, 165, 145, 200, 150, 165]

print("Real World Data:\n", real_data.to_string())
calories_prediction_from_realWorldData = model_predict(model, real_data)        //11
print("Model Predicted: Calories:\n", calories_prediction_from_realWorldData)
print("\n---------- Real World End ----------") 

print("\nMean Absolute Error:",                                                 //12
    get_mean_absolute_error(calories_prediction_from_traningData, 
                            calories_prediction_from_realWorldData))

Output:
---------- Traning Begin ----------
Raw data
    Duration  Pulse  Maxpulse  Calories
0        60    110     130.0     409.1
1        60    117     145.0     479.0
2        60    103     135.0     340.0
3        45    109     175.0     282.4
4        45    117     148.0     406.0
5        60    102     127.0     300.5
6        60    110       NaN     374.0
7        45    104     134.0       NaN
Cleaned data:
    Duration  Pulse  Maxpulse  Calories
0        60    110     130.0     409.1
1        60    117     145.0     479.0
2        60    103     135.0     340.0
3        45    109     175.0     282.4
4        45    117     148.0     406.0
5        60    102     127.0     300.5
Traning Data(Removed Calories):
    Duration  Pulse  Maxpulse
0        60    110     130.0
1        60    117     145.0
2        60    103     135.0
3        45    109     175.0
4        45    117     148.0
5        60    102     127.0
Model Predicted: Calories:
 [409.1 479.  340.  282.4 406.  300.5]
---------- Traning End ----------

---------- Real World Start ----------
Real World Data:
    Duration  Pulse  Maxpulse
0        55    120       150
1        45    125       165
2        60    117       145
3        80    150       200
4        55    120       150
5        45    125       165
Model Predicted: Calories:
 [479. 406. 479. 479. 479. 406.]

---------- Real World End ----------

Mean Absolute Error: 109.5
                
1. import pandas (We will convert data from csv to dataframe using pandas.)
2. Use DecisionTreeRegressor from Scikit Learn
2. import mean_absolute_error for calculating MEA later in code 3. read_csv() (From Pandas): Read csv data into dataframe using

4. dropna() (From Pandas): Clean the data by removing empty rows

We only have 1 set of data, from which we will seperate traning and prediction data
5. Traning data: 'Duration','Pulse','Maxpulse'
6. Prediction data: 'Calories'

7. Initialize the model DecisionTreeRegressor
8. fit() (From scikit-learn) Feed the training and prediction data to model. Supervised learning
9. predict() (From scikit-learn) Output of model. Predict the calories based on traninig data. You will see model gives same values for calories as fed. This is overfitting

10. Now model is ready, feed with real world data
11. predict() (From scikit-learn) Model predicts calories from real world data

12. Validate the model using mean absolute error

Model Validation (MEA)

Predictive Accuracy? What is quality of prediction that model made? How close are model's predictions to actual result?
How to measure?
  Compare predicted values from training data and actual predicted values from real world data. It will mix of good and bad predictions
  Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.
Mean Absolute Error (MAE): Average of abs value of (actual - predicted)
Actual Calories Predicted Calories
200 190
300 310
MEA = ((200-190) + abs(300-310))/2 = 10