A Hands-On Guide to Predictive Analytics with Python: Titanic Dataset Example
Learn the step-by-step process of building a predictive model with Python, using the Titanic dataset to explore data preprocessing, machine learning, and interpreting predictions effectively.
Predictive Analytics is at the forefront of data science, helping industries make data-driven decisions by forecasting outcomes based on historical data. This article provides a practical introduction to Predictive Analytics using Python and the Titanic dataset. You’ll learn how to preprocess data, build a machine learning model, and interpret predictions effectively.
Whether you're a beginner or looking to enhance your data science skills, this hands-on guide will help you understand the essentials of Predictive Analytics and metrics like precision, recall, and F1-score. By the end, you’ll be ready to explore predictive modeling in your own projects.
Applications of Predictive Analytics
Healthcare: Predicting disease outbreaks, patient readmissions, and treatment outcomes.
Finance: Fraud detection, credit scoring, and risk management.
Retail: Personalized marketing, demand forecasting, and inventory optimization.
Manufacturing: Predictive maintenance and supply chain optimization.
Education: Identifying at-risk students and optimizing learning paths.
Customer Service: Churn prediction and improving customer satisfaction.
Applications of Predictive Analytics
Healthcare: Predicting disease outbreaks, patient readmissions, and treatment outcomes.
Finance: Fraud detection, credit scoring, and risk management.
Retail: Personalized marketing, demand forecasting, and inventory optimization.
Manufacturing: Predictive maintenance and supply chain optimization.
Education: Identifying at-risk students and optimizing learning paths.
Customer Service: Churn prediction and improving customer satisfaction.
Steps of the Predictive Analytics Process
Define the Objective:
Clearly define the problem to solve or the prediction to make.
Data Collection:
Gather historical data relevant to the objective.
Data Preprocessing:
Clean and preprocess data (handle missing values, outliers, normalization).
Feature Engineering:
Select or create features that best represent the data for prediction.
Model Selection:
Choose the most suitable predictive algorithm (e.g., Linear Regression, Random Forest).
Model Training:
Train the model on a subset of the data (training set).
Model Testing and Validation:
Evaluate the model's performance on unseen data (test set).
Deployment and Monitoring:
Deploy the model into production and monitor its performance.
Types of Predictive Algorithms
Regression Algorithms:
Predict continuous outcomes (e.g., Linear Regression, Lasso Regression).
Classification Algorithms:
Predict categorical outcomes (e.g., Logistic Regression, Random Forest, SVM).
Time Series Forecasting:
Predict future values based on temporal data (e.g., ARIMA, Prophet).
Ensemble Methods:
Combine multiple algorithms for better performance (e.g., Gradient Boosting, XGBoost).
Neural Networks:
Complex pattern recognition and prediction (e.g., Deep Learning, LSTMs).
Predictive algorithms fall under the category of Supervised Learning because they require labeled data (known inputs and corresponding outputs) to train the model.
Titanic Dataset Overview
The Titanic disaster of 1912 is one of the most infamous maritime tragedies in history. But did you know this event has also inspired data scientists and machine learning enthusiasts to build predictive models? Using the Titanic dataset, we can predict whether a passenger might survive based on their characteristics like age, gender, and travel class.
In this article, we’ll walk through a beginner-friendly approach to building a predictive model for Titanic survival using Python. We’ll also explain how to give new inputs to the trained model and interpret the outputs.
Size of the Dataset:
The Titanic dataset contains 891 rows (passengers) and 12 columns (attributes about passengers).
Column Names and Explanations:
PassengerId: Unique ID for each passenger.
Survived: Survival status (1 = Survived, 0 = Did not survive) - Target variable.
Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
Name: Name of the passenger.
Sex: Gender of the passenger (male/female).
Age: Age of the passenger.
SibSp: Number of siblings/spouses aboard.
Parch: Number of parents/children aboard.
Ticket: Ticket number.
Fare: Ticket fare.
Cabin: Cabin number (often missing).
Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
Understanding the Dataset
The Titanic dataset contains information about the passengers:
Features (Inputs):
Pclass
: Ticket class (1st, 2nd, 3rd).Age
: Age of the passenger.Sex
: Gender of the passenger.Fare
: Ticket price.
Target (Output):
Survived
: Whether the passenger survived (1
) or not (0
).
Steps to Build the Predictive Model
Set Up Your Environment
Import libraries like
pandas
,sklearn
, andnumpy
for data manipulation and modeling.
Data Preprocessing
Handle missing values (e.g., fill
Age
with mean).Encode categorical columns (e.g.,
Sex
to 0 for male and 1 for female).
Feature Selection
Use relevant features like
Pclass
,Age
,Sex
, andFare
.
Model Training
Use a supervised learning algorithm like Random Forest to train the predictive model.
Prediction and Evaluation
Evaluate the model using metrics like accuracy, precision, recall, and F1-score.
Making Predictions
Feed new passenger data into the trained model and interpret the survival predictions.
Titanic Dataset Implementation in Python
Setting Up the Environment
# Importing necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
train_test_split
: Splits the dataset into training and testing subsets. This ensures the model can be trained on one part of the data and tested on another to evaluate its performance.RandomForestClassifier
: A machine learning algorithm that builds multiple decision trees and combines their outputs for robust classification results.accuracy_score
: Calculates the proportion of correct predictions made by the model.classification_report
: Provides a detailed breakdown of the model's performance, including metrics like precision, recall, and F1-score.
Loading and Exploring the Dataset
import pandas as pd
# Load dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(url)
# Display dataset details
print(f"Dataset Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()}")
print("First 5 rows:")
print(data.head())
Data Preprocessing
# Data Cleaning without inplace=True
data['Age'] = data['Age'].fillna(data['Age'].mean()) # Fill missing ages with mean
data['Fare'] = data['Fare'].fillna(data['Fare'].mean()) # Fill missing fares with mean
# Encoding Gender
data['Gender'] = data['Sex'].map({'male': 0, 'female': 1})
Training and Evaluating the Model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate model
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))
Inputs:
X
: Features (e.g., passenger attributes likePclass
,Age
, etc.).y
: Target variable (Survived
column indicating survival status).test_size=0.3
: Reserves 30% of the data for testing and uses 70% for training.random_state=42
: Ensures reproducibility by fixing the random seed for splitting.
Outputs:
X_train
: Training features (70% ofX
).X_test
: Testing features (30% ofX
).y_train
: Training target variable (70% ofy
).y_test
: Testing target variable (30% ofy
).
RandomForestClassifier Parameters (For Detailed Explanation See at the end of the Article):
n_estimators=100
: Creates a forest of 100 decision trees.random_state=42
: Ensures reproducibility by fixing the random seed.
Training:
The
fit
method trains the model using the training data (X_train
andy_train
). During training, the model learns patterns in the data to predict survival.
predict
:Generates predictions for the test dataset (
X_test
) based on the patterns learned during training.Outputs
y_pred
, an array of predicted values (e.g., survival status0
or1
).
Accuracy:
Measures the percentage of correct predictions: Accuracy=Number of Correct Predictions/Total Number of Predictions
Outputs a score between 0 and 1 (e.g.,
Accuracy: 0.85
means 85% accuracy).
Classification Report:
Provides a detailed breakdown of the model’s performance for each class (e.g.,
Survived=1
andDid Not Survive=0
):Precision: Proportion of correct positive predictions out of all positive predictions made.
Recall: Proportion of actual positives correctly identified.
F1-Score: Harmonic mean of precision and recall.
Support: Number of true instances for each class in the test data.
Output
Accuracy: 0.80
Classification Report:
precision recall f1-score support
0 0.80 0.88 0.84 157
1 0.80 0.68 0.74 111
accuracy 0.80 268
macro avg 0.80 0.78 0.79 268
weighted avg 0.80 0.80 0.80 268
Making Predictions with New Input
Once the model is trained, you can use it to predict survival for new passengers.
Example: Input New Passenger Details
Imagine you are trying to predict the survival of a new passenger:
Ticket Class (
Pclass
): 3rd class.Age: 28 years.
Gender (
Sex
): Female.Fare: 7.25.
import pandas as pd
# Create a DataFrame for the new passenger
new_passenger = pd.DataFrame([[3, 28, 1, 7.25]], columns=['Pclass', 'Age', 'Gender', 'Fare'])
# Predict survival for the new passenger
prediction = model.predict(new_passenger)
# Interpret output
if prediction[0] == 1:
print("Prediction: Survived")
else:
print("Prediction: Did Not Survive")
Output
Prediction: Survived
Output Explanation:
Prediction =
1
:The model predicts that this passenger survived.
Prediction =
0
:The model predicts that this passenger did not survive.
Understanding Metrics
Precision (Detecting Survivors)
Out of all passengers predicted as survivors, how many were truly survivors?
Example: If the model predicted 100 passengers as survivors and 85 of them actually survived, the precision is 85%.
Recall (Finding Survivors)
Out of all the actual survivors, how many did the model correctly identify?
Example: If there were 120 actual survivors and the model identified 90 of them, the recall is 75%.
F1-Score (Balancing Precision and Recall)
Balances precision and recall to avoid false positives (predicting survival when not true) and false negatives (missing actual survivors).
Explanation of RandomForestClassifier
Parameters
1. n_estimators=100
: Creates a forest of 100 decision trees
What it does:
This parameter specifies the number of decision trees the random forest will create during training. In this case, it builds 100 decision trees.
Why it matters:
Each tree in the forest independently predicts the target variable (e.g.,
Survived
in the Titanic example).The final prediction is determined by combining the outputs of all these trees, typically using:
Majority Voting (for classification problems): The class predicted by the majority of trees is chosen.
Averaging (for regression problems): The average of all tree predictions is used.
Effect of Increasing/Decreasing
n_estimators
:Increasing
n_estimators
:Improves model accuracy as more trees reduce overfitting.
Increases training time and memory usage.
Decreasing
n_estimators
:Reduces computational cost but may lower accuracy due to insufficient trees.
Analogy:
Imagine a panel of 100 judges making a decision. Each judge votes independently, and the majority wins. Adding more judges generally improves the fairness of the decision, but it also takes more time and resources.
2. random_state=42
: Ensures reproducibility by fixing the random seed
What it does:
Sets a seed for the random number generator used in various stages of the Random Forest algorithm (e.g., random sampling of data, selection of features for each tree).
Why it matters:
Without a fixed seed, the random processes in Random Forest can lead to slightly different results each time the model is trained, even with the same data and parameters.
Setting
random_state
ensures the same sequence of random numbers is generated each time, producing consistent results.
Effect of Changing
random_state
:Changing the value of
random_state
results in a different random sequence and may slightly change the model's performance due to differences in tree construction.
Why use
42
?:The number
42
is often used as a humorous reference to The Hitchhiker's Guide to the Galaxy, where 42 is the "Answer to the Ultimate Question of Life, the Universe, and Everything." Any integer can be used as the seed.
Analogy:
Think of
random_state
as a shuffle function with a fixed order. Imagine shuffling a deck of cards the same way every time—this ensures reproducibility of the shuffled order, no matter how many times you shuffle.
n_estimators=100
:Creates 100 decision trees to enhance model accuracy and robustness.
random_state=42
:Ensures reproducible results by fixing the randomness in data sampling and tree-building processes.
These parameters help control the behavior and performance of the Random Forest model. Let me know if you'd like further clarification!
Next Steps
Predictive Analytics is a powerful tool to transform raw data into actionable insights. Using this hands-on guide, you’ve built your first predictive model with Python. Now, it's your turn to apply these techniques to other datasets, such as sales forecasting, customer churn prediction, or fraud detection.
What You Can Explore Next
Enhance Your Model:
Add new features such as a combined "Family Size" from
SibSp
andParch
.Experiment with advanced techniques like feature scaling and encoding.
Try Different Algorithms:
Explore Logistic Regression, Gradient Boosting, or Support Vector Machines (SVM) to see if they outperform Random Forest on this dataset.
Hyperparameter Tuning:
Use GridSearchCV or RandomizedSearchCV to find the optimal parameters for your Random Forest model.
By taking these steps, you’ll develop a deeper understanding of predictive analytics and build more accurate models for a variety of real-world problems.
For more in-depth technical insights and articles, feel free to explore:
LinkTree: LinkTree - Ebasiq
Substack: ebasiq by Girish
YouTube Channel: Ebasiq YouTube Channel
Instagram: Ebasiq Instagram
Technical Blog: Ebasiq Blog
GitHub Code Repository: Girish GitHub Repos
LinkedIn: Girish LinkedIn
Personal Blog: Girish - BlogBox