A Hands-On Guide to Predictive Analytics with Python: Titanic Dataset Example

Learn the step-by-step process of building a predictive model with Python, using the Titanic dataset to explore data preprocessing, machine learning, and interpreting predictions effectively.

Nov 14, 2024

Predictive Analytics is at the forefront of data science, helping industries make data-driven decisions by forecasting outcomes based on historical data. This article provides a practical introduction to Predictive Analytics using Python and the Titanic dataset. You’ll learn how to preprocess data, build a machine learning model, and interpret predictions effectively.

Whether you're a beginner or looking to enhance your data science skills, this hands-on guide will help you understand the essentials of Predictive Analytics and metrics like precision, recall, and F1-score. By the end, you’ll be ready to explore predictive modeling in your own projects.

Applications of Predictive Analytics

Healthcare: Predicting disease outbreaks, patient readmissions, and treatment outcomes.
Finance: Fraud detection, credit scoring, and risk management.
Retail: Personalized marketing, demand forecasting, and inventory optimization.
Manufacturing: Predictive maintenance and supply chain optimization.
Education: Identifying at-risk students and optimizing learning paths.
Customer Service: Churn prediction and improving customer satisfaction.

Applications of Predictive Analytics

Healthcare: Predicting disease outbreaks, patient readmissions, and treatment outcomes.
Finance: Fraud detection, credit scoring, and risk management.
Retail: Personalized marketing, demand forecasting, and inventory optimization.
Manufacturing: Predictive maintenance and supply chain optimization.
Education: Identifying at-risk students and optimizing learning paths.
Customer Service: Churn prediction and improving customer satisfaction.

Steps of the Predictive Analytics Process

Define the Objective:
- Clearly define the problem to solve or the prediction to make.
Data Collection:
- Gather historical data relevant to the objective.
Data Preprocessing:
- Clean and preprocess data (handle missing values, outliers, normalization).
Feature Engineering:
- Select or create features that best represent the data for prediction.
Model Selection:
- Choose the most suitable predictive algorithm (e.g., Linear Regression, Random Forest).
Model Training:
- Train the model on a subset of the data (training set).
Model Testing and Validation:
- Evaluate the model's performance on unseen data (test set).
Deployment and Monitoring:
- Deploy the model into production and monitor its performance.

Types of Predictive Algorithms

Regression Algorithms:
- Predict continuous outcomes (e.g., Linear Regression, Lasso Regression).
Classification Algorithms:
- Predict categorical outcomes (e.g., Logistic Regression, Random Forest, SVM).
Time Series Forecasting:
- Predict future values based on temporal data (e.g., ARIMA, Prophet).
Ensemble Methods:
- Combine multiple algorithms for better performance (e.g., Gradient Boosting, XGBoost).
Neural Networks:
- Complex pattern recognition and prediction (e.g., Deep Learning, LSTMs).

Predictive algorithms fall under the category of Supervised Learning because they require labeled data (known inputs and corresponding outputs) to train the model.

Titanic Dataset Overview

The Titanic disaster of 1912 is one of the most infamous maritime tragedies in history. But did you know this event has also inspired data scientists and machine learning enthusiasts to build predictive models? Using the Titanic dataset, we can predict whether a passenger might survive based on their characteristics like age, gender, and travel class.

In this article, we’ll walk through a beginner-friendly approach to building a predictive model for Titanic survival using Python. We’ll also explain how to give new inputs to the trained model and interpret the outputs.

Size of the Dataset:
- The Titanic dataset contains 891 rows (passengers) and 12 columns (attributes about passengers).
Column Names and Explanations:
1. PassengerId: Unique ID for each passenger.
2. Survived: Survival status (1 = Survived, 0 = Did not survive) - Target variable.
3. Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
4. Name: Name of the passenger.
5. Sex: Gender of the passenger (male/female).
6. Age: Age of the passenger.
7. SibSp: Number of siblings/spouses aboard.
8. Parch: Number of parents/children aboard.
9. Ticket: Ticket number.
10. Fare: Ticket fare.
11. Cabin: Cabin number (often missing).
12. Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

Understanding the Dataset

The Titanic dataset contains information about the passengers:

Features (Inputs):
- Pclass: Ticket class (1st, 2nd, 3rd).
- Age: Age of the passenger.
- Sex: Gender of the passenger.
- Fare: Ticket price.
Target (Output):
- Survived: Whether the passenger survived (1) or not (0).

Steps to Build the Predictive Model

Set Up Your Environment
- Import libraries like pandas, sklearn, and numpy for data manipulation and modeling.
Data Preprocessing
- Handle missing values (e.g., fill Age with mean).
- Encode categorical columns (e.g., Sex to 0 for male and 1 for female).
Feature Selection
- Use relevant features like Pclass, Age, Sex, and Fare.
Model Training
- Use a supervised learning algorithm like Random Forest to train the predictive model.
Prediction and Evaluation
- Evaluate the model using metrics like accuracy, precision, recall, and F1-score.
Making Predictions
- Feed new passenger data into the trained model and interpret the survival predictions.

Titanic Dataset Implementation in Python

Setting Up the Environment

# Importing necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

train_test_split: Splits the dataset into training and testing subsets. This ensures the model can be trained on one part of the data and tested on another to evaluate its performance.
RandomForestClassifier: A machine learning algorithm that builds multiple decision trees and combines their outputs for robust classification results.
accuracy_score: Calculates the proportion of correct predictions made by the model.
classification_report: Provides a detailed breakdown of the model's performance, including metrics like precision, recall, and F1-score.

Loading and Exploring the Dataset

import pandas as pd

# Load dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data = pd.read_csv(url)

# Display dataset details
print(f"Dataset Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()}")
print("First 5 rows:")
print(data.head())

Data Preprocessing

# Data Cleaning without inplace=True
data['Age'] = data['Age'].fillna(data['Age'].mean())  # Fill missing ages with mean
data['Fare'] = data['Fare'].fillna(data['Fare'].mean())  # Fill missing fares with mean

# Encoding Gender
data['Gender'] = data['Sex'].map({'male': 0, 'female': 1})

Training and Evaluating the Model

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate model
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

Inputs:
- X: Features (e.g., passenger attributes like Pclass, Age, etc.).
- y: Target variable (Survived column indicating survival status).
- test_size=0.3: Reserves 30% of the data for testing and uses 70% for training.
- random_state=42: Ensures reproducibility by fixing the random seed for splitting.
Outputs:
- X_train: Training features (70% of X).
- X_test: Testing features (30% of X).
- y_train: Training target variable (70% of y).
- y_test: Testing target variable (30% of y).

RandomForestClassifier Parameters (For Detailed Explanation See at the end of the Article):
- n_estimators=100: Creates a forest of 100 decision trees.
- random_state=42: Ensures reproducibility by fixing the random seed.
Training:
- The fit method trains the model using the training data (X_train and y_train). During training, the model learns patterns in the data to predict survival.
predict:
- Generates predictions for the test dataset (X_test) based on the patterns learned during training.
- Outputs y_pred, an array of predicted values (e.g., survival status 0 or 1).

Accuracy:
- Measures the percentage of correct predictions: Accuracy=Number of Correct Predictions/Total Number of Predictions
- Outputs a score between 0 and 1 (e.g., Accuracy: 0.85 means 85% accuracy).
Classification Report:
- Provides a detailed breakdown of the model’s performance for each class (e.g., Survived=1 and Did Not Survive=0):
  - Precision: Proportion of correct positive predictions out of all positive predictions made.
  - Recall: Proportion of actual positives correctly identified.
  - F1-Score: Harmonic mean of precision and recall.
  - Support: Number of true instances for each class in the test data.

Output

Accuracy: 0.80
Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.88      0.84       157
           1       0.80      0.68      0.74       111

    accuracy                           0.80       268
   macro avg       0.80      0.78      0.79       268
weighted avg       0.80      0.80      0.80       268

Making Predictions with New Input

Once the model is trained, you can use it to predict survival for new passengers.

Example: Input New Passenger Details

Imagine you are trying to predict the survival of a new passenger:

Ticket Class (Pclass): 3rd class.
Age: 28 years.
Gender (Sex): Female.
Fare: 7.25.

import pandas as pd

# Create a DataFrame for the new passenger
new_passenger = pd.DataFrame([[3, 28, 1, 7.25]], columns=['Pclass', 'Age', 'Gender', 'Fare'])

# Predict survival for the new passenger
prediction = model.predict(new_passenger)

# Interpret output
if prediction[0] == 1:
    print("Prediction: Survived")
else:
    print("Prediction: Did Not Survive")

Output

Prediction: Survived

Output Explanation:

Prediction = 1:
- The model predicts that this passenger survived.
Prediction = 0:
- The model predicts that this passenger did not survive.

Understanding Metrics

Precision (Detecting Survivors)

Out of all passengers predicted as survivors, how many were truly survivors?

Example: If the model predicted 100 passengers as survivors and 85 of them actually survived, the precision is 85%.

Recall (Finding Survivors)

Out of all the actual survivors, how many did the model correctly identify?

Example: If there were 120 actual survivors and the model identified 90 of them, the recall is 75%.

F1-Score (Balancing Precision and Recall)

Balances precision and recall to avoid false positives (predicting survival when not true) and false negatives (missing actual survivors).

Explanation of `RandomForestClassifier` Parameters

1. `n_estimators=100`: Creates a forest of 100 decision trees

What it does:
- This parameter specifies the number of decision trees the random forest will create during training. In this case, it builds 100 decision trees.
Why it matters:
- Each tree in the forest independently predicts the target variable (e.g., Survived in the Titanic example).
- The final prediction is determined by combining the outputs of all these trees, typically using:
  - Majority Voting (for classification problems): The class predicted by the majority of trees is chosen.
  - Averaging (for regression problems): The average of all tree predictions is used.
Effect of Increasing/Decreasing n_estimators:
- Increasing n_estimators:
  - Improves model accuracy as more trees reduce overfitting.
  - Increases training time and memory usage.
- Decreasing n_estimators:
  - Reduces computational cost but may lower accuracy due to insufficient trees.
Analogy:
- Imagine a panel of 100 judges making a decision. Each judge votes independently, and the majority wins. Adding more judges generally improves the fairness of the decision, but it also takes more time and resources.

2. `random_state=42`: Ensures reproducibility by fixing the random seed

What it does:
- Sets a seed for the random number generator used in various stages of the Random Forest algorithm (e.g., random sampling of data, selection of features for each tree).
Why it matters:
- Without a fixed seed, the random processes in Random Forest can lead to slightly different results each time the model is trained, even with the same data and parameters.
- Setting random_state ensures the same sequence of random numbers is generated each time, producing consistent results.
Effect of Changing random_state:
- Changing the value of random_state results in a different random sequence and may slightly change the model's performance due to differences in tree construction.
Why use 42?:
- The number 42 is often used as a humorous reference to The Hitchhiker's Guide to the Galaxy, where 42 is the "Answer to the Ultimate Question of Life, the Universe, and Everything." Any integer can be used as the seed.
Analogy:
- Think of random_state as a shuffle function with a fixed order. Imagine shuffling a deck of cards the same way every time—this ensures reproducibility of the shuffled order, no matter how many times you shuffle.

n_estimators=100:
- Creates 100 decision trees to enhance model accuracy and robustness.
random_state=42:
- Ensures reproducible results by fixing the randomness in data sampling and tree-building processes.

These parameters help control the behavior and performance of the Random Forest model. Let me know if you'd like further clarification!

Next Steps

Predictive Analytics is a powerful tool to transform raw data into actionable insights. Using this hands-on guide, you’ve built your first predictive model with Python. Now, it's your turn to apply these techniques to other datasets, such as sales forecasting, customer churn prediction, or fraud detection.

What You Can Explore Next

Enhance Your Model:
- Add new features such as a combined "Family Size" from SibSp and Parch.
- Experiment with advanced techniques like feature scaling and encoding.
Try Different Algorithms:
- Explore Logistic Regression, Gradient Boosting, or Support Vector Machines (SVM) to see if they outperform Random Forest on this dataset.
Hyperparameter Tuning:
- Use GridSearchCV or RandomizedSearchCV to find the optimal parameters for your Random Forest model.

By taking these steps, you’ll develop a deeper understanding of predictive analytics and build more accurate models for a variety of real-world problems.

For more in-depth technical insights and articles, feel free to explore:

LinkTree: LinkTree - Ebasiq
Substack: ebasiq by Girish
YouTube Channel: Ebasiq YouTube Channel
Instagram: Ebasiq Instagram
Technical Blog: Ebasiq Blog
GitHub Code Repository: Girish GitHub Repos
LinkedIn: Girish LinkedIn
Personal Blog: Girish - BlogBox

ebasiq by Girish

A Hands-On Guide to Predictive Analytics with Python: Titanic Dataset Example

Learn the step-by-step process of building a predictive model with Python, using the Titanic dataset to explore data preprocessing, machine learning, and interpreting predictions effectively.

Applications of Predictive Analytics

Applications of Predictive Analytics

Steps of the Predictive Analytics Process

Types of Predictive Algorithms

Titanic Dataset Overview

Understanding the Dataset

Steps to Build the Predictive Model

Titanic Dataset Implementation in Python

Setting Up the Environment

Loading and Exploring the Dataset

Data Preprocessing

Training and Evaluating the Model

Making Predictions with New Input

Example: Input New Passenger Details

Output Explanation:

Understanding Metrics

Precision (Detecting Survivors)

Recall (Finding Survivors)

F1-Score (Balancing Precision and Recall)

Explanation of RandomForestClassifier Parameters

1. n_estimators=100: Creates a forest of 100 decision trees

2. random_state=42: Ensures reproducibility by fixing the random seed

Next Steps

What You Can Explore Next

For more in-depth technical insights and articles, feel free to explore:

Discussion about this post

Explanation of `RandomForestClassifier` Parameters

1. `n_estimators=100`: Creates a forest of 100 decision trees

2. `random_state=42`: Ensures reproducibility by fixing the random seed