Python File Handling: An In-Depth Guide with Code Examples

Mastering the Art of Reading, Writing, and Manipulating Files in Python for Text, Images, CSV, Excel, Word, PDFs, and More

Dec 17, 2024

File handling is a cornerstone of Python programming, especially in Data Analysis, Machine Learning, and Automation tasks. Python provides built-in methods and external libraries to handle various file formats, enabling you to read, write, append, extract, and manipulate data efficiently.

This comprehensive guide will take you through the below concepts:

Why file handling is essential?
Libraries required and installation.
Working with different file formats: use cases, real-world scenarios, and advanced examples.

1. Text Files

Why Use Text Files?

Text files are lightweight, human-readable, and often used for:

Log Analysis: Reading server logs to monitor system performance or detect issues.
Configuration Files: Parsing settings stored in .txt or .ini files.
Data Preprocessing: Reading and cleaning raw text data for NLP (Natural Language Processing).
Reports and Summaries: Generating or reading plain text-based reports.

Libraries Required

No external libraries required (built-in open() method).

Basic Example: Reading a Text File

# Read a text file line by line
with open("example.txt", "r") as file:
    for line in file:
        print(line.strip())  # Removes extra newlines

Advanced Use Cases

1. Checking If a File Exists Before Reading

import os

if os.path.exists("example.txt"):
    with open("example.txt", "r") as file:
        print(file.read())
else:
    print("File does not exist.")

2. Writing Custom Logs

from datetime import datetime

# Write logs with timestamps
with open("logs.txt", "a") as file:
    file.write(f"{datetime.now()}: Process Completed Successfully.\n")

3. Custom Line Filtering

Suppose you want to remove lines containing a specific word.

# Remove lines containing the word "ERROR"
with open("logs.txt", "r") as file:
    lines = file.readlines()

with open("cleaned_logs.txt", "w") as file:
    for line in lines:
        if "ERROR" not in line:
            file.write(line)

2. Image Files

Why Use Image Files?

Images are often used in Machine Learning (image classification, object detection) or data visualization pipelines.

Machine Learning: Preprocessing images for training classification models (e.g., image resizing, augmentation).
Computer Vision: Detecting objects in images for security systems.
Data Extraction: Reading and analyzing images for text (OCR - Optical Character Recognition).

Libraries Required

Pillow (PIL): pip install pillow
OpenCV: pip install opencv-python

Basic Example: Displaying Images

Using Pillow (PIL)

from PIL import Image

# Open and display an image
img = Image.open("example.jpg")
img.show()

# Convert to grayscale
gray_img = img.convert("L")
gray_img.save("grayscale_example.jpg")

Using OpenCV:

import cv2

# Read and display the image
img = cv2.imread("example.jpg")
cv2.imshow("Image", img)

# Resize the image
resized = cv2.resize(img, (100, 100))
cv2.imshow("Resized Image", resized)

cv2.waitKey(0)
cv2.destroyAllWindows()

Advanced Example: Resizing and Pixel Manipulation

from PIL import Image

# Open an image
img = Image.open("example.jpg")

# Extract image properties
print(f"Image Format: {img.format}")
print(f"Image Size: {img.size}")
print(f"Image Mode: {img.mode}")

# Resize the image
resized_img = img.resize((200, 200))
resized_img.save("resized_example.jpg")

# Access and modify specific pixels
pixels = img.load()
pixels[10, 10] = (255, 0, 0)  # Change pixel (10,10) to red
img.save("modified_pixel_example.jpg")

3. CSV Files

Why Use CSV Files?

CSV files are the go-to format for storing and sharing tabular data because they are lightweight and supported across platforms.

Data Analytics: Importing CSV datasets for analysis in tools like pandas.
Financial Reports: Automating the generation and analysis of revenue or expense CSV files.
ETL Pipelines: Extracting and transforming CSV data for databases.

Libraries Required

Built-in csv module
pandas: pip install pandas

Basic Example: Reading and Writing CSV

Using `csv` Module

import csv

# Writing to CSV
data = [["Name", "Age"], ["Girish", 25], ["Chandra", 30]]
with open("output.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerows(data)

# Reading CSV
with open("output.csv", "r") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

Using `pandas`:

import pandas as pd

# Read CSV
df = pd.read_csv("example.csv")
print(df)

# Write CSV
df.to_csv("output.csv", index=False)

Advanced Example: Large File Processing with `pandas`

import pandas as pd

# Read CSV in chunks (for large files)
chunk_size = 1000
for chunk in pd.read_csv("large_data.csv", chunksize=chunk_size):
    print(chunk.head())

4. Excel Files

Why Use Excel Files?

Excel files (.xlsx) are essential for structured data storage and analysis in the business and data science world.

Financial Modeling: Analyzing and automating financial data stored in Excel.
Inventory Management: Reading and updating inventory data in Excel sheets.
Data Reporting: Automating Excel-based reports.

Libraries Required

openpyxl: pip install openpyxl
pandas: pip install pandas

Basic Example: Reading and Writing Excel

Using `pandas`

import pandas as pd

# Write a DataFrame to Excel
data = {"Name": ["Alice", "Bob"], "Age": [25, 30]}
df = pd.DataFrame(data)
df.to_excel("output.xlsx", index=False)

# Read from Excel
df = pd.read_excel("output.xlsx")
print(df)

#Reading Multiple Sheets

# Read all sheets
data = pd.read_excel("example.xlsx", sheet_name=None)  # Returns a dictionary
for sheet, content in data.items():
    print(f"Sheet: {sheet}")
    print(content)

Advanced Example: Applying Conditional Formatting

from openpyxl import load_workbook
from openpyxl.styles import PatternFill

# Open workbook and apply formatting
wb = load_workbook("output.xlsx")
sheet = wb.active

# Apply red color fill to cells where Age > 25
red_fill = PatternFill(start_color="FF0000", fill_type="solid")
for row in sheet.iter_rows(min_row=2, max_col=2):
    for cell in row:
        if cell.value and isinstance(cell.value, int) and cell.value > 25:
            cell.fill = red_fill

wb.save("formatted_output.xlsx")

5. Word Files

Why Use Word Files?

Word files (.docx) are useful for generating reports, documents, and storing structured content.

Report Extraction: Extracting text and tables from Word documents for further analysis.
Automated Document Generation: Creating templated reports, invoices, or resumes.
Data Processing: Extracting embedded data like tables and images.

Libraries Required

python-docx: pip install python-docx

Reading Paragraphs and Tables

from docx import Document

# Open Word file
doc = Document("example.docx")

# Read paragraphs
for paragraph in doc.paragraphs:
    print(paragraph.text)

def extract_tables_from_word(word_file):
    # Open the Word document
    doc = Document(word_file)
    table_data = []

    # Iterate through each table
    for table_index, table in enumerate(doc.tables):
        print(f"--- Table {table_index + 1} ---")
        for row in table.rows:
            row_data = [cell.text.strip() for cell in row.cells]
            table_data.append(row_data)
            print("\t".join(row_data))  # Print each row in a table-like format

    print("Table extraction completed.")
    return table_data

# Example usage
word_file = "example.docx"  # Replace with your Word file path
tables = extract_tables_from_word(word_file)

Extracting Images from a Word Document

Purpose

Word documents often include images embedded within paragraphs or media sections. These images can be extracted for further use.

Libraries Required

Built-in zipfile (Word documents are essentially ZIP archives).

Code Example: Extract Images from Word Document

import zipfile
import os

def extract_images_from_word(docx_file, output_folder):
    # Create output folder if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)

    # Open the .docx file as a ZIP archive
    with zipfile.ZipFile(docx_file, 'r') as docx_zip:
        for file in docx_zip.namelist():
            # Check for image files in the Word media folder
            if file.startswith("word/media/"):
                docx_zip.extract(file, output_folder)
                print(f"Extracted: {file}")

    print("Image extraction completed.")

# Example usage
docx_file = "example.docx"  # Replace with your Word file path
output_folder = "word_images"
extract_images_from_word(docx_file, output_folder)

6. PDF Files

Why Use PDF Files?

PDFs are used for sharing documents with fixed formatting.

Invoice Processing: Extracting text, tables, or images from invoices for accounting systems.
Document Search: Searching and analyzing large collections of PDFs.
Legal Document Analysis: Extracting sections of legal or policy documents.

Libraries Required

PyPDF2: pip install PyPDF2

Extracting Text from a PDF

from PyPDF2 import PdfReader

# Open PDF and extract text
reader = PdfReader("example.pdf")
for page in reader.pages:
    print(page.extract_text())

Extracting Images from a PDF

Purpose

PDF files often contain embedded images that need to be extracted for analysis, reporting, or visualization purposes.

Libraries Required

PyMuPDF (fitz): pip install pymupdf

Code Example: Extract Images from PDF

import fitz  # PyMuPDF
import os

def extract_images_from_pdf(pdf_file, output_folder):
    # Create output folder if it doesn't exist
    os.makedirs(output_folder, exist_ok=True)

    # Open the PDF file
    pdf = fitz.open(pdf_file)

    # Loop through all the pages
    for page_num in range(len(pdf)):
        page = pdf[page_num]
        images = page.get_images(full=True)  # Extract all images on the page

        for img_index, img in enumerate(images):
            xref = img[0]  # Image reference number
            base_image = pdf.extract_image(xref)
            image_bytes = base_image["image"]  # Image binary content
            image_ext = base_image["ext"]  # Image extension (e.g., jpg, png)

            # Save the image
            image_filename = f"{output_folder}/page{page_num+1}_img{img_index+1}.{image_ext}"
            with open(image_filename, "wb") as img_file:
                img_file.write(image_bytes)
                print(f"Saved: {image_filename}")

    print("Image extraction completed.")

# Example usage
pdf_file = "example.pdf"  # Replace with your PDF file path
output_folder = "extracted_images"
extract_images_from_pdf(pdf_file, output_folder)

7. JSON Files

Why Use JSON Files?

JSON is the most common format for storing structured data, especially in APIs and web development.

APIs: Reading data returned from APIs (e.g., weather, stock prices).
Configuration Management: Storing settings in JSON format for applications.
Data Storage: Reading and writing structured data for web or mobile apps.

Libraries Required

Built-in json module.

Basic Example: Reading and Writing JSON

import json

# Writing to JSON
data = {"name": "Alice", "age": 25}
with open("output.json", "w") as file:
    json.dump(data, file, indent=4)

# Reading JSON
with open("output.json", "r") as file:
    data = json.load(file)
    print(data)

Conclusion

Python's file handling is versatile, efficient, and essential for working with data in various formats. This guide covered everything from basic operations to advanced use cases across:

Text
Images
CSV
Excel
Word
PDF
JSON

By mastering these, you’ll unlock powerful capabilities for Data Analysis, Automation, and Machine Learning workflows.

Explore More:

If you enjoyed this article, dive deeper into the fascinating world of mathematics, technology, and ancient wisdom through my blogs:

🌟 Ganitham Guru – Discover the beauty of mathematics and its applications rooted in ancient and modern insights.
🌐 Girish Blog Box – A hub of thought-provoking articles on technology, personal growth, and more.
💡 Ebasiq – Simplifying complex concepts in AI, data science, and beyond for everyone.
🌐 Linktr.ee - Discover all my profiles and resources in one place.

Stay inspired and keep exploring the endless possibilities of knowledge! ✨

ebasiq by Girish

Discussion about this post

ebasiq by Girish

Python File Handling: An In-Depth Guide with Code Examples

Mastering the Art of Reading, Writing, and Manipulating Files in Python for Text, Images, CSV, Excel, Word, PDFs, and More

1. Text Files

Why Use Text Files?

Libraries Required

Basic Example: Reading a Text File

Advanced Use Cases

1. Checking If a File Exists Before Reading

2. Writing Custom Logs

3. Custom Line Filtering

2. Image Files

Why Use Image Files?

Libraries Required

Basic Example: Displaying Images

Using Pillow (PIL)

Using OpenCV:

Advanced Example: Resizing and Pixel Manipulation

3. CSV Files

Why Use CSV Files?

Libraries Required

Basic Example: Reading and Writing CSV

Using csv Module

Using pandas:

Advanced Example: Large File Processing with pandas

4. Excel Files

Why Use Excel Files?

Libraries Required

Basic Example: Reading and Writing Excel

Using pandas

Advanced Example: Applying Conditional Formatting

5. Word Files

Why Use Word Files?

Libraries Required

Reading Paragraphs and Tables

Extracting Images from a Word Document

Purpose

Libraries Required

Code Example: Extract Images from Word Document

6. PDF Files

Why Use PDF Files?

Libraries Required

Extracting Text from a PDF

Extracting Images from a PDF

Purpose

Libraries Required

Code Example: Extract Images from PDF

7. JSON Files

Why Use JSON Files?

Libraries Required

Basic Example: Reading and Writing JSON

Conclusion

Explore More:

Discussion about this post

Using `csv` Module

Using `pandas`:

Advanced Example: Large File Processing with `pandas`

Using `pandas`