Python File Handling: An In-Depth Guide with Code Examples
Mastering the Art of Reading, Writing, and Manipulating Files in Python for Text, Images, CSV, Excel, Word, PDFs, and More
File handling is a cornerstone of Python programming, especially in Data Analysis, Machine Learning, and Automation tasks. Python provides built-in methods and external libraries to handle various file formats, enabling you to read, write, append, extract, and manipulate data efficiently.
This comprehensive guide will take you through the below concepts:
Why file handling is essential?
Libraries required and installation.
Working with different file formats: use cases, real-world scenarios, and advanced examples.
1. Text Files
Why Use Text Files?
Text files are lightweight, human-readable, and often used for:
Log Analysis: Reading server logs to monitor system performance or detect issues.
Configuration Files: Parsing settings stored in
.txt
or.ini
files.Data Preprocessing: Reading and cleaning raw text data for NLP (Natural Language Processing).
Reports and Summaries: Generating or reading plain text-based reports.
Libraries Required
No external libraries required (built-in
open()
method).
Basic Example: Reading a Text File
# Read a text file line by line
with open("example.txt", "r") as file:
for line in file:
print(line.strip()) # Removes extra newlines
Advanced Use Cases
1. Checking If a File Exists Before Reading
import os
if os.path.exists("example.txt"):
with open("example.txt", "r") as file:
print(file.read())
else:
print("File does not exist.")
2. Writing Custom Logs
from datetime import datetime
# Write logs with timestamps
with open("logs.txt", "a") as file:
file.write(f"{datetime.now()}: Process Completed Successfully.\n")
3. Custom Line Filtering
Suppose you want to remove lines containing a specific word.
# Remove lines containing the word "ERROR"
with open("logs.txt", "r") as file:
lines = file.readlines()
with open("cleaned_logs.txt", "w") as file:
for line in lines:
if "ERROR" not in line:
file.write(line)
2. Image Files
Why Use Image Files?
Images are often used in Machine Learning (image classification, object detection) or data visualization pipelines.
Machine Learning: Preprocessing images for training classification models (e.g., image resizing, augmentation).
Computer Vision: Detecting objects in images for security systems.
Data Extraction: Reading and analyzing images for text (OCR - Optical Character Recognition).
Libraries Required
Pillow (PIL):
pip install pillow
OpenCV:
pip install opencv-python
Basic Example: Displaying Images
Using Pillow (PIL)
from PIL import Image
# Open and display an image
img = Image.open("example.jpg")
img.show()
# Convert to grayscale
gray_img = img.convert("L")
gray_img.save("grayscale_example.jpg")
Using OpenCV:
import cv2
# Read and display the image
img = cv2.imread("example.jpg")
cv2.imshow("Image", img)
# Resize the image
resized = cv2.resize(img, (100, 100))
cv2.imshow("Resized Image", resized)
cv2.waitKey(0)
cv2.destroyAllWindows()
Advanced Example: Resizing and Pixel Manipulation
from PIL import Image
# Open an image
img = Image.open("example.jpg")
# Extract image properties
print(f"Image Format: {img.format}")
print(f"Image Size: {img.size}")
print(f"Image Mode: {img.mode}")
# Resize the image
resized_img = img.resize((200, 200))
resized_img.save("resized_example.jpg")
# Access and modify specific pixels
pixels = img.load()
pixels[10, 10] = (255, 0, 0) # Change pixel (10,10) to red
img.save("modified_pixel_example.jpg")
3. CSV Files
Why Use CSV Files?
CSV files are the go-to format for storing and sharing tabular data because they are lightweight and supported across platforms.
Data Analytics: Importing CSV datasets for analysis in tools like pandas.
Financial Reports: Automating the generation and analysis of revenue or expense CSV files.
ETL Pipelines: Extracting and transforming CSV data for databases.
Libraries Required
Built-in
csv
modulepandas:
pip install pandas
Basic Example: Reading and Writing CSV
Using csv
Module
import csv
# Writing to CSV
data = [["Name", "Age"], ["Girish", 25], ["Chandra", 30]]
with open("output.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerows(data)
# Reading CSV
with open("output.csv", "r") as file:
reader = csv.reader(file)
for row in reader:
print(row)
Using pandas
:
import pandas as pd
# Read CSV
df = pd.read_csv("example.csv")
print(df)
# Write CSV
df.to_csv("output.csv", index=False)
Advanced Example: Large File Processing with pandas
import pandas as pd
# Read CSV in chunks (for large files)
chunk_size = 1000
for chunk in pd.read_csv("large_data.csv", chunksize=chunk_size):
print(chunk.head())
4. Excel Files
Why Use Excel Files?
Excel files (.xlsx
) are essential for structured data storage and analysis in the business and data science world.
Financial Modeling: Analyzing and automating financial data stored in Excel.
Inventory Management: Reading and updating inventory data in Excel sheets.
Data Reporting: Automating Excel-based reports.
Libraries Required
openpyxl:
pip install openpyxl
pandas:
pip install pandas
Basic Example: Reading and Writing Excel
Using pandas
import pandas as pd
# Write a DataFrame to Excel
data = {"Name": ["Alice", "Bob"], "Age": [25, 30]}
df = pd.DataFrame(data)
df.to_excel("output.xlsx", index=False)
# Read from Excel
df = pd.read_excel("output.xlsx")
print(df)
#Reading Multiple Sheets
# Read all sheets
data = pd.read_excel("example.xlsx", sheet_name=None) # Returns a dictionary
for sheet, content in data.items():
print(f"Sheet: {sheet}")
print(content)
Advanced Example: Applying Conditional Formatting
from openpyxl import load_workbook
from openpyxl.styles import PatternFill
# Open workbook and apply formatting
wb = load_workbook("output.xlsx")
sheet = wb.active
# Apply red color fill to cells where Age > 25
red_fill = PatternFill(start_color="FF0000", fill_type="solid")
for row in sheet.iter_rows(min_row=2, max_col=2):
for cell in row:
if cell.value and isinstance(cell.value, int) and cell.value > 25:
cell.fill = red_fill
wb.save("formatted_output.xlsx")
5. Word Files
Why Use Word Files?
Word files (.docx
) are useful for generating reports, documents, and storing structured content.
Report Extraction: Extracting text and tables from Word documents for further analysis.
Automated Document Generation: Creating templated reports, invoices, or resumes.
Data Processing: Extracting embedded data like tables and images.
Libraries Required
python-docx:
pip install python-docx
Reading Paragraphs and Tables
from docx import Document
# Open Word file
doc = Document("example.docx")
# Read paragraphs
for paragraph in doc.paragraphs:
print(paragraph.text)
def extract_tables_from_word(word_file):
# Open the Word document
doc = Document(word_file)
table_data = []
# Iterate through each table
for table_index, table in enumerate(doc.tables):
print(f"--- Table {table_index + 1} ---")
for row in table.rows:
row_data = [cell.text.strip() for cell in row.cells]
table_data.append(row_data)
print("\t".join(row_data)) # Print each row in a table-like format
print("Table extraction completed.")
return table_data
# Example usage
word_file = "example.docx" # Replace with your Word file path
tables = extract_tables_from_word(word_file)
Extracting Images from a Word Document
Purpose
Word documents often include images embedded within paragraphs or media sections. These images can be extracted for further use.
Libraries Required
Built-in
zipfile
(Word documents are essentially ZIP archives).
Code Example: Extract Images from Word Document
import zipfile
import os
def extract_images_from_word(docx_file, output_folder):
# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)
# Open the .docx file as a ZIP archive
with zipfile.ZipFile(docx_file, 'r') as docx_zip:
for file in docx_zip.namelist():
# Check for image files in the Word media folder
if file.startswith("word/media/"):
docx_zip.extract(file, output_folder)
print(f"Extracted: {file}")
print("Image extraction completed.")
# Example usage
docx_file = "example.docx" # Replace with your Word file path
output_folder = "word_images"
extract_images_from_word(docx_file, output_folder)
6. PDF Files
Why Use PDF Files?
PDFs are used for sharing documents with fixed formatting.
Invoice Processing: Extracting text, tables, or images from invoices for accounting systems.
Document Search: Searching and analyzing large collections of PDFs.
Legal Document Analysis: Extracting sections of legal or policy documents.
Libraries Required
PyPDF2:
pip install PyPDF2
Extracting Text from a PDF
from PyPDF2 import PdfReader
# Open PDF and extract text
reader = PdfReader("example.pdf")
for page in reader.pages:
print(page.extract_text())
Extracting Images from a PDF
Purpose
PDF files often contain embedded images that need to be extracted for analysis, reporting, or visualization purposes.
Libraries Required
PyMuPDF (fitz):
pip install pymupdf
Code Example: Extract Images from PDF
import fitz # PyMuPDF
import os
def extract_images_from_pdf(pdf_file, output_folder):
# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)
# Open the PDF file
pdf = fitz.open(pdf_file)
# Loop through all the pages
for page_num in range(len(pdf)):
page = pdf[page_num]
images = page.get_images(full=True) # Extract all images on the page
for img_index, img in enumerate(images):
xref = img[0] # Image reference number
base_image = pdf.extract_image(xref)
image_bytes = base_image["image"] # Image binary content
image_ext = base_image["ext"] # Image extension (e.g., jpg, png)
# Save the image
image_filename = f"{output_folder}/page{page_num+1}_img{img_index+1}.{image_ext}"
with open(image_filename, "wb") as img_file:
img_file.write(image_bytes)
print(f"Saved: {image_filename}")
print("Image extraction completed.")
# Example usage
pdf_file = "example.pdf" # Replace with your PDF file path
output_folder = "extracted_images"
extract_images_from_pdf(pdf_file, output_folder)
7. JSON Files
Why Use JSON Files?
JSON is the most common format for storing structured data, especially in APIs and web development.
APIs: Reading data returned from APIs (e.g., weather, stock prices).
Configuration Management: Storing settings in JSON format for applications.
Data Storage: Reading and writing structured data for web or mobile apps.
Libraries Required
Built-in
json
module.
Basic Example: Reading and Writing JSON
import json
# Writing to JSON
data = {"name": "Alice", "age": 25}
with open("output.json", "w") as file:
json.dump(data, file, indent=4)
# Reading JSON
with open("output.json", "r") as file:
data = json.load(file)
print(data)
Conclusion
Python's file handling is versatile, efficient, and essential for working with data in various formats. This guide covered everything from basic operations to advanced use cases across:
Text
Images
CSV
Excel
Word
PDF
JSON
By mastering these, you’ll unlock powerful capabilities for Data Analysis, Automation, and Machine Learning workflows.
Explore More:
If you enjoyed this article, dive deeper into the fascinating world of mathematics, technology, and ancient wisdom through my blogs:
🌟 Ganitham Guru – Discover the beauty of mathematics and its applications rooted in ancient and modern insights.
🌐 Girish Blog Box – A hub of thought-provoking articles on technology, personal growth, and more.
💡 Ebasiq – Simplifying complex concepts in AI, data science, and beyond for everyone.
🌐 Linktr.ee - Discover all my profiles and resources in one place.
Stay inspired and keep exploring the endless possibilities of knowledge! ✨