Python Web Scraping: Create Custom Datasets

The Problem

As a data science student, I often hit a wall: I have a question, but I don’t have the data.

In most practicals and tutorials, we are usually given clean .csv files, like the famous iris.csv or titanic.csv datasets. But in the real world, the data you need isn’t always neatly stored in an Excel file. It is often locked inside websites, hidden behind HTML tags.

I realized that if I wanted to analyze real-world trends, like comparing laptop prices in Sri Lanka or tracking new sentiment, I needed to build my own datasets. This technique is called Web Scraping.

The Tech Stack

To scrape a website, we don’t need complex tools. We need 3 specific Python libraries

requests: This library sends a request to the website, requesting the HTML
BeautifulSoup (bs4): This library reads the HTML received from the website to find the data we want
Pandas: to save the received data into a structured .csv file

Install the following library as well

pip install requests beautifulsoup4 pandas

The Project: Scrape a Bookstore

For starters, we will scrape Books to Scrape, a test website. Our goal is to get the Title and price of every book on the front page and save it to Excel.

Step 1: The Setup

First, we have to import our tools and set our target URL.

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL we want to scrape
url = 'https://books.toscrape.com/'

# Pretend to be a real browser (This is the Best Practice)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

Most websites block or restrict bots, scrapers, automated scripts, and suspicious traffic. When your scraper sends a request without a User-Agent header, it looks like this:

User-Agent: python-requests/2.31.0

To a website, this is a huge red flag. Some sites instantly reject that request. So we replace it with a common browser signature, like Chrome on Windows:

Mozilla/5.0 (Windows NT 10.0; Win64; x64)

This tricks the server into treating your scraper like a normal human visitor.

Some websites will:

Block the IP
Return empty HTML
Show a CAPTCHA
Send you a 403 Forbidden
Redirect you to an error page

All of these will break your scraper.

What the User-Agent actually does is:

The User-Agent is a simple piece of text telling the website:

“Hi, I’m Chrome version X running on Windows”

It does not give you full access to the website, but it reduces suspicion. Using a User-Agent is safe, expected, and considered good practice for every scraper. Even harmless sites block Python-style requests because they assume you are a bot aggressively collecting data

Using User-Agent is not to bypass Security; it’s only to prevent unnecessary blocking when scraping public data. If a website intentionally blocks scraping, you must respect that.

Step 2: Fetching the HTML

When we scrape a website, the first step is to download its HTML. This is done using requests.get().

response = requests.get(URL, headers = headers)

# Checks if the request was a success (Status code == 200)

if response.status_code == 200:
    print("We are Connected")    
    soup = BautifulSoup(response.content, 'html.parser')
else:
    print('Failed to Connect and retrive page')

What this does is:

requests.get(url, headers=headers)

This sends an HTTPS request to the website requesting its HTML. The response includes the page’s HTML, status code, and server information.

If the request is successful, BeautifulSoup parses the HTML and returns a structure we can search.

Step 3: The Inspector Trick - How we Find the right HTML tags

Before writing any scripting code, we must inspect the website (Books to Scrape) to understand where the data is located.

Open the Webpage
Right-click on a book
Click Inspect

This opens the browser’s Developer tools and highlights the exact HTML of the selected book.

When you inspect one book, you will notice:

Every book is inside an

<article class="product_pod">

The book title is inside

<h3><a title="Book Name"></a></h3>

The Price is inside

<p class="price_color">£51.77</p>

This is why we target these tags in our Python code.

If you don’t inspect the HTML, you won’t know what to extract

Now that we know the HTML structure, we can scrape all the books on the page.

# Find all book containers
books = soup.find_all('article', class_='product_pod')

book_data = []

for books in books:
    #Extract the title (It's inside an 'a' tag within 'h3')
    title = book.h3.a['title']

    #Extract the Price
    Price = book.find('p', class_='price_color').text

    #Store the extracted title and price
    book_data.append({
        'Book Title': title,
        'Price': price
    })

print(f"Scraped {len(book_data)} books")

This builds a list like:

[
  {"Book Title": "A Light in the Attic", "Price": "£51.77"},
  {"Book Title": "Tipping the Velvet", "Price": "£53.74"},
  ...
]

Step 4: Saving your Data to a CSV File

Now that we’ve extracted all the book titles and prices into a Python list, the next step is to convert it into a dataset. This is where Pandas come in.

df = pd.DataFrame(book_data)

# Clean the data: Remove the Currency '£' symbol
df['Price'] = df['Price'].str.replace('£', '')

# Save to CSV
df.to_csv('my_books_dataset.csv', index=False)
print("Dataset saved successfully!")

You now have a real dataset created entirely from scratch. You can open the CSV in any data analysis tool.

This is how real Data Science projects start. You find the data, you collect it, and you prepare it for Analysis.

A Note on Ethics

Scraping is powerful—but don’t abuse it. Keep it clean:

Check robots.txt. Visit https://example.com/robots.txt to see which pages are allowed for bots. If scraping is disallowed, respect it.
Don’t Overload the Server. Avoid sending hundreds of requests per second. Use time.sleep() between requests to pace your scraper.
Respect data privacy. Only scrape Public information. Never collect personal or sensitive data without permission.

Follow these rules, and your scraping stays legal, ethical, and respectful.

The Catch Static vs. Dynamic Websites

Before you try to scrape your favourite social media feed, you need to know a critical limitation. This method (requests + BeautifulSoup) only works for Static Websites.

How to Test your site:

Open the website in Chrome.
Press F12 to open Developer Tools
Press Ctrl + Shift + P / command + Shift + p and type “Disable JavaScript.”
Reload the Page

If the data disappears, it is a Dynamic Site. You will need advanced tools like Selenium or Playwright to scrape it. (Which I will cover in a future article!)

Building your own dataset isn’t complicated; you need a target website, a scraping plan, and a few lines of Python. With requests, BeautifulSoup, and pandas, you can collect, clean, and save structured data in minutes. Once you master the basics, you can scale up, automate, and create datasets tailored to any project. This skill is a force multiplier in data science, letting you work with information that isn’t available anywhere else.

Learn Web Scraping with Python: Build Your Own Datasets from Scratch

The Problem

The Tech Stack

The Project: Scrape a Bookstore

Step 1: The Setup

Step 2: Fetching the HTML

Step 3: The Inspector Trick - How we Find the right HTML tags

Step 4: Saving your Data to a CSV File

A Note on Ethics

The Catch Static vs. Dynamic Websites

Comments

More from this blog

The 80% Reality of Data Science: Why Data Cleaning Dominates Professional Workflows

Why Environment Variables Exist (.env Files Explained)

How Dependency Conflicts Break Group Projects 💥

Why “It Works on My Laptop” Is a Real Engineering Problem

Command Palette

The Problem

The Tech Stack

The Project: Scrape a Bookstore

Step 1: The Setup

Step 2: Fetching the HTML

Step 3: The Inspector Trick - How we Find the right HTML tags

Step 4: Saving your Data to a CSV File

A Note on Ethics

The Catch Static vs. Dynamic Websites

Comments

More from this blog