Learn Web Scraping with Python: Build Your Own Datasets from Scratch
Don't wait for Kaggle, If your data exist on web, you can get it

Data enthusiast and software developer exploring web development, machine learning, and tech innovations. I share project insights, coding experiments, and practical approaches to solving real-world problems.
The Problem
As a data science student, I often hit a wall: I have a question, but I don’t have the data.
In most practicals and tutorials, we are usually given clean .csv files, like the famous iris.csv or titanic.csv datasets. But in the real world, the data you need isn’t always neatly stored in an Excel file. It is often locked inside websites, hidden behind HTML tags.
I realized that if I wanted to analyze real-world trends, like comparing laptop prices in Sri Lanka or tracking new sentiment, I needed to build my own datasets. This technique is called Web Scraping.
The Tech Stack
To scrape a website, we don’t need complex tools. We need 3 specific Python libraries
requests: This library sends a request to the website, requesting the HTML
BeautifulSoup (bs4): This library reads the HTML received from the website to find the data we want
Pandas: to save the received data into a structured
.csvfile
Install the following library as well
pip install requests beautifulsoup4 pandas
The Project: Scrape a Bookstore
For starters, we will scrape Books to Scrape, a test website. Our goal is to get the Title and price of every book on the front page and save it to Excel.
Step 1: The Setup
First, we have to import our tools and set our target URL.
import requests
from bs4 import BeautifulSoup
import pandas as pd
# URL we want to scrape
url = 'https://books.toscrape.com/'
# Pretend to be a real browser (This is the Best Practice)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
Most websites block or restrict bots, scrapers, automated scripts, and suspicious traffic. When your scraper sends a request without a User-Agent header, it looks like this:
User-Agent: python-requests/2.31.0
To a website, this is a huge red flag. Some sites instantly reject that request. So we replace it with a common browser signature, like Chrome on Windows:
Mozilla/5.0 (Windows NT 10.0; Win64; x64)
This tricks the server into treating your scraper like a normal human visitor.
Some websites will:
Block the IP
Return empty HTML
Show a CAPTCHA
Send you a 403 Forbidden
Redirect you to an error page
All of these will break your scraper.
What the User-Agent actually does is:
The User-Agent is a simple piece of text telling the website:
“Hi, I’m Chrome version X running on Windows”
It does not give you full access to the website, but it reduces suspicion. Using a User-Agent is safe, expected, and considered good practice for every scraper. Even harmless sites block Python-style requests because they assume you are a bot aggressively collecting data
Using User-Agent is not to bypass Security; it’s only to prevent unnecessary blocking when scraping public data. If a website intentionally blocks scraping, you must respect that.
Step 2: Fetching the HTML
When we scrape a website, the first step is to download its HTML. This is done using requests.get().
response = requests.get(URL, headers = headers)
# Checks if the request was a success (Status code == 200)
if response.status_code == 200:
print("We are Connected")
soup = BautifulSoup(response.content, 'html.parser')
else:
print('Failed to Connect and retrive page')
What this does is:
requests.get(url, headers=headers)
This sends an HTTPS request to the website requesting its HTML. The response includes the page’s HTML, status code, and server information.
If the request is successful, BeautifulSoup parses the HTML and returns a structure we can search.
Step 3: The Inspector Trick - How we Find the right HTML tags
Before writing any scripting code, we must inspect the website (Books to Scrape) to understand where the data is located.
Open the Webpage
Right-click on a book
Click Inspect
This opens the browser’s Developer tools and highlights the exact HTML of the selected book.
When you inspect one book, you will notice:
- Every book is inside an
<article class="product_pod">
- The book title is inside
<h3><a title="Book Name"></a></h3>
- The Price is inside
<p class="price_color">£51.77</p>
This is why we target these tags in our Python code.
If you don’t inspect the HTML, you won’t know what to extract
Now that we know the HTML structure, we can scrape all the books on the page.
# Find all book containers
books = soup.find_all('article', class_='product_pod')
book_data = []
for books in books:
#Extract the title (It's inside an 'a' tag within 'h3')
title = book.h3.a['title']
#Extract the Price
Price = book.find('p', class_='price_color').text
#Store the extracted title and price
book_data.append({
'Book Title': title,
'Price': price
})
print(f"Scraped {len(book_data)} books")
This builds a list like:
[
{"Book Title": "A Light in the Attic", "Price": "£51.77"},
{"Book Title": "Tipping the Velvet", "Price": "£53.74"},
...
]
Step 4: Saving your Data to a CSV File
Now that we’ve extracted all the book titles and prices into a Python list, the next step is to convert it into a dataset. This is where Pandas come in.
df = pd.DataFrame(book_data)
# Clean the data: Remove the Currency '£' symbol
df['Price'] = df['Price'].str.replace('£', '')
# Save to CSV
df.to_csv('my_books_dataset.csv', index=False)
print("Dataset saved successfully!")
You now have a real dataset created entirely from scratch. You can open the CSV in any data analysis tool.
This is how real Data Science projects start. You find the data, you collect it, and you prepare it for Analysis.
A Note on Ethics
Scraping is powerful—but don’t abuse it. Keep it clean:
Check robots.txt. Visit
https://example.com/robots.txtto see which pages are allowed for bots. If scraping is disallowed, respect it.Don’t Overload the Server. Avoid sending hundreds of requests per second. Use time.sleep() between requests to pace your scraper.
Respect data privacy. Only scrape Public information. Never collect personal or sensitive data without permission.
Follow these rules, and your scraping stays legal, ethical, and respectful.
The Catch Static vs. Dynamic Websites
Before you try to scrape your favourite social media feed, you need to know a critical limitation. This method (requests + BeautifulSoup) only works for Static Websites.
How to Test your site:
Open the website in Chrome.
Press
F12to open Developer ToolsPress
Ctrl + Shift + P/command + Shift + pand type “Disable JavaScript.”Reload the Page
If the data disappears, it is a Dynamic Site. You will need advanced tools like Selenium or Playwright to scrape it. (Which I will cover in a future article!)
Building your own dataset isn’t complicated; you need a target website, a scraping plan, and a few lines of Python. With requests, BeautifulSoup, and pandas, you can collect, clean, and save structured data in minutes. Once you master the basics, you can scale up, automate, and create datasets tailored to any project. This skill is a force multiplier in data science, letting you work with information that isn’t available anywhere else.



