Web Scraping US States Data from Wikipedia

Project Overview

This project demonstrates a Python script to extract data about the United States' states from Wikipedia, specifically from this page. The script fetches information such as the name of each state, its population, and area.

Technologies Used

Python
Libraries: requests, BeautifulSoup, pandas, re

Features

Data Extraction: The script navigates through a Wikipedia page, parsing the HTML content to extract relevant data.
Data Points: For each state, the script retrieves the state's name, population size, and area.
Special Cases Handling: The script is capable of adjusting the extraction logic for states where the capital is the same as the largest city, which affects the HTML structure.
Data Cleaning: Extracted data is cleaned (e.g., removing commas in numbers) and converted into appropriate data types.
Data Representation: The final output is a pandas DataFrame, showcasing the data in a structured format.

How It Works

Importing Libraries: The script begins by importing necessary Python libraries.
Fetching the Web Page: Using the requests library, the script fetches the Wikipedia page.
Parsing HTML: The BeautifulSoup library is used to parse the HTML content.
Inspecting Row Structure: The script systematically examines each row in the HTML table to locate and extract the state name, population, and area.
Handling Layout Shifts: For states where the table layout differs (capital is the largest city), the script adjusts its extraction logic accordingly.
Creating DataFrame: Extracted data is then structured into a pandas DataFrame for easy viewing and analysis.
Displaying Data: The script finally displays the data for the first five states as an example.

Sample Output

The output is a DataFrame with columns for state name, population, and area. Here's a sample for the first five states:

State	Population	Area
Alabama	5024279	52420
Alaska	733391	665384
Arizona	7151502	113990
Arkansas	3011524	53179
California	39538223	163695

Conclusion

This project is a practical example of using web scraping techniques to gather valuable data from a public webpage. It demonstrates the power of Python and its libraries in simplifying the task of data extraction and manipulation.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
special case example.png		special case example.png
us-states-web-scraping.ipynb		us-states-web-scraping.ipynb
us-states-web-scraping.py		us-states-web-scraping.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping US States Data from Wikipedia

Project Overview

Technologies Used

Features

How It Works

Sample Output

Conclusion

About

Releases

Packages

Languages

omerskoc/us-states-web-scraping

Folders and files

Latest commit

History

Repository files navigation

Web Scraping US States Data from Wikipedia

Project Overview

Technologies Used

Features

How It Works

Sample Output

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages