This project demonstrates a Python script to extract data about the United States' states from Wikipedia, specifically from this page. The script fetches information such as the name of each state, its population, and area.
- Python
- Libraries: requests, BeautifulSoup, pandas, re
- Data Extraction: The script navigates through a Wikipedia page, parsing the HTML content to extract relevant data.
- Data Points: For each state, the script retrieves the state's name, population size, and area.
- Special Cases Handling: The script is capable of adjusting the extraction logic for states where the capital is the same as the largest city, which affects the HTML structure.
- Data Cleaning: Extracted data is cleaned (e.g., removing commas in numbers) and converted into appropriate data types.
- Data Representation: The final output is a pandas DataFrame, showcasing the data in a structured format.
- Importing Libraries: The script begins by importing necessary Python libraries.
- Fetching the Web Page: Using the
requests
library, the script fetches the Wikipedia page. - Parsing HTML: The
BeautifulSoup
library is used to parse the HTML content. - Inspecting Row Structure: The script systematically examines each row in the HTML table to locate and extract the state name, population, and area.
- Handling Layout Shifts: For states where the table layout differs (capital is the largest city), the script adjusts its extraction logic accordingly.
- Creating DataFrame: Extracted data is then structured into a pandas DataFrame for easy viewing and analysis.
- Displaying Data: The script finally displays the data for the first five states as an example.
The output is a DataFrame with columns for state name, population, and area. Here's a sample for the first five states:
State | Population | Area |
---|---|---|
Alabama | 5024279 | 52420 |
Alaska | 733391 | 665384 |
Arizona | 7151502 | 113990 |
Arkansas | 3011524 | 53179 |
California | 39538223 | 163695 |
This project is a practical example of using web scraping techniques to gather valuable data from a public webpage. It demonstrates the power of Python and its libraries in simplifying the task of data extraction and manipulation.