The Web Page Topic Extractor is a Python-based tool that fetches content from specified URLs and identifies the most frequent relevant topics from the text on the page. It uses natural language processing techniques to filter and analyze the text, making it an efficient tool for quick content summarization.
- Web Scraping: Fetches web page content using the
requests
library. - HTML Parsing: Parses HTML content to extract text using
BeautifulSoup
. - Text Analysis: Processes text to remove common stopwords and non-alphabetic characters, and identifies common topics using the
nltk
library.
This project requires Python 3.6 or higher, along with several external libraries. Below are the necessary dependencies:
- Python 3.6+
requests
beautifulsoup4
nltk
To set up the project on your local machine, follow these steps:
-
Clone the repository or download the source code.
-
Install the required Python libraries using pip:
pip install requests beautifulsoup4 nltk
-
Ensure that the required NLTK resources are downloaded by running the following commands in your Python environment:
import nltk nltk.download('punkt') nltk.download('stopwords')
To run the script as a standalone application:
-
Open your terminal or command prompt.
-
Navigate to the directory containing the script.
-
Modify the
urls
list in the script to include the URLs you want to analyze. -
Execute the script by running:
python topic_extractor.py
You can also import the functionality into your own Python projects:
-
Ensure the script
topic_extractor.py
is in your project directory. -
Import the
common_topics
function:from topic_extractor import common_topics
-
Use the
common_topics
function to analyze URLs:topics = common_topics('http://www.cnn.com/2013/06/10/politics/edward-snowden-profile/') print("Identified Topics:", topics)
Below is an example of how the Web Page Topic Extractor processes a given URL and outputs the identified topics:
- URL: http://www.cnn.com/2013/06/10/politics/edward-snowden-profile/
- Identified Topics: ['cnn', 'ad', 'video', 'snowden', 'nsa']