Web Page Topic Extractor

The Web Page Topic Extractor is a Python-based tool that fetches content from specified URLs and identifies the most frequent relevant topics from the text on the page. It uses natural language processing techniques to filter and analyze the text, making it an efficient tool for quick content summarization.

Features

Web Scraping: Fetches web page content using the requests library.
HTML Parsing: Parses HTML content to extract text using BeautifulSoup.
Text Analysis: Processes text to remove common stopwords and non-alphabetic characters, and identifies common topics using the nltk library.

Requirements

This project requires Python 3.6 or higher, along with several external libraries. Below are the necessary dependencies:

Python 3.6+
requests
beautifulsoup4
nltk

Installation

To set up the project on your local machine, follow these steps:

Clone the repository or download the source code.
Install the required Python libraries using pip:
```
pip install requests beautifulsoup4 nltk
```
Ensure that the required NLTK resources are downloaded by running the following commands in your Python environment:
```
import nltk
nltk.download('punkt')
nltk.download('stopwords')
```

Usage

Running as a Standalone Script

To run the script as a standalone application:

Open your terminal or command prompt.
Navigate to the directory containing the script.
Modify the urls list in the script to include the URLs you want to analyze.
Execute the script by running:
```
python topic_extractor.py
```

Importing in Your Project

You can also import the functionality into your own Python projects:

Ensure the script topic_extractor.py is in your project directory.

Import the common_topics function:

from topic_extractor import common_topics

Use the common_topics function to analyze URLs:

topics = common_topics('http://www.cnn.com/2013/06/10/politics/edward-snowden-profile/')
print("Identified Topics:", topics)

Example Output

Below is an example of how the Web Page Topic Extractor processes a given URL and outputs the identified topics:

URL: http://www.cnn.com/2013/06/10/politics/edward-snowden-profile/
Identified Topics: ['cnn', 'ad', 'video', 'snowden', 'nsa']

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Design Documentation (Part 2 Solution).md		Design Documentation (Part 2 Solution).md
README.md		README.md
extract_topics.py		extract_topics.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Page Topic Extractor

Features

Requirements

Installation

Usage

Running as a Standalone Script

Importing in Your Project

Example Output

About

Releases

Packages

Languages

rohitbpatil27/Web-Page-Topic-Extractor

Folders and files

Latest commit

History

Repository files navigation

Web Page Topic Extractor

Features

Requirements

Installation

Usage

Running as a Standalone Script

Importing in Your Project

Example Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages