A machine learning project aimed at detecting phishing websites based on various features extracted from URLs.
- Phishing Site Detection
- Introduction
- Project Structure
- Installation
- Dataset
- Preprocessing
- Modeling
- Evaluation
- Results
- How to Use
- Contributing
- License
- Acknowledgments
Phishing is a fraudulent attempt to obtain sensitive information by disguising as a trustworthy entity in electronic communications. This project uses machine learning to detect phishing websites by analyzing features extracted from URLs.
├── data
│ ├── raw # Raw data from the source
│ ├── processed # Processed data after preprocessing
├── notebooks
│ ├── data_preprocessing.ipynb # Jupyter notebook for data preprocessing
│ ├── model_training.ipynb # Jupyter notebook for model training and evaluation
├── src
│ ├── data_preprocessing.py # Python script for data preprocessing
│ ├── model.py # Python script for model creation and training
│ ├── evaluation.py # Python script for model evaluation
├── models
│ ├── model_checkpoint.pth # Saved model checkpoint
├── requirements.txt # List of dependencies
├── README.md # Project documentation
To set up the project locally, follow these steps:
git clone https://github.com/mungekarkiran/phishing_site_detection.git
cd phishing-site-detection
pip install -r requirements.txt
The dataset used for this project consists of URLs labeled as phishing or legitimate. Features were extracted from these URLs to train the model.
- Source: PhishTank Dataset
- Size: Approximately 10,000 URLs
- Preprocessing: Extracted features include length of URL, presence of special characters, domain age, and more.
The preprocessing steps include:
- Extracting features from URLs.
- Handling missing data.
- Encoding categorical features.
- Normalizing numerical features.
Various machine learning models were explored, including:
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)
- Gradient Boosting
- Hyperparameter tuning was performed to optimize the models.
The models were evaluated using the following metrics:
- Accuracy: Proportion of correctly identified phishing and legitimate sites.
- Precision: The percentage of correctly identified phishing sites among all sites identified as phishing.
- Recall: The percentage of actual phishing sites that were correctly identified.
- F1-Score: The harmonic mean of precision and recall.
The best-performing model achieved:
- Accuracy: 95%
- Precision: 93%
- Recall: 92%
- F1-Score: 92.5% These results demonstrate the model's ability to effectively identify phishing sites.
To use the trained model for predicting whether a URL is phishing or legitimate:
model = load_model('models/model_checkpoint.pth')
url = 'http://example.com'
prediction = predict(model, url)
print(f'The URL is {"Phishing" if prediction else "Legitimate"}')
Contributions are welcome! Please follow these guidelines:
- Fork the repository.
- Create a new branch (git checkout -b feature/your-feature-name).
- Make your changes.
- Commit your changes (git commit -m 'Add feature').
- Push to the branch (git push origin feature/your-feature-name).
- Open a pull request.
This project is licensed under the MIT License.
- The creators of the PhishTank dataset.
- Inspiration from various machine learning tutorials and resources.
- Thanks to the contributors who helped in making this project possible.
You can customize the placeholder text such as your_username
, PhishTank
, or any other details specific to your project. Once you've done that, you can save this as README.md
in your GitHub repository.