Plagiarism Detection

plagiarism detector following this paper that examines a text file and performs binary classification; labeling that file as either plagiarized or not, depending on how similar the text file is to a provided source text.

Problem Statment and Analysis

Plagiarism is defined as “the appropriation of another person's ideas, processes, results, or words without giving appropriate credit”, so our goal here to try to find a solution for this by using some comparing between original and target text after making some preprocessing techniques for text before fitting it into Machine learning model to classify this model is plagiarized or not, according to the paper mentioned above will try to make some text processing after calculating containment and longest common subsequence using dynamic programming algorithm.

Created features.
before we prepare our final dataset I'm made multiple features using multiple N-gram with containment and longest common subsequence, then try to calculate a correlation matrix to ignore very high correlated columns

Correlation Matrix.

DataSet

This data is a slightly modified version of a dataset created by Paul Clough (Information Studies) and Mark Stevenson (Computer Science), at the University of Sheffield. You can read all about the data collection and corpus, at their university webpage

Citation for data: Clough, P. and Stevenson, M. Developing A Corpus of Plagiarised Short Answers, Language Resources and Evaluation: Special Issue on Plagiarism and Authorship Analysis,

Project Flow

Data Exploration
Defining Features
Train and Deploy Model into AWS SageMaker

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Paper		Paper
assets		assets
notebook_ims		notebook_ims
plagiarism_data		plagiarism_data
source_pytorch		source_pytorch
source_sklearn		source_sklearn
.gitignore		.gitignore
1_Data_Exploration.ipynb		1_Data_Exploration.ipynb
2_Plagiarism_Feature_Engineering.ipynb		2_Plagiarism_Feature_Engineering.ipynb
3_Training_a_Model.ipynb		3_Training_a_Model.ipynb
README.md		README.md
helpers.py		helpers.py
problem_unittests.py		problem_unittests.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plagiarism Detection

Problem Statment and Analysis

DataSet

Project Flow

About

Releases

Packages

Languages

Mostafa-ashraf19/Plagiarism_Detection

Folders and files

Latest commit

History

Repository files navigation

Plagiarism Detection

Problem Statment and Analysis

DataSet

Project Flow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages