Gianni2437-p0

This project takes in multiple text documents and calculates word counts and TF-IDF scores for key words within the documents. File paths to the documents and stopwords directories will need to be entered through the command line using the '-d' and '-s' commands. The project initially calculates raw input word counts (sp1), followed by word frequencies without specified stopwords (sp2), word frequenices with parsing for punctuation (sp3), and finally TFF-IDF calculations for each relevant word in each document (sp4). sp1, sp2, and sp3 json files contain the top 40 words from the entire corpus while sp4 output contains the combination of the top 5 words from each input document.

Problems:

sp1 and sp2 got imperfect scores, the parsing did not remove empty strings from which ended up in the JSON output
imperfect preprocessing also led to lower scores on sp3 and sp4
sp4 output includes terms such as "don\u2019t"

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
CONTRIBUTERS		CONTRIBUTERS
LICENSE		LICENSE
README.md		README.md
gianni2437-p0.py		gianni2437-p0.py
sp1.json		sp1.json
sp2.json		sp2.json
sp3.json		sp3.json
sp4.json		sp4.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gianni2437-p0

About

Releases

Packages

Languages

License

dsp-uga/Gianni2437-p0

Folders and files

Latest commit

History

Repository files navigation

Gianni2437-p0

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages