Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates for requirements.txt and special characters #3

Conversation

jimmynotjames
Copy link
Contributor

@jimmynotjames jimmynotjames commented Apr 26, 2024

This PR contains two commits.

The first commit updates requirements.txt to what's necessary to get a clean run with scraper.py. A minor update to .gitignore for venv's (hope that's okay).

The second commit refines some special chars handling:

\u0435 is a Cyrillic small letter "е" (U+0435).
Example: "I park\u0435d my car right between the Methodist"
lyrics.json currently has 410 of these.

\u200b is a zero-width space and it's weirdly hanging out in two song titles:
Two instances in lyrics.json
"l\u200bong story short"
"r\u200bight where you left me"

I also added a couple of bits of code here and there for the sake of consistency and extra safeguarding.

Lastly, I did not include the resulting output data files because I saw that the diff was rather large and I got some interesting messages to the console that changed on every run (probably due to timeouts from Genius?). I'm also not sure if you run any post-processing or manual sanity-checking on that, but I presume that if you merge my PR, you can easily run it yourself to produce the new files.

@jimmynotjames
Copy link
Contributor Author

@shaynak : Ready for your review!

.gitignore Outdated Show resolved Hide resolved
scraper.py Outdated Show resolved Hide resolved
@shaynak
Copy link
Owner

shaynak commented Apr 26, 2024

As a note - once I approve & merge this request, it'll take me roughly a week to make updates to the dataset because I'm away from my computer.

@jimmynotjames jimmynotjames requested a review from shaynak April 26, 2024 22:04
@jimmynotjames
Copy link
Contributor Author

@shaynak: Friendly reminder about this! Hope you've had a good weekend!

@shaynak shaynak merged commit fdcef8e into shaynak:main May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants