Data-Wrangling

data wrangling on a we-rate-dogs dataset from twitter

introduction

WeRateDogs is a twitter account rate the dog stages and has millions of users I wrangled the data in the wrangle_act.ipynb, and the wrangling process included the following steps:

1-data gathering

I downloaded the data set twitter_archive_enhanced.csv provided in the workspace
then gathering the tweets data set using twitter API tweepy
then gathering the image_predictions.tsv provided in the workspace

2-data assessing

-quality

Twitter enhanced archive data set twEnArch twitter_archive_enhanced

there is decimal rating and the numerator column is int instead of float
wrong rating numerators were extracted from the text column
there is no need of column denominator (we add scale to numerator column header )
null values in retweets columns
source column need to reformating
timestamp data type to date
in_replay_to status id and user id columns need to be converted to appropriate data type
breeds of the dog are inaccurate
name columns have null values and misleading values like a

-tidiness

there are three data sets instead of one master data set
dog type can be represented in one column instead of three

note :

there are more quality issues but I choose just 8

3-cleaning data

there is decimal rating and the numerator column is int instead of float i converted the data type to float
wrong rating numerators were extracted from the text column I reproduced the rating from the text using regex then
there is no need of column denominator (we add scale to numerator column header ) I added teh scale to the numerator column header then removed the rating_denominator column
null values in retweets columns I deleted the rows which have the retweets columns != None so that i will analysis the orgin tweets
source column need to reformating I extracted the information from the source column and replace it
timestamp data type to date I converted the timestamp to date data type
in_replay_to status id and user id columns need to be converted to appropriate data type I converted these column to string after conveting its content from exp format to int format
breeds of the dog is inaccurate i removed the rows which have inaccurate breeds types
name columns have null values and misleading values like ’a’ I re_extract the name of the dog from the text usin regex then replace it
there are three data sets instead of one master data set I used merge function to make the master data set df by joining the three data sets twEnArch, predictions, tweets
dog type can be represented in one column dog_stages instead of four doggo floofer pupper Puppo I used merge function to calculate the dog stages column from the four columns

Shots

https://github.com/mostafaGwely/Data-Wrangling/blob/master/act_report.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
act_report.pdf		act_report.pdf
college.db		college.db
data_wrangling_report.ipynb		data_wrangling_report.ipynb
tweet_json.txt		tweet_json.txt
twitter-archive-enhanced.csv		twitter-archive-enhanced.csv
wrangle_act.ipynb		wrangle_act.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Wrangling

introduction

1-data gathering

2-data assessing

-quality

-tidiness

note :

3-cleaning data

Shots

About

Releases

Packages

Languages

mostafaGwely/Data-Wrangling

Folders and files

Latest commit

History

Repository files navigation

Data-Wrangling

introduction

1-data gathering

2-data assessing

-quality

-tidiness

note :

3-cleaning data

Shots

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages