Skip to content

mostafaGwely/Data-Wrangling

Repository files navigation

Data-Wrangling

data wrangling on a we-rate-dogs dataset from twitter

introduction

WeRateDogs is a twitter account rate the dog stages and has millions of users I wrangled the data in the wrangle_act.ipynb, and the wrangling process included the following steps:

1-data gathering

  • I downloaded the data set twitter_archive_enhanced.csv provided in the workspace
  • then gathering the tweets data set using twitter API tweepy
  • then gathering the image_predictions.tsv provided in the workspace

2-data assessing

-quality

Twitter enhanced archive data set twEnArch twitter_archive_enhanced

  • there is decimal rating and the numerator column is int instead of float
  • wrong rating numerators were extracted from the text column
  • there is no need of column denominator (we add scale to numerator column header )
  • null values in retweets columns
  • source column need to reformating
  • timestamp data type to date
  • in_replay_to status id and user id columns need to be converted to appropriate data type
  • breeds of the dog are inaccurate
  • name columns have null values and misleading values like a

-tidiness

  • there are three data sets instead of one master data set
  • dog type can be represented in one column instead of three

note :

there are more quality issues but I choose just 8

3-cleaning data

  • there is decimal rating and the numerator column is int instead of float i converted the data type to float

  • wrong rating numerators were extracted from the text column I reproduced the rating from the text using regex then

  • there is no need of column denominator (we add scale to numerator column header ) I added teh scale to the numerator column header then removed the rating_denominator column

  • null values in retweets columns I deleted the rows which have the retweets columns != None so that i will analysis the orgin tweets

  • source column need to reformating I extracted the information from the source column and replace it

  • timestamp data type to date I converted the timestamp to date data type

  • in_replay_to status id and user id columns need to be converted to appropriate data type I converted these column to string after conveting its content from exp format to int format

  • breeds of the dog is inaccurate i removed the rows which have inaccurate breeds types

  • name columns have null values and misleading values like ’a’ I re_extract the name of the dog from the text usin regex then replace it

  • there are three data sets instead of one master data set I used merge function to make the master data set df by joining the three data sets twEnArch, predictions, tweets

  • dog type can be represented in one column dog_stages instead of four doggo floofer pupper Puppo I used merge function to calculate the dog stages column from the four columns

Shots

https://github.com/mostafaGwely/Data-Wrangling/blob/master/act_report.pdf

About

data wrangling on we rate dogs data set from twitter

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published