This is a repository for Udacity Data Analyst Project 1 (Investigate a Dataset). The dataset used in the project is also included in this repository.
The libraries used on this project include:
- Pandas – For storing and manipulating structured data. Pandas functionality is built on NumPy (upgrade to version 0.25.1)
- Numpy – For multi-dimensional array, matrix data structures and, performing mathematical operations
- Matplotlib – For all visualizations (including maps and graphs)
I analyzed the dataset which contains information of about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. The analysis is focused on answering the questions:
- Which movie title had the highest budget?
- Which movie titles has the highest revenue?
- Which movies are the most popular of all times?
- Is there a correlation between vote_count and revenue?
- What kinds of properties are associated with movies that have high revenues?
The main steps for this project can be summarized as follows:
- Data Wrangling
- Data Assessment
- Data Cleaning
- Exploratory Analysis
- Conclusions/Results
Based on the data and analysis carried out;
-
The most Popular Movies of all time are Jurassic World, Mad Max: Fury Road, Interstellar, Guardians of Galaxy and Insurgent.
-
The Scatter plot visualization plotted shows that there is no correlation between vote_counts and revenue generated.
-
High Popularity ratings is associated with movies that generates high revenue
-
The budget of a movie that generates low revenue is about 5 million while that of a high revenue movie over 52 million. This clearly shows that budget of a movie is correllated with the revenue of a movie, but there are limitations to this result, such as the year the movie was released(release_year) and Director of the Movie.