Scrape data on movie certifications from CBFC site

Installation

Run the following command in your command prompt or terminal (note that a Python environment must be present)

$ pip install -r requirements.txt

Usage

Step1:

Download the repo into your system, into a folder of your choice.

Step2:

Open the movies.csv (alternatively, any input .json file) on a text editor/excel as you wish. movies.csv contains a huge list of 5L+ movies, each of whose elements contain two pieces of information - the movie id and lang_id as used by the CBFC site; however, if the input is a .json file will contain only the movie-id. Decide which movies you want to scrape - the start and end indices.

Step3:

Run the following command in the terminal / command prompt, (assuming that the dependencies are installed.), and once you have decided the start and end indices. If you supply batch-size, the processing is performed in batches of movies at a time. Output json will be created correspondingly.

$ python download.py --range <start-index>-<end-index> --batch-size=<batch-size>

For example, if you want to scrape details of movies from movie ID 2 to movie ID 102 in "movies.csv", and save them in batches of 100 movies at a time, run the following:

$ python download.py --range 2-102 --batch-size=100

Note1: By default, the input is "movies.csv".

Alternatively, if you want to scrape details of movies from movie ID 2 to movie ID 102 in "anand.json", and save them in batches of 100 movies at a time, run the following:

$ python download.py --range 2-102 --batch-size=100 --input="anand.json"

Output jsons will be created for each batch in the specified range, but will include only those movies as contained in "anand.json".

Note1: By default, the batch-size is 1000.

Note2: By default, this process will work in parallel and consume all cores on your computer. If you want to allocate only a specific number of cores to this task, add another argument --n-jobs to the command, as follows:

$ python download.py --range 2-102 --n-jobs 2  # Use only two cores

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Analysis		Analysis
.flake8		.flake8
.gitignore		.gitignore
Cleaning.ipynb		Cleaning.ipynb
Consolidate.ipynb		Consolidate.ipynb
LICENSE		LICENSE
Movie Certifications.ipynb		Movie Certifications.ipynb
README.md		README.md
consolidate.py		consolidate.py
download.py		download.py
movies-1002-2001.json		movies-1002-2001.json
movies-2-1001.json		movies-2-1001.json
movies-2002-3001.json		movies-2002-3001.json
movies-3002-4001.json		movies-3002-4001.json
movies-4002-5001.json		movies-4002-5001.json
movies-5002-6001.json		movies-5002-6001.json
movies.csv		movies.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape data on movie certifications from CBFC site

Installation

Usage

Step1:

Step2:

Step3:

About

Releases

Packages

Languages

License

indiainpixels/MovieCertifications

Folders and files

Latest commit

History

Repository files navigation

Scrape data on movie certifications from CBFC site

Installation

Usage

Step1:

Step2:

Step3:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages