Run the following command in your command prompt or terminal (note that a Python environment must be present)
$ pip install -r requirements.txt
Download the repo into your system, into a folder of your choice.
Open the movies.csv (alternatively, any input .json file) on a text editor/excel as you wish. movies.csv contains a huge list of 5L+ movies, each of whose elements contain two pieces of information - the movie id and lang_id as used by the CBFC site; however, if the input is a .json file will contain only the movie-id. Decide which movies you want to scrape - the start and end indices.
Run the following command in the terminal / command prompt, (assuming that the dependencies are installed.), and once you have decided the start and end indices. If you supply batch-size, the processing is performed in batches of movies at a time. Output json will be created correspondingly.
$ python download.py --range <start-index>-<end-index> --batch-size=<batch-size>
For example, if you want to scrape details of movies from movie ID 2 to movie ID 102 in "movies.csv", and save them in batches of 100 movies at a time, run the following:
$ python download.py --range 2-102 --batch-size=100
Note1: By default, the input is "movies.csv".
Alternatively, if you want to scrape details of movies from movie ID 2 to movie ID 102 in "anand.json", and save them in batches of 100 movies at a time, run the following:
$ python download.py --range 2-102 --batch-size=100 --input="anand.json"
Output jsons will be created for each batch in the specified range, but will include only those movies as contained in "anand.json".
Note1: By default, the batch-size is 1000.
Note2: By default, this process will work in parallel and consume all cores on your computer. If you want to allocate only a specific number of cores to this task, add another argument --n-jobs
to the command, as follows:
$ python download.py --range 2-102 --n-jobs 2 # Use only two cores