This project is aimed at downloading previous year(s) question papers. I wanted to make it so it can be automatically updated each semester to have the question papers.
- Short URL: https://t.ly/PuJA
- Original Drive URL This is (probably) maintained by the Examination branch. I got it from somewhere and they constantly upload old question papers aftear each semester.
It was tough to search for old question papers there, due to how hard it is to navigate google drive.
You could not search question papers branch wise/semester wise. Only year wise.
First i downloaded each question paper manually, but that took a lot of space (4 GB). It just wasn't the best solution, but it helped me search for question papers by paper id (i wrote a python script for it but i lost it somewhere)
Secondly, I got an idea to directly scrape the google drive using the google drive's API. So i pulled all the data, but that data was humungus (about 10k records) so i needed a database to store it. I could not hardcode it somewhere since that would be inefficient. So i used my favourite database (PostgreSQL) to store the file_name, download_link and folder_name(folder which it came from)
1. Setting up the database
Docs pending
2. Setting up the Enviornment Variables
Environment variables are dynamic settings external to the code, used to configure software behavior without altering the code itself. Long story short, you don't want to change your credentials in code every time, enviornment variables can be useful in that case.
- Check the .env.example
- copy paste it and rename it to .env
- Put your credentials in it.
⚠️ WARNING: Never push environment variables to a public repository. They are meant to be private.
3. Drive Scraping
It scrapes the data from google drive, and stores it in a PostgreSQL database (hosted on my homeserver)
A service account is a special kind of account typically used by an application or compute workload, such as a Compute Engine instance, rather than a person. A service account is identified by its email address, which is unique to the account. Read more here.
This is how to do it step by step:
- Go Here and make an account on google cloud.
- Make a new project:
- go to APIs and Services on https://console.cloud.google.com/
- Go to Enable APIs and Services:
- Search and Enable google drive API.
- Create credentials for that api. You make a service account to use google drive API.
- rename the API key to
service_account_key.json
and save it in same folder as thedrive_scraper.py
script.
- You need to make a virtual enviornment for python. Read online what it does. You make it using the following command on windows CMD terminal:
python -m venv venv
Search how to do it for your operating system online. - Activate the virtual enviornment. For windows: type
venv\scripts\activate
in your terminal - Run the script using python drive_scraper.py
- At the end it should store the list of files, download links, folder name in the database.
4. Subject scraping
Till now we got:- File Name (usually subject code)
- Download link.
Now it's not user friendly to go to your syllabus, search for subject code, put that subject code to search for question paper.
To make it better, we will need Subject Code, Semester, Branch etc. Luckily for us, we got https://academics.gndec.ac.in/datesheet as a really good source for getting all that data. This is what i mean:
Now we could just copy paste that data into excel directly, but that would be too hard for a lot of branches. Why do manual labor? that's what we are engineers for, to reduce manual labor where it's not necessary.
This is my approach to getting that data:
- First inspect how the request goes to https://academics.gndec.ac.in/ in the browser console. this is how you do it:
- Now we need to get that "value" field which is being sent as a request. I found that it was in the dropdown menu of Program. Here:
- Gotta do the same process now, but programatically. So i used python to do so (see scrape_subjects.py)
- Extracted all the <table elements from the response and used beautifulsoup and pandas to convert it into .csv format, and saved it.
- Merged all those CSVs into one, and imported them into the database after some processing on CSV files, using this command:
COPY paper_ids(branch_id, semester, subject_name, subject_code, paper_id, m_code, scheme)
FROM '/path/to/your/csv/file.csv'
WITH (FORMAT csv, HEADER true);
5. The API
Now users can't just access database directly, we need API for sending the data to the user's browser. Well server.js explains how the API works. There's more info on it in api.md
6. The Frontend
Now the hard part is done. API is complete and database is setup. All what needs to be done is the frontend for the project. May it be someone else, not me who appends to this file.