Skip to content

I did this project for a client back when I was still learning GCP, the purpose of this project is to extract tabular data from PDF and then clean it and store it in BigQuery.

Notifications You must be signed in to change notification settings

shozy786/PDF-Data-Cleaner

Repository files navigation

PDF Data Cleaner

I did this project for a client back when I was still learning GCP, the purpose of this project is to extract tabular data from PDF and then clean it and store it in BigQuery.

Getting Environment Ready

  1. Turn on the following APIs in Google Cloud Console.
  • Vision API
  • BigQuery API
  • Cloud Run API
  • Functions API
  • Storage API
  1. Create 3 Buckets in the GCS as shown in the figure below.
  • First one is where the user will upload the pdf files. I named it dark-foundry-340620-companydata.
  • Second one is where the processed files will be stored. I named it dark-foundry-340620-companydata-processed.
  • Third one is used to store Cloud Vision results temporarily (Must be always empty). I named it dark-foundry-340620-tmp.

image

  1. Create a BigQuery dataset to store results. I named it companydata.
  2. Create a new service account and give it the role of Owner and download its key. Keep this key file in your own folder. (At this moment, I know and have learnt that this step was not necessary)

Making Necessary Code Changes:

  1. In extract.py, change the following lines with your own info.
  • BUCKET="your bucket name where user will store pdf files"
  • PROCESSED_BUCKET="your bucket name where files will be stored after processing"
  • TABLE_ID="your project name.your bq dataset name.your bq table name"
  • LOGIN_FILE="The key file that you generated in the previous section” Make sure that it is the same path as extract.py"
  1. In target.py, change the following things.
  • gcs_destination_uri= "name of the temporary bucket to be used by VISION API prefixed with gs://"
  • LOGIN_FILE= "name of your json key file generated in the previous section"
  • TMP_BUCKET_NAME= "name of the temporary bucket to be used by VISION API"

Deployment

  1. Install gcloud CLI on your local pc. Open Command Prompt and navigate to the folder where you made your code changes. Type this command: gcloud run deploy. Press return.
  2. Login to your google account if prompted. Allow unauthorized access. When deployment is succeeded, you’ll receive a URI that will run the code if it is hit by a get request.
  3. Copy that URL.
  4. Go to the main.py in function-source folder and replace URL in it with your own URL that you got from Cloud Run. It should be like requests.get('your url')
  5. Create a new Cloud Function select Python 3.x runtime. Select Trigger Type: Cloud Storage and Event Type: Finalize/Create.
  6. Select Bucket as your bucket where user will store PDF files. Click Save.
  7. Copy the code from function-source/main.py to the cloud function’s main.py.
  8. Copy the code from function-source/requirements.txt to cloud function’s requirements.txt.
  9. Deploy the function. Test it by uploading the PDF file in your PDF bucket and refresh the page after a couple of minutes it will be gone and new rows will be added to your BigQuery table.

About

I did this project for a client back when I was still learning GCP, the purpose of this project is to extract tabular data from PDF and then clean it and store it in BigQuery.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published