Skip to content

mediarain/website-scraper-with-amazon-comprehend

 
 

Repository files navigation

Amazon Comprehend demo

This project lets you analyze websites using Amazon Comprehend.

You can deploy this project using SAM. Once deployed, you're able to query the service by including a header value of 'url', which indicates the url of the website that you want to "understand/comprehend". The Lambda function parses the website represented by the url and runs the body of this website through the Amazon Comprehend service.

NOTE 1: Make sure the owner of the site you're scraping allows you to do so! This is very important. If your use-case is to scrape your customers' sites to provide better recommendations, analysis, etc, and they know about that then you're good to go, otherwise make sure you do your due-dilligence!

NOTE 2: For the most part your test usage should fall under the Free Tier, but check this page to make sure, as well as your current billing levels.

TO-DO

Additions you might want to implement (I might add some of this in the future)

  • Add another property to define an ElasticSearch endpoint and automatically pipe the output there
  • Save the data in Amazon S3 and analyze the data using Amazon Athena and Amazon Quicksight
  • Use Amazon Rekognition to add additional analysis by analyzing images found on the website
  • Understand the websites that you're parsing and extract only the content that makes the most sense. I'm using JSoup so it should make the additional parsing easier
  • Use Amazon Translate for additional insight of international, non-English, websites
  • Use a crawler instead of a scraper - more website coverage

Build and deploy

# Build the code
./gradlew jar

# Package it
aws cloudformation package --template-file sam.yaml --s3-bucket YOUR_BUCKET_NAME --output-template-file /tmp/UpdatedSAMTemplate.yaml

# Deploy it (change the parameter value as needed)
aws cloudformation deploy --template-file /tmp/UpdatedSAMTemplate.yaml --stack-name comprehend-stack --parameter-overrides RegionParameter=us-east-1 --capabilities CAPABILITY_IAM

Test it

# Get the Comprehend output
curl -XGET 'https://$API_GATEWAY_ID.execute-api.$REGION.amazonaws.com/Prod/token/valid' --header "url: url-to-parse"

Releases

No releases published

Packages

No packages published

Languages

  • Kotlin 100.0%