v1.0
Data Science Team’s MVP for Story Squad Release Canvas 2
Functional features by category:
Transcription and Moderation
- Transcription
- Google Cloud Vision OCR (#10)
- Connects to Google Cloud Vision API and uses their Optical Character Recognition model to transcribe the handwritten stories uploaded by users.
- Low confidence flag (#20)
- During transcription, returns Google Cloud Vision’s confidence in each transcribed submission. Raises a flag if the transcription confidence is below 85% signifying poor image or handwriting quality and consequently possibly inaccurate evaluation metrics.
- Google Cloud Vision OCR (#10)
- Text Moderation
- Bad/Inappropriate words filter (#31)
- Added method into Google API service that checks the word tokens against a list of words that are known to be inappropriate.
- Bad/Inappropriate words filter (#31)
- Image Moderation
- Safe Search (#10)
- Connects to Google Cloud Vision API and utilizes their built-in Safe Search service to flag if a user’s uploaded illustration has racy, adult, or violent content.
- Safe Search (#10)
Complexity Analysis
- Complexity Metric - “Squad Score” (#18)
- Cleans transcribed text and returns a custom complexity score
- This baseline implementation includes four features generated only with Python and Pandas. It is intended to be iterated upon.
- Given the limited amount of labels to train a model/formula toward, this formula only utilizes features that are representative of features seen in validated complexity models or requested by stakeholder and in form that is least susceptible to errors in child writing/handwriting and transcription. (i.e. using characters for length metric rather than syllables or words)
- Features:
- sl: story length (in characters)
- awl: average word length (in characters)
- qn: quotes number
- uw: unique words count (over two characters)
- Weights:
- Squad Score is initiated with only weights of 1 for each feature, as there were not enough labels on the data to be able to tune weights in a generalizable way.
- There is also a standardized “range scaler” of 30, meant to bring the overall Squad Score up to a closer range of 0-100, purely for ease of metric reading.
- Formula: sl(1)(30) + awl(1)(30) + qn(1)(30) + uw(1)(30)
- Range: the score bottoms out at 0, but does not have a bounded upper range
- Metrics:
- The only labels available at the time of this development were a 1-25 ranking of 25 of the training set stories. Applying this Squad Score formula to these 25 stories resulted in a -.60 correlation coefficient of scores to rankings.
Deployment
- API Endpoints
- Submission/text (#19)
- REST API endpoint that transcribes and computes squad score of submission then returns that information to the web backend.
- Submission/illustration (#19)
- REST API endpoint that submits the illustration to the Google Vision API: Safe Search service to flag inappropriate content in user submitted content.
- Submission/text (#19)
- GitHub Actions
- Header Security Token Checking
- AuthRouteHandler (#27)
- Feature that checks request’s headers against a known security token to allow access to API endpoints.
- AuthRouteHandler (#27)