-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keyphrases in sustained complaints #1
Comments
We want to ignore field headers that are likely to congest the common string detection, so my thinking was to combine Both are technically written by the DPA, but the I think the Open to suggestions! Maybe it's more reasonable to just use the allegations only, since that's more connected the language of the original complaint? Zac was looking for ways to improve the language of submitted complaints based on what has historically received more traction in terms of sustained allegations. |
May also be worth noting here that when allegations are added by the DPA (or OCC, since we also have those older records), these tended to be sustained more often due to the nature of how they come to exist. It might be best to exclude these if our interest is in features of the original complaint/allegation language, though I'm not sure and will process them altogether for now. |
My initial thought is we should treat allegations and findings of fact separately rather than combining them. Although both are written by DPA, in terms of strategizing how to word complaints we submit in the future, we'll want to focus on language in allegations. That said, it might be the case that there is language in the findings that points to specific ways of presenting important evidence, that we can adopt for when we submit complaints. I just think it's worth trying it with just the allegations for this purpose (in addition to combined which we can also set up, the code should not be much different). Ideally, we structure this as a classification problem, where we classify whether the complaint is sustained or not. And we use extracted keyphrases as features, along with: type of alleged misconduct, OCC vs. DPA, and whether or not the complaint was original or added by the agency. Those are the three major features that I imagine we'd want to be able to "control" for in some way (maybe there are others?). From there we can examine variable importance for specific keyphrases, and given your observation here, we should stratify those by whether the complaint was added or not (and perhaps focus on DPA vs OCC). In addition to auto-extracted keyphrases, make sure we include those that Zac has already come up with as features. |
That makes sense, I'll work with just the I did remember the goal to make it a classification problem and agree that I'm fiddling with the model parameters to make sure the extracted keyphrases are useful (top two results are consistently "officer" and "complainant") and I'll go back to confirm which of the existing phrases we indicate were suggested by Zac so we can pull those in. Thanks for the feedback! |
Zac asked last week if we could check whether there are common phrases present in the allegations that contribute to whether an allegation is sustained.
There's an open source python library,
pke
, that could be useful for identifying common phrases. I took a pass at it with theTopicRank
extractor but found that we may be incorrectly separating allegations that trail onto the next page, so I'll need to fix that issue before continuing to process phrases from the allegations.The text was updated successfully, but these errors were encountered: