-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parser detection of unhandled content #80
base: master
Are you sure you want to change the base?
Conversation
pyscraper/new_hansard.py
Outdated
@@ -617,6 +617,15 @@ def parse_question(self, question): | |||
|
|||
p.text = re.sub('\n', ' ', text) | |||
tag.append(p) | |||
|
|||
if len(para) > 1: | |||
for p in para: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not double the output in the cases it's trying to catch? e.g.
<Question><hs_Para><Number>Q2</Number>.
<Uin>[908984]</Uin>
<Member><B>Mr
Steve Reed</B> (Croydon North) (Lab):</Member>
<QuestionText></QuestionText>I
add my condolences to those already expressed about the former Father
of the House, and I welcome
my<hs_TimeCode time="2017-03-01T12:21:31"></hs_TimeCode> new hon.
Friend the Member for Stoke-on-Trent Central (Gareth Snell) to his
place.</hs_Para><hs_Para>Young
black men who use mental health services are more likely than other
people to be subject to detention, extreme forms of medication and
severe physical restraint, and, in extreme cases, this has led to
death, including that of my constituent Seni Lewis. Too many black
people with mental ill health are afraid to seek treatment from a
service they fear will not treat them fairly. Will the Prime Minister
meet me and some of the affected families to discuss the need for an
inquiry into institutional racism in the mental health
service<hs_TimeCode time="2017-03-01T12:22:18"></hs_TimeCode>?</hs_Para></Question>
The following-sibling would catch the first para after QuestionText, and then this loop would catch it again.
Fix for the parser failing to pick up all the text if there is more than one hs_Para element instite a Question tag
Store the UID and HRSContentID of handled tags so we can later compare to a list of all IDs in the document
Get a list of all tag IDs in the document and compare to the list we've processed and throw an exception if they don't match.
9f0e0a4
to
d511b3e
Compare
Copes with tags that are mostly processed from inside another tag
c4476de
to
f96f8f3
Compare
There's lots of tags that we don't directly parse as we're interested in sub tags or they are parsed as part of the parent. Mark these as seen.
We didn't use namespaces before so they weren't being parsed properly. Correct this and track the tags.
Make sure we are coping with questions where part of the question isn't in the tail of QuestionText but is in following tags. Also cope with oddities like multiple question number tags.
Clause tags actually relate to the text after them so ignore them at the top level and then go back and parse them as part of the following heading tag. Then add them as the first part of the first speech under the heading. Fixes #53
If there's more than one heading or procedure in a new debate tag then make those into paragraphs in the first speech of the debate.
rather than just parsing it all into a single line of text parse all the paragraphs and indents so that we try and retain a bit more structure.
Scans the list of seen files and then picks out the latest one and then re-parses that. Assumes that the files are ordered in date order in the list.
f96f8f3
to
97d679c
Compare
) | ||
for t in following_tags: | ||
tag_name = self.get_tag_name_no_ns(t) | ||
self.handle_tag(tag_name, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adapted part of this commit in master to fix a recent issue. Note this doesn't fully work, in that any subsequent paragraphs would become a new no-speaker speech. What I've done in e8acc13 is make sure this uses new_speech()
so current_speech
is set and then they'll be attached correctly. This simplifies the function a bit too.
403ee7b
to
0c4983b
Compare
bc05e4e
to
cf4da9e
Compare
The parser now tracks all the tags it sees as it goes using tag IDs and then compares those to a list of IDs extracted using XPath. If there is a difference between the lists it throws an Exception.
There's also a number of parser improvements in here which were found in the process of making sure that it parsed things correctly:
It also adds a script to make re-parsing easier.
Fixes #54
Fixes #66