-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IL Israeli Knesset Corpus sample #881
base: data
Are you sure you want to change the base?
Conversation
@GiliGoldin, nice to see that our ParlaMint family is about to grow. Your initial commit failed on the filesize limit (100 MB is the maximum, that is allowed), your files size is I will send you an invitation to this repository with limited access, but you will be able to see the action logs. |
<seg xml:id="seg.id-8005043c-61d6-48d2-9674-c5ff64c438e6">אני פותחת את ישיבת ועדת החינוך, התרבות והספורט של הכנסת.</seg> | ||
<seg xml:id="seg.id-997bd044-dd0c-46d3-aaf4-cca2871481d6">קודם כל, אנחנו נפתח בהקמות שתי ועדות משנה.</seg> | ||
<seg xml:id="seg.id-145ec7a4-00e8-4797-a646-e34a01239a91">הקמת ועדת משנה ראשונה היא לחוק חינוך ממלכתי דתי.</seg> | ||
<seg xml:id="seg.ec44704c-bf27-4135-85ef-90a2ec33938e">אבקש להביא לידיעת</seg> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<seg>
element corresponds to paragraphs in source transcription: https://clarin-eric.github.io/ParlaMint/#sec-uterrance
The utterances are then segmented using the element, which encodes the paragraphs of the source transcription. Even if the source files do not contain paragraph markings, each speech should contain at least one segment.
You are using seg
for some sub-sentence segmentation, which causes wrong linguistic annotations - because אבקש להביא לידיעת
is a sentence-fragment (I believe).
This utterance should contain two paragraphs according to the source HTML: (https://oknesset.org/meetings/2/1/2166933.html#speech-2166933-7)
BTW, there are recommended different ID structures in documentation - it is better to use a structure like this: ParlaMint-CZ_2020-01-22-ps2017-040-02-005-012.u1.p1
it makes debugging more manageable, and you can be sure that the id is unique (I don't know how your hashes are constructed, but I guess the "good morning" sentence repeats in the corpus, so if it is only built from text content, then it is not unique)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm converting my already processed corpus to parlaMint , the texts I have are already segmented to sentences and I do not have the information regarding the paragraphs. I would like to keep each sentence independently both as a sentence and as it's own paragraph in the speech. Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the texts segmented correctly? Is not this bug in the source Knesset Corpus?
Converting your corpus into a ParlaMint format is an excellent opportunity to discover source corpus bugs so that you can fix them in the source corpus(in the next release) and also in the ParlaMint version.
From ParlaMint's point of view (my opinion), it makes better sense to create one paragraph(seg
) per utterance(u
) if you decide to ignore paragraphs in original transcriptions (cause you are using an existing corpus which does not care about paragraphs).
But considering right-left Hebrew writing, I am not a hundred per cent sure that it is a good idea because I don't see all the consequences that can arise...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback and suggestions. I have merged all the sentences back into one paragraph per utterance, as recommended.
Additionally, I have fixed the ud logic in my code that caused some of the issues, as well as improved the fallback logic.
Regarding the segmentation, you are correct that there are some bugs, as the process was done automatically. However, the error rate is relatively low. Since the data is already segmented into sentences, addressing these issues at this stage would be quite challenging. I do plan to address this in the future as part of ongoing improvements.
Please let me know if anything else is needed for the data to meet the ParlaMint standards or to proceed with merging this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know if anything else is needed for the data to meet the ParlaMint standards or to proceed with merging this pull request.
Tonight/tomorrow, I will give you more complex feedback, including reporting bugs and suggestions for improvement to make your corpus more comparable with other ParlaMint corpora.
No description provided.