Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IL Israeli Knesset Corpus sample #881

Open
wants to merge 15 commits into
base: data
Choose a base branch
from

Conversation

GiliGoldin
Copy link
Collaborator

No description provided.

@matyaskopp
Copy link
Collaborator

@GiliGoldin, nice to see that our ParlaMint family is about to grow.

Your initial commit failed on the filesize limit (100 MB is the maximum, that is allowed), your files size is Samples/ParlaMint-IL size =366 MB: https://github.com/clarin-eric/ParlaMint/actions/runs/11895256085/job/33144542662?pr=881#step:4:30

I will send you an invitation to this repository with limited access, but you will be able to see the action logs.

<seg xml:id="seg.id-8005043c-61d6-48d2-9674-c5ff64c438e6">אני פותחת את ישיבת ועדת החינוך, התרבות והספורט של הכנסת.</seg>
<seg xml:id="seg.id-997bd044-dd0c-46d3-aaf4-cca2871481d6">קודם כל, אנחנו נפתח בהקמות שתי ועדות משנה.</seg>
<seg xml:id="seg.id-145ec7a4-00e8-4797-a646-e34a01239a91">הקמת ועדת משנה ראשונה היא לחוק חינוך ממלכתי דתי.</seg>
<seg xml:id="seg.ec44704c-bf27-4135-85ef-90a2ec33938e">אבקש להביא לידיעת</seg>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<seg> element corresponds to paragraphs in source transcription: https://clarin-eric.github.io/ParlaMint/#sec-uterrance

The utterances are then segmented using the element, which encodes the paragraphs of the source transcription. Even if the source files do not contain paragraph markings, each speech should contain at least one segment.

You are using seg for some sub-sentence segmentation, which causes wrong linguistic annotations - because אבקש להביא לידיעת is a sentence-fragment (I believe).

This utterance should contain two paragraphs according to the source HTML: (https://oknesset.org/meetings/2/1/2166933.html#speech-2166933-7)

BTW, there are recommended different ID structures in documentation - it is better to use a structure like this: ParlaMint-CZ_2020-01-22-ps2017-040-02-005-012.u1.p1
it makes debugging more manageable, and you can be sure that the id is unique (I don't know how your hashes are constructed, but I guess the "good morning" sentence repeats in the corpus, so if it is only built from text content, then it is not unique)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm converting my already processed corpus to parlaMint , the texts I have are already segmented to sentences and I do not have the information regarding the paragraphs. I would like to keep each sentence independently both as a sentence and as it's own paragraph in the speech. Is that ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the texts segmented correctly? Is not this bug in the source Knesset Corpus?
Converting your corpus into a ParlaMint format is an excellent opportunity to discover source corpus bugs so that you can fix them in the source corpus(in the next release) and also in the ParlaMint version.

From ParlaMint's point of view (my opinion), it makes better sense to create one paragraph(seg) per utterance(u) if you decide to ignore paragraphs in original transcriptions (cause you are using an existing corpus which does not care about paragraphs).
But considering right-left Hebrew writing, I am not a hundred per cent sure that it is a good idea because I don't see all the consequences that can arise...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback and suggestions. I have merged all the sentences back into one paragraph per utterance, as recommended.
Additionally, I have fixed the ud logic in my code that caused some of the issues, as well as improved the fallback logic.
Regarding the segmentation, you are correct that there are some bugs, as the process was done automatically. However, the error rate is relatively low. Since the data is already segmented into sentences, addressing these issues at this stage would be quite challenging. I do plan to address this in the future as part of ongoing improvements.
Please let me know if anything else is needed for the data to meet the ParlaMint standards or to proceed with merging this pull request.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if anything else is needed for the data to meet the ParlaMint standards or to proceed with merging this pull request.

Tonight/tomorrow, I will give you more complex feedback, including reporting bugs and suggestions for improvement to make your corpus more comparable with other ParlaMint corpora.

@GiliGoldin GiliGoldin marked this pull request as draft November 21, 2024 08:24
@GiliGoldin GiliGoldin marked this pull request as ready for review November 21, 2024 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants