Add IL Israeli Knesset Corpus sample #881

GiliGoldin · 2024-11-18T15:07:36Z

No description provided.

matyaskopp · 2024-11-18T15:23:17Z

@GiliGoldin, nice to see that our ParlaMint family is about to grow.

Your initial commit failed on the filesize limit (100 MB is the maximum, that is allowed), your files size is Samples/ParlaMint-IL size =366 MB: https://github.com/clarin-eric/ParlaMint/actions/runs/11895256085/job/33144542662?pr=881#step:4:30

I will send you an invitation to this repository with limited access, but you will be able to see the action logs.

Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.ana.xml

matyaskopp · 2024-11-20T13:44:15Z

Samples/ParlaMint-IL/2021/ParlaMint-IL_2021-12-21.xml

+               <seg xml:id="seg.id-8005043c-61d6-48d2-9674-c5ff64c438e6">אני פותחת את ישיבת ועדת החינוך, התרבות והספורט של הכנסת.</seg>
+               <seg xml:id="seg.id-997bd044-dd0c-46d3-aaf4-cca2871481d6">קודם כל, אנחנו נפתח בהקמות שתי ועדות משנה.</seg>
+               <seg xml:id="seg.id-145ec7a4-00e8-4797-a646-e34a01239a91">הקמת ועדת משנה ראשונה היא לחוק חינוך ממלכתי דתי.</seg>
+               <seg xml:id="seg.ec44704c-bf27-4135-85ef-90a2ec33938e">אבקש להביא לידיעת</seg>


<seg> element corresponds to paragraphs in source transcription: https://clarin-eric.github.io/ParlaMint/#sec-uterrance

The utterances are then segmented using the element, which encodes the paragraphs of the source transcription. Even if the source files do not contain paragraph markings, each speech should contain at least one segment.

You are using seg for some sub-sentence segmentation, which causes wrong linguistic annotations - because אבקש להביא לידיעת is a sentence-fragment (I believe).

This utterance should contain two paragraphs according to the source HTML: (https://oknesset.org/meetings/2/1/2166933.html#speech-2166933-7)

BTW, there are recommended different ID structures in documentation - it is better to use a structure like this: ParlaMint-CZ_2020-01-22-ps2017-040-02-005-012.u1.p1
it makes debugging more manageable, and you can be sure that the id is unique (I don't know how your hashes are constructed, but I guess the "good morning" sentence repeats in the corpus, so if it is only built from text content, then it is not unique)

I'm converting my already processed corpus to parlaMint , the texts I have are already segmented to sentences and I do not have the information regarding the paragraphs. I would like to keep each sentence independently both as a sentence and as it's own paragraph in the speech. Is that ok?

Are the texts segmented correctly? Is not this bug in the source Knesset Corpus?
Converting your corpus into a ParlaMint format is an excellent opportunity to discover source corpus bugs so that you can fix them in the source corpus(in the next release) and also in the ParlaMint version.

From ParlaMint's point of view (my opinion), it makes better sense to create one paragraph(seg) per utterance(u) if you decide to ignore paragraphs in original transcriptions (cause you are using an existing corpus which does not care about paragraphs).
But considering right-left Hebrew writing, I am not a hundred per cent sure that it is a good idea because I don't see all the consequences that can arise...

Thank you for your feedback and suggestions. I have merged all the sentences back into one paragraph per utterance, as recommended.
Additionally, I have fixed the ud logic in my code that caused some of the issues, as well as improved the fallback logic.
Regarding the segmentation, you are correct that there are some bugs, as the process was done automatically. However, the error rate is relatively low. Since the data is already segmented into sentences, addressing these issues at this stage would be quite challenging. I do plan to address this in the future as part of ongoing improvements.
Please let me know if anything else is needed for the data to meet the ParlaMint standards or to proceed with merging this pull request.

Please let me know if anything else is needed for the data to meet the ParlaMint standards or to proceed with merging this pull request.

Tonight/tomorrow, I will give you more complex feedback, including reporting bugs and suggestions for improvement to make your corpus more comparable with other ParlaMint corpora.

Add IL Israeli Knesset Corpus sample

5b98772

GiliGoldin added 10 commits November 18, 2024 22:07

only sample speakers

4474855

ids fix

3bebb74

protocol 2021 deleted

f7d9518

faction names fix

c69da10

faction names fix and roles fixed

5d131ab

ud-syn placeholder links

2d683d0

taxonomies

534a157

parlamint2conllu add he

5175081

ud link changes

9d39b03

ud link changes

19c9928

matyaskopp reviewed Nov 20, 2024

View reviewed changes

Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.ana.xml Outdated Show resolved Hide resolved

matyaskopp reviewed Nov 20, 2024

View reviewed changes

reference to s instead of seg and fix msds

300799b

GiliGoldin marked this pull request as draft November 21, 2024 08:24

GiliGoldin marked this pull request as ready for review November 21, 2024 08:28

GiliGoldin force-pushed the data branch from 4459fb4 to e8e7564 Compare November 21, 2024 08:32

merge sentences to one segment

f46960f

GiliGoldin force-pushed the data branch from e8e7564 to f46960f Compare November 21, 2024 09:21

fix fallback ud tree treatment

332bb31

matyaskopp added a commit that referenced this pull request Nov 21, 2024

add Hebrew into list of languages to be translated (related to #881)

9946040

Merge branch 'data' into data

27a4fa7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IL Israeli Knesset Corpus sample #881

Add IL Israeli Knesset Corpus sample #881

GiliGoldin commented Nov 18, 2024

matyaskopp commented Nov 18, 2024

matyaskopp Nov 20, 2024

GiliGoldin Nov 20, 2024

matyaskopp Nov 20, 2024

GiliGoldin Nov 21, 2024

matyaskopp Nov 21, 2024

Add IL Israeli Knesset Corpus sample #881

Are you sure you want to change the base?

Add IL Israeli Knesset Corpus sample #881

Conversation

GiliGoldin commented Nov 18, 2024

matyaskopp commented Nov 18, 2024

matyaskopp Nov 20, 2024

Choose a reason for hiding this comment

GiliGoldin Nov 20, 2024

Choose a reason for hiding this comment

matyaskopp Nov 20, 2024

Choose a reason for hiding this comment

GiliGoldin Nov 21, 2024

Choose a reason for hiding this comment

matyaskopp Nov 21, 2024

Choose a reason for hiding this comment