How to handle inappropriate word segmentation in English data？ #1273

xmgwzcn · 2024-09-05T11:33:44Z

xmgwzcn
Sep 5, 2024

Version number of KH Coder

3.Beta.07b

Your Qeuestion

Your Operating System
macOs Sonoma 14.5

Your Qeuestion
Thank you for reading this question.

When I generate a frequency word list from my English data, I notice that some word segments are incorrect because they are not complete words. For example, the list contains fragments like “r”, “ek”, and “Ho” instead of the correct words “your”, “week”, and “house”.

I have already tried using the function [Select Word for Analysis]. I correctly added “your”, “week”, and “house” in the [Force Pick-up] column, then [Run Pre-Processing]. However, the incorrect forms, such as “r”, “ek”, and “Ho”, still appear.

Screenshots

What language of text are you trying to analyze?
English

Answered by ko-ichi-h

Sep 5, 2024

However, I have attached the file containing data related to "r," "ek," and "Ho," which I hope will be helpful. From the file, we can see that "your," "week," and "Houston" are all complete words in my data file.

Sorry for my poor English. But, can you reproduce the problem with the Sample.xlsx?

It would be helpful if you could provide a file that reproduces the problem.

Also, how about cleaning your data? A good place to start is the CLEAN function in Excel. Please try creating a new Excel file with CLEANed text and making a new KH Coder project with that new file.
https://www.educba.com/clean-in-excel/

View full answer

ko-ichi-h · 2024-09-05T12:22:03Z

ko-ichi-h
Sep 5, 2024
Maintainer

Hmm, would you click “r”, “ek”, and “Ho” in the word list screen to open KWIC?
Then, would you attach screenshots of the KWIC?

We will be able to see where the "r" comes from.

Or, if you can attach the input data file here, it would be very helpful.

1 reply

xmgwzcn Sep 5, 2024
Author

Thank you for your prompt reply.

Screenshots of the KWIC are attached below.

Actually, "r," "ek," and "Ho" are just examples and not the only cases. There are other similar fragments that are incorrect as well. Due to data confidentiality reasons, I am unable to upload the entire file. However, I have attached the file containing data related to "r," "ek," and "Ho," which I hope will be helpful. From the file, we can see that "your," "week," and "Houston" are all complete words in my data file.
Sample.xlsx

Additionally, we can observe that "r" comes from "your," as shown in the screenshot. Interestingly, some instances of the word "your" are identified correctly and appear in the frequency list as "your," while others are incorrectly identified as "r."

ko-ichi-h · 2024-09-05T13:05:27Z

ko-ichi-h
Sep 5, 2024
Maintainer

However, I have attached the file containing data related to "r," "ek," and "Ho," which I hope will be helpful. From the file, we can see that "your," "week," and "Houston" are all complete words in my data file.

Sorry for my poor English. But, can you reproduce the problem with the Sample.xlsx?

It would be helpful if you could provide a file that reproduces the problem.

Also, how about cleaning your data? A good place to start is the CLEAN function in Excel. Please try creating a new Excel file with CLEANed text and making a new KH Coder project with that new file.
https://www.educba.com/clean-in-excel/

6 replies

xmgwzcn Sep 5, 2024
Author

However, when I enter the following words in the [Force Pick-up] column and then [Run Pre-Processing], the same issue occurs again. I also tried this with the sample data I attached, and the same problem happened again in the frequency list
Sample.xlsx

we
us
NASA
you
your
'you r'
everyone
skywatchers
digital creatoars
content creators
members of the media
metalheads
listeners
U.S.
American
US
United States

ko-ichi-h Sep 5, 2024
Maintainer

Yes, if you enter "you", [Force pick up] will split "your" into "you" and "r". It will also split "yourself" into "you" and "rself". This is that kind of functionality.

But as stated in the manual, [Force pick up] will prioritize what is listed above.

So, if you type

yourself
your
you r
you

the problem will be alleviated.

But why on earth would we need such forced extraction? I think there should be better solution than [Force pick up] to achieve your goal.

xmgwzcn Sep 5, 2024
Author

Oh, I see your point now.
You’re right, forced extraction was a poor idea.
I think you’ve inspired me to move in the right direction for a solution.

ko-ichi-h Sep 5, 2024
Maintainer

If you need to use [Force pick up], please ignore the "r" and any other fragments. You can use [force ignore] function to delete them from results / word lists.

xmgwzcn Sep 6, 2024
Author

Got it. Thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle inappropriate word segmentation in English data？ #1273

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to handle inappropriate word segmentation in English data？ #1273

xmgwzcn Sep 5, 2024

Version number of KH Coder

Your Qeuestion

Replies: 2 comments · 7 replies

ko-ichi-h Sep 5, 2024 Maintainer

xmgwzcn Sep 5, 2024 Author

ko-ichi-h Sep 5, 2024 Maintainer

xmgwzcn Sep 5, 2024 Author

ko-ichi-h Sep 5, 2024 Maintainer

xmgwzcn Sep 5, 2024 Author

ko-ichi-h Sep 5, 2024 Maintainer

xmgwzcn Sep 6, 2024 Author

xmgwzcn
Sep 5, 2024

Replies: 2 comments 7 replies

ko-ichi-h
Sep 5, 2024
Maintainer

xmgwzcn Sep 5, 2024
Author

ko-ichi-h
Sep 5, 2024
Maintainer

xmgwzcn Sep 5, 2024
Author

ko-ichi-h Sep 5, 2024
Maintainer

xmgwzcn Sep 5, 2024
Author

ko-ichi-h Sep 5, 2024
Maintainer

xmgwzcn Sep 6, 2024
Author