Is it possible to calculate TF-IDF using KH Coder? If yes, then can you please tell me how? #869
Replies: 15 comments
-
Dear PiyushKyushu, Thank you for your post. Do you need TF-IDF matrix? Something like document-word matrix that have TF-IDF in each cells? Or, do you need one value for each word? Matrix
d <- read.csv("C:/khcoder3/tf.csv", fileEncoding="UTF-8-BOM",check.names=F)
# calculate tf
n_cut <- NULL
for (i in 1:12){
if ( colnames(d)[i] == "length_w" ){
n_cut <- i
break
}
}
doc_length_mtr <- d[,(n_cut-1):n_cut]
leng <- as.numeric(doc_length_mtr[,2])
leng[leng ==0] <- 1
d <- d / leng
# delete unnecessary part
n_cut <- n_cut * -1
d <- d[,-1:n_cut]
# prepare a function: binary termfrequency
lw_bintf <- function(m) {
return( (m>0)*1 )
}
# prepare a function: inverse document frequency
# from Dumais (1992), Nakov (2001) uses log not log2
gw_idf <- function(m) {
df = rowSums(lw_bintf(m), na.rm=TRUE)
return ( ( log2(ncol(m)/df) + 1 ) )
}
# calculate tf-idf
d <- t(d)
d <- d[rowSums(d) > 0, ]
d <- d * gw_idf(d)
d <- t(d)
# save
fh <- file("C:/khcoder3/tf-idf.csv", "wb")
writeBin(as.raw(c(0xef, 0xbb, 0xbf)), fh)
close(fh)
fh <- file("C:/khcoder3/tf-idf.csv", "at", encoding="UTF-8")
write.csv(d, fh, quote=F, row.names=F)
close(fh) One value for each wordGo to [Project] [Export] [Word Frequency List (Excel)] in the menu. Here you can select "TF" or "DF". You can get both TF and DF by this command, so you can calculate TF-IDF yourself using Excel. |
Beta Was this translation helpful? Give feedback.
-
Dear Professor Higuchi, Thank you. |
Beta Was this translation helpful? Give feedback.
-
You should consult the literature to see if the above formula is correct. |
Beta Was this translation helpful? Give feedback.
-
Dear Professor Higuchi, Thank you. |
Beta Was this translation helpful? Give feedback.
-
Suppose the main window is showing H5 = 100, that means there are 100 documents in one excel file, so I want to confirm that when KH Coder count Term Frequency (TF), it gives the term frequency of one particular term in overall excel file (total frequency of a word in all 100 documents)? |
Beta Was this translation helpful? Give feedback.
-
KH Coder tries to count lemma of each word. To detect lemma, KH Coder uses Stanford POS Tagger in default settings. So, if Stanford POS Tagger recognizes "huawei"'s lemma as "huaweus", KH Coder counts "huaweus". If you want to unite variations, see #96 .
That's correct.
I am not sure. About the matrix I wrote above (TF-IDF for each word in each document), I know the formula. But about "one value for each word" style TF-IDF, I am not sure about the formula. If you find and show me the formula, I can tell you how to calculate it. BTW, you can get TF of each word in each document using [Project] [Export] [Document-Word Matrix] [CSV] function of KH Coder. Best. |
Beta Was this translation helpful? Give feedback.
-
Thank you, Professor Higuchi. I got the document-word matrix. Thank you once again. |
Beta Was this translation helpful? Give feedback.
-
Please confirm the formula by consulting literatures. I don't recommend that you just use the above formula without any confirmation. I would use Excel to get top 30 words. |
Beta Was this translation helpful? Give feedback.
-
Dear Professor Higuchi, IDF(t) = log_e(Total number of documents / Number of documents with term t in it) TF-IDF = TF(t) * IDF(t) Can you guide me about the steps for calculating TF-IDF using KH Coder and Excel? |
Beta Was this translation helpful? Give feedback.
-
I do not understand.
By this definition, each document has a different value of TF(t). So, it is a definition of "Matrix" type TF-IDF I wrote above. It is not a definition of "One value for each word" type TF-IDF. |
Beta Was this translation helpful? Give feedback.
-
Dear Professor Higuchi, Thank you for your reply. |
Beta Was this translation helpful? Give feedback.
-
In the first place, why do you want to get TF-IDF? |
Beta Was this translation helpful? Give feedback.
-
Dear Professor Higuchi, Thank you. |
Beta Was this translation helpful? Give feedback.
-
Where did you find this claim? Anyway, you need to find a literature containing this claim and the TF-IDF formula used for this purpose. I think that would be the first step. |
Beta Was this translation helpful? Give feedback.
-
Seems abandoned. I close this issue for now. Please reopen if it's necessary. |
Beta Was this translation helpful? Give feedback.
-
Dear Professor Higuchi,
First of all, thank you for this software, it is really wonderful. I am using it for more than a year.
For my research I want TF-IDF, I was wondering if it is possible to calculate this using Kh Coder.
There is a past question which talks about TF-DF not TF-IDF.
Can you kindly tell me the procedure to calculate TF-IDF?
Thanks once again.
Beta Was this translation helpful? Give feedback.
All reactions