Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kstudy translator #1308

Merged
merged 19 commits into from
Jan 4, 2018
Merged

Kstudy translator #1308

merged 19 commits into from
Jan 4, 2018

Conversation

yunusong
Copy link
Contributor

Translator for the Kstudy database.

Help appreciated for attachments, changing authors to a single field, and opening a javascript url at "journal/list_name.asp" detect. But any suggestions to improve the translator is welcome.

Thank you.

@fbennett
Copy link
Contributor

I've raised my hand, so ... a question first for the translator gurus listening in. The translator is based on the Framework, but it's going to need some out-of-the-ordinary post-processing of name field content, at least one intermediate xmlhttprequest call to acquire a ToC for one form of "multiple" content, and another to acquire attachments.

Is it possible to shoehorn arbitrary JS code into a Framework translator, or should things be recoded from scratch?

@avram
Copy link
Contributor

avram commented May 27, 2017

So FW does have some capacity to hook into the processing and run arbitrary JS, but I suspect it's better to rewrite than to do much complex work in those hooks

@zuphilip
Copy link
Contributor

Is it possible to shoehorn arbitrary JS code into a Framework translator, or should things be recoded from scratch?

One can add more JS to the functions detectWeb, doWeb for example

function detectWeb(doc, url) {
if(!ZU.xpath(doc, '//article/h1').length && !ZU.xpath(doc, '//div[@class="inner"]//div[contains(@class, "text_wrapper")]/h2/a')) {
return;
}
return FW.detectWeb(doc, url);
}
function doWeb(doc, url) { return FW.doWeb(doc, url); }
; or define a new function and call it somewhere in the FW chain.

Personally, I prefer the non-FW translators and here it seems that anything else would be pretty complex as well. @yunusong I can rewrite what you have as a normal translator without the FW, if this helps.

@fbennett
Copy link
Contributor

fbennett commented May 27, 2017

@zuphilip if you have time, and if @yunusong agrees, that would be great (Philipp is much more skilled with translators than I am).

With a non-FW translator as a base, I could then splice in conditional operations that would wake up in Juris-M, for transliterated names and whatnot.

@zuphilip
Copy link
Contributor

@yunusong
Copy link
Contributor Author

yunusong commented May 27, 2017

Thanks @zuphilip @fbennett !

Author field is not that big of a deal, actually. I tried adding ".remove(/(/).remove()/)" after ".split(/,/)" so that it won't take ")" as the last name. Then it took the romanized last name as the last name, and that was good enough for me. Separating last name and first name is almost never necessary in Korean academia.

On the other hand, I don't mind you guys rewriting the code at all. If it helps improving the code, I am all for it. What I would really like is the translator to automatically fetch the pdf files. The xpath for link to pdf files is "//div[@Class="choice"]/span[1]/a/@href" and an example link is
"javascript:SiteSelect1('http://210.101.116.18/kiss10/inFTP_Journal.asp', '87300332.pdf', 2201, 25169, 1);"
I tried rewriting the code without the framework myself, but I didn't have any better luck with the attachments. If you can help me with the attachments, that would be wonderful. Thank you.

@yunusong
Copy link
Contributor Author

yunusong commented May 27, 2017 via email

@zuphilip
Copy link
Contributor

@yunusong I created a PR towards your fork and branch yunusong#1 . After accepting this, we should see it here as well.

Rewrite KStudy.js as Non-FW translator
I figured out how to unravel the javascripted links in Journal Issue type of "multiple". I tested on my zotero and confirmed that it works.
KStudy.js Outdated
{
"translatorID": "b298ca93-0010-48f5-97fb-e9923519a380",
"label": "KStudy",
"creator": "Yunwoo Song","Philipp Zumstein"
Copy link
Contributor

@zuphilip zuphilip May 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thank you, but this needs to be one string containing both names, i.e.

"Yunwoo Song, Philipp Zumstein"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@yunusong
Copy link
Contributor Author

I located the JS file which contains the function for downloading pdf.
The function goes as
function SiteSelect1(ftproot, a_imag, inst_key, publ_key,isDownLoad) {
var url = "../search/download.asp?ftproot=" + ftproot + "&inst_key=" + inst_key + "&a_imag=" + a_imag + "&isDownLoad=" + isDownLoad + "&publ_key=" + publ_key
window.open(url, "", "toolbar=0,location=0,directories=0,status=0,menubar=0,scrollbars=1,resizable=1,copyhistory=1,top=170,left=250,width=540,height=500");
}

So I coded item.attachments as follows.

var pdfJSurl = ZU.xpathText(doc, '//div[@class="search_box"]/div[@class="choice"]/span[1]/a/@href');
var pdfurlKeys = pdfJSurl.match(/Select1\(\'(\S+)\'\,\s?\'(\S+)\'\,\s?(\S+)\,\s?(\S+)\,\s?(\S+)\)\;/);
var pdfurl = "../search/download.asp?ftproot=" + pdfurlKeys[1] + "&inst_key=" + pdfurlKeys[3] + "&a_imag=" + pdfurlKeys[2] + "&isDownLoad=" + pdfurlKeys[5] + "&publ_key=" + pdfurlKeys[4];
item.attachments.push({
	title : item.title,
	url : pdfurl,
	mimeType : "application/pdf",
	snapshot : false
})

But Zotero still won't fetch the pdf.

I tried fetching the resulting url by assigning to an unused slot. eg. issn. as
item.ISSN = pdfurl;
An example of a resulting url is as follows.
../search/download.asp?ftproot=http://210.101.116.17/kiss8/inFTP_Journal.asp&inst_key=7074&a_imag=40323420.pdf&isDownLoad=1&publ_key=25408

I tried pasting this address directly on a browser and it will start downloading the pdf.
So I am wondering what is still blocking Zotero from storing the pdf.

The odd thing I noticed about the original function was that it does not end the "var url =" statement with a semicolon. The semicolon only appears at the end of the second line. Would this have something to do with the problem? Thank you guys for the comments.

KStudy.js Outdated
@@ -1,7 +1,7 @@
{
"translatorID": "b298ca93-0010-48f5-97fb-e9923519a380",
"label": "KStudy",
"creator": "Yunwoo Song","Philipp Zumstein"
"creator": "Yunwoo Song, Philipp Zumstein"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not valid JSON. You need to add a comma at the end of this line.

@zuphilip
Copy link
Contributor

Can you try snapshot : true in your code above?

@fbennett
Copy link
Contributor

I'll take a look on Monday (I need to be in the University net to get access). As an offhand thought in the meantime, assuming that the translator is loading properly, there may be a redirect that the translator is not able to follow without some further jiggery-pokery. (You can follow the transactions by opening Developer Tools in the browser and then refreshing on the download URL.)

@yunusong
Copy link
Contributor Author

Thank you @fbennett !
@zuphilip I just tried "snapshot : true" but nothing changed. Zotero still shows a big red X for attachments, and the test page at scaffold will not show any value for "url" under attachments. I also tried "snapshot : false," but no pdf either.

@adam3smith
Copy link
Collaborator

Thanks everyone -- please let me know once this is ready for final review. As for FW, given #1277, I'd prefer us avoiding new FW translators.

@yunusong
Copy link
Contributor Author

yunusong commented May 28, 2017

@fbennett I have one more question.
When I erased all other attributes of the attachment and just left the url as "url : pdfurl" (which gives a long url that would have been the result of the function SiteSelect1), Zotero saves the snapshot of the page that leads to downloading, and the browser downloads the pdf file to my download folder in my computer.

I think I need the translator to access this final url directly. The gist of the html code that calls the url is as follows.

<body onLoad="document.down.submit();">
<form name="down" method="post" action="http://210.101.116.17/kiss8/inftp_journal.asp">
<input type="hidden" name="a_code" value="40323420.pdf">
<input type="hidden" name="code" value="7624112825580073101">
<input type="hidden" name="isDownLoad" value="1">
</form>
</body>

If you paste this html code into a new text and open it in a browser, you will see that it downloads a pdf file even without requiring proxy. Do you think you can work around this code and have the zotero to fetch the final url? I really appreciate your help.

@fbennett
Copy link
Contributor

fbennett commented May 28, 2017 via email

@yunusong
Copy link
Contributor Author

Can you guys tell me what is wrong with my code? I don't think I quite understand how to use ZU.doPost. But I am quite certain that I have to use it because the actual page that leads to pdf has method="post".

So in the scrape function I added the code for attachments as follows.

`var` pdfJSurl = ZU.xpathText(doc, '//div[@class="search_box"]/div[@class="choice"]/span[1]/a/@href');
`var` pdfurlKeys = pdfJSurl.match(/Select1\(\'(\S+)\'\,\s?\'(\S+)\'\,\s?(\S+)\,\s?(\S+)\,\s?(\S+)\)\;/);
`var` pdflinkurl = "../search/download.asp?ftproot=" + pdfurlKeys[1] + "&inst_key=" + pdfurlKeys[3] + "&a_imag=" + pdfurlKeys[2] + "&isDownLoad=" + pdfurlKeys[5] + "&publ_key=" + pdfurlKeys[4];

`item.attachments.push({`
	url : ZU.processDocuments(pdflinkurl, getpdf),
	snapshot : true,
})

I was hoping that this would process the getpdf function on the page that opens. And for getpdf function, I coded it as follows.

function getpdf(doc, url) {
var theurl = ZU.xpathText(doc, '//form[@name="down"]/@action');
var a_code = ZU.xpathText(doc, '//input[@name="a_code"]/@value');
var code = ZU.xpathText(doc, '//input[@name="code"]/@value');
var isDownload = ZU.xpathText(doc, '//input[@name="isDownload"]/@value');
var postData = [a_code, code, isDownload];
var headerData = {"a_code": a_code, "code": code, "isDownload": isDownload};
ZU.doPost(theurl, a_code + code + isDownload, function(pdfurl2) {});
}

I checked how other translators use the doPost, and they were very different from how I wrote it, but I don't quite understand how to fix it.

@zuphilip
Copy link
Contributor

That is a random guess, but I think you have to switch the two things, i.e. something like

ZU.doPost(pdflinkurl, postdata, function(pdfdoc) {
   item.attachments.push({
	doc: pdfdoc,
	title: "Fulltext",
        mimetype: "pplication/pdf"
   });
   item.complete();
}

But I don't really know if this can work, because usually the function has to work with text...

@fbennett
Copy link
Contributor

Picking up now (fell asleep last night, but I turned on the office proxy yesterday, so I'm "at work" this morning). Will fiddle around a bit and see what I can come up with.

@yunusong
Copy link
Contributor Author

Ok, I think I figured out how to use ZU.doPost. But I still couldn't get Zotero to save pdf. This is the code I added at the end of the scrape function. The weird thing is, though, when I run test in Scaffold, the result will show "Translation Successful" AND my browser will download the pdf to my computer, even though I only ran test. And when I actually test it with Zotero, it will not fetch pdf. Any thoughts on what I should change?

var pdfJSurl = ZU.xpathText(doc, '//div[@class="search_box"]/div[@class="choice"]/span[1]/a/@href');
var pdfurlKeys = pdfJSurl.match(/Select1\(\'(\S+)\'\,\s?\'(\S+)\'\,\s?(\S+)\,\s?(\S+)\,\s?(\S+)\)\;/);
var pdflinkurl = "../search/download.asp?ftproot=" + pdfurlKeys[1] + "&inst_key=" + pdfurlKeys[3] + "&a_imag=" + pdfurlKeys[2] + "&isDownLoad=" + pdfurlKeys[5] + "&publ_key=" + pdfurlKeys[4];
ZU.processDocuments(pdflinkurl, function(doc, url) {
	var actionurl = ZU.xpathText(doc, '//form[@name="down"]/@action');
	var inputData = "a_code=" + ZU.xpathText(doc, '//input[@name="a_code"]/@value') + "&code=" + ZU.xpathText(doc, '//input[@type="code"]/@value') + "&isDownload=" + ZU.xpathText(doc, '//input[@name="isDownload"]/@value');
	ZU.doPost(actionurl, inputData, function(pdfdoc) {
		item.attachments.push({
			url : pdfdoc,
		});
	});
});

item.complete();

@fbennett
Copy link
Contributor

fbennett commented May 29, 2017

Cracked it. I have code that attaches a readable PDF. As it turned out, no further call (to doGet, doPost or processDocuments) was necessary. Tried various things, but what did the trick was to use the viewer.asp URL that is the ultimate target of a view request. It's a POST call on the site, but setting the post data as a query string on the URL (which I presume the translator calls with GET) worked anyway. The only post data needed is code_num, which is just the server-side filename of the PDF. So it turns out to be super simple, after all our guesswork.

I'll file a pull request in a bit.

@zuphilip
Copy link
Contributor

zuphilip commented Jan 2, 2018

Scaffold has its own browser now and you can simply copy and paste whichever URL you wanted to see into the Current URL field. After copy and paste you have to hit enter and wait for the page to load. Does this work?

scaffold-url

@zuphilip
Copy link
Contributor

zuphilip commented Jan 2, 2018

🌥 Okay, I am continuing to work here now. I will let you know when I have a new version...

@zuphilip
Copy link
Contributor

zuphilip commented Jan 2, 2018

I generalized the translator for the example(s) found by @adam3smith but choosed in the end journalArticle for that documented in new test cases. However, it seems for me, that the PDF download we see in the public part is maybe tailored for that and could be fragile at all:

kiss-public-download

Thus, I decided to skip the PDF download for the moment. Maybe, we can start with the translator as it is now. Any improvement for PDF download could be added later by someone with access to the database.

Deleted also the comments for downloading PDFs,
which we do not yet handle.
@zuphilip
Copy link
Contributor

zuphilip commented Jan 2, 2018

Please have a look at the new version.

@socheres
Copy link
Contributor

socheres commented Jan 2, 2018

@zuphilip below some more test cases and remarks.
author in Korean should import in a "single field" in Zotero., because the first letters(Silben) of Korean names are usually surname and the other two letters are first name. Also, there is no middle name in korean.
Authors in latinized form with comma is separated correctly.
one more suggestions for issn.
//e.g. issn = ISSN(Print) var issn = ZU.xpathText(doc, '//li[label[text()="ISSN(Print)"]]'); if(issn){ item.ISSN = ZU.cleanISSN(issn); }

test cases:

http://kiss.kstudy.com/public/public2-article.asp?key=50789039

18:30:10 Returned item:
           {
             "itemType": "journalArticle"
             "creators": [
               {
                 "firstName": "Yoon Tae"
                 "lastName": "Kim"
                 "creatorType": "author"
               }
               {
                 "firstName": "Hyun Suk"
                 "lastName": "Park"
                 "creatorType": "author"
               }
             ]
             "notes": []
             "tags": []
             "seeAlso": []
             "attachments": []
             "title": "KOLMOGOROV DISTANCE FOR MULTIVARIATE NORMAL APPROXIMATION"
             "language": "ko-KR"
             "publicationTitle": "Korean Journal of mathematics (강원경기수학회)"
             "volume": "23"
             "issue": "1"
             "date": "2015"
             "pages": "1–10"
             "abstractNote": "초록 보기"
             "libraryCatalog": "kiss"
           }
18:30:10 Translation successful

http://kiss.kstudy.com/thesis/thesis-view.asp?key=3550330

18:31:51 Returned item:
           {
             "itemType": "journalArticle"
             "creators": [
               {
                 "lastName": "( Yoon Tae Kim ) "
                 "fieldMode": true
                 "creatorType": "author"
               }
               {
                 "lastName": " ( Hyun Suk Park )"
                 "fieldMode": true
                 "creatorType": "author"
               }
             ]
             "notes": []
             "tags": []
             "seeAlso": []
             "attachments": []
             "title": "Convergence rate of a test statistics observed by the longitudinal data with long memory"
             "language": "ko-KR"
             "publicationTitle": "CSAM(Communications for Statistical Applications and Methods)"
             "volume": "24"
             "issue": "5"
             "date": "2017-09"
             "pages": "481–492"
             "abstractNote": "초록 보기 This paper investigates a convergence rate of a test statistics given by two scale sampling method based on Ait-Sahalia and Jacod (Annals of Statistics, 37, 184-222, 2009). This statistics tests for longitudinal data having the existence of long memory dependence driven by fractional Brownian motion with Hurst parameter H ∈ (1/2, 1). We obtain an upper bound in the Kolmogorov distance for normal approximation of this test statistic. As a main tool for our works, the recent results in Nourdin and Peccati (Probability Theory and Related Fields, 145, 75-118, 2009; Annals of Probability, 37, 2231-2261, 2009) will be used. These results are obtained by employing techniques based on the combination between Malliavin calculus and Stein`s method for normal approximation."
             "libraryCatalog": "kiss"
           }
18:31:51 Translation successful

http://kiss.kstudy.com/public/public2-article.asp?key=50375984

18:33:10 Returned item:
           {
             "itemType": "journalArticle"
             "creators": [
               {
                 "firstName": "Ja-Son Y."
                 "lastName": "Kim"
                 "creatorType": "author"
               }
               {
                 "firstName": "Tae-Joon"
                 "lastName": "Park"
                 "creatorType": "author"
               }
               {
                 "firstName": "Jin-Sol"
                 "lastName": "Lee"
                 "creatorType": "author"
               }
               {
                 "firstName": "Ji-Yong"
                 "lastName": "Chun"
                 "creatorType": "author"
               }
               {
                 "firstName": "Joon-Seol"
                 "lastName": "Bae"
                 "creatorType": "author"
               }
               {
                 "firstName": "Byung-Lae"
                 "lastName": "Park"
                 "creatorType": "author"
               }
               {
                 "firstName": "Hyun-Sub"
                 "lastName": "Cheong"
                 "creatorType": "author"
               }
               {
                 "firstName": "Hyo-Suk"
                 "lastName": "Lee"
                 "creatorType": "author"
               }
               {
                 "firstName": "Yoon-Jun"
                 "lastName": "Kim"
                 "creatorType": "author"
               }
               {
                 "firstName": "Hyoung-Doo"
                 "lastName": "Shin"
                 "creatorType": "author"
               }
             ]
             "notes": []
             "tags": []
             "seeAlso": []
             "attachments": []
             "title": "Association Analysis of SERPINB5 Polymorphisms with HBV Clearance and HCC Occurrence in a Korean Population"
             "language": "ko-KR"
             "publicationTitle": "Genomics & informatics (한국유전체학회)"
             "volume": "8"
             "issue": "1"
             "date": "2010"
             "pages": "1–8"
             "abstractNote": "초록 보기"
             "libraryCatalog": "kiss"
           }
18:33:10 Translation successful

I added an itemType "report" for one type of new formats and added some minor details.

I think we need to add "url.indexOf('public2-jrd-sub.asp')>-1" as one of the options for "multiple", but for some reason I keep getting errors. Can someone take a look at this?

The example address is: http://kiss.kstudy.com/public/public2-jrd-sub.asp?key1=100171&key2=10017&selYEAR=2012&selVOL1=1&selNUM1=2&selMon=
KStudy.js Outdated
@@ -131,11 +133,23 @@ function scrape(doc, url) {
var containerParts = containerValue.match(/(.*)\s+(\d+)\D\s*(\d+)/);
if (containerParts) {
item.publicationTitle = containerParts[1];
item.volume = containerParts[2];
if (/\d{4}/.test(containerParts[2])) {
containerParts[2] = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment why you are doing this. Moreover, I guess you don't need to delete the value in the first case, but rather simply want to concentrate on the second case. Is this correct? Then you could instead try something like

if (containerParts[2].length<4) { // or the regexp if really needed
   item.volume = containerParts[2];
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay thank you for pointing that out.
The reason I added that line was that I found that sometimes the pages have published year in place of the volume number. So sometimes it would appear as Vol. 2013, and I didn't want that to happen, but your correction makes much more sense. Thank you. I'll fix that.

@zuphilip
Copy link
Contributor

zuphilip commented Jan 2, 2018

@yunusong For your question about the multiple in the public part: You need to tweak also the CSS selector for the different entries, i.e. this line

var rows = ZU.xpath(doc, '//div[contains(@class, "thesis-info")]/h5/a[contains(@href, "/thesis")]');

(The last check is not true contains(@href, "/thesis"); but change it wisely/carefully in order not to influence the other multiples negatively.)

@zuphilip
Copy link
Contributor

zuphilip commented Jan 2, 2018

@socheres Thank you for the review! The ISSN is a good idea to add as well. I am not sure that I understand the naming correctly. Did you see currently an error in the splitting of author names? Example?

@socheres
Copy link
Contributor

socheres commented Jan 2, 2018

@zuphilip authors name splitting works only correct in case 1, not in 2 or 3.
1.Kim, Moon Tae , Park, Hyun Suk
2.( Yoon Tae Kim ) , ( Hyun Suk Park )
3.김윤태 ( Yuntae Kim )

(Kim|Park|김) = last name

In case 2,3 they should be - if possible - in a full name field, like in Zotero where i can switch author field from double in single field.
3. 김 = last name
윤태 = first name
Kim = last name
Yuntae = first name

@socheres
Copy link
Contributor

socheres commented Jan 2, 2018

i think the website structure - authors - is too fragile and i think the effort is not worth it, because all the metadata on http://kiss.kstudy.com/ can be downloaded - third button below 인용하기 - in RIS.

@zuphilip
Copy link
Contributor

zuphilip commented Jan 2, 2018

Uh... RIS, yes we should use that in the translator (it is the blue button with the arrow to the right). I will look into this now...

@zuphilip
Copy link
Contributor

zuphilip commented Jan 3, 2018

Please have a look at the new version, which is pretty much a rewrite, but much simpler by using the RIS data. 🚀

Copy link
Collaborator

@adam3smith adam3smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small questions & requests.

KStudy.js Outdated

function scrape(doc, url) {
var key = url.match(/\bkey=(\d+)\b/)[1];
var risURL = "http://kiss.kstudy.com/p-common/export_endnote.asp";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets make this relative in case the site starts offering https



function scrape(doc, url) {
var key = url.match(/\bkey=(\d+)\b/)[1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check for this in detect, then, to make sure we have a key

KStudy.js Outdated
}
item.language = "ko-KR";
if (item.pages) {
item.pages = item.pages.replace('-', '–');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't usually do this -- do we actually prefer en-dashes over hyphens in the database?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any harm in doing so, though we should probably just be automatically converting hyphens in page ranges to en dashes on save.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... my mistake then. How is the page range then shown in citation styles? Maybe the replacement could also happens at some later step... I will delete it on the translator code, beause this would then be a larger issue.

KStudy.js Outdated
"items": [
{
"itemType": "journalArticle",
"title": "KOLMOGOROV DISTANCE FOR MULTIVARIATE NORMAL APPROXIMATION",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's title case all caps titles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to think about this before. I don't know how to check easily whether the title is all-uppercase but not a Korean title. Any ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to just use ZU.capitalizeTitle(title, true).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This also changes the title 치과용 콘빔 CT를 이용한 상악 정중과잉치의 3차원 분석, which I wanted to avoid. I guess some test like [\u0000-\u00FF]* should work for Latin alphabet only titles...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it change it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be great if the English title can be normalized, taking capitalization e.g. proper noun into account, and other words such as USA, EU in capitalize form. There are some similar projects (https://github.com/nlp-compromise/compromise), but I haven't found a "killer Javascript function" yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a site gives a title in all-caps, as seems to be the case here, we generally just title case it according to some basic rules and leave the fixing of proper nouns and acronyms up to users. (Along other things, we don’t even necessarily know the language in order to reliably exclude words.)

The issue here just seems to be (I haven’t looked at this closely) that some Korean titles include English characters, and we want to avoid changing the case of those, either in the translator or in a more generalized way in capitalizeTitle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dstillman Okay, my mistake. But the code pattern can still make sense then, to prevent making title case, when sentence case is correctly given.

I test that the majority of the title is from Latin alphabet. But your suggestion for capitalizeTitle on the words

not change words if they contain any characters outside [\u0000-\u00FF]

sounds reasonable for me.

@socheres The function ZU.capitalizeTitle can be used to "normalize" English titles into title case in translators, see https://github.com/zotero/zotero/blob/master/chrome/content/zotero/xpcom/utilities.js#L921 . However, it will add a capital letter to more or less every word with only a small number of exceptions cf. https://github.com/zotero/zotero/blob/2baa537542fcd5bc6e2826eb0eef6b34c571dcfc/chrome/content/zotero/xpcom/utilities.js#L922-L924 . To do anything similar e.g. for German and thereby differentiating between nouns and verbs etc. might be out of scope for Zotero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another example http://kiss.kstudy.com/thesis/thesis-view.asp?key=3480580 with the title

Analysis of Enzymes related to Lignin Modification of Phanerochaete chrysosporium ATCC20696 through the Transcriptomic and Proteomic Approaches

which is fine, but not title cased.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @avram @zuphilip for explaining. i totally understand that fixing crappy data records is out of scope for zotero. it was just feedback - nice to have feature - from the users/scientists. publisher who create metadata have to deal it.

KStudy.js Outdated
"creatorType": "author",
"fieldMode": 1
}
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

romanized korean names with comma create two last names:

저자 : Kim, Yoon Tae, Park, Hyun Suk

Korean names are correct imported in "filedMode 1", but it will be nice to have, if korean names can be separated in last and first name like the function "ZU.cleanAuthor" for korean names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you are right. The RIS data contains wrongly four authors in the AU tags. But I still can reuse some of the previous code for scraping it directly in these cases. Stay tuned..

KStudy.js Outdated
"lastName": "Hyun Suk",
"creatorType": "author",
"fieldMode": 1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line403-423

@zuphilip
Copy link
Contributor

zuphilip commented Jan 3, 2018

I did changes according to the reviews of @adam3smith and @socheres . Sometimes it is needed to scrape the information for the authors directly from the page. I hope that we have the author information now correctly.

Please have a look at the new version.

@socheres
Copy link
Contributor

socheres commented Jan 3, 2018

it looks 👍
source: > ( Chang-young Hong ) , ( Su-yeon Lee ) , ( Myungkil Kim ) , ( In-gyu Choi )
{ "itemType": "journalArticle" "creators": [ { "firstName": "Chang-young" "lastName": "Hong" "creatorType": "author" } { "firstName": "Su-yeon" "lastName": "Lee" "creatorType": "author" } { "firstName": "Myungkil" "lastName": "Kim" "creatorType": "author" } { "firstName": "In-gyu" "lastName": "Choi" "creatorType": "author" }

source > 저자 : Kim, Yoon Tae, Park, Hyun Suk
{ "itemType": "journalArticle" "creators": [ { "firstName": "Yoon Tae" "lastName": "Kim" "creatorType": "author" } { "firstName": "Hyun Suk" "lastName": "Park" "creatorType": "author" } ]
source > 김윤태 ( Yuntae Kim )
{ "itemType": "journalArticle" "creators": [ { "lastName": "김윤태" "creatorType": "author" "fieldMode": 1 }👏

@yunusong
Copy link
Contributor Author

yunusong commented Jan 3, 2018

I checked the update, and things look really great. Thanks for the work.

@adam3smith adam3smith merged commit f4a0d8e into zotero:master Jan 4, 2018
@adam3smith
Copy link
Collaborator

Great, thanks everyone!

@zuphilip
Copy link
Contributor

zuphilip commented Jan 4, 2018

✨ Thank you all @yunusong, @fbennett, @avram, @dstillman, @socheres, @adam3smith for the work and the review! That was a massive collaboration on a Zotero translator!

GuyAglionby pushed a commit to GuyAglionby/translators that referenced this pull request Jan 28, 2018
Using RIS data and author tweaks
jonschz pushed a commit to jonschz/translators that referenced this pull request Feb 8, 2018
Using RIS data and author tweaks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Translator Pull requests for new translators
Development

Successfully merging this pull request may close these issues.

7 participants