Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make EM the generic translator. Addresses #1092 #1434

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

adomasven
Copy link
Member

As per this comment:

  • If EM only contains webpage data, run DOI translator's detectWeb within EM and if there are DOIs present, return undefined, otherwise return webpage(Gray?).
  • Change special-case code within the Connector to always allow running the EM translator. If EM returns undefined then list EM options with and without snapshot as the final two options in the context menu, otherwise list the EM translator according to its priority.

If there are DOIs present, EM will not overshadow the DOI translator and otherwise will overtake. If both rich EM data and DOI are present then both can co-exist. This way we can avoid any changes or special translation handling within zotero and translation server and have a translator for every page. It sacrifices code clarity in the intermediate term, but is a workable solution for the short-term until someone has the time and spirit to commit to the bigger change.

The changes in the connector are coming soon, but we should get this EM translator in circulation early. The consequence of this update is that connector will detect EM translator for all previous pages that only displayed webpage saving options.

Alternatively, we should at least merge the part in importRDF() that will allow the EM translator to save even when init() doesn't detect a viable item type, such that when push out the updated Connector and it tries to save with EM translation doesn't break.

Awaiting comments from @dstillman regarding connector related stuff.

@dstillman
Copy link
Member

My main concern here is duplicating the page scan for DOI. It's probably quite fast, but we obviously want to be very careful with translators that run on every page load, so we should at least do some profiling to see how many milliseconds it would be adding.

But I agree that we can merge the importRDF part first (and that will let us start using it in translation-server too, which is a priority). Have you tested the saving here? Any unexpected differences between the data from EM and the existing webpage option for a generic page?

@adomasven
Copy link
Member Author

Have you tested the saving here? Any unexpected differences between the data from EM and the existing webpage option for a generic page?

I have tested it for functionality and it works! Otherwise, yes, there are somewhat significant, but expected differences between EM and the current code we have.

@adomasven adomasven force-pushed the feature/generic-EM-translator branch from 21dbb16 to 72f5ceb Compare October 9, 2017 11:41
@adomasven
Copy link
Member Author

adomasven commented Oct 9, 2017

Okay, so I've pushed just the change that allows EM to save even when it doesn't detect any significant tags. Good to review and merge.

@adomasven
Copy link
Member Author

My main concern here is duplicating the page scan for DOI. It's probably quite fast, but we obviously want to be very careful with translators that run on every page load, so we should at least do some profiling to see how many milliseconds it would be adding.

See the scan code. It is an XPath query and then running a regexp for each returned result. The load time we're adding for each page is proportional to the length of the page, at O(n log n). On this webpage (which is one of the testcases for DOI detectWeb() takes 0.9ms in Chrome. I think we're okay.

@adomasven
Copy link
Member Author

@adam3smith please take a look at this when you have time

@adam3smith
Copy link
Collaborator

Should I be seeing this?

The changes in the connector are coming soon, but we should get this EM translator in circulation early. The consequence of this update is that connector will detect EM translator for all previous pages that only displayed webpage saving options.

Because that doesn't appear to be the case. e.g. still getting the grey Webpage icon on
http://www.example.com/

@dstillman
Copy link
Member

No, we took that out for the moment — the only thing necessary for this version is make sure we're OK applying the EM low-quality save to all webpages, in place of the current special-case webpage code (which just saves document.title, URL, and access date), noting that these would include pages where no effort was made to embed metadata. I haven't looked at the whole function, but I would say, for example, that we don't want to assign the hostname as Library Catalog for all webpages (or at all for EM saves). And I'm not sure if including the keywords field as tags is a good idea.

@zuphilip
Copy link
Contributor

And I'm not sure if including the keywords field as tags is a good idea.

The tags will be emptied if EM is not called from another translator, see https://github.com/adomasven/translators/blob/21dbb16f0f9e6a54ef5109a8fa58f75a3ba8a0c6/Embedded%20Metadata.js#L821-L823

@adomasven
Copy link
Member Author

adomasven commented Oct 11, 2017

The tags will be emptied if EM is not called from another translator, see https://github.com/adomasven/translators/blob/21dbb16f0f9e6a54ef5109a8fa58f75a3ba8a0c6/Embedded%20Metadata.js#L821-L823

We should probably do the same thing for libraryCatalog.

Otherwise the function adds:

  • Title
  • Creators (new for generic saves)
  • Abstract note from <meta name="description"/> (new)
  • Url
  • Access date

Creators and abstract note quality will vary wildly. I'm not sure if even the worst possible abstract is worse than no abstract. A bad creator, on the other hand is worse than no creator when citing. So maybe add creators only for saves from another translator too? (Although if creators come from highwire metadata, then they're okay to include. This is specifically applicable to low quality fields.)

@dstillman
Copy link
Member

We can look around, but I think we want to keep the author meta tag — I’m sure it’s often wrong, but it’s a legit field that, unlike keywords, doesn’t have a long history of (worthless) SEO abuse. Seems like people should be able to use it to specify the author of a generic webpage.

@adomasven adomasven force-pushed the feature/generic-EM-translator branch from 72f5ceb to e7ee6c3 Compare October 11, 2017 09:32
@adomasven
Copy link
Member Author

I've updated this to exclude libraryCatalog for non-child-EM-translator saves. There will be a follow up after we merge some changes into the connector.

Awaiting approval.

@adam3smith
Copy link
Collaborator

I'm sorry I'm dense here (and by all means feel free to merge this if it's blocking other things). But if you want me to look at it more closely, could someone give me a page where to see the changed behavior at work?

@adomasven
Copy link
Member Author

Ugh, sorry. The main change was supposed to be this, although it seems that new Zotero.Item(undefined) actually defaults to 'webpage' either way, so aside from removing "libraryCatalog" (which, FWIW, doesn't even apply to 'webpage' items, i think?), the PR actually doesn't change anything. And it means that it doesn't block the connector changes either.

I'll push the initial changes to this PR and we can start over, although we need to merge the connector code first, so it will have to wait. Once again, sorry to bother you.

@adam3smith
Copy link
Collaborator

no worries, thanks for clarifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

4 participants