Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW]: MassMine: Your Access To Data #50

Closed
15 of 16 tasks
whedon opened this issue Aug 16, 2016 · 42 comments
Closed
15 of 16 tasks

[REVIEW]: MassMine: Your Access To Data #50

whedon opened this issue Aug 16, 2016 · 42 comments
Assignees
Labels
accepted published Papers published in JOSS recommend-accept Papers recommended for acceptance in JOSS. review

Comments

@whedon
Copy link

whedon commented Aug 16, 2016

Submitting author: @n3mo (Nicholas Van Horn)
Repository: https://github.com/n3mo/massmine
Version: v1.0.1
Editor: @mgymrek
Reviewer: @julianmcauley
Archive: 10.5281/zenodo.193078

Status

status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/bcbd89b81c517e5123fc1cfa80501ae7"><img src="http://joss.theoj.org/papers/bcbd89b81c517e5123fc1cfa80501ae7/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/bcbd89b81c517e5123fc1cfa80501ae7/status.svg)](http://joss.theoj.org/papers/bcbd89b81c517e5123fc1cfa80501ae7)

Reviewer questions

Conflict of interest

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

General checks

  • Repository: Is the source code for this software available at the repository url?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • Version: Does the release version given match the GitHub release (v1.0.1)?

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: Have any performance claims of the software been confirmed?

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g. API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

Paper PDF: 10.21105.joss.00050.pdf

  • Authors: Does the paper.md file include a list of authors with their affiliations?
  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • References: Do all archival references that should have a DOI list one (e.g. papers, datasets, software)?
@whedon whedon added the review label Aug 16, 2016
@arfon
Copy link
Member

arfon commented Aug 16, 2016

@n3mo thanks for this submission! Before we can proceed would you mind extracting the references into a paper.bib file. We won't be able to process this submission until we can compile the paper using pandoc.

@n3mo
Copy link

n3mo commented Aug 16, 2016

@arfon thanks for the quick response. I've extracted the references as requested, as well as migrated both files to a directory named "paper". Let me know if there's anything else I can do.

@arfon
Copy link
Member

arfon commented Aug 16, 2016

/ cc @openjournals/joss-reviewers - would anyone be willing to review this submission?

If you would like to review this submission then please comment on this thread so that others know you're doing a review (so as not to duplicate effort). Something as simple as :hand: I am reviewing this will suffice.

Reviewer instructions

  • Please work through the checklist at the start of this issue.
  • If you need any further guidance/clarification take a look at the reviewer guidelines here http://joss.theoj.org/about#reviewer_guidelines
  • Please make a publication recommendation at the end of your review

Any questions, please ask for help by commenting on this issue! 🚀

@aabeveridge
Copy link

Hello, I am the second author on MassMine and I have a question:

May we suggest or invite an outside reviewer, or do you have an internal list that you prefer for JOSS?

@labarba
Copy link
Member

labarba commented Sep 19, 2016

You can do both. For example, I'm lead author of this paper (now accepted) and I pinged someone on Twitter asking for a second reviewer:
#43

@labarba
Copy link
Member

labarba commented Sep 19, 2016

To make the process transparent, do post here when you request a reviewer's help, and mention the reviewer by GitHub handle.

@aabeveridge
Copy link

aabeveridge commented Sep 19, 2016

I just posted a message on Twitter and I provided the direct link to this page. Here is a link to the tweet: https://twitter.com/aaronbeveridge/status/777959812784984064

Thank you @labarba!

@arfon arfon changed the title Submission: MassMine: Your Access To Data [REVIEW]: MassMine: Your Access To Data Sep 20, 2016
@arfon
Copy link
Member

arfon commented Sep 23, 2016

@whedon list editors

@whedon
Copy link
Author

whedon commented Sep 23, 2016

Current JOSS editors:

@acabunoc
@arfon
@biorelated
@cMadan
@danielskatz
@jakevdp
@karthik
@katyhuff
@Kevin-Mattheus-Moerman
@kyleniemeyer
@labarba
@mgymrek
@pjotrp
@tracykteal

@arfon
Copy link
Member

arfon commented Sep 23, 2016

@whedon assign @mgymrek as editor

@whedon
Copy link
Author

whedon commented Sep 23, 2016

OK, the editor is @mgymrek

@arfon
Copy link
Member

arfon commented Sep 23, 2016

@mgymrek 👋 I'm happy to help edit this one with you.

@mgymrek
Copy link

mgymrek commented Sep 25, 2016

@mbfhunzaker @ptwobrussell @dnmilne is https://github.com/n3mo/massmine your cup of tea? Are you willing to sign up as a reviewer and review this submission?

@mgymrek
Copy link

mgymrek commented Oct 6, 2016

@julianmcauley has agreed to review. Julian can you confirm that here?

@julianmcauley
Copy link

Yes, happy to review

@arfon
Copy link
Member

arfon commented Oct 7, 2016

@whedon commands

@whedon
Copy link
Author

whedon commented Oct 7, 2016

Here are some things you can ask me to do:

# List all of Whedon's capabilities
@whedon commands

# Assign a GitHub user as the reviewer of this submission
@whedon assign @username as reviewer

# List the GitHub usernames of the JOSS editors
@whedon list editors

# List of JOSS reviewers together with programming language preferences and domain expertise
@whedon list reviewers

# Change editorial assignment
@whedon assign @username as editor

# Set the software archive DOI at the top of the issue e.g.
@whedon set 10.0000/zenodo.00000 as archive

# Open the review issue
@whedon start review

🚧 Important 🚧

This is all quite new. Please make sure you check the top of the issue after running a @whedon command (you might also need to refresh the page to see the issue update).

@arfon
Copy link
Member

arfon commented Oct 7, 2016

@whedon assign @julianmcauley as reviewer

@whedon
Copy link
Author

whedon commented Oct 7, 2016

OK, the reviewer is @julianmcauley

@whedon whedon assigned julianmcauley and mgymrek and unassigned mgymrek Oct 7, 2016
@arfon
Copy link
Member

arfon commented Oct 7, 2016

👋 @julianmcauley. Thanks for agreeing to review this submission. Please take a look at the reviewer guidelines here: http://joss.theoj.org/about#reviewer_guidelines and update the checklist at the top of the issue as you progress through your review.

@mgymrek
Copy link

mgymrek commented Nov 1, 2016

@julianmcauley have you had a chance to look at this submission? Let us know if you have questions about the review process.

@julianmcauley
Copy link

Comments on the Checklist:

Installation: Does installation proceed as outlined in the documentation?
Installation instructions aren't really provided in the release, but rather on a link in the readme. Following that link leads to clear instructions.

Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
The dependencies are clearly stated on the webpage linked from the readme.

Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
Somewhat. Examples are given in the form of youtube videos containing walked-through examples. Personally I find this a very difficult to navigate means of introducing people to the software. I would like to see such videos in addition to (rather than instead of) actual code examples.

Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g. API method documentation)?
Documentation is okay. Mostly it's in the form of comments within the .scm code. There's not really organized documentation in a single document (that I can find?) though. Having no experience with scheme I'll admit that I find this a little difficult to parse.

Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
No (not that I can find). Automated tests here would be useful -- some of the functionality being implemented here depends on APIs from external websites that may change from time to time, causing the software to break. As it is I'm not sure how users will figure out that this has happened (other than the software failing).

Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
No, I was not able to find this documentation.

References: Do all archival references that should have a DOI list one (e.g. papers, datasets, software)?
N/A

General comments:
Crawling large datasets is a cumbersome and time consuming task, and one that it's valuable to build tools to automate. This software focuses on a few specific websites (google, tumbler, twitter, wikipedia), though potentially more could be added following the same framework in the future. The software seems easy enough to use, and the documentation is okay, but depends heavily on reading code and watching videos rather than (say) providing code examples directly which might be easier.

Positively, I think this is a well put-together project that contains features that people would be interested in. Negatively, the focus so far is on websites that already have strong API support, so really this code is adding an additional layer on top of an easy-to-use API. To somebody like me (who has never before read scheme code), it would be easier to follow the documentation from these websites' APIs directly rather than following this code. But maybe I'm in the minority. Certainly this wouldn't be an issue once more "hard to crawl" websites are added into the mix, if that's the plan.

@n3mo
Copy link

n3mo commented Nov 5, 2016

Julian,

Thanks for the thoughtful feedback on the project. It may be helpful if I clarify an important detail. Although the code base is managed on GitHub, we envision the website located at www.massmine.org as the entry point for our typical end-user. Additional comments, with respect to this observation are detailed in-line below.

Installation: Does installation proceed as outlined in the documentation?
Installation instructions aren't really provided in the release, but rather on a link in the readme. Following that link leads to clear instructions.

The www.massmine.org website is intended to be the definitive source for installation files and documentation. The software is distributed as a download-able binary purposely to free the user from managing dependencies and compiling the software themselves.

Build instructions, as well as a link to the GitHub repository, are provided for advanced users, but we expect these users to be the exception rather than the rule.

Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
Somewhat. Examples are given in the form of youtube videos containing walked-through examples. Personally I find this a very difficult to navigate means of introducing people to the software. I would like to see such videos in addition to (rather than instead of) actual code examples.

I believe you have revealed a weakness in the design of our online documentation at www.massmine.org. There are in fact examples throughout the documentation. These resources are available in a sidebar (see screenshot below) that is revealed by clicking on the menu icon in the top left of each web page. However, this sidebar is hidden by default to accommodate smaller screens. It seems that this has led to the undesirable side effect of causing it to go unnoticed.

documentation_pic

Here are a few examples of what I'm referring to: The documentation provides a broad overview of how to use MassMine, as well as a separate detailed example analysis of Twitter. Both of these resources contain copious code snippets that gives new users copy-and-paste access to their first data set. Also, MassMine has further built-in help and examples for users, as documented on the broad overview page.

Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g. API method documentation)?
Documentation is okay. Mostly it's in the form of comments within the .scm code. There's not really organized documentation in a single document (that I can find?) though. Having no experience with scheme I'll admit that I find this a little difficult to parse.

Once again, the website's navigation sidebar has hidden the full documentation. The sidebar indeed contains complete documentation to all of MassMine's options, with a separate page for each data source. Further, example usage code snippets are provided with each documented function. We certainly don't intend our typical user to ever have to look at the raw source code.

Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
No (not that I can find). Automated tests here would be useful -- some of the functionality being implemented here depends on APIs from external websites that may change from time to time, causing the software to break. As it is I'm not sure how users will figure out that this has happened (other than the software failing).

Automated tests are deliberately missing from the application itself. As explained above, our expected end-user will use the pre-built software tool, and we've intentionally shielded them from having to participate in the code-test-compile process.

As it stands, the software is written to be agnostic about the data returned by the various APIs. As such, if Twitter, for example, changes the data returned by a given API endpoint, MassMine should continue to work. That is, under the hood it makes no assumptions about the data it receives. This behavior has already made it robust to several upstream changes. For example, Twitter recently increased the number of trends returned from its REST API from 10 to 50. This change required no update to the MassMine code base.

It is possible that more extensive (and rare) changes, such as adjustments to the underlying URIs of the APIs, will lead to errors. We are typically aware of such impending changes, which are often publicized in advance, and work to facilitate a fix prior to any problems.

Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
No, I was not able to find this documentation.

This was indeed missing--thanks! I've added language to the readme on GitHub. Links to the GitHub repo are available at the top of the www.massmine.org website. In the future, we could consider adding information for contributors to the documentation website as well. We also plan to enable disqus comments throughout the documentation to provide support to users unaccustomed to GitHub.

General comments:
Crawling large datasets is a cumbersome and time consuming task, and one that it's valuable to build tools to automate. This software focuses on a few specific websites (google, tumbler, twitter, wikipedia), though potentially more could be added following the same framework in the future. The software seems easy enough to use, and the documentation is okay, but depends heavily on reading code and watching videos rather than (say) providing code examples directly which might be easier.

Positively, I think this is a well put-together project that contains features that people would be interested in. Negatively, the focus so far is on websites that already have strong API support, so really this code is adding an additional layer on top of an easy-to-use API. To somebody like me (who has never before read scheme code), it would be easier to follow the documentation from these websites' APIs directly rather than following this code. But maybe I'm in the minority. Certainly this wouldn't be an issue once more "hard to crawl" websites are added into the mix, if that's the plan.

Hopefully, my comments above address most of your concerns. It seems the biggest problem was the perceived lack of documentation. Full documentation does indeed exist, albeit hidden from view by default. We plan to make it visible by default to avoid the extra step of clicking on the menu button which seems easy to miss.

Regarding your point about already-existing strong support for the current data sources: the purpose of this NEH grant project was to fill a missing technology gap for a specific class of users. Indeed, there are both (1) pre-made applications that analyze networked data sources, and (2) many packages for most major programming languages that provide low-level access to popular web APIs.

Existing options in category #1 are either proprietary and expensive, and/or only provide pre-defined analyses rather than raw data. Further, provided functionality is typically geared toward brand management and advertising applications, rendering it ineffective to open-ended research questions.

Existing options in category #2 require programming experience to use in any substantive way. For many researchers, the learning curve required makes data acquisition difficult. MassMine makes these data sources available to non-programmers interested in performing research on such data.

Additionally, MassMine does many things behind the scenes for the user, such as quietly managing rate limits imposed by the APIs, handling authentication credentials, and (responsibly) reconnecting dropped connections to prevent data loss due to common network hiccups. Also, we believe that providing a toolchain-agnostic application that provides a common user interface across many different APIs is of great benefit to users.

If you have further questions or concerns after examining the documentation, please don't hesitate to let me know.

Best,
Nick

@mgymrek
Copy link

mgymrek commented Nov 5, 2016

Thanks @julianmcauley! And thanks @n3mo for the quick response.

Regarding documentation: @julianmcauley does this clarification address your concerns? I agree it'd be helpful to move the documentation to place more obviously visible from the home page. The documentation itself seems pretty thorough from glancing through it.

For tests: although the tool is indeed built for the end user, that does not exclude the possibility of adding any automated tests for developers. Indeed, one could argue that most software is intended for end-users to just run, but it should still be tested even if users don't run those tests. This is also a listed requirement for JOSS, so I would love to see some tests added unless there is a good reason not to.

@julianmcauley
Copy link

Yes, that answers my concerns about documentation. I didn't see the sidebar, but rather I clicked the "docs" tab, and assumed that was it. It seems the docs tab links to only the "getting started" page of the usage instructions but doesn't contain a link to the remaining pages. This seems like an easy fix. But the usage documentation seems sufficiently thorough.

@n3mo
Copy link

n3mo commented Nov 7, 2016

@mgymrek, thanks for the quick response. I agree that testing is useful for developers, especially those eager to contribute to unfamiliar code bases. One source of complexity for this project is the particulars of connecting to multiple external APIs. These APIs require users and developers alike to set up log in credentials before using their services. This prevents MassMine from shipping with truly automatic testing, as we don't have spare oauth credentials for each service to ship with the test suite. To date, this has kept us from distributing tests with MassMine. Thereforee, we would prefer to not add testing, but if JOSS feels that this is strong requirement, I'm sure a suitable compromise could be reached. Please advise.

@gymreklab
Copy link

I see how it is tricky to make automated tests given the issue of credentials. Do any of these APIs have test credentials that can be used for testing purposes? I am also looping in @arfon to this conversation to see what he thinks.

@n3mo
Copy link

n3mo commented Nov 8, 2016

To my knowledge they do not offer test credentials.

@arfon
Copy link
Member

arfon commented Nov 9, 2016

I am also looping in @arfon to this conversation to see what he thinks.

Thanks for flagging this @gymreklab. Testing external APIs is always a little tricky but there are some language-specific tools (such as Webmock, VCR in Ruby-land) that achieve this. I'm not sure if there are similar tools for Scheme.

As an alternative, what about having some fixture data with example requests/responses from some of these external services and making sure that the MassMine package can process these responses as expected? This would then at least help someone who is trying to understand what the software is actually doing to view sample inputs and outputs.

What do you think @n3mo?

@n3mo
Copy link

n3mo commented Nov 11, 2016

Thanks for your thoughts @arfon. I agree that using fakes/mocks for external services is a reasonable compromise. I've begun adding testing to the software and will notify everyone once the update is available for review.

@n3mo
Copy link

n3mo commented Nov 20, 2016

Tests are now available! They are included in a separate directory in the repo, but can be ran directly with massmine. The installation instructions have been updated with details. But running the tests is simple once all build dependencies are installed:

./massmine.scm --test ./tests/run.scm

@mgymrek
Copy link

mgymrek commented Nov 21, 2016

Great! @julianmcauley would you be able to take a look at the added tests?

@julianmcauley
Copy link

Yes, certainly tests have been added, though to tell the truth I don't quite follow what functionality they really test. For twitter for instance the tests are:

`(test-begin "Twitter Module")

(test-assert "Twitter task descriptions" (list? twitter-task-descriptions))
(test-assert "Twitter task options" (list? twitter-task-options))
(test-assert "Search rate limit" (list? search-rate-limit))
(test-assert "Trends rate limit" (list? trends-rate-limit))
(test-assert "Timeline rate limit" (list? timeline-rate-limit))
(test-assert "Friends rate limit" (list? friends-rate-limit))
(test-assert "Followers rate limit" (list? followers-rate-limit))

(test-end "Twitter Module")`

Doesn't this just test that the methods exist but not actually test them for functionality? If so then I'm not sure if the tests are all they valuable, though certainly they meet the basic requirement of "having tests". Apologies if I misunderstood.

@mgymrek
Copy link

mgymrek commented Nov 29, 2016

@n3mo I have also taken a look at the tests and am not totally sure what is being tested. Could you provide a brief description here?

@n3mo
Copy link

n3mo commented Nov 30, 2016

@julianmcauley and @mgymrek, thanks again for the continued feedback.

The various tests ensure a mixture of goals. In the simplest case, they ensure that the methods exist and that they exist in the proper format. Tests for the Twitter and Tumblr modules amount essentially to this for reasons that I'll return to. The remaining modules (Google, Web URL, & Wikipedia) provide full tests of the various tasks (i.e., data requests) that massmine provides. For these modules the tests target, in addition to the simple existence checks and helper procedures, the top-level functions that are called when massmine is run by the user, and thus fully assess the underlying functionality. That is, they make actual data requests and ensure successful retrieval.

For Twitter and Tumblr we are back to the previous conversation in this thread. Without API credentials for developers (which are not provided by these services), we cannot have fully automated tests. We have previously discussed utilizing mocks where possible. This is confounded by two reasons: First, the remaining methods make calls to functions provided by other imported packages not part of this code base, making it difficult and/or impossible to inject mock data directly for return. Second, were we able to set up a simulated host behind oauth to make calls to during our tests (which would require a substantial amount of tangential work for this project, given that an existing framework such as Webmock or VCR does not already exist for Chicken Scheme), the reward would be minimal. The reason is that as it stands, the API endpoints targeted by the Twitter and Tumblr modules simply catch the returned JSON data as a string, dumping the value either to stdout or file. Thus, in the end this substantial effort would amount to producing mock data that our functions are essentially agnostic to. So long as they are strings, everything will work, which casts doubt on the value of such a large undertaking.

@mgymrek
Copy link

mgymrek commented Dec 1, 2016

Thanks @n3mo for the explanation. This sounds reasonable to me. @arfon, I think we can accept this. what is the next step?

@arfon
Copy link
Member

arfon commented Dec 1, 2016

@n3mo - At this point could you make an archive of the reviewed software in Zenodo/figshare/other service and update this thread with the DOI of the archive? I can then move forward with accepting the submission.

@n3mo
Copy link

n3mo commented Dec 6, 2016

I've archived the software with Zenodo. The DOI is: 10.5281/zenodo.193078

@arfon
Copy link
Member

arfon commented Dec 6, 2016

@whedon set 10.5281/zenodo.193078 as archive

@whedon
Copy link
Author

whedon commented Dec 6, 2016

OK. 10.5281/zenodo.193078 is the archive.

@arfon
Copy link
Member

arfon commented Dec 7, 2016

Many thanks for reviewing this one @julianmcauley and @mgymrek for editing this paper.

@n3mo - your paper is now accepted into JOSS and your DOI is http://dx.doi.org/10.21105/joss.00050 ⚡️ 🚀 💥

@arfon arfon closed this as completed Dec 7, 2016
@arfon arfon added the accepted label Dec 7, 2016
@n3mo
Copy link

n3mo commented Dec 7, 2016

Thanks @arfon, @julianmcauley, and @mgymrek for your valuable feedback throughout this process.

@whedon whedon added published Papers published in JOSS recommend-accept Papers recommended for acceptance in JOSS. labels Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted published Papers published in JOSS recommend-accept Papers recommended for acceptance in JOSS. review
Projects
None yet
Development

No branches or pull requests

8 participants