[REVIEW]: MassMine: Your Access To Data #50

whedon · 2016-08-16T15:10:39Z

Submitting author: @n3mo (Nicholas Van Horn)
Repository: https://github.com/n3mo/massmine
Version: v1.0.1
Editor: @mgymrek
Reviewer: @julianmcauley
Archive: 10.5281/zenodo.193078

Status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/bcbd89b81c517e5123fc1cfa80501ae7"><img src="http://joss.theoj.org/papers/bcbd89b81c517e5123fc1cfa80501ae7/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/bcbd89b81c517e5123fc1cfa80501ae7/status.svg)](http://joss.theoj.org/papers/bcbd89b81c517e5123fc1cfa80501ae7)

Reviewer questions

Conflict of interest

As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

General checks

Repository: Is the source code for this software available at the repository url?
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
Version: Does the release version given match the GitHub release (v1.0.1)?

Functionality

Installation: Does installation proceed as outlined in the documentation?
Functionality: Have the functional claims of the software been confirmed?
Performance: Have any performance claims of the software been confirmed?

Documentation

A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g. API method documentation)?
Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

Paper PDF: 10.21105.joss.00050.pdf

Authors: Does the paper.md file include a list of authors with their affiliations?
A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
References: Do all archival references that should have a DOI list one (e.g. papers, datasets, software)?

The text was updated successfully, but these errors were encountered:

arfon · 2016-08-16T15:11:46Z

@n3mo thanks for this submission! Before we can proceed would you mind extracting the references into a paper.bib file. We won't be able to process this submission until we can compile the paper using pandoc.

n3mo · 2016-08-16T16:02:03Z

@arfon thanks for the quick response. I've extracted the references as requested, as well as migrated both files to a directory named "paper". Let me know if there's anything else I can do.

arfon · 2016-08-16T16:12:24Z

/ cc @openjournals/joss-reviewers - would anyone be willing to review this submission?

If you would like to review this submission then please comment on this thread so that others know you're doing a review (so as not to duplicate effort). Something as simple as :hand: I am reviewing this will suffice.

Reviewer instructions

Please work through the checklist at the start of this issue.
If you need any further guidance/clarification take a look at the reviewer guidelines here http://joss.theoj.org/about#reviewer_guidelines
Please make a publication recommendation at the end of your review

Any questions, please ask for help by commenting on this issue! 🚀

aabeveridge · 2016-09-19T19:34:48Z

Hello, I am the second author on MassMine and I have a question:

May we suggest or invite an outside reviewer, or do you have an internal list that you prefer for JOSS?

labarba · 2016-09-19T19:52:43Z

You can do both. For example, I'm lead author of this paper (now accepted) and I pinged someone on Twitter asking for a second reviewer:
#43

labarba · 2016-09-19T19:54:52Z

To make the process transparent, do post here when you request a reviewer's help, and mention the reviewer by GitHub handle.

aabeveridge · 2016-09-19T20:04:54Z

I just posted a message on Twitter and I provided the direct link to this page. Here is a link to the tweet: https://twitter.com/aaronbeveridge/status/777959812784984064

Thank you @labarba!

arfon · 2016-09-23T18:25:38Z

@whedon list editors

whedon · 2016-09-23T18:25:41Z

Current JOSS editors:

@acabunoc
@arfon
@biorelated
@cMadan
@danielskatz
@jakevdp
@karthik
@katyhuff
@Kevin-Mattheus-Moerman
@kyleniemeyer
@labarba
@mgymrek
@pjotrp
@tracykteal

arfon · 2016-09-23T18:25:57Z

@whedon assign @mgymrek as editor

whedon · 2016-09-23T18:26:00Z

OK, the editor is @mgymrek

arfon · 2016-09-23T18:27:41Z

@mgymrek 👋 I'm happy to help edit this one with you.

mgymrek · 2016-09-25T19:52:03Z

@mbfhunzaker @ptwobrussell @dnmilne is https://github.com/n3mo/massmine your cup of tea? Are you willing to sign up as a reviewer and review this submission?

mgymrek · 2016-10-06T23:20:16Z

@julianmcauley has agreed to review. Julian can you confirm that here?

julianmcauley · 2016-10-06T23:23:23Z

Yes, happy to review

arfon · 2016-10-07T04:37:53Z

@whedon commands

whedon · 2016-10-07T04:38:00Z

Here are some things you can ask me to do:

# List all of Whedon's capabilities
@whedon commands

# Assign a GitHub user as the reviewer of this submission
@whedon assign @username as reviewer

# List the GitHub usernames of the JOSS editors
@whedon list editors

# List of JOSS reviewers together with programming language preferences and domain expertise
@whedon list reviewers

# Change editorial assignment
@whedon assign @username as editor

# Set the software archive DOI at the top of the issue e.g.
@whedon set 10.0000/zenodo.00000 as archive

# Open the review issue
@whedon start review

🚧 Important 🚧

This is all quite new. Please make sure you check the top of the issue after running a @whedon command (you might also need to refresh the page to see the issue update).

arfon · 2016-10-07T04:38:17Z

@whedon assign @julianmcauley as reviewer

whedon · 2016-10-07T04:38:20Z

OK, the reviewer is @julianmcauley

arfon · 2016-10-07T04:39:52Z

👋 @julianmcauley. Thanks for agreeing to review this submission. Please take a look at the reviewer guidelines here: http://joss.theoj.org/about#reviewer_guidelines and update the checklist at the top of the issue as you progress through your review.

mgymrek · 2016-11-01T02:32:52Z

@julianmcauley have you had a chance to look at this submission? Let us know if you have questions about the review process.

julianmcauley · 2016-11-04T03:49:09Z

Comments on the Checklist:

Installation: Does installation proceed as outlined in the documentation?
Installation instructions aren't really provided in the release, but rather on a link in the readme. Following that link leads to clear instructions.

Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
The dependencies are clearly stated on the webpage linked from the readme.

Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
Somewhat. Examples are given in the form of youtube videos containing walked-through examples. Personally I find this a very difficult to navigate means of introducing people to the software. I would like to see such videos in addition to (rather than instead of) actual code examples.

Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g. API method documentation)?
Documentation is okay. Mostly it's in the form of comments within the .scm code. There's not really organized documentation in a single document (that I can find?) though. Having no experience with scheme I'll admit that I find this a little difficult to parse.

Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
No (not that I can find). Automated tests here would be useful -- some of the functionality being implemented here depends on APIs from external websites that may change from time to time, causing the software to break. As it is I'm not sure how users will figure out that this has happened (other than the software failing).

Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
No, I was not able to find this documentation.

References: Do all archival references that should have a DOI list one (e.g. papers, datasets, software)?
N/A

General comments:
Crawling large datasets is a cumbersome and time consuming task, and one that it's valuable to build tools to automate. This software focuses on a few specific websites (google, tumbler, twitter, wikipedia), though potentially more could be added following the same framework in the future. The software seems easy enough to use, and the documentation is okay, but depends heavily on reading code and watching videos rather than (say) providing code examples directly which might be easier.

Positively, I think this is a well put-together project that contains features that people would be interested in. Negatively, the focus so far is on websites that already have strong API support, so really this code is adding an additional layer on top of an easy-to-use API. To somebody like me (who has never before read scheme code), it would be easier to follow the documentation from these websites' APIs directly rather than following this code. But maybe I'm in the minority. Certainly this wouldn't be an issue once more "hard to crawl" websites are added into the mix, if that's the plan.

n3mo · 2016-11-05T16:00:53Z

Julian,

Thanks for the thoughtful feedback on the project. It may be helpful if I clarify an important detail. Although the code base is managed on GitHub, we envision the website located at www.massmine.org as the entry point for our typical end-user. Additional comments, with respect to this observation are detailed in-line below.

Installation: Does installation proceed as outlined in the documentation?
Installation instructions aren't really provided in the release, but rather on a link in the readme. Following that link leads to clear instructions.

The www.massmine.org website is intended to be the definitive source for installation files and documentation. The software is distributed as a download-able binary purposely to free the user from managing dependencies and compiling the software themselves.

Build instructions, as well as a link to the GitHub repository, are provided for advanced users, but we expect these users to be the exception rather than the rule.

Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
Somewhat. Examples are given in the form of youtube videos containing walked-through examples. Personally I find this a very difficult to navigate means of introducing people to the software. I would like to see such videos in addition to (rather than instead of) actual code examples.

I believe you have revealed a weakness in the design of our online documentation at www.massmine.org. There are in fact examples throughout the documentation. These resources are available in a sidebar (see screenshot below) that is revealed by clicking on the menu icon in the top left of each web page. However, this sidebar is hidden by default to accommodate smaller screens. It seems that this has led to the undesirable side effect of causing it to go unnoticed.

Here are a few examples of what I'm referring to: The documentation provides a broad overview of how to use MassMine, as well as a separate detailed example analysis of Twitter. Both of these resources contain copious code snippets that gives new users copy-and-paste access to their first data set. Also, MassMine has further built-in help and examples for users, as documented on the broad overview page.

Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g. API method documentation)?
Documentation is okay. Mostly it's in the form of comments within the .scm code. There's not really organized documentation in a single document (that I can find?) though. Having no experience with scheme I'll admit that I find this a little difficult to parse.

Once again, the website's navigation sidebar has hidden the full documentation. The sidebar indeed contains complete documentation to all of MassMine's options, with a separate page for each data source. Further, example usage code snippets are provided with each documented function. We certainly don't intend our typical user to ever have to look at the raw source code.

Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
No (not that I can find). Automated tests here would be useful -- some of the functionality being implemented here depends on APIs from external websites that may change from time to time, causing the software to break. As it is I'm not sure how users will figure out that this has happened (other than the software failing).

Automated tests are deliberately missing from the application itself. As explained above, our expected end-user will use the pre-built software tool, and we've intentionally shielded them from having to participate in the code-test-compile process.

As it stands, the software is written to be agnostic about the data returned by the various APIs. As such, if Twitter, for example, changes the data returned by a given API endpoint, MassMine should continue to work. That is, under the hood it makes no assumptions about the data it receives. This behavior has already made it robust to several upstream changes. For example, Twitter recently increased the number of trends returned from its REST API from 10 to 50. This change required no update to the MassMine code base.

It is possible that more extensive (and rare) changes, such as adjustments to the underlying URIs of the APIs, will lead to errors. We are typically aware of such impending changes, which are often publicized in advance, and work to facilitate a fix prior to any problems.

Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
No, I was not able to find this documentation.

This was indeed missing--thanks! I've added language to the readme on GitHub. Links to the GitHub repo are available at the top of the www.massmine.org website. In the future, we could consider adding information for contributors to the documentation website as well. We also plan to enable disqus comments throughout the documentation to provide support to users unaccustomed to GitHub.

General comments:
Crawling large datasets is a cumbersome and time consuming task, and one that it's valuable to build tools to automate. This software focuses on a few specific websites (google, tumbler, twitter, wikipedia), though potentially more could be added following the same framework in the future. The software seems easy enough to use, and the documentation is okay, but depends heavily on reading code and watching videos rather than (say) providing code examples directly which might be easier.

Positively, I think this is a well put-together project that contains features that people would be interested in. Negatively, the focus so far is on websites that already have strong API support, so really this code is adding an additional layer on top of an easy-to-use API. To somebody like me (who has never before read scheme code), it would be easier to follow the documentation from these websites' APIs directly rather than following this code. But maybe I'm in the minority. Certainly this wouldn't be an issue once more "hard to crawl" websites are added into the mix, if that's the plan.

Hopefully, my comments above address most of your concerns. It seems the biggest problem was the perceived lack of documentation. Full documentation does indeed exist, albeit hidden from view by default. We plan to make it visible by default to avoid the extra step of clicking on the menu button which seems easy to miss.

Regarding your point about already-existing strong support for the current data sources: the purpose of this NEH grant project was to fill a missing technology gap for a specific class of users. Indeed, there are both (1) pre-made applications that analyze networked data sources, and (2) many packages for most major programming languages that provide low-level access to popular web APIs.

Existing options in category #1 are either proprietary and expensive, and/or only provide pre-defined analyses rather than raw data. Further, provided functionality is typically geared toward brand management and advertising applications, rendering it ineffective to open-ended research questions.

Existing options in category #2 require programming experience to use in any substantive way. For many researchers, the learning curve required makes data acquisition difficult. MassMine makes these data sources available to non-programmers interested in performing research on such data.

Additionally, MassMine does many things behind the scenes for the user, such as quietly managing rate limits imposed by the APIs, handling authentication credentials, and (responsibly) reconnecting dropped connections to prevent data loss due to common network hiccups. Also, we believe that providing a toolchain-agnostic application that provides a common user interface across many different APIs is of great benefit to users.

If you have further questions or concerns after examining the documentation, please don't hesitate to let me know.

Best,
Nick

mgymrek · 2016-11-05T18:18:59Z

Thanks @julianmcauley! And thanks @n3mo for the quick response.

Regarding documentation: @julianmcauley does this clarification address your concerns? I agree it'd be helpful to move the documentation to place more obviously visible from the home page. The documentation itself seems pretty thorough from glancing through it.

For tests: although the tool is indeed built for the end user, that does not exclude the possibility of adding any automated tests for developers. Indeed, one could argue that most software is intended for end-users to just run, but it should still be tested even if users don't run those tests. This is also a listed requirement for JOSS, so I would love to see some tests added unless there is a good reason not to.

julianmcauley · 2016-11-06T22:57:22Z

Yes, that answers my concerns about documentation. I didn't see the sidebar, but rather I clicked the "docs" tab, and assumed that was it. It seems the docs tab links to only the "getting started" page of the usage instructions but doesn't contain a link to the remaining pages. This seems like an easy fix. But the usage documentation seems sufficiently thorough.

n3mo · 2016-11-07T03:27:31Z

@mgymrek, thanks for the quick response. I agree that testing is useful for developers, especially those eager to contribute to unfamiliar code bases. One source of complexity for this project is the particulars of connecting to multiple external APIs. These APIs require users and developers alike to set up log in credentials before using their services. This prevents MassMine from shipping with truly automatic testing, as we don't have spare oauth credentials for each service to ship with the test suite. To date, this has kept us from distributing tests with MassMine. Thereforee, we would prefer to not add testing, but if JOSS feels that this is strong requirement, I'm sure a suitable compromise could be reached. Please advise.

gymreklab · 2016-11-07T20:07:08Z

I see how it is tricky to make automated tests given the issue of credentials. Do any of these APIs have test credentials that can be used for testing purposes? I am also looping in @arfon to this conversation to see what he thinks.

n3mo · 2016-11-08T21:23:29Z

To my knowledge they do not offer test credentials.

arfon · 2016-11-09T17:53:28Z

I am also looping in @arfon to this conversation to see what he thinks.

Thanks for flagging this @gymreklab. Testing external APIs is always a little tricky but there are some language-specific tools (such as Webmock, VCR in Ruby-land) that achieve this. I'm not sure if there are similar tools for Scheme.

As an alternative, what about having some fixture data with example requests/responses from some of these external services and making sure that the MassMine package can process these responses as expected? This would then at least help someone who is trying to understand what the software is actually doing to view sample inputs and outputs.

What do you think @n3mo?

n3mo · 2016-11-11T13:53:45Z

Thanks for your thoughts @arfon. I agree that using fakes/mocks for external services is a reasonable compromise. I've begun adding testing to the software and will notify everyone once the update is available for review.

n3mo · 2016-11-20T22:22:25Z

Tests are now available! They are included in a separate directory in the repo, but can be ran directly with massmine. The installation instructions have been updated with details. But running the tests is simple once all build dependencies are installed:

./massmine.scm --test ./tests/run.scm

mgymrek · 2016-11-21T17:13:05Z

Great! @julianmcauley would you be able to take a look at the added tests?

julianmcauley · 2016-11-25T04:48:06Z

Yes, certainly tests have been added, though to tell the truth I don't quite follow what functionality they really test. For twitter for instance the tests are:

`(test-begin "Twitter Module")

(test-assert "Twitter task descriptions" (list? twitter-task-descriptions))
(test-assert "Twitter task options" (list? twitter-task-options))
(test-assert "Search rate limit" (list? search-rate-limit))
(test-assert "Trends rate limit" (list? trends-rate-limit))
(test-assert "Timeline rate limit" (list? timeline-rate-limit))
(test-assert "Friends rate limit" (list? friends-rate-limit))
(test-assert "Followers rate limit" (list? followers-rate-limit))

(test-end "Twitter Module")`

Doesn't this just test that the methods exist but not actually test them for functionality? If so then I'm not sure if the tests are all they valuable, though certainly they meet the basic requirement of "having tests". Apologies if I misunderstood.

mgymrek · 2016-11-29T00:18:50Z

@n3mo I have also taken a look at the tests and am not totally sure what is being tested. Could you provide a brief description here?

n3mo · 2016-11-30T04:26:33Z

@julianmcauley and @mgymrek, thanks again for the continued feedback.

The various tests ensure a mixture of goals. In the simplest case, they ensure that the methods exist and that they exist in the proper format. Tests for the Twitter and Tumblr modules amount essentially to this for reasons that I'll return to. The remaining modules (Google, Web URL, & Wikipedia) provide full tests of the various tasks (i.e., data requests) that massmine provides. For these modules the tests target, in addition to the simple existence checks and helper procedures, the top-level functions that are called when massmine is run by the user, and thus fully assess the underlying functionality. That is, they make actual data requests and ensure successful retrieval.

For Twitter and Tumblr we are back to the previous conversation in this thread. Without API credentials for developers (which are not provided by these services), we cannot have fully automated tests. We have previously discussed utilizing mocks where possible. This is confounded by two reasons: First, the remaining methods make calls to functions provided by other imported packages not part of this code base, making it difficult and/or impossible to inject mock data directly for return. Second, were we able to set up a simulated host behind oauth to make calls to during our tests (which would require a substantial amount of tangential work for this project, given that an existing framework such as Webmock or VCR does not already exist for Chicken Scheme), the reward would be minimal. The reason is that as it stands, the API endpoints targeted by the Twitter and Tumblr modules simply catch the returned JSON data as a string, dumping the value either to stdout or file. Thus, in the end this substantial effort would amount to producing mock data that our functions are essentially agnostic to. So long as they are strings, everything will work, which casts doubt on the value of such a large undertaking.

mgymrek · 2016-12-01T17:14:32Z

Thanks @n3mo for the explanation. This sounds reasonable to me. @arfon, I think we can accept this. what is the next step?

arfon · 2016-12-01T23:12:14Z

@n3mo - At this point could you make an archive of the reviewed software in Zenodo/figshare/other service and update this thread with the DOI of the archive? I can then move forward with accepting the submission.

n3mo · 2016-12-06T22:09:26Z

I've archived the software with Zenodo. The DOI is: 10.5281/zenodo.193078

arfon · 2016-12-06T23:03:02Z

@whedon set 10.5281/zenodo.193078 as archive

whedon · 2016-12-06T23:03:05Z

OK. 10.5281/zenodo.193078 is the archive.

arfon · 2016-12-07T00:52:41Z

Many thanks for reviewing this one @julianmcauley and @mgymrek for editing this paper.

@n3mo - your paper is now accepted into JOSS and your DOI is http://dx.doi.org/10.21105/joss.00050 ⚡️ 🚀 💥

n3mo · 2016-12-07T04:03:47Z

Thanks @arfon, @julianmcauley, and @mgymrek for your valuable feedback throughout this process.

whedon added the review label Aug 16, 2016

arfon changed the title ~~Submission: MassMine: Your Access To Data~~ [REVIEW]: MassMine: Your Access To Data Sep 20, 2016

whedon assigned mgymrek Sep 23, 2016

whedon assigned julianmcauley and mgymrek and unassigned mgymrek Oct 7, 2016

arfon closed this as completed Dec 7, 2016

arfon added the accepted label Dec 7, 2016

shahriariravanian mentioned this issue Apr 28, 2018

[PRE REVIEW]: fib-tf: A TensorFlow-based Cardiac Electrophysiology Simulator #702

Closed

whedon added published Papers published in JOSS recommend-accept Papers recommended for acceptance in JOSS. labels Mar 2, 2020

[REVIEW]: MassMine: Your Access To Data #50

[REVIEW]: MassMine: Your Access To Data #50

Comments

whedon commented Aug 16, 2016 • edited by arfon Loading

Status

Reviewer questions

Conflict of interest

General checks

Functionality

Documentation

Software paper

arfon commented Aug 16, 2016

n3mo commented Aug 16, 2016

arfon commented Aug 16, 2016

aabeveridge commented Sep 19, 2016

labarba commented Sep 19, 2016

labarba commented Sep 19, 2016

aabeveridge commented Sep 19, 2016 • edited Loading

arfon commented Sep 23, 2016

whedon commented Sep 23, 2016

arfon commented Sep 23, 2016

whedon commented Sep 23, 2016

arfon commented Sep 23, 2016

mgymrek commented Sep 25, 2016

mgymrek commented Oct 6, 2016

julianmcauley commented Oct 6, 2016

arfon commented Oct 7, 2016

whedon commented Oct 7, 2016

arfon commented Oct 7, 2016

whedon commented Oct 7, 2016

arfon commented Oct 7, 2016

mgymrek commented Nov 1, 2016

julianmcauley commented Nov 4, 2016

n3mo commented Nov 5, 2016

mgymrek commented Nov 5, 2016

julianmcauley commented Nov 6, 2016

n3mo commented Nov 7, 2016

gymreklab commented Nov 7, 2016

n3mo commented Nov 8, 2016

arfon commented Nov 9, 2016

n3mo commented Nov 11, 2016

n3mo commented Nov 20, 2016

mgymrek commented Nov 21, 2016

julianmcauley commented Nov 25, 2016

mgymrek commented Nov 29, 2016

n3mo commented Nov 30, 2016

mgymrek commented Dec 1, 2016

arfon commented Dec 1, 2016

n3mo commented Dec 6, 2016

arfon commented Dec 6, 2016

whedon commented Dec 6, 2016

arfon commented Dec 7, 2016

n3mo commented Dec 7, 2016

whedon commented Aug 16, 2016 •

edited by arfon

Loading

aabeveridge commented Sep 19, 2016 •

edited

Loading