-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formalise the Structure of a CProject #10
Comments
QuickscrapeAn example output folder after scraping two random papers one open access one not.
|
GetpapersAn example output folder showing two papers from EuPMC. Notice that in this case the results file
|
Agreed this is important. The key thing is to find unique ids if possible. This is not always easy. Quickscale uses URLs and this will clearly not uniquify. Getpapers is normally pointed at a repository , but not sure if, say arxiv has unique ids (remember DOIs don't work for all documents) |
Quickscrape can now also use incrementing integers as the directory names. Arxiv does have unique ids, e.g. in http://arxiv.org/pdf/1601.00900v1.pdf the id is |
Norma InputNorma can read input from files ending in the following extensions:
I think it can currently read from whatever file is specified with -i but it may fail if is doesn't meet a complex criteria called reserved name. This means that the file must either be named one of the following:
or it must be a file of any name within a reversed directory with a name from this list:
I think this means that one could pass a command like |
Norma OutputIn terms of nlm2html Norma will happily output to where ever you ask it to with what ever extension you desire. For example |
AmiJust using the new This give us many things as an output into the CProjects CTrees:
We can use the 'old style' ami commands. For example
We don't normally get snippets or summaries running in this way. The way of generating summaries mentioned in the workshop notes no longer works ( This is then processed into summary files with the snippetsfiles as the input and typically a |
CProject python libraryThis simply looks for the following files: and however it uses the terminology 'type' rather than 'option'. |
Ami summarise (https://github.com/matthewgthomas/ami-summarise)Seems to look for any xml file which contains anywhere in its path either 'frequency' or 'binomial'. It writes to these files (to quote):
|
Tom's SummaryI think throughout this we have found three different classes of files in the CProject/CTree as people currently use it. We have:
We should keep information that is tied to just one paper, and only depends on that one paper in a single paper folder as much as possible. I think if possible we should include the bibliographic metadata that we get from quickscrape and getpapers in the paper folder rather than globally. The python CProject parses the scholarly html to extract this bibliographic metadata and could also save it here if wanted. Information that is dependent on more than one paper should be kept outside of these paper folders (obviously). But it still needs to be ordered in some way. Peter suggests in #5 that we have a summary folder in the root. I think this is a good idea. The more we put into the root folder the more we have to figure out what is and isn't a paper, or paper folder. IMHO the logical rule is to have ami, or other tools that are (and only are) extracting facts from papers placing data into project/paper/results/plugin[/option]/results.xml. To be honest I don't really see we need to restrict it to results.xml (for example bag-of-words makes html). Could be JSON, plain text, whatever you want. We should then have all CProject wide summaries written to a summary folder in the root of the CProject. For example /project/summary/summarisername/projectWordCloud.html, /project/summary/summarisername2/full.dataTables.html and so on. All that matters, I think, is that we don't allow people to have plugins with the same name (or they can trample on each other's data). Similarly we don't want a summarisername to be duplicated for this reason. We can then allow people to claim plugin names and state (to whatever level of precision they like) what the plugin will read and what it will output. This perhaps doesn't need to be done programatically but there should be somewhere we say what can do what. The same can be true of summarisers: they can only summarise the output of certain plugin(s). Currently some bits of ami do summaries but I think that they should not and ought to be pulled out into another program. |
Also, I'm not sure if it will link him in but I noticed that @robintw seems to have been writing some python code to read cProjects. I will try and get in contact to find out what structure he currently relies on and so on / if he has suggestions. |
Peter and I had a chat today and we proposed something like this as a layout for the CProject
|
Thanks for looping me in on this. At the moment, my Python code just depends on the output folder structure produced by
It assumes that all folders beneath I'm a little confused by @tarrow's most recent post - I can't quite see how that links to the quickscrape output structure. I assume |
Typical console output from
Proposal: we should capture much of this in |
@robintw Thanks |
I think we're close to something workable. The naming of things needs some attention - it needs to be clear what's what if these are sitting in the normal filesystem. Also we should minimise the verbosity and the depth of the directory structure. Using the example you gave above @tarrow, and for now just focusing on 'metadata'. The original is:
First thing, I think we should eliminate the redundant directories - just name the files:
Secondly, what is supposed to be in these files? The names don't suggest anything meaningful to me. It seems to me that we only need to capture:
In both cases there may be multiple files, so how about just:
Where:
|
I like the most recent suggestion from @blahah, particularly the specific names for the json files and the |
Explanation of metadata:
I suggested directories rather than names as we may wish to group information. I would certainly want both |
OK, so I think
|
agree about agree about BibJSON. probably agree about the rest. I am mainly concerned about what we might get from repos other than EPMC and don't want to limit. |
regarding the pyCProject naming: I took |
There is also the question of multiple files of the same type - the
there should also be On Tue, May 17, 2016 at 5:30 PM, Christopher Kittel <
Peter Murray-Rust |
There is a problem with empty directories created by |
I'm making this issue to try and formalise what should and shouldn't be in a CProject. Since the main interface between parts of the software is the filesystem tree of a CProject there needs to be a standard so that people can write other programs to interface with it.
This is also an exercise in trying to keep our development decisions more open; if we want to attract outside contributors the decisions process needs to be as transparent as possible as well as soliciting input from anyone interested.
In the following comments I'm setting out what I think the current position is from the various bits of software, specifically, quickscrape, getpapers, norma, ami and the python CProject library. Then I'll make some suggestions as to how I think it should be laid out. Please make any suggestions you think are important. It's much better that we get this design right now rather than do it quickly.
The text was updated successfully, but these errors were encountered: