Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_refs() incorrectly splitting abstract over multiple fields #18

Open
nealhaddaway opened this issue Mar 11, 2022 · 9 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@nealhaddaway
Copy link
Collaborator

read_refs() is incorrectly splitting the abstract in this record across multiple fields:
10.1111/j.1469-8137.2004.01201.x

@nealhaddaway
Copy link
Collaborator Author

I believe it's this line x <- gsub(",(?=\\s[[:alpha:]]{2,})", " and ", x, perl = TRUE) (line 18) of 'clean_functions.R':
https://github.com/mjwestgate/synthesisr/blob/master/R/clean_functions.R

@chriscpritchard
Copy link

I'm looking into this, but I can't seem to replicate the issue. Would you be able to upload an RIS file where this occurs?

@nealhaddaway
Copy link
Collaborator Author

I've emailed you a file - don't think I can publish it here..

@chriscpritchard
Copy link

I've had a look and I still can't replicate this.

The abstract is all in the abstract:

r$> x <- read_refs("C:\\Users/chris/Downloads/references-problem.ris")
r$> View(x)

gives me:

"Contents I. Introduction 2 II. Carbon in temperate grasslands 2 III. The process of carbon sequestration in soils 4 IV. Tracking carbon movement 9 V. Models of soil carbon dynamics 10 VI. Management effects on carbon sequestration 11 VII. Climate-change effects on carbon sequestration 12 VIII. Response to elevated CO2 13 IX. Conclusions 14 References 14 Summary The substantial stocks of carbon sequestered in temperate grassland ecosystems are located largely below ground in roots and soil. Organic C in the soil is located in discrete pools, but the characteristics of these pools are still uncertain. Carbon sequestration can be determined directly by measuring changes in C pools, indirectly by using 13C as a tracer, or by simulation modelling. All these methods have their limitations, but long-term estimates rely almost exclusively on modelling. Measured and modelled rates of C sequestration range from 0 to > 8 Mg C ha�\210�1 yr�\210�1. Management practices, climate and elevated CO2 strongly influence C sequestration rates and their influence on future C stocks in grassland soils is considered. Currently there is significant potential to increase C sequestration in temperate grassland systems by changes in management, but climate change and increasing CO2 concentrations in future will also have significant impacts. Global warming may negate any storage stimulated by changed management and elevated CO2, although there is increasing evidence that the reverse could be the case."

Might be helpful for you to demo for me or explain the exact steps to replicate the bug?

@nealhaddaway
Copy link
Collaborator Author

nealhaddaway commented Mar 16, 2022 via email

@nealhaddaway
Copy link
Collaborator Author

nealhaddaway commented Mar 16, 2022 via email

@chriscpritchard
Copy link

Just checked - I get those columns when using the cran version, it appears to be fixed in master, perhaps in c406bc9.

@nealhaddaway
Copy link
Collaborator Author

nealhaddaway commented Mar 16, 2022 via email

@nealhaddaway nealhaddaway added the bug Something isn't working label Apr 10, 2022
@nealhaddaway
Copy link
Collaborator Author

This is happening with a different file using the GitHub version still - see this example EMBASE file: https://gitlab.com/extending-the-earcheck/living-review/-/blob/master/search/literature_search_02/Embase_290521_N974.RIS?expanded=true&viewer=simple

It reads in across 30 columns, which shouldn't all be there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants