-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is134 get db cohort method data #136
Conversation
R/DataLoadingSaving.R
Outdated
tempEmulationSchema, | ||
targetId, | ||
maxCohortSize, | ||
sampled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does downSample()
need sampled
, which is guaranteed to be FALSE
?
Finally: If a function has more than 2 arguments I really prefer to use named arguments, even if it looks dumb, just to avoid correctly assigning the wrong value to the wrong parameter. So connection = connection,
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll remove the sampled
argument, and just specify it in downSample
as FALSE
at the start.
This is the bit of code from the develop branch that downSample
replaces DataLoadingSaving.R L196:238:
renderedSql <- SqlRender::loadRenderTranslateSql("CountCohorts.sql",
packageName = "CohortMethod",
dbms = connectionDetails$dbms,
tempEmulationSchema = tempEmulationSchema,
target_id = targetId
)
counts <- DatabaseConnector::querySql(connection, renderedSql, snakeCaseToCamelCase = TRUE)
ParallelLogger::logDebug("Pre-sample total row count is ", sum(counts$rowCount))
preSampleCounts <- dplyr::tibble(dummy = 0)
idx <- which(counts$treatment == 1)
if (length(idx) == 0) {
preSampleCounts$targetPersons <- 0
preSampleCounts$targetExposures <- 0
} else {
preSampleCounts$targetPersons <- counts$personCount[idx]
preSampleCounts$targetExposures <- counts$rowCount[idx]
}
idx <- which(counts$treatment == 0)
if (length(idx) == 0) {
preSampleCounts$comparatorPersons <- 0
preSampleCounts$comparatorExposures <- 0
} else {
preSampleCounts$comparatorPersons <- counts$personCount[idx]
preSampleCounts$comparatorExposures <- counts$rowCount[idx]
}
preSampleCounts$dummy <- NULL
if (preSampleCounts$targetExposures > maxCohortSize) {
message("Downsampling target cohort from ", preSampleCounts$targetExposures, " to ", maxCohortSize)
sampled <- TRUE
}
if (preSampleCounts$comparatorExposures > maxCohortSize) {
message("Downsampling comparator cohort from ", preSampleCounts$comparatorExposures, " to ", maxCohortSize)
sampled <- TRUE
}
if (sampled) {
renderedSql <- SqlRender::loadRenderTranslateSql("SampleCohorts.sql",
packageName = "CohortMethod",
dbms = connectionDetails$dbms,
tempEmulationSchema = tempEmulationSchema,
max_cohort_size = maxCohortSize
)
DatabaseConnector::executeSql(connection, renderedSql)
}
R/DataLoadingSaving.R
Outdated
} | ||
DatabaseConnector::executeSql(connection, renderedSql) | ||
|
||
sampled <- FALSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend moving sampled <- FALSE
to an else
clause of if (maxCohortSize != 0) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 3056327.
Codecov Report
@@ Coverage Diff @@
## develop #136 +/- ##
===========================================
- Coverage 88.63% 87.02% -1.62%
===========================================
Files 22 23 +1
Lines 5172 5316 +144
===========================================
+ Hits 4584 4626 +42
- Misses 588 690 +102
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
R/DataLoadingSaving.R
Outdated
} | ||
return(covariateSettings) | ||
|
||
preSample <- function(idx, colType, counts, preSampleCounts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, function names should be verb + noun. So here, maybe call it countPreSample()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 3056327.
R/DataLoadingSaving.R
Outdated
DatabaseConnector::querySql(connection, renderedSql, snakeCaseToCamelCase = TRUE) | ||
ParallelLogger::logDebug("Pre-sample total row count is ", sum(counts$rowCount)) | ||
preSampleCounts <- dplyr::tibble(dummy = 0) | ||
idx <- which(counts$treatment == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not move the computing of idx
to the preSample()
function? You can pass which treatment (0 or 1) as an argument, which would allow you to remove the colType
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 3056327.
However, the implementation is a bit more verbose, let me know what you prefer.
R/DataLoadingSaving.R
Outdated
counts <- | ||
DatabaseConnector::querySql(connection, renderedSql, snakeCaseToCamelCase = TRUE) | ||
ParallelLogger::logDebug("Pre-sample total row count is ", sum(counts$rowCount)) | ||
preSampleCounts <- dplyr::tibble(dummy = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of starting with a tibble with a dummy variable that preSample()
operates on, why not have preSample()
create a tibble with one row of variables, and then simply bind_cols()
those variables for the target and comparator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I miss read this comment.. I'll have a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: bbbb1d4.
Hi @mvankessel-EMC . I added some comments throughout. I also see you're using a very specific code style. It is not clear to me why you sometimes break up a line into multiple lines. As a rule-of-thumb I try to break up lines if they exceed 80 characters (although I'll go to maximum of 100 characters if it reads better). Also, this is my personal preference, but I prefer value <- computeValue(argument) instead of value <-
computeValue(argument) |
@mvankessel-EMC : let me know when I can review again |
Hi @schuemie, I pushed new updates. Latest commit: bbbb1d4. Please let me know what you think. |
Looks great! As discussed, further refactoring would probably require turning the meta-data into some nice object that can be passed around by reference. But I'll merge what you've done so far, and leave it to you if you want to work on that. |
Pull Request for issue: #134
This PR contains the following changes:
downSample
.removeDuplicateSubjects
, which now defaults to"keep all"
, andupdated test-simulation.R
to accommodate this change.studyStartDate
andstudyEndDate
NULL updates after the assertions.preSample
, which is re-run for target and comparator cohorts.