Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is134 get db cohort method data #136

Merged
merged 17 commits into from
Apr 13, 2023
Merged

Conversation

mvankessel-EMC
Copy link
Collaborator

Pull Request for issue: #134

This PR contains the following changes:

  1. Moved down sampling code to function downSample.
  2. Deprecated the boolean (logical) support for removeDuplicateSubjects, which now defaults to "keep all", and updated test-simulation.R to accommodate this change.
  3. Moved the studyStartDate and studyEndDate NULL updates after the assertions.
  4. Presampeling code is moved to the function preSample, which is re-run for target and comparator cohorts.

@mvankessel-EMC mvankessel-EMC added the enhancement New functionality that could be added label Apr 11, 2023
@mvankessel-EMC mvankessel-EMC requested a review from schuemie April 11, 2023 13:00
tempEmulationSchema,
targetId,
maxCohortSize,
sampled)
Copy link
Member

@schuemie schuemie Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does downSample() need sampled, which is guaranteed to be FALSE?

Finally: If a function has more than 2 arguments I really prefer to use named arguments, even if it looks dumb, just to avoid correctly assigning the wrong value to the wrong parameter. So connection = connection, , etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll remove the sampled argument, and just specify it in downSample as FALSE at the start.

This is the bit of code from the develop branch that downSample replaces DataLoadingSaving.R L196:238:

renderedSql <- SqlRender::loadRenderTranslateSql("CountCohorts.sql",
      packageName = "CohortMethod",
      dbms = connectionDetails$dbms,
      tempEmulationSchema = tempEmulationSchema,
      target_id = targetId
    )
    counts <- DatabaseConnector::querySql(connection, renderedSql, snakeCaseToCamelCase = TRUE)
    ParallelLogger::logDebug("Pre-sample total row count is ", sum(counts$rowCount))
    preSampleCounts <- dplyr::tibble(dummy = 0)
    idx <- which(counts$treatment == 1)
    if (length(idx) == 0) {
      preSampleCounts$targetPersons <- 0
      preSampleCounts$targetExposures <- 0
    } else {
      preSampleCounts$targetPersons <- counts$personCount[idx]
      preSampleCounts$targetExposures <- counts$rowCount[idx]
    }
    idx <- which(counts$treatment == 0)
    if (length(idx) == 0) {
      preSampleCounts$comparatorPersons <- 0
      preSampleCounts$comparatorExposures <- 0
    } else {
      preSampleCounts$comparatorPersons <- counts$personCount[idx]
      preSampleCounts$comparatorExposures <- counts$rowCount[idx]
    }
    preSampleCounts$dummy <- NULL
    if (preSampleCounts$targetExposures > maxCohortSize) {
      message("Downsampling target cohort from ", preSampleCounts$targetExposures, " to ", maxCohortSize)
      sampled <- TRUE
    }
    if (preSampleCounts$comparatorExposures > maxCohortSize) {
      message("Downsampling comparator cohort from ", preSampleCounts$comparatorExposures, " to ", maxCohortSize)
      sampled <- TRUE
    }
    if (sampled) {
      renderedSql <- SqlRender::loadRenderTranslateSql("SampleCohorts.sql",
        packageName = "CohortMethod",
        dbms = connectionDetails$dbms,
        tempEmulationSchema = tempEmulationSchema,
        max_cohort_size = maxCohortSize
      )
      DatabaseConnector::executeSql(connection, renderedSql)
    }

}
DatabaseConnector::executeSql(connection, renderedSql)

sampled <- FALSE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend moving sampled <- FALSE to an else clause of if (maxCohortSize != 0) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 3056327.

R/DataLoadingSaving.R Outdated Show resolved Hide resolved
R/DataLoadingSaving.R Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Apr 11, 2023

Codecov Report

Merging #136 (e53d78e) into develop (456537f) will decrease coverage by 1.62%.
The diff coverage is 58.04%.

❗ Current head e53d78e differs from pull request most recent head bbbb1d4. Consider uploading reports for the commit bbbb1d4 to get more accurate results

@@             Coverage Diff             @@
##           develop     #136      +/-   ##
===========================================
- Coverage    88.63%   87.02%   -1.62%     
===========================================
  Files           22       23       +1     
  Lines         5172     5316     +144     
===========================================
+ Hits          4584     4626      +42     
- Misses         588      690     +102     
Impacted Files Coverage Δ
R/HelperFunctions.R 66.66% <0.00%> (-21.57%) ⬇️
R/Viewer.R 0.00% <0.00%> (ø)
R/Export.R 89.44% <33.33%> (-0.42%) ⬇️
R/PsFunctions.R 81.63% <68.00%> (ø)
R/RunAnalyses.R 92.91% <77.77%> (ø)
R/StudyPopulation.R 94.41% <90.90%> (ø)
R/DataLoadingSaving.R 93.10% <92.59%> (+0.86%) ⬆️
R/Balance.R 76.67% <100.00%> (ø)
R/OutcomeModels.R 93.11% <100.00%> (ø)
R/Simulation.R 99.52% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

}
return(covariateSettings)

preSample <- function(idx, colType, counts, preSampleCounts) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, function names should be verb + noun. So here, maybe call it countPreSample()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 3056327.

DatabaseConnector::querySql(connection, renderedSql, snakeCaseToCamelCase = TRUE)
ParallelLogger::logDebug("Pre-sample total row count is ", sum(counts$rowCount))
preSampleCounts <- dplyr::tibble(dummy = 0)
idx <- which(counts$treatment == 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move the computing of idx to the preSample() function? You can pass which treatment (0 or 1) as an argument, which would allow you to remove the colType argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 3056327.

However, the implementation is a bit more verbose, let me know what you prefer.

counts <-
DatabaseConnector::querySql(connection, renderedSql, snakeCaseToCamelCase = TRUE)
ParallelLogger::logDebug("Pre-sample total row count is ", sum(counts$rowCount))
preSampleCounts <- dplyr::tibble(dummy = 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of starting with a tibble with a dummy variable that preSample() operates on, why not have preSample() create a tibble with one row of variables, and then simply bind_cols() those variables for the target and comparator?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss read this comment.. I'll have a look

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: bbbb1d4.

@schuemie
Copy link
Member

Hi @mvankessel-EMC . I added some comments throughout.

I also see you're using a very specific code style. It is not clear to me why you sometimes break up a line into multiple lines. As a rule-of-thumb I try to break up lines if they exceed 80 characters (although I'll go to maximum of 100 characters if it reads better).

Also, this is my personal preference, but I prefer

value <- computeValue(argument)

instead of

value <- 
  computeValue(argument)

@schuemie
Copy link
Member

@mvankessel-EMC : let me know when I can review again

@mvankessel-EMC
Copy link
Collaborator Author

mvankessel-EMC commented Apr 13, 2023

@mvankessel-EMC : let me know when I can review again

Hi @schuemie, I pushed new updates. Latest commit: bbbb1d4.

Please let me know what you think.

@schuemie
Copy link
Member

Looks great! As discussed, further refactoring would probably require turning the meta-data into some nice object that can be passed around by reference. But I'll merge what you've done so far, and leave it to you if you want to work on that.

@schuemie schuemie merged commit 6a766e5 into develop Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New functionality that could be added
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants