High number of fitError when working with high level of missingness (scp) #63

leopoldguyot · 2024-08-26T08:02:28Z

The Issue

When working with msqrob2, particularly with the msqrobLm function on single-cell data, a significant number of features produce a fitError object. This issue arise from the high presence of missing data. Upon investigation, it became apparent that these non-estimated features were those where one of the reference levels of the factor variable involved in the formula had no observations. This creates a problem due to how msqrob2 handles data subsetting before modeling. Specifically, msqrob2 subsets the design matrix by removing all rows and columns without observations for the current feature. However, it cannot subset out the reference level since it is "buried" within the model formulation. Consequently, the resulting model becomes rank-deficient, rendering it impossible to fit with methods like MASS::rlm.

To understand how other tools handle this, we examined limma and found that it does not manage these special cases. Instead, limma returns all models, even if they are rank-deficient. This means that, for features with no observations in the reference level, the reference level changes during modeling. This shift leads to confusing and potentially misleading interpretations of the associated coefficients.

The Fix

To address the issue of rank deficiency, we adopted a different approach to subset the model matrix. Instead of subsetting columns of the design matrix based on the missingness of observations, we propose subsetting the columns by selecting the first x columns according to the sorting of the columns based on their pivoting rank, where x represents the rank of the model. To determine x and the pivoting rank of the columns, we perform a QR decomposition. This approach allows us to fit features that lack observations in their reference levels.

However, the coefficients of these problematic features may still pose challenges because the interpretation may change for some parameters associated with shifts in their reference levels. To manage this, we introduced a new slot in the params slot of the StatModel object called "referencePresent". This new slot contains information for each parameter, indicating whether its reference level changed during modeling.

This information is then utilized by other functions in the package, such as getContrast and varContrast. These functions now include the option to return only the contrasts involving parameters whose reference levels did not change. This update is also reflected in higher-level functions like hypothesisTest or topFeatures, which can now be used to return only the feature contrasts that do not involve these reference changes. If users still wish to obtain contrasts for these problematic cases, a new column indicating whether the reference has changed will be included in the output table of the results.

Issue #63

cvanderaa · 2024-09-03T10:08:55Z

Léopold made important progress on this issue, which is currently available in branch:issue63 (see also #64).

Note to myself: I still need to pull the branch and see if I can break the code with additional unit tests 😈😉

cvanderaa · 2024-09-03T11:07:24Z

Also pasting a relevant section from Léopold's report:

This [PR] is not perfect as it remains limited; specifically, it does not apply when ridge regression is used. In addition, the solution could be further optimized, as it currently demands significant resources. An initial optimization effort was made, but further refinement is needed to deliver a solution ready for deployment to users.

Hence we still need to:

Extend to ridge estimation
Optimize for computational efficiency (speed)

leopoldguyot mentioned this issue Aug 26, 2024

Issue https://github.com/statOmics/msqrob2/issues/63 #64

Merged

statOmics deleted a comment Aug 26, 2024

cvanderaa added a commit that referenced this issue Sep 3, 2024

Merge pull request #64 from leopoldguyot/reference_drop_issue

285af81

Issue #63

cvanderaa self-assigned this Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High number of fitError when working with high level of missingness (scp) #63

High number of fitError when working with high level of missingness (scp) #63

leopoldguyot commented Aug 26, 2024

cvanderaa commented Sep 3, 2024

cvanderaa commented Sep 3, 2024

High number of fitError when working with high level of missingness (scp) #63

High number of fitError when working with high level of missingness (scp) #63

Comments

leopoldguyot commented Aug 26, 2024

The Issue

The Fix

cvanderaa commented Sep 3, 2024

cvanderaa commented Sep 3, 2024