Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read lipidblast-2022 data #114

Open
pratarora opened this issue Jul 11, 2024 · 9 comments
Open

Unable to read lipidblast-2022 data #114

pratarora opened this issue Jul 11, 2024 · 9 comments
Assignees

Comments

@pratarora
Copy link
Contributor

pratarora commented Jul 11, 2024

Hi

Thanks for such a great package!

I am trying to make a compounddb object from lipidblast-2022 data (https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/d0899a6d-15b7-46ad-a836-f07313465fb8). But it keeps on failing, with an empty data frame as an output. I first thought it was memory issue, so I tried with bigger memory(256GB, it doesn't work).

I also tried to split the json to smaller chunks, in this case it can read the file but it gives an empty data frame.
I have checked the structure of the json file it seems to be fine as the original.

any idea what i might be missing?
Any place i could directly find the sqlite file?

fl <-("MoNA-export-LipidBlast_2022-split_100k/MoNA-export-LipidBlast_2022-split_1.json")
cmps <- compound_tbl_lipidblast(fl)
cmps

my session info is :

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] jsonlite_1.8.8 mzR_2.36.0 Rcpp_1.0.12
[4] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[7] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
[10] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[13] tidyverse_2.0.0 CompoundDb_1.6.0 AnnotationFilter_1.26.0
[16] MetaboCoreUtils_1.10.0 MetaboAnnotation_1.6.1 Spectra_1.12.0
[19] ProtGenerics_1.34.0 BiocParallel_1.36.0 S4Vectors_0.40.2
[22] BiocGenerics_0.48.1

loaded via a namespace (and not attached):
[1] MultiAssayExperiment_1.28.0 magrittr_2.0.3
[3] fs_1.6.4 zlibbioc_1.48.2
[5] vctrs_0.6.5 memoise_2.0.1
[7] RCurl_1.98-1.14 base64enc_0.1-3
[9] htmltools_0.5.8.1 S4Arrays_1.2.1
[11] AnnotationHub_3.10.0 curl_5.2.1
[13] SparseArray_1.2.4 htmlwidgets_1.6.4
[15] cachem_1.0.8 uuid_1.2-0
[17] igraph_2.0.3 mime_0.12
[19] lifecycle_1.0.4 pkgconfig_2.0.3
[21] Matrix_1.6-5 R6_2.5.1
[23] fastmap_1.1.1 GenomeInfoDbData_1.2.11
[25] MatrixGenerics_1.14.0 shiny_1.8.0
[27] clue_0.3-65 digest_0.6.35
[29] rsvg_2.6.0 colorspace_2.1-0
[31] AnnotationDbi_1.64.1 GenomicRanges_1.54.1
[33] RSQLite_2.3.6 filelock_1.0.3
[35] fansi_1.0.6 timechange_0.3.0
[37] httr_1.4.7 abind_1.4-5
[39] compiler_4.3.2 bit64_4.0.5
[41] withr_3.0.0 DBI_1.2.2
[43] MASS_7.3-60 ChemmineR_3.54.0
[45] rappdirs_0.3.3 DelayedArray_0.28.0
[47] rjson_0.2.21 tools_4.3.2
[49] interactiveDisplayBase_1.40.0 httpuv_1.6.14
[51] glue_1.7.0 QFeatures_1.12.0
[53] promises_1.2.1 grid_4.3.2
[55] pbdZMQ_0.3-11 cluster_2.1.4
[57] generics_0.1.3 gtable_0.3.5
[59] tzdb_0.4.0 hms_1.1.3
[61] xml2_1.3.6 utf8_1.2.4
[63] XVector_0.42.0 BiocVersion_3.18.1
[65] pillar_1.9.0 IRdisplay_1.1
[67] later_1.3.2 BiocFileCache_2.10.1
[69] lattice_0.22-5 bit_4.0.5
[71] tidyselect_1.2.1 Biostrings_2.70.3
[73] gridExtra_2.3 IRanges_2.36.0
[75] SummarizedExperiment_1.32.0 Biobase_2.62.0
[77] matrixStats_1.3.0 DT_0.32
[79] stringi_1.8.4 lazyeval_0.2.2
[81] yaml_2.3.8 evaluate_0.23
[83] codetools_0.2-19 MsCoreUtils_1.14.1
[85] BiocManager_1.30.23 cli_3.6.2
[87] IRkernel_1.3.2 xtable_1.8-4
[89] repr_1.1.7 munsell_0.5.1
[91] GenomeInfoDb_1.38.8 dbplyr_2.5.0
[93] png_0.1-8 parallel_4.3.2
[95] ellipsis_0.3.2 blob_1.2.4
[97] bitops_1.0-7 scales_1.3.0
[99] ncdf4_1.22 crayon_1.5.2
[101] rlang_1.1.3 KEGGREST_1.42.0

I split the json file in python so that i can avoid loading on memory. The code i used to split the json:

import ijson
import json
from decimal import Decimal

def decimal_default(obj):
    if isinstance(obj, Decimal):
        return float(obj)
    raise TypeError

def split_large_json(input_file, output_prefix, chunk_size):
    with open(input_file, 'r') as f:
        # Initialize the JSON item iterator
        objects = ijson.items(f, 'item')
        
        chunk = []
        chunk_count = 0
        
        for obj in objects:
            chunk.append(obj)
            if len(chunk) == chunk_size:
                # Write the current chunk to a new file
                with open(f'{output_prefix}_{chunk_count + 1}.json', 'w') as chunk_file:
                    json.dump(chunk, chunk_file, indent=4, default=decimal_default)
                chunk = []
                chunk_count += 1

        # Write any remaining items in the last chunk
        if chunk:
            with open(f'{output_prefix}_{chunk_count + 1}.json', 'w') as chunk_file:
                json.dump(chunk, chunk_file, indent=4, default=decimal_default)

# Replace 'large_file.json' with the path to your large JSON file
input_file = 'MoNA-export-LipidBlast_2022-json (1)/MoNA-export-LipidBlast_2022.json'
# Prefix for the output chunk files
output_prefix = 'MoNA-export-LipidBlast_2022-split_100k/MoNA-export-LipidBlast_2022-split'
# Size of each chunk
chunk_size = 100000

# Split the large JSON file into smaller pieces
split_large_json(input_file, output_prefix, chunk_size)
@jorainer
Copy link
Member

Thanks for adding this issue. I'm currently working on a solution to import/process large json files in a chunk-wise manner.

@jorainer jorainer self-assigned this Jul 17, 2024
jorainer added a commit that referenced this issue Jul 18, 2024
- Add parameter `n` to `compound_tbl_lipidblast()` to enable and support import
  and processing of the data in sets of lines to reduce memory demand (issue
  #114).
@jorainer
Copy link
Member

Hi @pratarora , I've added a new parameter n to compound_tbl_lipidblast that you can use to define the number of lines of a (large) json file that should be loaded and processed at a time. This should help avoiding the memory problems you reported.

Thus, the code below (with n = 100000) should work on the full file (it worked on my computer).

fl <-("MoNA-export-LipidBlast_2022.json")
cmps <- compound_tbl_lipidblast(fl, n = 100000, verbose = TRUE)

For now, you would need to install this version of the package from github using BiocManager::install("RforMassSpectrometry/CompoundDb") (assuming you have the BiocManager package installed - which you can with install.packages("BiocManager").

@pratarora
Copy link
Contributor Author

pratarora commented Jul 18, 2024

Hi Johannes

Thanks a lot for the quick fix.

I tried it, but it works partially. The exactmass column is converted to character while reading.
Inchi and inchikey is not read,
I can change exactmass manually to numeric and get the sqlite file.

cmps %>% str

tibble [1,346,798 × 7] (S3: tbl_df/tbl/data.frame)
 $ compound_id: chr [1:1346798] "LipidBlast2022_999999" "LipidBlast2022_999998" "LipidBlast2022_999997" "LipidBlast2022_999996" ...
 $ name       : chr [1:1346798] "HexCer 27:3;3O/15:0;(2OH)" "HexCer 27:3;3O/14:1;(2OH)" "HexCer 27:3;3O/14:0;(2OH)" "HexCer 27:3;3O/13:1;(2OH)" ...
 $ inchi      : chr [1:1346798] "" "" "" "" ...
 $ inchikey   : chr [1:1346798] NA NA NA NA ...
 $ formula    : chr [1:1346798] "C48H89NO10" "C47H85NO10" "C47H87NO10" "C46H83NO10" ...
 $ exactmass  : chr [1:1346798] "839.648648048" "823.6173479199999" "825.632997984" "809.601697856" ...
 $ synonyms   :List of 1346798
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  ..$ : chr NA
  .. [list output truncated]

@jorainer
Copy link
Member

oooooh, thanks for the report! I will check that. Should be numeric of course!

@pratarora
Copy link
Contributor Author

I am working on it, I will post the code for it soon.

jorainer added a commit that referenced this issue Jul 18, 2024
- fix `compound_tbl_lipidblast()` to return field *exactmass* as numeric (issue
  #114).
@jorainer
Copy link
Member

should be fixed now. thanks for reporting!

@pratarora
Copy link
Contributor Author

pratarora commented Jul 18, 2024

Hi Johannes,

I modified the code. I tried to make it a bit parallel as well. It can now read all the fields for both the old lipidblast
(https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/70d4b8c0-4863-4632-ade4-0ace697bdaa1) and the 2022 one (https://mona.fiehnlab.ucdavis.edu/rest/downloads/retrieve/d0899a6d-15b7-46ad-a836-f07313465fb8). I will create a pull request if it seems ok for you.

.parse_lipidblast_json_element  <- function(x) {
    id <- x$id
    cmp <- x$compound[[1]]
    
    # Helper function to extract metadata from multiple sources
    get_metadata <- function(name, sources) {
        for (source in sources) {
            value <- unlist(lapply(source, function(z) {
                if (z$name == name) z$value
            }))
            if (length(value) > 0 && !all(value == "")) {
                return(value[1])
            }
        }
        return(NA_character_)
    }
    
    # Define metadata sources
    metadata_sources <- list(x$metaData, cmp$metaData)
    
    # Get names
    nms <- vapply(cmp$names, `[[`, "name", FUN.VALUE = "character")
    snms <- if (length(nms) > 1L) nms[-1L] else NA_character_
    
    # Extract metadata
    mass <- get_metadata("total exact mass", metadata_sources)
    mass <- as.numeric(mass)
    if (is.na(mass)) mass <- NA_real_
    
    frml <- get_metadata("molecular formula", metadata_sources)
    inchi <- get_metadata("InChI", metadata_sources)
    inchikey <- get_metadata("InChIKey", metadata_sources)
    smiles <- get_metadata("SMILES", metadata_sources)
    compound_class <- get_metadata("compound class", metadata_sources)
    ionization_mode <- get_metadata("ionization mode", metadata_sources)
    precursor_mz <- as.numeric(get_metadata("precursor m/z", metadata_sources))
    precursor_type <- get_metadata("precursor type", metadata_sources)
    retention_time <- as.numeric(get_metadata("retention time", metadata_sources))
    ccs <- as.numeric(get_metadata("ccs", metadata_sources))
    
    # Extract spectrum data
    spectrum <- x$spectrum
    if (is.null(spectrum)) spectrum <- NA_character_
    
    list(
        compound_id = id,
        name = nms[1],
        inchi = inchi,
        inchikey = inchikey,
        smiles = smiles,
        formula = frml,
        exactmass = mass,
        compound_class = compound_class,
        ionization_mode = ionization_mode,
        precursor_mz = precursor_mz,
        precursor_type = precursor_type,
        retention_time = retention_time,
        ccs = ccs,
        spectrum = spectrum,
        synonyms = list(snms)
    )
}
.import_lipidblast <- function(file, n = -1L, verbose = FALSE, BPPARAM = bpparam()) {
    if (n < 0) {
        lipidb <- jsonlite::read_json(file)
        if (verbose)
            message("Processing ", length(lipidb), " elements ...", appendLF = FALSE)
        
        # Use BiocParallel to process elements in parallel
        res <- bplapply(lipidb, .parse_lipidblast_json_element, BPPARAM = BPPARAM)
        
        if (verbose) message(" done.")
    } else {
        res <- .import_lipidblast_json_chunk(file, n = n, verbose = verbose, BPPARAM = BPPARAM)
    }
    dplyr::bind_rows(res)
}

.import_lipidblast_json_chunk <- function(x, n = 10000, verbose = FALSE, BPPARAM = bpparam()) {
    con <- file(x, open = "r")
    on.exit(close(con))
    res <- list()
    while (length(ls <- readLines(con, n = n, warn = FALSE))) {
        if (length(grep("^\\[", ls[1L])))
            ls <- ls[-1L]
        if (length(grep("^\\]", ls[length(ls)])))
            ls <- ls[-length(ls)]
        ls <- sub(",$", "", ls)
        if (length(ls)) {
            chunk_res <- bplapply(ls, function(z) {
                .parse_lipidblast_json_element(jsonlite::fromJSON(z, simplifyVector = FALSE))
            }, BPPARAM = BPPARAM)
            res <- c(res, chunk_res)
        }
        if (verbose)
            message("Processed ", length(ls), " elements")
    }
    res
}
compound_tbl_lipidblast <- function(file, collapse = character(), n = -1L,
                                    verbose = FALSE, BPPARAM = bpparam()) {
    .check_parameter_file(file)
    res <- .import_lipidblast(file, n = n, verbose = verbose, BPPARAM = BPPARAM)
    if (length(collapse)) {
        ## collapse elements from lists.
        res$synonyms <- vapply(res$synonyms, paste0, collapse = collapse[1L],
                                FUN.VALUE = "character")
    }
    res
}

@pratarora
Copy link
Contributor Author

new lipidblast
image

old lipidblast
image

@pratarora
Copy link
Contributor Author

Hi Johannes,
I created the pull request:
#117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants