Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Development #117

Merged
merged 49 commits into from
May 3, 2023
Merged

Development #117

merged 49 commits into from
May 3, 2023

Conversation

briney
Copy link
Collaborator

@briney briney commented May 3, 2023

No description provided.

srgk26 and others added 30 commits December 23, 2022 03:03
Refactoring temporary json file concatenation
* Uncomment dask dataframe import.

* Remove json output type from list of parquet incompatible formats.

* Specify dtypes for dask dataframe read from json.

* Enable reading json files into dask dataframe and writing as parquet file.

* Enable files to be read in binary mode for all output formats.

* Change json output field for j gene from score to assigner_score.

* Revert change in line position to read file in binary mode.

* Converting IMGT positions from integers or floats to string.

* Coerce raw_position to be stringtype.

* Add schema for JSON fields datatypes to override when writing to parquet.

* Additional schema attributes.

* Convert schema to full pyarrow schema for full dataset.

* Add columns desired order and dtypes for dataframe metadata.

* Reorder dtype fields.

* Remove unneeded column and dtype information.

* Edit json reading and parquet writing code.

* Add additional schema attributes involved in BCR.

* Reorder schema fields.

* Reorder pyarrow schema.
* Replace string dtypes to object dtypes.

* Add function attribute to indicate if parquet will be written to `write_output` function.

* Write parquet files directly in place of temporary JSON files.

* Add flag to ignore datatype conversion errors when casting integer columns with NaNs, and change output file name.

* Edit concat_outputs to simply move files instead for parquet files generated from json output.

* Edit file path from string concatenation to os path join.

* Minor edit to ps.path.join.

* Added if statement to check if file exists before attempting to delete temporary file.

* Add `.snappy` file extension to parquet files.

* Simplified file name to simply moving to directory instead.

* Simplify specifying columns by changing `schema.names` to `dtypes`.

* Parse strings of dictionary into dictionary with `json.loads` before loading into dataframe.

* Read in temporary parquet files, repartition and write back parquet files.

* Remove setting writing metadata file in parquet to False as it's the default function argument.

* Remove unused imports.

* Remove if condition to check for temp files before deleting them.
* Fix chunking of fastq files

* ignore vscode
* Replace double quotation marks to single quotes for consistency with rest of codebase.

* Add empty line at EOF.

* Allow matplotlib to be installed to the latest version since scanpy has upgraded their matplotlib support.

* Add comments to better explain code edits.
@briney briney merged commit 00b3ad8 into master May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants