Development #117

briney · 2023-05-03T23:48:21Z

No description provided.

…to file.

Refactoring temporary json file concatenation

* Uncomment dask dataframe import. * Remove json output type from list of parquet incompatible formats. * Specify dtypes for dask dataframe read from json. * Enable reading json files into dask dataframe and writing as parquet file. * Enable files to be read in binary mode for all output formats. * Change json output field for j gene from score to assigner_score. * Revert change in line position to read file in binary mode. * Converting IMGT positions from integers or floats to string. * Coerce raw_position to be stringtype. * Add schema for JSON fields datatypes to override when writing to parquet. * Additional schema attributes. * Convert schema to full pyarrow schema for full dataset. * Add columns desired order and dtypes for dataframe metadata. * Reorder dtype fields. * Remove unneeded column and dtype information. * Edit json reading and parquet writing code. * Add additional schema attributes involved in BCR. * Reorder schema fields. * Reorder pyarrow schema.

… already assigned V gene

… end of the V and/or the 3' end of the J

… match

* Replace string dtypes to object dtypes. * Add function attribute to indicate if parquet will be written to `write_output` function. * Write parquet files directly in place of temporary JSON files. * Add flag to ignore datatype conversion errors when casting integer columns with NaNs, and change output file name. * Edit concat_outputs to simply move files instead for parquet files generated from json output. * Edit file path from string concatenation to os path join. * Minor edit to ps.path.join. * Added if statement to check if file exists before attempting to delete temporary file. * Add `.snappy` file extension to parquet files. * Simplified file name to simply moving to directory instead. * Simplify specifying columns by changing `schema.names` to `dtypes`. * Parse strings of dictionary into dictionary with `json.loads` before loading into dataframe. * Read in temporary parquet files, repartition and write back parquet files. * Remove setting writing metadata file in parquet to False as it's the default function argument. * Remove unused imports. * Remove if condition to check for temp files before deleting them.

* Fix chunking of fastq files * ignore vscode

* Replace double quotation marks to single quotes for consistency with rest of codebase. * Add empty line at EOF. * Allow matplotlib to be installed to the latest version since scanpy has upgraded their matplotlib support. * Add comments to better explain code edits.

…source. (#13)

…nctions (no need for a matrix)

Add support to write in parquet format

Preprocessing

srgk26 and others added 30 commits December 23, 2022 03:03

Use shutil to concatenate file contents.

6943de1

Lower buffer size from 1GB to 16MB.

b5ca9ce

Add new line in bytemode after concatenating each temporary JSON file.

3930a46

Edit write_output function to add new line when writing writing list …

d2c104f

…to file.

Remove writing new line at EOF for IMGT, tabular, and AIRR formats.

dcdfd07

Merge pull request #2 from SyntenyBio/srgk26/concat_json_1

a0af51a

Refactoring temporary json file concatenation

by default, now force assignment of J genes to match the locus of the…

8be8588

… already assigned V gene

If the input sequence is sufficiently long, force alignment to the 5'…

bb8a502

… end of the V and/or the 3' end of the J

Update build_germline_dbs.py

47bd7c6

More verbose logging when the loci of top-scoring V and J genes don't…

7897e4f

… match

force full-length alignment without using global_alignment()

d3628ff

new human germline database

d28e248

Exclude index when writing parquet from pandas. (#8)

d9959c5

Create __init__.py

2cc56e8

Create qc.py

96877e3

Create trimming.py

989c40c

Create umi.py

f80ef4b

Create pp.py

fd058bc

add preprocess directory

52368d2

new macaque germline database

d738f3e

temp reorg of preprocess folder

2940f66

add light chain V genes to macaque database

c3200ad

pin matplotlib (#9)

7a317bb

Fix chunk size (#10)

7ceb8ac

* Fix chunking of fastq files * ignore vscode

Freeze numpy install to version 1.23.4. (#12)

f611712

Remove specifying maxtasksperchild when creating multiprocess pool re…

9d60d97

…source. (#13)

Set matplotlib version at 3.6.3. (#14)

03fa169

briney and others added 19 commits March 2, 2023 08:58

fix preprocess imports

9407364

remove scikit-bio

f0c4475

update gapped IMGT alignment to use new abutils pairwise alignment fu…

4cafc98

…nctions (no need for a matrix)

remove _get_gapped_imgt_substitution_matrix

c63a09e

Merge remote-tracking branch 'upstream/development'

7296ae2

Merge pull request #115 from SyntenyBio/master

7fbe9fe

Add support to write in parquet format

Create umi.py

996194c

formatting

f429200

check to ensure query isn't empty before performing alignment

92bc36f

check to ensure query isn't empty before local_alignment

3bf71ce

formatting

a03b6fc

fix type hint

5a01fd3

get parasail matrix for gapped germline re-alignment

f1a1ff4

Update requirements.txt

2eb01aa

fix matrix creation

90ac224

Update requirements.txt

4d03bc5

bump version to 0.6.0

79119b2

Merge branch 'development' into preprocessing

2d2ca3e

Merge pull request #116 from briney/preprocessing

438812c

Preprocessing

briney merged commit 00b3ad8 into master May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development #117

Development #117

briney commented May 3, 2023

Development #117

Development #117

Conversation

briney commented May 3, 2023