Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FL support #552

Closed
wants to merge 258 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
258 commits
Select commit Hold shift + click to select a range
eb74c04
fix typo
hasan7n Mar 5, 2024
7a5ae3a
Merge remote-tracking branch 'upstream/main' into fl-poc
hasan7n Mar 5, 2024
dfe4084
tmp fix mlcube issue
hasan7n Mar 5, 2024
417424a
add FL integration test
hasan7n Mar 6, 2024
ca72d8e
debug
hasan7n Mar 6, 2024
c5ce774
debug
hasan7n Mar 6, 2024
77d099a
debug
hasan7n Mar 6, 2024
e1427de
debug
hasan7n Mar 6, 2024
b5edaff
add prep cube for FL tests
hasan7n Mar 12, 2024
a8f2f94
update and restructure fl example
hasan7n Mar 12, 2024
2ff534d
update cli training tests
hasan7n Mar 12, 2024
ee8d195
fix bugs
hasan7n Mar 12, 2024
9f2935a
TMP TO BE REVERTED
hasan7n Mar 21, 2024
294fd84
multiple datasets per collaborator cert
hasan7n Mar 25, 2024
710652a
fix training tests
hasan7n Mar 26, 2024
b3e4e52
empty
hasan7n Mar 26, 2024
fb860e1
Merge remote-tracking branch 'upstream/main' into fl-poc
hasan7n Mar 26, 2024
0181e47
post-merging-main fixes
hasan7n Mar 26, 2024
ff3c4ad
debug
hasan7n Mar 26, 2024
4131f78
Merge remote-tracking branch 'upstream/main' into fl-poc
hasan7n Apr 1, 2024
4d4c53e
fix testing util
hasan7n Apr 2, 2024
03ebfbb
fix training tests
hasan7n Apr 2, 2024
20ec858
use ip address for aggregtor
hasan7n Apr 2, 2024
f3910b8
aggregator changes
hasan7n Apr 18, 2024
36c4ddc
add ca server entity
hasan7n Apr 18, 2024
687f8e7
associations refactoring
hasan7n Apr 18, 2024
c7883b5
remove signing code
hasan7n Apr 18, 2024
1b356cb
add training event object
hasan7n Apr 22, 2024
9205d98
modify training event
hasan7n Apr 22, 2024
7b9b3a1
modify training event
hasan7n Apr 22, 2024
dc6cadd
changes to training entity
hasan7n Apr 22, 2024
acc67c1
traindataset association updates
hasan7n Apr 22, 2024
c122539
add aggregator_association
hasan7n Apr 23, 2024
09d49c1
add ca_association
hasan7n Apr 23, 2024
7cb1660
update state logic in benchmarks for consistency
hasan7n Apr 23, 2024
0febf6b
create migrations and admin views
hasan7n Apr 23, 2024
c479c3e
update fl mlcube interface
hasan7n Apr 26, 2024
7e45c55
Merge remote-tracking branch 'upstream/main' into fl-tmp
hasan7n Apr 27, 2024
48ccd60
refactor entities
hasan7n Apr 27, 2024
89f256b
client updates
hasan7n Apr 29, 2024
33ad968
update FL example
hasan7n Apr 29, 2024
45bedcc
update/create missing APIs
hasan7n Apr 29, 2024
2343ba1
update new migration files
hasan7n Apr 29, 2024
19c80d8
add mock crt mlcube
hasan7n Apr 29, 2024
ed2d33e
update test env
hasan7n Apr 29, 2024
bf69385
bug fixes
hasan7n Apr 30, 2024
c4b3fa7
use event's report within a subfolder
hasan7n Apr 30, 2024
0ee1b7f
use IP address instead of hostname in integration tests
hasan7n Apr 30, 2024
be5a0ba
fix tests
hasan7n Apr 30, 2024
ad93f1f
remove comment
hasan7n Apr 30, 2024
9f52cd3
add --overwrite flag for an easier life
hasan7n May 1, 2024
3c7e9ea
use abs path for training config
hasan7n May 1, 2024
ce20325
add missing call
hasan7n May 1, 2024
5b7da41
REVERT ME
hasan7n May 1, 2024
b21ef08
remove trailing equals
hasan7n May 1, 2024
fd80541
fix mock certs issue
hasan7n May 1, 2024
192767c
add download run files
hasan7n May 1, 2024
a3e3a54
Revert "REVERT ME"
hasan7n May 1, 2024
d0ac9ec
use ip address in integration tests
hasan7n May 1, 2024
cab620d
add option to choose interface for publishing port
hasan7n May 6, 2024
0b5e4c0
remove some TODOs from the code
hasan7n May 17, 2024
4f73035
add ca server and client code
hasan7n May 17, 2024
220afee
Merge remote-tracking branch 'upstream/main' into fl-tmp
hasan7n May 20, 2024
5a1256a
temporary measures for minimum UI friction
hasan7n May 21, 2024
3dd45ff
empty
hasan7n May 21, 2024
961e2bd
Merge remote-tracking branch 'upstream/main' into fl-tmp
hasan7n May 21, 2024
712bcf6
post-merge main
hasan7n May 21, 2024
e76bea6
skip approval in tests
hasan7n May 21, 2024
97b1bac
bug fix for association ls
hasan7n May 21, 2024
af650e5
better UI for data owners
hasan7n May 21, 2024
4777e60
config storage migration
hasan7n May 22, 2024
538ea4e
update medperf version
hasan7n May 22, 2024
6b42235
use check update again
hasan7n May 22, 2024
6d2e16e
Draft example of NNUNet Integration with OpenFL and MedPerf
msheller Jun 11, 2024
c10100d
Merge branch 'main' into fl-poc
hasan7n Jun 12, 2024
2615ee7
Merge remote-tracking branch 'origin/fl-poc' into fl-poc-nnunet-draft
hasan7n Jun 12, 2024
5c83c5b
use same setup files as fl/fl
hasan7n Jun 12, 2024
b44ab75
Added missing utils file and changed import path to read from the src…
msheller Jun 13, 2024
aa781fb
Merge branch 'fl-poc-nnunet-draft' of https://github.com/hasan7n/medp…
msheller Jun 13, 2024
76e11b4
remove buggy unused imports
hasan7n Jun 13, 2024
21bbd1d
modify build script
hasan7n Jun 14, 2024
107c1f1
Merge remote-tracking branch 'origin/fl-dev' into fl-poc-nnunet-draft
hasan7n Jun 14, 2024
046ad75
sync fl examples
hasan7n Jul 2, 2024
685c64b
draft update for the shape mismatch fix
hasan7n Jul 10, 2024
e902989
rename imports
hasan7n Jul 10, 2024
41b4fd9
bugfixes, add init_model task
hasan7n Jul 18, 2024
73f145f
change tests config scripts
hasan7n Jul 18, 2024
30d394f
modify test setup script
hasan7n Jul 18, 2024
8405bd2
WIP. About to merge openfl PR 996
msheller Jul 28, 2024
2fe0635
add fl-admin mlcube
hasan7n Jul 29, 2024
7cc3dd2
allow passing col list when starting an event
hasan7n Jul 29, 2024
3f62a0b
add fl admin mlcube to training exp object
hasan7n Jul 29, 2024
0c6a89f
some setup/tests changes
hasan7n Jul 29, 2024
65270f4
bugfix in tests
hasan7n Jul 29, 2024
251cdd1
tests bugfixes
hasan7n Jul 29, 2024
7525b35
Added admin config to test plan
msheller Jul 29, 2024
d7b7df9
Merge branch 'fl-poc-nnunet-draft' of https://github.com/hasan7n/medp…
msheller Jul 29, 2024
89664e2
Added admin endpoints to list of functions that can use the network. …
msheller Jul 29, 2024
e23bfae
add cutofftime task in cube.py
hasan7n Jul 29, 2024
2c4eb43
Successfully tested with admin endpoint to set straggler handler timeout
msheller Jul 29, 2024
b06cc3d
Merge branch 'fl-poc-nnunet-draft' of https://github.com/hasan7n/medp…
msheller Jul 29, 2024
e749925
update numpy version in requirements
hasan7n Jul 30, 2024
fe4bd92
update testing scripts
hasan7n Jul 30, 2024
0268960
support gpu drivers v470
hasan7n Aug 1, 2024
16bf408
aggregator copy files bugfix
hasan7n Aug 1, 2024
8dcbfba
singularity option for fl tests
hasan7n Aug 1, 2024
70e4e56
add test option for admin fl
hasan7n Aug 1, 2024
b82618b
enabled specification of batches per epoch for training and validation
brandon-edwards Aug 16, 2024
612b191
dummy loader now also uses the per collaborator partial_epoch param
brandon-edwards Aug 16, 2024
27f85e5
admin commands in medperf
hasan7n Aug 18, 2024
31f39ca
typo
hasan7n Aug 22, 2024
44ed99b
refactor steps in fl mlcube
hasan7n Aug 22, 2024
f82717f
modify tests scripts
hasan7n Aug 22, 2024
3d7b6b1
add error checking in step-ca client
hasan7n Aug 27, 2024
4e59c4a
some clean up, as well as backing off to only pass partial_epoch to
brandon-edwards Aug 28, 2024
a27ef7b
update commit hash for fl admin mlcube
hasan7n Sep 2, 2024
bc431ff
update step-ca client dockerfile
hasan7n Sep 2, 2024
0f7d43c
bugfix argument name in training.submit
hasan7n Sep 3, 2024
fb4aa4f
store status in a temporary file to avoid read-only problem
hasan7n Sep 3, 2024
d6cd39d
update cli integration tests
hasan7n Sep 3, 2024
e2a4ddc
update mock tokens
hasan7n Sep 3, 2024
dee5152
update fl testing scripts
hasan7n Sep 3, 2024
26b4337
update fl integration test mlcube example
hasan7n Sep 3, 2024
1868af9
recent changes
brandon-edwards Sep 4, 2024
89d2ca2
initial changes to support timeouts rather than partial_epoch values
brandon-edwards Sep 9, 2024
a49f892
now tracking time in train and val loop directly
brandon-edwards Sep 14, 2024
b38d9e4
moving new param change to the nnunet train function
brandon-edwards Sep 14, 2024
5c34957
had a typo in param and also unsued args check
brandon-edwards Sep 14, 2024
0ccf964
some clean up
brandon-edwards Sep 16, 2024
07ef451
corrected percent completed calculation
brandon-edwards Sep 19, 2024
0b412ec
supporting max_num_epochs as nnunet runner init parameter
brandon-edwards Sep 25, 2024
81c552b
changing previous commit using max_num_epochs to TOTAL_max_num_epochs
brandon-edwards Sep 25, 2024
eea4279
putting timeout check at beginning of train and val loops so that we can
brandon-edwards Sep 26, 2024
4e303d1
first pass at accounting for separate train and val tasks into config
brandon-edwards Sep 26, 2024
b7970cf
allowing val_cutoff to pass through to nnunet training function
brandon-edwards Sep 30, 2024
ad014db
Revert "allowing val_cutoff to pass through to nnunet training function"
brandon-edwards Sep 30, 2024
2530189
implementing separate train and val tasks
brandon-edwards Sep 30, 2024
b096e3d
correcting timeouts for train and val tasks
brandon-edwards Sep 30, 2024
e04cc15
five collaborators
brandon-edwards Sep 30, 2024
0829fb7
changing fl plan file to indicate which model to apply for local vs
brandon-edwards Sep 30, 2024
b59c2dc
more of my local files,and changing be_setup.. script to copy over
brandon-edwards Sep 30, 2024
c0b9f51
correcting training config
brandon-edwards Sep 30, 2024
c8c1b3c
some more test infrastructure
brandon-edwards Oct 1, 2024
f99a0d7
need to pass epochs value of 1 to nnunet train function used for
brandon-edwards Oct 1, 2024
502bd85
now copying over whole repo into raid homedir
brandon-edwards Oct 1, 2024
25ae1e4
another testing change
brandon-edwards Oct 1, 2024
bdaffbc
now using validation_only parameter
brandon-edwards Oct 1, 2024
ca80500
allowing at most 1 second over one batch of val (during training) to
brandon-edwards Oct 1, 2024
4b9d604
need to account for loaders being different when validate_only is used
brandon-edwards Oct 1, 2024
ac45d05
typo 'validate_only'
brandon-edwards Oct 2, 2024
a194f6b
local setup script changes
brandon-edwards Oct 2, 2024
e6fa758
moving back to not using validation_only in order that we still have
brandon-edwards Oct 2, 2024
1f05091
preping to move all local stuff over the files named with be_...
brandon-edwards Oct 3, 2024
7260321
putting back fl-poc versions of some files
brandon-edwards Oct 3, 2024
180e5f0
changes to make sure validate method in nnunet runner does not save a
brandon-edwards Oct 3, 2024
ac18c19
some debug statements
brandon-edwards Oct 4, 2024
0fc0aa0
some debug statements
brandon-edwards Oct 4, 2024
ff8eb55
some more debug statements
brandon-edwards Oct 4, 2024
daaad2b
New handling of epoch in checkpoint and new handling of lr scheduling,
brandon-edwards Oct 4, 2024
5fee48a
passing self.TOTAL_max_num_epochs to lr scheduler computation
brandon-edwards Oct 7, 2024
58cedc1
will try absolutely no train/val in val/train cases respectively
brandon-edwards Oct 7, 2024
21d0c0e
new changes to keep global results out of checkpoints, returning metrics
brandon-edwards Oct 7, 2024
47c7637
fixing syntax error
brandon-edwards Oct 8, 2024
7aaa31a
syntax
brandon-edwards Oct 8, 2024
3d2871b
function for local val now only grabs metrics from checkpoint
brandon-edwards Oct 8, 2024
2682d17
having now to designate both val_epoch and train_epoch booleans at the
brandon-edwards Oct 8, 2024
19c3dff
removing variable no longer used
brandon-edwards Oct 8, 2024
059d126
removing check for no val work done during train call, no longer the
brandon-edwards Oct 8, 2024
cda5de9
moving validation from checkpoint and not both under function 'validate'
brandon-edwards Oct 8, 2024
24fe1b2
adding single val funciton to training config
brandon-edwards Oct 8, 2024
d7d838e
syntax
brandon-edwards Oct 8, 2024
91a9523
replaced np.sum for torch.sum, and preparing for per_task data size
brandon-edwards Oct 8, 2024
9b2f8ee
was using np on tensors
brandon-edwards Oct 8, 2024
dc495ea
casting to numpy (some are already, and some are torch tensors)
brandon-edwards Oct 8, 2024
e65eeef
typo
brandon-edwards Oct 8, 2024
aaf61fa
setting checkpoint delta error to 0.1 for now as well as verbose
brandon-edwards Oct 8, 2024
766c177
correcting metric that had train loss instead of val metric
brandon-edwards Oct 8, 2024
67e3467
removing some testing output
brandon-edwards Oct 10, 2024
4788659
enabling task dependent data size
brandon-edwards Oct 10, 2024
75e14e1
commenting out debug print
brandon-edwards Oct 10, 2024
3ebfc6b
adding recent changes to intel_build for local openfl changes
brandon-edwards Oct 10, 2024
18764cc
ensuring get_data_size returns an int so as to satisfy proto schema
brandon-edwards Oct 10, 2024
aa2d300
changing test setting
brandon-edwards Oct 11, 2024
0dc6eea
only including model in task results in train task (not local or global
brandon-edwards Oct 11, 2024
3ecb078
inconsistency in new variable name
brandon-edwards Oct 11, 2024
c7cdd8f
now including per label DICE for global and local val
brandon-edwards Oct 11, 2024
5e10b47
correcting syntax
brandon-edwards Oct 11, 2024
efd4e40
nnunet training function should return more
brandon-edwards Oct 12, 2024
b4dc660
syntax
brandon-edwards Oct 13, 2024
6373de5
corrected return to be train and val completed as opposed to batches
brandon-edwards Oct 13, 2024
a715b15
typo
brandon-edwards Oct 13, 2024
a958151
small cleanup
brandon-edwards Oct 14, 2024
2dcd792
setting train and val cutoff in plan, and setting defaults to infinity
brandon-edwards Oct 16, 2024
f0e0170
Merge branch 'fl-poc' of https://github.com/hasan7n/medperf into be_e…
brandon-edwards Oct 16, 2024
a98b2ef
post review with Micah
brandon-edwards Oct 17, 2024
f99a88c
inserting old debug stuff as info (stdout) to have going forward
brandon-edwards Oct 18, 2024
9bafb04
now inserting info to stdout
brandon-edwards Oct 18, 2024
2bfea20
changing to 300 rounds for testing
brandon-edwards Oct 18, 2024
c8002ad
was not printing at the time info was properly populated
brandon-edwards Oct 18, 2024
1482e9f
now using round and not saving any checkpoints
brandon-edwards Oct 19, 2024
227cac2
I want to be explicit about where we are writing checkpoints now
brandon-edwards Oct 21, 2024
b90f732
round instead of current_epoch
brandon-edwards Oct 21, 2024
5d1c727
some clean up
brandon-edwards Oct 21, 2024
e774fe9
default argument at end
brandon-edwards Oct 21, 2024
7314eef
provide checkpoint path
brandon-edwards Oct 21, 2024
659d024
Nowing using fl_round rather than round to avoid collision with internal
brandon-edwards Oct 21, 2024
ccc5fca
missed another spot where checkpoint path was needed
brandon-edwards Oct 21, 2024
1161754
setting of epoch pre nnunet train/val corresponds to previous epoch
brandon-edwards Oct 22, 2024
4f998fd
inserting some debug prints
brandon-edwards Oct 22, 2024
d6e7f4d
indentation issues
brandon-edwards Oct 22, 2024
ee45767
removing some debug prints
brandon-edwards Oct 22, 2024
43a08ef
removing a print
brandon-edwards Oct 22, 2024
fb69b71
changes due to new signature for nnunet train function
brandon-edwards Oct 22, 2024
131dfd1
preparing to test on brain threshold data
brandon-edwards Oct 22, 2024
fc60385
changes to training config for testing
brandon-edwards Oct 22, 2024
1230382
another iteration of data creation (so changing data file names)
brandon-edwards Oct 23, 2024
b5b6fa4
changes new config and parameter definitions
brandon-edwards Oct 24, 2024
4e2aae1
enabling dampening of the train_completion with admin control
brandon-edwards Oct 26, 2024
0919308
left a stray idea in the training config that does not belong
brandon-edwards Oct 26, 2024
243f024
setting min train_completion_dampener to -1 to allow multiplication
brandon-edwards Oct 27, 2024
5bff371
removing be scripts from branch
brandon-edwards Oct 28, 2024
0ed94e9
removing be_Dockerfile
brandon-edwards Oct 28, 2024
317cf87
doc string
brandon-edwards Oct 28, 2024
e27aaf0
doc changes and removing validation_only param as we control that
brandon-edwards Oct 28, 2024
f16f011
remiving epochs variable since nnunet trainer run_training is hard coded
brandon-edwards Oct 28, 2024
f5ac88f
docstring
brandon-edwards Oct 28, 2024
da0b84b
dockstring and fl plan fixing rounds to train for both agg and task
brandon-edwards Oct 28, 2024
2167e7e
dockstring
brandon-edwards Oct 28, 2024
c5161c6
removing comment and test piece in config
brandon-edwards Oct 29, 2024
e05c01d
removing unused function and providing default args
brandon-edwards Oct 29, 2024
c396483
removing a line of print output
brandon-edwards Oct 29, 2024
128e28b
Merge pull request #7 from hasan7n/be_enable_partial_epochs
brandon-edwards Oct 30, 2024
eb28131
update testing scripts
hasan7n Oct 30, 2024
6923731
update test scripts again
hasan7n Oct 30, 2024
9e18f82
restart training on failure
hasan7n Oct 31, 2024
11fd6e7
Revert "Merge pull request #7 from hasan7n/be_enable_partial_epochs"
hasan7n Oct 31, 2024
68577a3
use fixed hashes for openfl and nnunet installations
hasan7n Oct 31, 2024
4b0d1b0
update openfl commit
hasan7n Nov 1, 2024
e35e722
Revert "Revert "Merge pull request #7 from hasan7n/be_enable_partial_…
hasan7n Nov 8, 2024
f150ac5
Modality change, fixed 250 epochs, and seeded splits logic (#10)
brandon-edwards Nov 8, 2024
ca2261b
use curl instead of wget in step-ca client
hasan7n Nov 8, 2024
98a2a6d
timestamp logs and weights outputs of aggregator
hasan7n Nov 8, 2024
6606ac7
update tests scripts
hasan7n Nov 8, 2024
726f690
update hashes, and update weights url
hasan7n Nov 11, 2024
9b3e850
update fladmin mlcube
hasan7n Nov 11, 2024
880c934
get status update command
hasan7n Nov 13, 2024
5908c24
Added first version of script to parse training status .yaml file int…
msheller Nov 13, 2024
24596f0
Switched to ipynb
msheller Nov 14, 2024
f812376
update cuda118 env in dockerfile
hasan7n Nov 14, 2024
667ffe4
Merge branch 'fl-poc' of https://github.com/hasan7n/medperf into fl-poc
hasan7n Nov 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions .github/workflows/train-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: FL Integration workflow

on: pull_request

jobs:
setup:
name: fl-integration-test
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v2

- name: Set up Python 3.9
uses: actions/setup-python@v2
with:
python-version: '3.9'

- name: Install dependencies
working-directory: .
run: |
python -m pip install --upgrade pip
pip install -e cli/
pip install -r cli/test-requirements.txt
pip install -r server/requirements.txt
pip install -r server/test-requirements.txt

- name: Set server environment vars
working-directory: ./server
run: cp .env.local.local-auth .env

- name: Run django server in background with generated certs
working-directory: ./server
run: sh setup-dev-server.sh & sleep 6

- name: Run server integration tests
working-directory: ./server
run: python seed.py --cert cert.crt

- name: Run client integration tests
working-directory: .
run: sh cli/cli_tests_training.sh -f
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -147,3 +147,10 @@ cython_debug/
# Dev Environment Specific
.vscode
.venv
server/keys

# exclude fl example
!examples/fl/mock_cert/project/ca/root.key
!examples/fl/mock_cert/project/ca/cert/root.crt
!flca/dev_assets/intermediate_ca.crt
!flca/dev_assets/root_ca.crt
3 changes: 1 addition & 2 deletions cli/cli_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
################### Start Testing ########################
##########################################################


##########################################################
echo "=========================================="
echo "Printing MedPerf version"
Expand Down Expand Up @@ -195,7 +194,7 @@ echo "Running data submission step"
echo "====================================="
print_eval "medperf dataset submit -p $PREP_UID -d $DIRECTORY/dataset_a -l $DIRECTORY/dataset_a --name='dataset_a' --description='mock dataset a' --location='mock location a' -y"
checkFailed "Data submission step failed"
DSET_A_UID=$(medperf dataset ls | grep dataset_a | tr -s ' ' | cut -d ' ' -f 1)
DSET_A_UID=$(medperf dataset ls | grep dataset_a | tr -s ' ' | awk '{$1=$1;print}' | cut -d ' ' -f 1)
echo "DSET_A_UID=$DSET_A_UID"
##########################################################

Expand Down
Loading
Loading