Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

where is the file .h5? #10

Open
ouyangzhuzhu opened this issue Jan 9, 2019 · 5 comments
Open

where is the file .h5? #10

ouyangzhuzhu opened this issue Jan 9, 2019 · 5 comments

Comments

@ouyangzhuzhu
Copy link

HI:
friends!
I have installed all the tools the README.md mentioned and download the ResNet-56 (10 MB) and run this command below:
mpirun -n 4 python plot_surface.py --mpi --cuda --model resnet56 --x=-1:1:51 --y=-1:1:51 \ --model_file cifar10/trained_nets/resnet56_sgd_lr=0.1_bs=128_wd=0.0005/model_300.t7 \ --dir_type weights --xnorm filter --xignore biasbn --ynorm filter --yignore biasbn --plot
But 24 hoursd later, nothing changed , i cann't finf .h5 file created.
Where can i found the .h5 file or did I miss something? Hope u can help~~ 3ks

@ljk628
Copy link
Collaborator

ljk628 commented Jan 10, 2019

Hi @ouyangzhuzhu,

Thanks for your question. The h5 file should be generated in the same folder as your model file. With that command, there should be two .h5 files in the folder cifar10/trained_nets/resnet56_sgd_lr=0.1_bs=128_wd=0.0005/:
model_300.t7_weights_xignore=biasbn_xnorm=filter_yignore=biasbn_ynorm=filter.h5 is the direction file which saves the directions, model_300.t7_weights_xignore=biasbn_xnorm=filter_yignore=biasbn_ynorm=filter.h5_[-1.0,1.0,51]x[-1.0,1.0,51].h5 is the surface file which contains the surface values respect to that direction and resolution.

We have provided our precomputed files. So if you want to generate your own result file, you can delete them or simply use a different resolution.

@ouyangzhuzhu
Copy link
Author

great 3ks @ljk628 ! Yes after 4 hours I got the final h5 files just like u said!~
But I got a error at the end, can u help see it:
Evaluating rank 0 2600/2601 (100.0%) coord=[1. 1.] train_loss= 17.668 train_acc=8.31 time=5.66 sy nc=0.00 Rank 0 done! Total time: 14505.95 Sync: 2.20 Traceback (most recent call last): File "plot_surface.py", line 298, in <module> plot_2D.plot_2d_contour(surf_file, 'train_loss', args.vmin, args.vmax, args.vlevel, args.show) File "/home/l00221575/Downloads/loss-landscape/plot_2D.py", line 18, in plot_2d_contour f = h5py.File(surf_file, 'r') File "/home/l00221575/venv_openai-es/lib/python3.5/site-packages/h5py/_hl/files.py", line 394, in __init__ swmr=swmr) File "/home/l00221575/venv_openai-es/lib/python3.5/site-packages/h5py/_hl/files.py", line 170, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 85, in h5py.h5f.open OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
And I try to use the comman below to produce and customize a contour plot using the script plot_2D.py:
python plot_2D.py --surf_file path_to_surf_file --surf_name train_loss
I failed too :( :
(venv_openai-es) l00221575@F0817-S05:~/Downloads/loss-landscape$ python plot_2D.py --surf_file cifar10/trained_nets/resnet56_sgd_lr\=0.1_b s\=128_wd\=0.0005/ --surf_name train_loss Traceback (most recent call last): File "plot_2D.py", line 205, in <module> plot_2d_contour(args.surf_file, args.surf_name, args.vmin, args.vmax, args.vlevel, args.show) File "plot_2D.py", line 18, in plot_2d_contour f = h5py.File(surf_file, 'r') File "/home/l00221575/venv_openai-es/lib/python3.5/site-packages/h5py/_hl/files.py", line 394, in __init__ swmr=swmr) File "/home/l00221575/venv_openai-es/lib/python3.5/site-packages/h5py/_hl/files.py", line 170, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 85, in h5py.h5f.open OSError: Unable to open file (file read failed: time = Thu Jan 10 02:44:57 2019 , filename = 'cifar10/trained_nets/resnet56_sgd_lr=0.1_bs=128_wd=0.0005/', file descriptor = 4, errno = 21, error message = 'Is a director y', buf = 0x7ffdce19ddb0, total read size = 8, bytes this sub-read = 8, bytes actually read = 18446744073709551615, offset = 0)

@ljk628
Copy link
Collaborator

ljk628 commented Jan 10, 2019

This is the same as #4, which can be temporally solved by downgrading the h5py pip install h5py==2.7.0.

@ouyangzhuzhu
Copy link
Author

great great great 3ks!!!!! it worked!!!!

@ouyangzhuzhu
Copy link
Author

ouyangzhuzhu commented Jan 10, 2019

hi, @ljk628 I got a error when i try the ResNet-56-noshort (20 MB), the info below is the Traceback .
And i delete the "mpirun -n 4 " because maybe something wrong with my mpirun, but it works with slower speed when I try ResNet-56 (10 MB). Please help me ~ great 3ks~~~
(venv_openai-es) l00221575@F0817-S05:~/Downloads/loss-landscape$ python plot_surface.py --mpi --cuda --model resnet56 --x=-1:1:51 --y=-1:1:51 --model_file cifar10/trained_nets/resnet56_noshort_sgd_lr\=0.1_bs\=128_wd\=0.0005/model_300.t7 /home/l00221575/venv_openai-es/lib/python3.5/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from floattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters

[[57286,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: F0817-S05
Another transport will be used instead, although this may result in
lower performance.

Rank 0 use GPU 0 of 8 GPUs on F0817-S05
Traceback (most recent call last):
File "plot_surface.py", line 243, in
net = model_loader.load(args.dataset, args.model, args.model_file)
File "/home/l00221575/Downloads/loss-landscape/model_loader.py", line 6, in load
net = cifar10.model_loader.load(model_name, model_file, data_parallel)
File "/home/l00221575/Downloads/loss-landscape/cifar10/model_loader.py", line 49, in load
net.load_state_dict(stored['state_dict'])
File "/home/l00221575/venv_openai-es/lib/python3.5/site-packages/torch/nn/modules/module.py", line 719, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ResNet_cifar:
Missing key(s) in state_dict: "layer2.0.shortcut.0.weight",
"layer2.0.shortcut.1.running_var",
"layer2.0.shortcut.1.bias",
"layer2.0.shortcut.1.weight",
"layer2.0.shortcut.1.running_mean",
"layer3.0.shortcut.0.weight",
"layer3.0.shortcut.1.running_var",
"layer3.0.shortcut.1.bias",
"layer3.0.shortcut.1.weight",
"layer3.0.shortcut.1.running_mean".`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants