Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Val_loss NaN for any training #58

Closed
louisPoulain opened this issue Oct 4, 2024 · 1 comment · Fixed by #59
Closed

Val_loss NaN for any training #58

louisPoulain opened this issue Oct 4, 2024 · 1 comment · Fixed by #59
Labels
bug Something isn't working

Comments

@louisPoulain
Copy link
Collaborator

louisPoulain commented Oct 4, 2024

sum_vobs_nan.ref().deref()=<tf.Tensor: shape=(), dtype=int32, numpy=0>/500000
tf.reduce_sum(tf.cast(tf.math.is_nan(y_pred.distribution.normal.mean()), tf.int32))=<tf.Tensor: shape=(), dtype=int32, numpy=544>/500000
tf.reduce_sum(tf.cast(tf.math.is_nan(y_pred.distribution.normal.stddev()), tf.int32))=<tf.Tensor: shape=(), dtype=int32, numpy=544>/500000
tf.reduce_sum(tf.cast(tf.math.is_nan(y_pred.mean()), tf.int32))=<tf.Tensor: shape=(), dtype=int32, numpy=544>/500000
sum_samp1_nan.ref().deref()=<tf.Tensor: shape=(), dtype=int32, numpy=54400>/50000000
sum_samp2_nan.ref().deref()=<tf.Tensor: shape=(), dtype=int32, numpy=54400>/50000000
sum_e1_nan.ref().deref()=<tf.Tensor: shape=(), dtype=int32, numpy=544>/500000
sum_e2_nan.ref().deref()=<tf.Tensor: shape=(), dtype=int32, numpy=544>/500000
sum_twcrps_nan.ref().deref()=<tf.Tensor: shape=(), dtype=int32, numpy=544>/500000

For any training there is exactly one batch of data (size 500'000) that produces exactly 544 NaNs in the predicted distribution.
The distribution is a doubly-censored normal.
The actual loss, on the other hand is never NaN

Other infos

  • No Nans in the validation set or in the training set
Base file features: - ch1:cloud_area_fraction_ensavg - ch1:cloud_area_fraction_ensstd - ch2:air_temperature_ensavg - ch2:cloud_area_fraction_ensavg - ch2:cloud_area_fraction_ensstd - ifs:air_temperature_ensavg - ifs:cloud_area_fraction_ensavg - ifs:cloud_area_fraction_ensstd - terrain:distance_to_alpine_ridge - terrain:valley_norm_2000m - time:cos_dayofyear - time:cos_hourofday - time:sin_dayofyear - time:sin_hourofday

normalizer:
fillvalue: -5
default: Standardizer

targets:

  • obs:cloud_area_fraction

sample_weights:

  • time:weight_leadtime

data_partitioning:
time_split:
train:
- 2020-02
- 2020-03
- 2020-05
- 2020-06
- 2020-07
- 2020-08
- 2020-09
- 2020-10
- 2020-11
- 2020-12
- 2021-02
- 2021-03
- 2021-04
- 2021-05
- 2021-06
- 2021-07
- 2021-08
- 2021-09
- 2021-11
- 2021-12
- 2022-01
- 2022-02
- 2022-03
- 2022-04
- 2022-05
- 2022-06
- 2022-08
- 2022-09
- 2020-10
- 2022-11
- 2022-12
- 2023-01
val:
- 2020-04
- 2021-01
- 2021-10
- 2022-07
test:
start: "2023-02-01"
end: "2023-09-30"
station_split:
train: 0.85
val: 0.15
seed: 42
time_dim_name: forecast_reference_time
batching:
event_dims: []
batch_size: 500_000
shuffle: True

model:
fully_connected_network:
hidden_layers: [512, 512, 512]
activations: relu
dropout: [0.3, 0.2, 0.1]
probabilistic_layer: IndependentDoublyCensoredNormal
mc_dropout: True

optimizer:
Adam:
learning_rate:
CosineDecayRestarts:
initial_learning_rate: 0.001
first_decay_steps: 20
t_mul: 2.0
m_mul: 1.025
alpha: 0.0
beta_1: 0.94
loss:
WeightedCRPSEnergy:
threshold: 0.0
n_samples: 100
epochs: 4
steps_per_epoch: 1

metrics:

  • MAEBusts:
    threshold: 0.25

callbacks:

  • EarlyStopping:
    monitor: val_loss
    mode: min
    patience: 20
    restore_best_weights: True
    verbose: 1
    min_delta: 0.000050000
  • EnsembleMetrics:
    thresholds: [0.9]

Env

Environment absl-py==2.1.0 aiohappyeyeballs @ file:///home/conda/feedstock_root/build_artifacts/aiohappyeyeballs_1724167852130/work aiohttp @ file:///home/conda/feedstock_root/build_artifacts/aiohttp_1727281375658/work aiosignal @ file:///home/conda/feedstock_root/build_artifacts/aiosignal_1667935791922/work alembic @ file:///home/conda/feedstock_root/build_artifacts/alembic_1727122811080/work aniso8601 @ file:///home/conda/feedstock_root/build_artifacts/aniso8601_1618789466884/work asciitree==0.3.3 asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work astunparse==1.6.3 async-timeout @ file:///home/conda/feedstock_root/build_artifacts/async-timeout_1691763562544/work attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1722977137225/work bcrypt @ file:///home/conda/feedstock_root/build_artifacts/bcrypt_1724960420580/work blinker @ file:///home/conda/feedstock_root/build_artifacts/blinker_1715091184126/work bokeh @ file:///home/conda/feedstock_root/build_artifacts/bokeh_1719324651922/work boto3 @ file:///home/conda/feedstock_root/build_artifacts/boto3_1727422381215/work botocore @ file:///home/conda/feedstock_root/build_artifacts/botocore_1727397771150/work Brotli @ file:///home/conda/feedstock_root/build_artifacts/brotli-split_1725267488082/work cachetools @ file:///home/conda/feedstock_root/build_artifacts/cachetools_1724028158384/work certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1725278078093/work/certifi cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1725571112467/work cftime @ file:///home/conda/feedstock_root/build_artifacts/cftime_1725400455427/work charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1698833585322/work click @ file:///home/conda/feedstock_root/build_artifacts/click_1692311806742/work cloudpickle @ file:///home/conda/feedstock_root/build_artifacts/cloudpickle_1697464713350/work contourpy @ file:///home/conda/feedstock_root/build_artifacts/contourpy_1727293517607/work cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography-split_1725443044072/work cycler @ file:///home/conda/feedstock_root/build_artifacts/cycler_1696677705766/work cytoolz @ file:///home/conda/feedstock_root/build_artifacts/cytoolz_1706897086113/work dask==2022.12.1 dask-expr @ file:///home/conda/feedstock_root/build_artifacts/dask-expr_1722982607046/work databricks-sdk @ file:///home/conda/feedstock_root/build_artifacts/databricks-sdk_1726835227694/work decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work Deprecated @ file:///home/conda/feedstock_root/build_artifacts/deprecated_1685233314779/work distributed @ file:///home/conda/feedstock_root/build_artifacts/distributed_1722982528621/work dm-tree==0.1.8 docker @ file:///home/conda/feedstock_root/build_artifacts/docker-py_1716508870406/work entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1720869315914/work executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1725214404607/work fasteners @ file:///home/conda/feedstock_root/build_artifacts/fasteners_1643971550063/work Flask @ file:///home/conda/feedstock_root/build_artifacts/flask_1712667726126/work flatbuffers==24.3.25 fonttools @ file:///home/conda/feedstock_root/build_artifacts/fonttools_1727206408738/work frozenlist @ file:///home/conda/feedstock_root/build_artifacts/frozenlist_1725395644230/work fsspec @ file:///home/conda/feedstock_root/build_artifacts/fsspec_1725543257300/work gast==0.4.0 gitdb @ file:///home/conda/feedstock_root/build_artifacts/gitdb_1697791558612/work GitPython @ file:///home/conda/feedstock_root/build_artifacts/gitpython_1711991025291/work google-auth @ file:///home/conda/feedstock_root/build_artifacts/google-auth_1726832896641/work google-auth-oauthlib==1.0.0 google-pasta==0.2.0 graphene @ file:///home/conda/feedstock_root/build_artifacts/graphene_1690379572063/work graphql-core @ file:///home/conda/feedstock_root/build_artifacts/graphql-core_1725549136655/work graphql-relay @ file:///home/conda/feedstock_root/build_artifacts/graphql-relay_1650134628625/work greenlet @ file:///home/conda/feedstock_root/build_artifacts/greenlet_1726922189413/work grpcio==1.66.1 gunicorn @ file:///home/conda/feedstock_root/build_artifacts/gunicorn_1713358040599/work h5py==3.12.1 idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1726459485162/work importlib_metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1726082825846/work importlib_resources @ file:///home/conda/feedstock_root/build_artifacts/importlib_resources_1725921340658/work ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1701831663892/work itsdangerous @ file:///home/conda/feedstock_root/build_artifacts/itsdangerous_1713372668944/work jax==0.4.30 jaxlib==0.4.30 jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1715127149914/work jmespath @ file:///home/conda/feedstock_root/build_artifacts/jmespath_1655568249366/work joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1714665484399/work keras==2.12.0 kiwisolver @ file:///home/conda/feedstock_root/build_artifacts/kiwisolver_1725459266648/work libclang==18.1.1 llvmlite==0.43.0 locket @ file:///home/conda/feedstock_root/build_artifacts/locket_1650660393415/work lz4 @ file:///home/conda/feedstock_root/build_artifacts/lz4_1725089417274/work Mako @ file:///home/conda/feedstock_root/build_artifacts/mako_1715711344987/work Markdown @ file:///home/conda/feedstock_root/build_artifacts/markdown_1710435156458/work MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1724959465445/work matplotlib==3.9.2 matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1713250518406/work ml_dtypes==0.5.0 mlflow @ file:///home/conda/feedstock_root/build_artifacts/mlflow-split_1726566280547/work mlflow-skinny @ file:///home/conda/feedstock_root/build_artifacts/mlflow-split_1726566280547/work mlpp-lib==0.12.2 msgpack @ file:///home/conda/feedstock_root/build_artifacts/msgpack-python_1725975012026/work multidict @ file:///home/conda/feedstock_root/build_artifacts/multidict_1725953652790/work munkres==1.1.4 netCDF4 @ file:///home/conda/feedstock_root/build_artifacts/netcdf4_1725449927647/work numba @ file:///home/conda/feedstock_root/build_artifacts/numba_1718888028049/work numcodecs @ file:///home/conda/feedstock_root/build_artifacts/numcodecs_1715218778254/work numpy==1.24.3 oauthlib==3.2.2 opentelemetry-api @ file:///home/conda/feedstock_root/build_artifacts/opentelemetry-api_1676680662101/work opentelemetry-sdk @ file:///home/conda/feedstock_root/build_artifacts/opentelemetry-sdk_1676709164054/work opentelemetry-semantic-conventions @ file:///home/conda/feedstock_root/build_artifacts/opentelemetry-semantic-conventions_1676680479396/work opt_einsum==3.4.0 packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1718189413536/work pandas==1.5.3 paramiko @ file:///home/conda/feedstock_root/build_artifacts/paramiko_1726748051454/work parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1712320355065/work partd @ file:///home/conda/feedstock_root/build_artifacts/partd_1715026491486/work pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1726075067949/work prometheus_client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1726901976720/work prometheus_flask_exporter @ file:///home/conda/feedstock_root/build_artifacts/prometheus_flask_exporter_1720670279306/work prompt_toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1727341649933/work properscoring==0.1 protobuf==4.25.3 psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1725737916340/work ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl pure_eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1721585709575/work pyarrow==17.0.0 pyarrow-hotfix @ file:///home/conda/feedstock_root/build_artifacts/pyarrow-hotfix_1700596371886/work pyasn1 @ file:///home/conda/feedstock_root/build_artifacts/pyasn1_1726839225972/work pyasn1_modules @ file:///home/conda/feedstock_root/build_artifacts/pyasn1-modules_1726029546107/work pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1711811537435/work Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1714846767233/work PyNaCl @ file:///home/conda/feedstock_root/build_artifacts/pynacl_1725739244417/work pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1722587090966/work pyparsing @ file:///home/conda/feedstock_root/build_artifacts/pyparsing_1724616129934/work PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1709299778482/work pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1706886791323/work pyu2f @ file:///home/conda/feedstock_root/build_artifacts/pyu2f_1604248910016/work PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1725456176299/work querystring_parser @ file:///home/conda/feedstock_root/build_artifacts/querystring_parser_1723625595981/work requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1717057054362/work requests-oauthlib==2.0.0 rsa @ file:///home/conda/feedstock_root/build_artifacts/rsa_1658328885051/work s3transfer @ file:///home/conda/feedstock_root/build_artifacts/s3transfer_1719300139436/work scikit-learn @ file:///home/conda/feedstock_root/build_artifacts/scikit-learn_1726082655509/work/dist/scikit_learn-1.5.2-cp39-cp39-linux_x86_64.whl#sha256=9bdea44be238844ca955b35fde2df3049752a843e3eb223cf91e68e25efefa5c scipy @ file:///home/conda/feedstock_root/build_artifacts/scipy-split_1716470218293/work/dist/scipy-1.13.1-cp39-cp39-linux_x86_64.whl#sha256=e6696cb8683d94467891b7648e068a3970f6bc0a1b3c1aa7f9bc89458eafd2f0 six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work smmap @ file:///home/conda/feedstock_root/build_artifacts/smmap_1634310307496/work sortedcontainers @ file:///home/conda/feedstock_root/build_artifacts/sortedcontainers_1621217038088/work SQLAlchemy @ file:///home/conda/feedstock_root/build_artifacts/sqlalchemy_1726596200000/work sqlparse @ file:///home/conda/feedstock_root/build_artifacts/sqlparse_1721304206023/work stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work tblib @ file:///home/conda/feedstock_root/build_artifacts/tblib_1702066284995/work tensorboard==2.12.3 tensorboard-data-server==0.7.2 tensorflow==2.12.1 tensorflow-estimator==2.12.0 tensorflow-io-gcs-filesystem==0.37.1 tensorflow-probability==0.20.1 termcolor==2.4.0 threadpoolctl @ file:///home/conda/feedstock_root/build_artifacts/threadpoolctl_1714400101435/work toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1706112571092/work tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1724955920300/work traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1713535121073/work typing_extensions==4.5.0 tzdata @ file:///home/conda/feedstock_root/build_artifacts/python-tzdata_1727140567071/work unicodedata2 @ file:///home/conda/feedstock_root/build_artifacts/unicodedata2_1695847984941/work urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1718728347128/work wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work websocket-client @ file:///home/conda/feedstock_root/build_artifacts/websocket-client_1713923384721/work Werkzeug @ file:///home/conda/feedstock_root/build_artifacts/werkzeug_1724330738730/work wrapt==1.14.1 xarray==2022.12.0 xyzservices @ file:///home/conda/feedstock_root/build_artifacts/xyzservices_1725366347586/work yarl @ file:///home/conda/feedstock_root/build_artifacts/yarl_1727422848961/work zarr @ file:///home/conda/feedstock_root/build_artifacts/zarr_1716779724722/work zict @ file:///home/conda/feedstock_root/build_artifacts/zict_1681770155528/work zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1726248574750/work
@louisPoulain louisPoulain added the bug Something isn't working label Oct 4, 2024
@louisPoulain
Copy link
Collaborator Author

After reviewing the features set that we have, it seems that we have Inf or -Inf values in some of our variables.
The cause could be that Dataset.drop_nans only checks for the presence of NaNs and not infinite values.
I'll propose a fix

@louisPoulain louisPoulain linked a pull request Oct 7, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant