Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFJob failed to run behind proxy with IOError: Not a gzipped file #182

Closed
NohaIhab opened this issue Jul 22, 2024 · 3 comments
Closed

TFJob failed to run behind proxy with IOError: Not a gzipped file #182

NohaIhab opened this issue Jul 22, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@NohaIhab
Copy link
Contributor

Bug Description

The training operator UATs failed in a CKF deployment behind proxy, with TFJob in Failed status

To Reproduce

  1. Deploy CKF 1.9/beta behind proxy
  2. Run the training-operator UATs
  3. Describe the TFJob

Environment

microk8s 1.29-strict/stable
juju 3.4.4

Relevant Log Output

Traceback (most recent call last):
  File "/var/tf_mnist/mnist_with_summaries.py", line 212, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/var/tf_mnist/mnist_with_summaries.py", line 183, in main
    train()
  File "/var/tf_mnist/mnist_with_summaries.py", line 39, in train
    fake_data=FLAGS.fake_data)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 306, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 262, in read_data_sets
    train_images = extract_images(f)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 306, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 62, in extract_images
    magic = _read32(bytestream)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 43, in _read32
    return numpy.frombuffer(bytestream.read(4), dtype=dt)[0]
  File "/usr/lib/python2.7/gzip.py", line 268, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 303, in _read
    self._read_gzip_header()
  File "/usr/lib/python2.7/gzip.py", line 197, in _read_gzip_header
    raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file
Successfully downloaded train-images-idx3-ubyte.gz 3687 bytes.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz

Additional Context

No response

@NohaIhab NohaIhab added the bug Something isn't working label Jul 22, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6042.

This message was autogenerated

@misohu
Copy link
Member

misohu commented Aug 22, 2024

So I have inspoected the tf job image and the code they use there. First of all there is higher tag for that image but its the same cod ewith python2.7 which is still not working. So I duf the code and the librarry and basically the important code which is downloading the dataset to the image which is causing problem is this piece of code.

from six.moves import urllib

def urlretrieve_with_retry(url, filename=None):
  return urllib.request.urlretrieve(url, filename)

name, r = urlretrieve_with_retry("https://raw.githubusercontent.com/golbin/TensorFlow-MNIST/master/mnist/data/train-images-idx3-ubyte.gz", "train-images-idx3-ubyte.gz")

So I have created an sleeping image behind proxy with python 2.7 I sshed and I rerun just that code with proxy env variables. To compare from the same pod I have run simple curl to get the file to compare. You wont believe but the python code succeeds and the curl succeeds but the files are different size. The python code gets just 3KB of data

root@python-bash-pod:/# ls -lh train-images-idx3-ubyt*.gz
-rw-r--r-- 1 root root 9.5M Aug 22 06:20 train-images-idx3-ubyte-curl.gz
-rw-r--r-- 1 root root 3.7K Aug 22 06:20 train-images-idx3-ubyte.gz

To overcome this problem in the tf job command I am first curling all the datasets to the image before running the code so the code does not use the urlretrieve_with_retry.

DnPlas added a commit to canonical/charmed-kubeflow-uats that referenced this issue Aug 27, 2024
DnPlas added a commit to canonical/charmed-kubeflow-uats that referenced this issue Aug 27, 2024
@misohu
Copy link
Member

misohu commented Aug 28, 2024

Fixed here canonical/charmed-kubeflow-uats#105

@misohu misohu closed this as completed Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants