Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run this program on multiple GPUs #22

Open
sewellGUO opened this issue Dec 16, 2020 · 5 comments
Open

How to run this program on multiple GPUs #22

sewellGUO opened this issue Dec 16, 2020 · 5 comments

Comments

@sewellGUO
Copy link

hello @georgesterpu,
Thank you for the open source code, I have a problem now.
When I run this program on multiple GPUs, I found that only one GPU is full, and the remaining GPUs are not used.
I am just a new comer for tensorflow. Some of the methods provided by Google are also useless, so I want to know how to modify the code to solve my problem?

@georgesterpu
Copy link
Owner

Hi @sewellGUO
Thanks for opening the issue.
I did not implement any multi-GPU training support into this project, as it was never my use case.

Have you taken a look at these two guides written for TensorFlow 1.x?
https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/distribute_strategy.ipynb
https://github.com/tensorflow/docs/blob/master/site/en/r1/tutorials/distribute/training_loops.ipynb

The guides written by the TensorFlow team at Google are certainly the best resources to start with.

I would also take a look at the state of the art in Distributed Training before deciding which is the right strategy.

Please note that some of the models used in this project have been ported to TensorFlow 2 in the Taris repository:
https://github.com/georgesterpu/Taris
That repository does not have multi-gpu training either, but it may be easier to work with TF2 than with TF1 at this stage.

Do you find the two guides above useful to your use case ?

@sewellGUO
Copy link
Author

sewellGUO commented Dec 21, 2020

Thanks for reply, i will try it.
I have a problem now, as follows:

Traceback (most recent call last):
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: Decoder/decoder/my_dense/bias_0-grad
         [[{{node Decoder/decoder/my_dense/bias_0-grad}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_audiovisual.py", line 64, in <module>
    main()
  File "run_audiovisual.py", line 59, in main
    logfile=logfile,
  File "/data/cqx/gjw/git/moban/avsr-tf1/avsr/experiment.py", line 112, in run_experiment
    try_restore_latest_checkpoint=True
  File "/data/cqx/gjw/git/moban/avsr-tf1/avsr/avsr.py", line 272, in train
    ], **self.sess_opts)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: Decoder/decoder/my_dense/bias_0-grad
         [[node Decoder/decoder/my_dense/bias_0-grad (defined at /data/cqx/gjw/git/moban/avsr-tf1/avsr/seq2seq.py:231) ]]

Caused by op 'Decoder/decoder/my_dense/bias_0-grad', defined at:
  File "run_audiovisual.py", line 64, in <module>
    main()
  File "run_audiovisual.py", line 59, in main
    logfile=logfile,
  File "/data/cqx/gjw/git/moban/avsr-tf1/avsr/experiment.py", line 107, in run_experiment
    **kwargs
  File "/data/cqx/gjw/git/moban/avsr-tf1/avsr/avsr.py", line 216, in __init__
    self._create_models()
  File "/data/cqx/gjw/git/moban/avsr-tf1/avsr/avsr.py", line 526, in _create_models
    batch_size=self._hparams.batch_size[0])
  File "/data/cqx/gjw/git/moban/avsr-tf1/avsr/avsr.py", line 570, in _make_model
    hparams=self._hparams
  File "/data/cqx/gjw/git/moban/avsr-tf1/avsr/seq2seq.py", line 26, in __init__
    self._init_optimiser()
  File "/data/cqx/gjw/git/moban/avsr-tf1/avsr/seq2seq.py", line 231, in _init_optimiser
    summary = tf.summary.histogram("%s-grad" % variable.name, value)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/summary/summary.py", line 177, in histogram
    tag=tag, values=values, name=scope)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 312, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/data/cqx/anaconda3/envs/g_tf1.13/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: Decoder/decoder/my_dense/bias_0-grad
         [[node Decoder/decoder/my_dense/bias_0-grad (defined at /data/cqx/gjw/git/moban/avsr-tf1/avsr/seq2seq.py:231) ]]

I tried to reduce the learning rate and batch size, but it didn’t work. Have you ever had this problem?

@georgesterpu
Copy link
Owner

This looks like a tricky one. I have never encountered it in over 3 years.

  1. Can you please confirm your TensorFlow and Python versions? My conda tf1.x environment uses python==3.7.6 and tensorflow==1.13.1 from pypi.
  2. Could you share more info about how you are launching the experiment ?
  3. Decoder/decoder/my_dense/bias_0-grad should refer to the variable created at this line, which is the argument passed to tf.contrib.seq2seq.BasicDecoder. It is a linear transformation applied on the decoder LSTM cell output to produce the pre-softmax class activations. The vocabulary size is automatically inferred from the list of allowed output tokens. I have only used this repository for English, and the list of characters/tokens is stored here. Are you using a different set of output tokens ? I can see on stackoverflow that your error is sometimes triggered by this specific aspect.
  4. What if you pass use_bias=False to the instantiation of this tensorflow.python.layers.core.Dense layer at L112 ? Do you see a new variable name in the error message ?
  5. The error seems related to the generation of the gradients histograms for visualisation in TensorBoard. I am wondering if the error is specific to summary generation, or if there is a NaN value in the gradient calculated for that variable. For an ablation, you could remove all the tf.summary code in avsr.py that displays the training graph, its variables, and the gradient histograms.

I remember having difficulties debugging my code in TF1, which is one reason I migrated to TF2 last year.
If none of the above helps you identify the issue, you may have to open an issue in the TensorFlow repository, although it might be tricky to create an example that reproduces the error.

@sewellGUO
Copy link
Author

Thank you for your prompt reply. I have tested one by one according to your reply and found there are numbers in my output token. I regenerated the label and no longer reported an error.
After my test, the program can run on python3.6.9 and tensorflow-gpu1.13.1.

@georgesterpu
Copy link
Owner

Nice, really glad to see that it worked.
Yes, with that default character list file, the text is assumed to be normalised, including only the letters a-z, space, and apostrophe.
You can also extend the tokens vocabulary file with digits or other characters, if you don't want to parse the labels.
Take care that the conversion from numbers to words is quite tricky, since there are multiple ways of pronouncing the same number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants