[chibi@centos8 ~]$ sudo nvidia-docker run --rm -ti nvcr.io/nvidia/tensorflow:19.04-py3 Unable to find image 'nvcr.io/nvidia/tensorflow:19.04-py3' locally 19.04-py3: Pulling from nvidia/tensorflow 34667c7e4631: Waiting d18d76a881a4: Waiting 119c7358fbfc: Waiting 2aaf13f3eff0: Waiting 202fa0f8874b: Waiting 3b700a61ede6: Waiting 87e6ca450d3f: Waiting a1e76dce1aec: Waiting 9b91fa2f9276: Waiting b5877a9add73: Waiting bab74df105f1: Waiting 534bbf505504: Waiting 4956bf3bbbb9: Waiting f4371944c97d: Waiting 4615a735431d: Waiting 5db2639932b5: Waiting 629d5c9d75a4: Waiting 8071b94b5429: Waiting 6eb8eba2ad5a: Waiting e32e86c15b8b: Waiting 08db5b51b243: Waiting f71ce95fb406: Waiting 3498ed8c5685: Waiting 62819d8896c1: Waiting 34bc85bf8bef: Pulling fs layer 4a95ca3431c4: Waiting 41bc2d0a4d4d: Waiting a2ceadc61854: Waiting 2d0c5308ff92: Waiting a531832992b8: Waiting b24a8fd8f2e1: Waiting 8d9313624ab7: Waiting e5cafe011f22: Pull complete eca19a329cd4: Pull complete 65ee50af0bcc: Pull complete 5f60ec8c32f4: Pull complete d7dcb657fa13: Pull complete 1f6ef6575fbe: Pull complete d1ef346a3015: Pull complete 4ef9cb404fd5: Pull complete f6797f45a018: Pull complete 1d4380527325: Pull complete 965f2629db02: Pull complete 5debff4c8c0a: Pull complete b3a3a9d82be6: Pull complete eac05f20b729: Pull complete 3ce0a7f80167: Pull complete 2a21e34a5784: Pull complete c1ccf19e258e: Pull complete 0b6ea9d0652b: Pull complete 307bc8c3f024: Pull complete ca75fd593a79: Pull complete 0cd3cdca1af7: Pull complete 48e857e9d372: Pull complete 3264ea403ca9: Pull complete Digest: sha256:aaebc136d5d50937362675c77afd908bd96cded68846f39163050a023c8a9851 Status: Downloaded newer image for nvcr.io/nvidia/tensorflow:19.04-py3 ================ == TensorFlow == ================ NVIDIA Release 19.04 (build 6132408) TensorFlow Version 1.13.1 Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. Copyright 2017-2019 The TensorFlow Authors. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file. NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced. NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be insufficient for TensorFlow. NVIDIA recommends the use of the following flags: nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ... root@597e78370cf3:/workspace# ls README.md docker-examples nvidia-examples root@597e78370cf3:/workspace# cd nvidia-examples root@597e78370cf3:/workspace/nvidia-examples# ls NCF bert cnn ssdv1.2 OpenSeq2Seq big_lstm gnmt_v2 tensorrt UNet_Industrial build_imagenet_data resnet50v1.5 root@597e78370cf3:/workspace/nvidia-examples# cd big_lstm root@597e78370cf3:/workspace/nvidia-examples/big_lstm# ls 1b_word_vocab.txt data_utils_test.py language_model_test.py README.md download_1b_words_data.sh model_utils.py __init__.py hparams.py run_utils.py common.py hparams_test.py single_lm_train.py data_utils.py language_model.py testdata root@597e78370cf3:/workspace/nvidia-examples/big_lstm# ./download_1b_words_data.sh Please specify root of dataset directory: data Success: dataset root dir validated --2020-07-03 18:33:59-- http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz Resolving www.statmt.org (www.statmt.org)... 129.215.197.184 Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1792209805 (1.7G) [application/x-gzip] Saving to: ‘1-billion-word-language-modeling-benchmark-r13output.tar.gz’ 1-billion-word-lang 100%[===================>] 1.67G 972KB/s in 41m 52s 2020-07-03 19:15:51 (697 KB/s) - ‘1-billion-word-language-modeling-benchmark-r13output.tar.gz’ saved [1792209805/1792209805] 1-billion-word-language-modeling-benchmark-r13output/ 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/ 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00024-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00057-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00055-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00096-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00081-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00033-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00072-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00082-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00018-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00008-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00059-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00005-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00091-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00062-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00031-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00095-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00076-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00006-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00038-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00015-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00087-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00021-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00049-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00009-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00027-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00056-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00046-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00032-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00029-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00088-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00085-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00011-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00012-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00067-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00003-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00093-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00050-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00053-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00044-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00019-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00066-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00028-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00045-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00039-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00071-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00052-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00078-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00037-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00002-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00014-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00048-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00017-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00004-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00077-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00080-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00020-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00051-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00016-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00079-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00043-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00068-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00099-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00064-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00034-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00054-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00040-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00070-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00063-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00041-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00083-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00061-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00073-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00094-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00030-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00060-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00035-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00023-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00042-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00025-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00090-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00089-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00065-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00075-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00022-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00026-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00098-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00084-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00010-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00069-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00013-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00092-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00036-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00097-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00007-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00074-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00001-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00047-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00086-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00058-of-00100 1-billion-word-language-modeling-benchmark-r13output/.svn/ 1-billion-word-language-modeling-benchmark-r13output/.svn/tmp/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/de/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/de/de102cd0c91cd19e6612f0840e68a2f20ba8134c.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/de/deed1b75d3bd5cc36ae6aeb85d56680b892b7948.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/86/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/86/86c58db52fbf362c5bc329afc33b8805085fcb0d.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/9f/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/9f/9f2882e21f860a83ad6ea8898ebab140974ed301.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/bc/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/bc/bcdbc523ee7488dc438cab869b6d5e236578dbfa.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/d2/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/d2/d2718bc26d0ee0a213d7d4add99a304cb5b39ede.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/c5/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/c5/c5b24f61479da923123d0394a188da922ea0359c.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/11/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/11/116d6ea61730d8199127596b072e981338597779.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/b0/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/b0/b0e26559cfe641245584a9400b35ba28d64f1411.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/d3/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/d3/d3ae508e3bcb0e696dd70aecd052410f1f7afc1d.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/9e/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/9e/9e148bd766e8805e0eb97eeae250433ec7a2e996.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/31/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/31/31b645a482e0b81fda3c567cada307c6fcf7ec80.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/da/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/da/da39a3ee5e6b4b0d3255bfef95601890afd80709.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/c1/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/c1/c1ed42c415ec884e591fb5c70d373da640a383b5.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/e3/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/e3/e37ba0f85e94073ccaced1eed7e4f5d737a25f49.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/entries 1-billion-word-language-modeling-benchmark-r13output/.svn/format 1-billion-word-language-modeling-benchmark-r13output/.svn/wc.db 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/ 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00015-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00031-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00027-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00010-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00033-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00042-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00046-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00037-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00029-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00013-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00002-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00048-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00006-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00030-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00025-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00039-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00008-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00020-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00001-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00034-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00044-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00045-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00016-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00004-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00035-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00038-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00009-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00024-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00022-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00021-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00032-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00011-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00049-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00041-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00019-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00023-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00040-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00014-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00007-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00017-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00012-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00018-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00003-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00028-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00043-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00005-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00036-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00026-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00047-of-00050 1-billion-word-language-modeling-benchmark-r13output/README Success! One billion words dataset ready at: data/1-billion-word-language-modeling-benchmark-r13output/ Please pass this dir to single_lm_train.py via the --datadir option. root@597e78370cf3:/workspace/nvidia-examples/big_lstm# time python single_lm_train.py --mode=train --logdir=./logs --num_gpus=4 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue. *****HYPER PARAMETERS***** {'num_sampled': 8192, 'num_steps': 20, 'run_profiler': False, 'vocab_size': 793470, 'optimizer': 0, 'do_summaries': False, 'learning_rate': 0.2, 'num_shards': 8, 'emb_size': 512, 'keep_prob': 0.9, 'average_params': True, 'batch_size': 128, 'max_time': 180, 'projected_size': 512, 'num_gpus': 4, 'num_delayed_steps': 150, 'max_grad_norm': 10.0, 'num_layers': 1, 'state_size': 2048} ************************** WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/model_utils.py:33: UniformUnitScaling.__init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:75: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:107: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_impl.py:1444: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_grad.py:425: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Current time: 1593803793.502863 ALL VARIABLES WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:18: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Please use tf.global_variables instead. model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 model/global_step:0 () model/model/emb_0/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_1/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_2/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_3/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_4/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_5/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_6/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_7/Adagrad:0 (99184, 512) /gpu:0 model/model/lstm_0/LSTMCell/W_0/Adagrad:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/Adagrad:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/Adagrad:0 (2048, 512) /gpu:0 model/model/softmax_w_0/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_1/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_2/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_3/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_4/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_5/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_6/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_7/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_b/Adagrad:0 (793470,) /gpu:0 model/model/lstm_0/LSTMCell/W_0/ExponentialMovingAverage:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/ExponentialMovingAverage:0 (2048, 512) /gpu:0 TRAINABLE VARIABLES model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 LOCAL VARIABLES model/model/state_0_0:0 (128, 2560) /gpu:0 model/model_1/state_1_0:0 (128, 2560) /gpu:1 model/model_2/state_2_0:0 (128, 2560) /gpu:2 model/model_3/state_3_0:0 (128, 2560) /gpu:3 WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:32: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-07-03 19:16:34.181493: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900010000 Hz 2020-07-03 19:16:34.188066: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0xaddaed0 executing computations on platform Host. Devices: 2020-07-03 19:16:34.188107: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2020-07-03 19:16:34.663112: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 19:16:34.694277: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 19:16:34.701156: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 19:16:34.702171: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0xadda8f0 executing computations on platform CUDA. Devices: 2020-07-03 19:16:34.702193: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): TITAN RTX, Compute Capability 7.5 2020-07-03 19:16:34.702199: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): TITAN RTX, Compute Capability 7.5 2020-07-03 19:16:34.702204: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (2): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-07-03 19:16:34.702210: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (3): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-07-03 19:16:34.703412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:01:00.0 totalMemory: 23.65GiB freeMemory: 23.23GiB 2020-07-03 19:16:34.703441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:21:00.0 totalMemory: 23.65GiB freeMemory: 23.49GiB 2020-07-03 19:16:34.703465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:4a:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-07-03 19:16:34.703488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:4b:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-07-03 19:16:34.703510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3 2020-07-03 19:16:35.359708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-03 19:16:35.359748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 2020-07-03 19:16:35.359753: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N N N N 2020-07-03 19:16:35.359756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N N N N 2020-07-03 19:16:35.359762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: N N N N 2020-07-03 19:16:35.359766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: N N N N 2020-07-03 19:16:35.359908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22508 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:01:00.0, compute capability: 7.5) 2020-07-03 19:16:35.360504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22765 MB memory) -> physical GPU (device: 1, name: TITAN RTX, pci bus id: 0000:21:00.0, compute capability: 7.5) 2020-07-03 19:16:35.360674: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10231 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:4a:00.0, compute capability: 7.5) 2020-07-03 19:16:35.360955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10231 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:4b:00.0, compute capability: 7.5) Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00063-of-00100 Finished processing! 2020-07-03 19:16:54.643870: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally Iteration 1, time = 9.96s, wps = 1028, train loss = 12.9980 Iteration 2, time = 7.91s, wps = 1295, train loss = 12.9379 Iteration 3, time = 0.10s, wps = 97919, train loss = 12.8231 Iteration 4, time = 0.10s, wps = 102391, train loss = 11.5177 Iteration 5, time = 0.10s, wps = 106192, train loss = 12.8015 Iteration 6, time = 0.10s, wps = 104471, train loss = 11.9737 Iteration 7, time = 0.09s, wps = 110239, train loss = 70.0958 Iteration 8, time = 0.09s, wps = 110940, train loss = 40.6221 Iteration 9, time = 0.10s, wps = 104315, train loss = 15.0274 Iteration 20, time = 1.02s, wps = 110836, train loss = 42.5500 Iteration 40, time = 1.85s, wps = 110415, train loss = 9.3836 Iteration 60, time = 1.85s, wps = 110512, train loss = 9.4714 Iteration 80, time = 1.87s, wps = 109353, train loss = 8.4355 Iteration 100, time = 1.89s, wps = 108357, train loss = 8.6085 Iteration 120, time = 1.90s, wps = 107574, train loss = 7.5886 Iteration 140, time = 1.86s, wps = 110374, train loss = 7.3507 Iteration 160, time = 1.88s, wps = 109073, train loss = 7.1262 Iteration 180, time = 1.87s, wps = 109699, train loss = 6.8167 Iteration 200, time = 1.87s, wps = 109235, train loss = 6.4489 Iteration 220, time = 1.88s, wps = 108726, train loss = 6.4779 Iteration 240, time = 1.86s, wps = 109887, train loss = 6.2938 Iteration 260, time = 1.86s, wps = 109924, train loss = 6.3366 Iteration 280, time = 1.89s, wps = 108625, train loss = 6.2328 Iteration 300, time = 1.88s, wps = 108971, train loss = 6.1755 Iteration 320, time = 1.88s, wps = 109078, train loss = 6.0751 Iteration 340, time = 1.87s, wps = 109503, train loss = 5.9710 Iteration 360, time = 1.88s, wps = 108788, train loss = 5.9520 Iteration 380, time = 1.88s, wps = 108994, train loss = 5.8914 Iteration 400, time = 1.89s, wps = 108398, train loss = 5.8950 Iteration 420, time = 1.89s, wps = 108161, train loss = 5.9174 Iteration 440, time = 1.88s, wps = 109201, train loss = 5.7866 Iteration 460, time = 1.86s, wps = 109921, train loss = 5.8666 Iteration 480, time = 1.87s, wps = 109230, train loss = 5.7077 Iteration 500, time = 1.86s, wps = 110046, train loss = 5.7194 Iteration 520, time = 1.87s, wps = 109540, train loss = 5.6760 Iteration 540, time = 1.88s, wps = 109009, train loss = 5.6580 Iteration 560, time = 1.88s, wps = 108887, train loss = 5.6510 Iteration 580, time = 1.86s, wps = 110089, train loss = 5.5753 Iteration 600, time = 1.87s, wps = 109397, train loss = 5.6717 Iteration 620, time = 1.87s, wps = 109230, train loss = 5.5207 Iteration 640, time = 1.88s, wps = 108879, train loss = 5.5245 Iteration 660, time = 1.89s, wps = 108263, train loss = 5.5972 Iteration 680, time = 1.89s, wps = 108135, train loss = 5.4818 Iteration 700, time = 1.89s, wps = 108405, train loss = 5.5303 Iteration 720, time = 1.88s, wps = 108909, train loss = 5.4665 Iteration 740, time = 1.89s, wps = 108593, train loss = 5.4456 Iteration 760, time = 1.88s, wps = 108751, train loss = 5.4600 Iteration 780, time = 1.88s, wps = 109065, train loss = 5.4073 Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00022-of-00100 Finished processing! Iteration 800, time = 3.51s, wps = 58319, train loss = 5.4404 Iteration 820, time = 1.88s, wps = 109193, train loss = 5.4586 Iteration 840, time = 1.87s, wps = 109302, train loss = 5.3758 Iteration 860, time = 1.88s, wps = 108811, train loss = 5.3544 Iteration 880, time = 1.87s, wps = 109356, train loss = 5.3205 Iteration 900, time = 1.87s, wps = 109324, train loss = 5.2878 Iteration 920, time = 1.89s, wps = 108139, train loss = 5.3638 Iteration 940, time = 1.88s, wps = 108747, train loss = 5.2605 Iteration 960, time = 1.88s, wps = 109142, train loss = 5.2760 Iteration 980, time = 1.89s, wps = 108162, train loss = 5.2918 Iteration 1000, time = 1.88s, wps = 108919, train loss = 5.2150 Iteration 1020, time = 1.87s, wps = 109596, train loss = 5.1953 Iteration 1040, time = 1.87s, wps = 109478, train loss = 5.2023 Iteration 1060, time = 1.89s, wps = 108396, train loss = 5.2692 Iteration 1080, time = 1.88s, wps = 109007, train loss = 5.1498 Iteration 1100, time = 1.89s, wps = 108594, train loss = 5.1820 Iteration 1120, time = 1.88s, wps = 108792, train loss = 5.2077 Iteration 1140, time = 1.89s, wps = 108503, train loss = 5.1593 Iteration 1160, time = 1.89s, wps = 108489, train loss = 5.1783 Iteration 1180, time = 1.88s, wps = 108788, train loss = 5.0927 Iteration 1200, time = 1.89s, wps = 108607, train loss = 5.0954 Iteration 1220, time = 1.87s, wps = 109412, train loss = 5.0597 Iteration 1240, time = 1.87s, wps = 109545, train loss = 5.1182 Iteration 1260, time = 1.89s, wps = 108413, train loss = 5.0832 Iteration 1280, time = 1.87s, wps = 109477, train loss = 5.1334 Iteration 1300, time = 1.88s, wps = 108826, train loss = 5.0809 Iteration 1320, time = 1.88s, wps = 108895, train loss = 5.0608 Iteration 1340, time = 1.88s, wps = 108668, train loss = 5.0820 Iteration 1360, time = 1.90s, wps = 107798, train loss = 5.0008 Iteration 1380, time = 1.90s, wps = 107665, train loss = 5.0323 Iteration 1400, time = 1.89s, wps = 108191, train loss = 5.0376 Iteration 1420, time = 1.89s, wps = 108262, train loss = 4.9649 Iteration 1440, time = 1.88s, wps = 108705, train loss = 5.0356 Iteration 1460, time = 1.90s, wps = 107937, train loss = 4.9902 Iteration 1480, time = 1.89s, wps = 108267, train loss = 4.9792 Iteration 1500, time = 1.87s, wps = 109542, train loss = 4.9815 Iteration 1520, time = 1.92s, wps = 106699, train loss = 5.0326 Iteration 1540, time = 1.90s, wps = 107971, train loss = 5.0022 Iteration 1560, time = 1.89s, wps = 108266, train loss = 5.0133 Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00064-of-00100 Finished processing! /usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened. warnings.warn("Attempting to use a closed FileWriter. " real 3m14.884s user 23m40.457s sys 4m53.393s root@597e78370cf3:/workspace/nvidia-examples/big_lstm# time python single_lm_train.py --mode=train --logdir=./logs --num_gpus=3 --datadir=./data/1-billion-word- language-modeling-benchmark-r13output WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue. *****HYPER PARAMETERS***** {'num_sampled': 8192, 'num_shards': 8, 'vocab_size': 793470, 'state_size': 2048, 'num_delayed_steps': 150, 'do_summaries': False, 'num_steps': 20, 'projected_size': 512, 'run_profiler': False, 'batch_size': 128, 'learning_rate': 0.2, 'max_grad_norm': 10.0, 'optimizer': 0, 'keep_prob': 0.9, 'num_gpus': 3, 'emb_size': 512, 'average_params': True, 'num_layers': 1, 'max_time': 180} ************************** WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/model_utils.py:33: UniformUnitScaling.__init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:75: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:107: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_impl.py:1444: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_grad.py:425: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Current time: 1593804818.1887305 ALL VARIABLES WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:18: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Please use tf.global_variables instead. model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 model/global_step:0 () model/model/emb_0/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_1/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_2/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_3/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_4/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_5/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_6/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_7/Adagrad:0 (99184, 512) /gpu:0 model/model/lstm_0/LSTMCell/W_0/Adagrad:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/Adagrad:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/Adagrad:0 (2048, 512) /gpu:0 model/model/softmax_w_0/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_1/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_2/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_3/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_4/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_5/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_6/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_7/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_b/Adagrad:0 (793470,) /gpu:0 model/model/lstm_0/LSTMCell/W_0/ExponentialMovingAverage:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/ExponentialMovingAverage:0 (2048, 512) /gpu:0 TRAINABLE VARIABLES model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 LOCAL VARIABLES model/model/state_0_0:0 (128, 2560) /gpu:0 model/model_1/state_1_0:0 (128, 2560) /gpu:1 model/model_2/state_2_0:0 (128, 2560) /gpu:2 WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:32: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-07-03 19:33:38.726489: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900010000 Hz 2020-07-03 19:33:38.732897: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x9cf9dd0 executing computations on platform Host. Devices: 2020-07-03 19:33:38.732941: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2020-07-03 19:33:39.222730: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 19:33:39.246306: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 19:33:39.253308: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 19:33:39.254214: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x9cf97f0 executing computations on platform CUDA. Devices: 2020-07-03 19:33:39.254230: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): TITAN RTX, Compute Capability 7.5 2020-07-03 19:33:39.254235: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): TITAN RTX, Compute Capability 7.5 2020-07-03 19:33:39.254239: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (2): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-07-03 19:33:39.254245: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (3): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-07-03 19:33:39.255185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:01:00.0 totalMemory: 23.65GiB freeMemory: 23.22GiB 2020-07-03 19:33:39.255216: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:21:00.0 totalMemory: 23.65GiB freeMemory: 23.49GiB 2020-07-03 19:33:39.255241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:4a:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-07-03 19:33:39.255265: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:4b:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-07-03 19:33:39.255291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3 2020-07-03 19:33:39.890951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-03 19:33:39.890988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 2020-07-03 19:33:39.890993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N N N N 2020-07-03 19:33:39.890996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N N N N 2020-07-03 19:33:39.891002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: N N N N 2020-07-03 19:33:39.891006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: N N N N 2020-07-03 19:33:39.891138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22507 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:01:00.0, compute capability: 7.5) 2020-07-03 19:33:39.891704: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22765 MB memory) -> physical GPU (device: 1, name: TITAN RTX, pci bus id: 0000:21:00.0, compute capability: 7.5) 2020-07-03 19:33:39.891871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10231 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:4a:00.0, compute capability: 7.5) 2020-07-03 19:33:39.892129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10231 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:4b:00.0, compute capability: 7.5) WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00072-of-00100 Finished processing! 2020-07-03 19:33:52.438746: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally Iteration 1579, time = 7.86s, wps = 977, train loss = 5.5537 Iteration 1580, time = 5.92s, wps = 1297, train loss = 5.1107 Iteration 1581, time = 0.08s, wps = 91504, train loss = 4.9729 Iteration 1582, time = 0.08s, wps = 91408, train loss = 4.9435 Iteration 1583, time = 0.08s, wps = 99830, train loss = 4.9171 Iteration 1584, time = 0.08s, wps = 101344, train loss = 4.9854 Iteration 1585, time = 0.07s, wps = 102505, train loss = 4.9139 Iteration 1586, time = 0.08s, wps = 92469, train loss = 4.8910 Iteration 1587, time = 0.08s, wps = 100163, train loss = 4.9810 Iteration 1598, time = 0.85s, wps = 99640, train loss = 4.8773 Iteration 1618, time = 1.51s, wps = 101428, train loss = 4.9599 Iteration 1638, time = 1.52s, wps = 101278, train loss = 4.9465 Iteration 1658, time = 1.52s, wps = 101271, train loss = 4.8812 Iteration 1678, time = 1.52s, wps = 101110, train loss = 4.8537 Iteration 1698, time = 1.50s, wps = 102083, train loss = 4.9318 Iteration 1718, time = 1.51s, wps = 101666, train loss = 4.9131 Iteration 1738, time = 1.52s, wps = 101023, train loss = 4.9077 Iteration 1758, time = 1.52s, wps = 100986, train loss = 4.8953 Iteration 1778, time = 1.51s, wps = 101669, train loss = 4.8407 Iteration 1798, time = 1.51s, wps = 102051, train loss = 4.8164 Iteration 1818, time = 1.51s, wps = 101489, train loss = 4.8556 Iteration 1838, time = 1.52s, wps = 101003, train loss = 4.8904 Iteration 1858, time = 1.51s, wps = 101719, train loss = 4.8287 Iteration 1878, time = 1.52s, wps = 100849, train loss = 4.8711 Iteration 1898, time = 1.52s, wps = 101264, train loss = 4.8240 Iteration 1918, time = 1.51s, wps = 101511, train loss = 4.7784 Iteration 1938, time = 1.51s, wps = 101826, train loss = 4.8181 Iteration 1958, time = 1.52s, wps = 101364, train loss = 4.8680 Iteration 1978, time = 1.53s, wps = 100646, train loss = 4.8207 Iteration 1998, time = 1.51s, wps = 101604, train loss = 4.8724 Iteration 2018, time = 1.52s, wps = 101001, train loss = 4.7667 Iteration 2038, time = 1.53s, wps = 100358, train loss = 4.8414 Iteration 2058, time = 1.52s, wps = 101183, train loss = 4.8288 Iteration 2078, time = 1.53s, wps = 100464, train loss = 4.8820 Iteration 2098, time = 1.53s, wps = 100515, train loss = 4.8259 Iteration 2118, time = 1.52s, wps = 101041, train loss = 4.7697 Iteration 2138, time = 1.54s, wps = 100026, train loss = 4.7567 Iteration 2158, time = 1.52s, wps = 101135, train loss = 4.7607 Iteration 2178, time = 1.52s, wps = 101272, train loss = 4.7536 Iteration 2198, time = 1.53s, wps = 100399, train loss = 4.7824 Iteration 2218, time = 1.52s, wps = 100871, train loss = 4.7847 Iteration 2238, time = 1.51s, wps = 101889, train loss = 4.7074 Iteration 2258, time = 1.51s, wps = 101814, train loss = 4.7834 Iteration 2278, time = 1.51s, wps = 101776, train loss = 4.7296 Iteration 2298, time = 1.52s, wps = 101290, train loss = 4.7979 Iteration 2318, time = 1.52s, wps = 101178, train loss = 4.7232 Iteration 2338, time = 1.51s, wps = 101422, train loss = 4.6287 Iteration 2358, time = 1.51s, wps = 101439, train loss = 4.7152 Iteration 2378, time = 1.53s, wps = 100342, train loss = 4.7465 Iteration 2398, time = 1.51s, wps = 101655, train loss = 4.7107 Iteration 2418, time = 1.52s, wps = 101013, train loss = 4.7414 Iteration 2438, time = 1.53s, wps = 100134, train loss = 4.6817 Iteration 2458, time = 1.53s, wps = 100157, train loss = 4.7062 Iteration 2478, time = 1.52s, wps = 100999, train loss = 4.7028 Iteration 2498, time = 1.52s, wps = 101227, train loss = 4.7475 Iteration 2518, time = 1.52s, wps = 101248, train loss = 4.7537 Iteration 2538, time = 1.53s, wps = 100408, train loss = 4.7804 Iteration 2558, time = 1.52s, wps = 100913, train loss = 4.6445 Iteration 2578, time = 1.52s, wps = 101384, train loss = 4.7710 Iteration 2598, time = 1.52s, wps = 101053, train loss = 4.6987 Iteration 2618, time = 1.52s, wps = 101301, train loss = 4.7574 Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00081-of-00100 Finished processing! Iteration 2638, time = 3.15s, wps = 48789, train loss = 4.7045 Iteration 2658, time = 1.51s, wps = 101424, train loss = 4.6030 Iteration 2678, time = 1.52s, wps = 101338, train loss = 4.6432 Iteration 2698, time = 1.51s, wps = 101390, train loss = 4.6316 Iteration 2718, time = 1.52s, wps = 101112, train loss = 4.5744 Iteration 2738, time = 1.53s, wps = 100371, train loss = 4.6487 Iteration 2758, time = 1.52s, wps = 100867, train loss = 4.6874 Iteration 2778, time = 1.52s, wps = 101243, train loss = 4.6121 Iteration 2798, time = 1.52s, wps = 101377, train loss = 4.5652 Iteration 2818, time = 1.52s, wps = 100729, train loss = 4.6431 Iteration 2838, time = 1.54s, wps = 99627, train loss = 4.6005 Iteration 2858, time = 1.52s, wps = 101065, train loss = 4.6142 Iteration 2878, time = 1.53s, wps = 100620, train loss = 4.5891 Iteration 2898, time = 1.52s, wps = 100916, train loss = 4.6390 Iteration 2918, time = 1.51s, wps = 101666, train loss = 4.6812 Iteration 2938, time = 1.52s, wps = 101069, train loss = 4.6531 Iteration 2958, time = 1.53s, wps = 100482, train loss = 4.6743 Iteration 2978, time = 1.52s, wps = 100785, train loss = 4.6259 Iteration 2998, time = 1.52s, wps = 101100, train loss = 4.6100 Iteration 3018, time = 1.52s, wps = 100813, train loss = 4.6069 Iteration 3038, time = 1.51s, wps = 101463, train loss = 4.5363 Iteration 3058, time = 1.52s, wps = 101227, train loss = 4.5576 Iteration 3078, time = 1.53s, wps = 100148, train loss = 4.5867 Iteration 3098, time = 1.52s, wps = 100818, train loss = 4.5534 Iteration 3118, time = 1.54s, wps = 99849, train loss = 4.6137 Iteration 3138, time = 1.52s, wps = 101230, train loss = 4.5812 Iteration 3158, time = 1.52s, wps = 101164, train loss = 4.5852 Iteration 3178, time = 1.51s, wps = 101709, train loss = 4.5923 Iteration 3198, time = 1.54s, wps = 99969, train loss = 4.5724 Iteration 3218, time = 1.52s, wps = 100856, train loss = 4.5565 Iteration 3238, time = 1.51s, wps = 101713, train loss = 4.6075 Iteration 3258, time = 1.53s, wps = 100318, train loss = 4.5752 Iteration 3278, time = 1.54s, wps = 99834, train loss = 4.5285 Iteration 3298, time = 1.52s, wps = 100765, train loss = 4.5045 Iteration 3318, time = 1.52s, wps = 100925, train loss = 4.5364 Iteration 3338, time = 1.53s, wps = 100513, train loss = 4.5821 Iteration 3358, time = 1.52s, wps = 101046, train loss = 4.5887 Iteration 3378, time = 1.53s, wps = 100393, train loss = 4.5052 Iteration 3398, time = 1.52s, wps = 100880, train loss = 4.5885 Iteration 3418, time = 1.52s, wps = 101229, train loss = 4.5018 Iteration 3438, time = 1.53s, wps = 100255, train loss = 4.5704 Iteration 3458, time = 1.52s, wps = 100882, train loss = 4.5531 Iteration 3478, time = 1.53s, wps = 100237, train loss = 4.5547 Iteration 3498, time = 1.52s, wps = 101029, train loss = 4.4756 Iteration 3518, time = 1.53s, wps = 100223, train loss = 4.6123 Iteration 3538, time = 1.53s, wps = 100302, train loss = 4.5621 Iteration 3558, time = 1.51s, wps = 101416, train loss = 4.4459 Iteration 3578, time = 1.54s, wps = 100055, train loss = 4.4796 Iteration 3598, time = 1.52s, wps = 100731, train loss = 4.5474 Iteration 3618, time = 1.54s, wps = 99844, train loss = 4.5311 Iteration 3638, time = 1.52s, wps = 100732, train loss = 4.5371 /usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened. warnings.warn("Attempting to use a closed FileWriter. " real 3m11.944s user 20m38.696s sys 4m32.911s root@597e78370cf3:/workspace/nvidia-examples/big_lstm# time python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=./data/1-billion-word- language-modeling-benchmark-r13output WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue. *****HYPER PARAMETERS***** {'average_params': True, 'num_steps': 20, 'batch_size': 128, 'state_size': 2048, 'num_layers': 1, 'run_profiler': False, 'projected_size': 512, 'num_gpus': 2, 'do_summaries': False, 'num_shards': 8, 'learning_rate': 0.2, 'optimizer': 0, 'num_delayed_steps': 150, 'emb_size': 512, 'max_time': 180, 'num_sampled': 8192, 'vocab_size': 793470, 'max_grad_norm': 10.0, 'keep_prob': 0.9} ************************** WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/model_utils.py:33: UniformUnitScaling.__init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:75: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:107: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_impl.py:1444: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_grad.py:425: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Current time: 1593805894.0171063 ALL VARIABLES WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:18: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Please use tf.global_variables instead. model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 model/global_step:0 () model/model/emb_0/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_1/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_2/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_3/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_4/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_5/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_6/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_7/Adagrad:0 (99184, 512) /gpu:0 model/model/lstm_0/LSTMCell/W_0/Adagrad:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/Adagrad:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/Adagrad:0 (2048, 512) /gpu:0 model/model/softmax_w_0/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_1/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_2/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_3/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_4/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_5/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_6/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_7/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_b/Adagrad:0 (793470,) /gpu:0 model/model/lstm_0/LSTMCell/W_0/ExponentialMovingAverage:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/ExponentialMovingAverage:0 (2048, 512) /gpu:0 TRAINABLE VARIABLES model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 LOCAL VARIABLES model/model/state_0_0:0 (128, 2560) /gpu:0 model/model_1/state_1_0:0 (128, 2560) /gpu:1 WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:32: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-07-03 19:51:34.425505: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900010000 Hz 2020-07-03 19:51:34.431739: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x8fb1480 executing computations on platform Host. Devices: 2020-07-03 19:51:34.431782: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2020-07-03 19:51:34.842064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 19:51:34.888682: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 19:51:34.900762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 19:51:34.901677: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x8fb0ea0 executing computations on platform CUDA. Devices: 2020-07-03 19:51:34.901707: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): TITAN RTX, Compute Capability 7.5 2020-07-03 19:51:34.901712: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): TITAN RTX, Compute Capability 7.5 2020-07-03 19:51:34.901717: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (2): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-07-03 19:51:34.901726: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (3): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-07-03 19:51:34.902741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:01:00.0 totalMemory: 23.65GiB freeMemory: 23.22GiB 2020-07-03 19:51:34.902770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:21:00.0 totalMemory: 23.65GiB freeMemory: 23.49GiB 2020-07-03 19:51:34.902792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:4a:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-07-03 19:51:34.902814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:4b:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-07-03 19:51:34.902837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3 2020-07-03 19:51:35.539739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-03 19:51:35.539779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 2020-07-03 19:51:35.539784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N N N N 2020-07-03 19:51:35.539788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N N N N 2020-07-03 19:51:35.539793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: N N N N 2020-07-03 19:51:35.539798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: N N N N 2020-07-03 19:51:35.539947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22507 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:01:00.0, compute capability: 7.5) 2020-07-03 19:51:35.540168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22765 MB memory) -> physical GPU (device: 1, name: TITAN RTX, pci bus id: 0000:21:00.0, compute capability: 7.5) 2020-07-03 19:51:35.540618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10231 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:4a:00.0, compute capability: 7.5) 2020-07-03 19:51:35.540777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10231 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:4b:00.0, compute capability: 7.5) WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00014-of-00100 Finished processing! 2020-07-03 19:51:44.867134: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally Iteration 3653, time = 5.34s, wps = 959, train loss = 4.7756 Iteration 3654, time = 3.48s, wps = 1473, train loss = 4.5576 Iteration 3655, time = 0.07s, wps = 74453, train loss = 4.4899 Iteration 3656, time = 0.06s, wps = 80713, train loss = 4.5061 Iteration 3657, time = 0.07s, wps = 75534, train loss = 4.4789 Iteration 3658, time = 0.06s, wps = 84420, train loss = 4.4402 Iteration 3659, time = 0.06s, wps = 85216, train loss = 4.5101 Iteration 3660, time = 0.06s, wps = 90407, train loss = 4.5272 Iteration 3661, time = 0.06s, wps = 89802, train loss = 4.5207 Iteration 3672, time = 0.64s, wps = 87726, train loss = 4.5275 Iteration 3692, time = 1.16s, wps = 88279, train loss = 4.5224 Iteration 3712, time = 1.16s, wps = 87971, train loss = 4.5066 Iteration 3732, time = 1.16s, wps = 87987, train loss = 4.5542 Iteration 3752, time = 1.15s, wps = 88860, train loss = 4.4935 Iteration 3772, time = 1.16s, wps = 88527, train loss = 4.5161 Iteration 3792, time = 1.16s, wps = 88554, train loss = 4.4771 Iteration 3812, time = 1.15s, wps = 89025, train loss = 4.5199 Iteration 3832, time = 1.17s, wps = 87734, train loss = 4.4939 Iteration 3852, time = 1.17s, wps = 87711, train loss = 4.4937 Iteration 3872, time = 1.17s, wps = 87337, train loss = 4.5844 Iteration 3892, time = 1.15s, wps = 89086, train loss = 4.4696 Iteration 3912, time = 1.16s, wps = 88391, train loss = 4.4883 Iteration 3932, time = 1.16s, wps = 88648, train loss = 4.5037 Iteration 3952, time = 1.17s, wps = 87659, train loss = 4.5307 Iteration 3972, time = 1.16s, wps = 88118, train loss = 4.5038 Iteration 3992, time = 1.15s, wps = 89097, train loss = 4.4403 Iteration 4012, time = 1.17s, wps = 87610, train loss = 4.5467 Iteration 4032, time = 1.17s, wps = 87340, train loss = 4.4343 Iteration 4052, time = 1.16s, wps = 88410, train loss = 4.4333 Iteration 4072, time = 1.17s, wps = 87618, train loss = 4.5432 Iteration 4092, time = 1.15s, wps = 88685, train loss = 4.5087 Iteration 4112, time = 1.16s, wps = 88539, train loss = 4.4467 Iteration 4132, time = 1.17s, wps = 87431, train loss = 4.4145 Iteration 4152, time = 1.15s, wps = 88772, train loss = 4.5009 Iteration 4172, time = 1.17s, wps = 87409, train loss = 4.4889 Iteration 4192, time = 1.16s, wps = 88383, train loss = 4.3925 Iteration 4212, time = 1.17s, wps = 87818, train loss = 4.4492 Iteration 4232, time = 1.16s, wps = 88034, train loss = 4.3889 Iteration 4252, time = 1.17s, wps = 87564, train loss = 4.4661 Iteration 4272, time = 1.16s, wps = 87931, train loss = 4.4331 Iteration 4292, time = 1.16s, wps = 88307, train loss = 4.4545 Iteration 4312, time = 1.15s, wps = 88817, train loss = 4.5601 Iteration 4332, time = 1.16s, wps = 88073, train loss = 4.4938 Iteration 4352, time = 1.17s, wps = 87206, train loss = 4.4806 Iteration 4372, time = 1.15s, wps = 88668, train loss = 4.4696 Iteration 4392, time = 1.15s, wps = 88747, train loss = 4.4087 Iteration 4412, time = 1.17s, wps = 87722, train loss = 4.4440 Iteration 4432, time = 1.17s, wps = 87335, train loss = 4.3928 Iteration 4452, time = 1.17s, wps = 87847, train loss = 4.4463 Iteration 4472, time = 1.17s, wps = 87399, train loss = 4.4316 Iteration 4492, time = 1.17s, wps = 87728, train loss = 4.4249 Iteration 4512, time = 1.16s, wps = 87941, train loss = 4.4350 Iteration 4532, time = 1.18s, wps = 86452, train loss = 4.4960 Iteration 4552, time = 1.16s, wps = 88638, train loss = 4.4153 Iteration 4572, time = 1.17s, wps = 87333, train loss = 4.5179 Iteration 4592, time = 1.17s, wps = 87814, train loss = 4.3696 Iteration 4612, time = 1.17s, wps = 87666, train loss = 4.5361 Iteration 4632, time = 1.18s, wps = 86899, train loss = 4.4513 Iteration 4652, time = 1.17s, wps = 87613, train loss = 4.4511 Iteration 4672, time = 1.17s, wps = 87600, train loss = 4.4454 Iteration 4692, time = 1.17s, wps = 87451, train loss = 4.4334 Iteration 4712, time = 1.17s, wps = 87719, train loss = 4.5277 Iteration 4732, time = 1.17s, wps = 87290, train loss = 4.3750 Iteration 4752, time = 1.18s, wps = 86725, train loss = 4.4177 Iteration 4772, time = 1.17s, wps = 87352, train loss = 4.3706 Iteration 4792, time = 1.16s, wps = 88009, train loss = 4.3425 Iteration 4812, time = 1.18s, wps = 86988, train loss = 4.4742 Iteration 4832, time = 1.15s, wps = 88784, train loss = 4.3850 Iteration 4852, time = 1.17s, wps = 87884, train loss = 4.4209 Iteration 4872, time = 1.17s, wps = 87649, train loss = 4.4352 Iteration 4892, time = 1.17s, wps = 87384, train loss = 4.3977 Iteration 4912, time = 1.17s, wps = 87346, train loss = 4.5104 Iteration 4932, time = 1.18s, wps = 86639, train loss = 4.3605 Iteration 4952, time = 1.17s, wps = 87882, train loss = 4.3836 Iteration 4972, time = 1.17s, wps = 87545, train loss = 4.4097 Iteration 4992, time = 1.17s, wps = 87835, train loss = 4.3750 Iteration 5012, time = 1.18s, wps = 86914, train loss = 4.4514 Iteration 5032, time = 1.17s, wps = 87210, train loss = 4.4559 Iteration 5052, time = 1.18s, wps = 86846, train loss = 4.3849 Iteration 5072, time = 1.17s, wps = 87666, train loss = 4.3532 Iteration 5092, time = 1.16s, wps = 88059, train loss = 4.4372 Iteration 5112, time = 1.19s, wps = 85905, train loss = 4.3629 Iteration 5132, time = 1.18s, wps = 86944, train loss = 4.4283 Iteration 5152, time = 1.17s, wps = 87758, train loss = 4.5261 Iteration 5172, time = 1.18s, wps = 86932, train loss = 4.3602 Iteration 5192, time = 1.18s, wps = 87108, train loss = 4.4402 Iteration 5212, time = 1.17s, wps = 87225, train loss = 4.3446 Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00051-of-00100 Finished processing! Iteration 5232, time = 2.77s, wps = 37033, train loss = 4.4805 Iteration 5252, time = 1.18s, wps = 86913, train loss = 4.2759 Iteration 5272, time = 1.17s, wps = 87532, train loss = 4.4370 Iteration 5292, time = 1.15s, wps = 88675, train loss = 4.3531 Iteration 5312, time = 1.18s, wps = 86958, train loss = 4.4307 Iteration 5332, time = 1.17s, wps = 87439, train loss = 4.4507 Iteration 5352, time = 1.17s, wps = 87776, train loss = 4.3245 Iteration 5372, time = 1.18s, wps = 86993, train loss = 4.4167 Iteration 5392, time = 1.18s, wps = 86626, train loss = 4.4126 Iteration 5412, time = 1.18s, wps = 86940, train loss = 4.4627 Iteration 5432, time = 1.18s, wps = 86719, train loss = 4.4457 Iteration 5452, time = 1.17s, wps = 87395, train loss = 4.3689 Iteration 5472, time = 1.16s, wps = 88363, train loss = 4.3004 Iteration 5492, time = 1.17s, wps = 87818, train loss = 4.3892 Iteration 5512, time = 1.17s, wps = 87400, train loss = 4.4017 Iteration 5532, time = 1.17s, wps = 87517, train loss = 4.3807 Iteration 5552, time = 1.17s, wps = 87828, train loss = 4.3584 Iteration 5572, time = 1.18s, wps = 86630, train loss = 4.3685 Iteration 5592, time = 1.18s, wps = 86605, train loss = 4.3768 Iteration 5612, time = 1.17s, wps = 87462, train loss = 4.3516 Iteration 5632, time = 1.18s, wps = 86873, train loss = 4.3360 Iteration 5652, time = 1.17s, wps = 87857, train loss = 4.3361 Iteration 5672, time = 1.17s, wps = 87545, train loss = 4.3155 Iteration 5692, time = 1.18s, wps = 87093, train loss = 4.3372 Iteration 5712, time = 1.17s, wps = 87243, train loss = 4.4086 Iteration 5732, time = 1.17s, wps = 87674, train loss = 4.3631 Iteration 5752, time = 1.17s, wps = 87692, train loss = 4.4276 Iteration 5772, time = 1.18s, wps = 86693, train loss = 4.3342 Iteration 5792, time = 1.17s, wps = 87505, train loss = 4.3239 Iteration 5812, time = 1.17s, wps = 87388, train loss = 4.3058 Iteration 5832, time = 1.19s, wps = 86024, train loss = 4.3832 Iteration 5852, time = 1.17s, wps = 87464, train loss = 4.3853 Iteration 5872, time = 1.18s, wps = 87144, train loss = 4.2992 Iteration 5892, time = 1.17s, wps = 87415, train loss = 4.3495 Iteration 5912, time = 1.17s, wps = 87785, train loss = 4.3024 Iteration 5932, time = 1.18s, wps = 86725, train loss = 4.2271 Iteration 5952, time = 1.19s, wps = 86233, train loss = 4.3730 Iteration 5972, time = 1.18s, wps = 86944, train loss = 4.1784 Iteration 5992, time = 1.18s, wps = 86789, train loss = 4.3241 Iteration 6012, time = 1.17s, wps = 87501, train loss = 4.3282 Iteration 6032, time = 1.18s, wps = 86599, train loss = 4.3822 Iteration 6052, time = 1.19s, wps = 86119, train loss = 4.2120 Iteration 6072, time = 1.17s, wps = 87315, train loss = 4.2384 Iteration 6092, time = 1.18s, wps = 86669, train loss = 4.3881 Iteration 6112, time = 1.18s, wps = 87019, train loss = 4.2586 Iteration 6132, time = 1.18s, wps = 86744, train loss = 4.2890 Iteration 6152, time = 1.18s, wps = 86851, train loss = 4.3376 Iteration 6172, time = 1.18s, wps = 87023, train loss = 4.3335 Iteration 6192, time = 1.18s, wps = 86842, train loss = 4.3771 Iteration 6212, time = 1.17s, wps = 87663, train loss = 4.3581 Iteration 6232, time = 1.17s, wps = 87695, train loss = 4.3586 Iteration 6252, time = 1.17s, wps = 87160, train loss = 4.3241 Iteration 6272, time = 1.17s, wps = 87210, train loss = 4.2938 Iteration 6292, time = 1.20s, wps = 85514, train loss = 4.3199 Iteration 6312, time = 1.18s, wps = 86566, train loss = 4.3607 Iteration 6332, time = 1.20s, wps = 85465, train loss = 4.3021 Iteration 6352, time = 1.20s, wps = 85432, train loss = 4.2807 Iteration 6372, time = 1.18s, wps = 86670, train loss = 4.3033 Iteration 6392, time = 1.18s, wps = 86674, train loss = 4.3420 Iteration 6412, time = 1.18s, wps = 86736, train loss = 4.3865 Iteration 6432, time = 1.19s, wps = 85908, train loss = 4.2495 /usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened. warnings.warn("Attempting to use a closed FileWriter. " real 3m10.391s user 16m12.794s sys 4m16.544s root@597e78370cf3:/workspace/nvidia-examples/big_lstm# time python single_lm_train.py --mode=train --logdir=./logs --num_gpus=1 --datadir=./data/1-billion-word- language-modeling-benchmark-r13output WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue. *****HYPER PARAMETERS***** {'emb_size': 512, 'batch_size': 128, 'num_gpus': 1, 'learning_rate': 0.2, 'run_profiler': False, 'num_shards': 8, 'optimizer': 0, 'state_size': 2048, 'max_time': 180, 'average_params': True, 'num_steps': 20, 'max_grad_norm': 10.0, 'num_sampled': 8192, 'num_delayed_steps': 150, 'num_layers': 1, 'vocab_size': 793470, 'do_summaries': False, 'projected_size': 512, 'keep_prob': 0.9} ************************** WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/model_utils.py:33: UniformUnitScaling.__init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:75: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:107: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_impl.py:1444: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_grad.py:425: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Current time: 1593808604.3147295 ALL VARIABLES WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:18: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Please use tf.global_variables instead. model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 model/global_step:0 () model/model/emb_0/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_1/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_2/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_3/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_4/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_5/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_6/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_7/Adagrad:0 (99184, 512) /gpu:0 model/model/lstm_0/LSTMCell/W_0/Adagrad:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/Adagrad:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/Adagrad:0 (2048, 512) /gpu:0 model/model/softmax_w_0/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_1/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_2/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_3/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_4/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_5/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_6/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_7/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_b/Adagrad:0 (793470,) /gpu:0 model/model/lstm_0/LSTMCell/W_0/ExponentialMovingAverage:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/ExponentialMovingAverage:0 (2048, 512) /gpu:0 TRAINABLE VARIABLES model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 LOCAL VARIABLES model/model/state_0_0:0 (128, 2560) /gpu:0 WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:32: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-07-03 20:36:44.517509: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900010000 Hz 2020-07-03 20:36:44.523761: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x750ea70 executing computations on platform Host. Devices: 2020-07-03 20:36:44.523803: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2020-07-03 20:36:44.970803: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 20:36:44.975630: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 20:36:44.982691: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-07-03 20:36:44.983633: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x750d8d0 executing computations on platform CUDA. Devices: 2020-07-03 20:36:44.983666: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): TITAN RTX, Compute Capability 7.5 2020-07-03 20:36:44.983673: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): TITAN RTX, Compute Capability 7.5 2020-07-03 20:36:44.983678: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (2): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-07-03 20:36:44.983685: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (3): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-07-03 20:36:44.984778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:01:00.0 totalMemory: 23.65GiB freeMemory: 23.22GiB 2020-07-03 20:36:44.984807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:21:00.0 totalMemory: 23.65GiB freeMemory: 23.49GiB 2020-07-03 20:36:44.984830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:4a:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-07-03 20:36:44.984853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:4b:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-07-03 20:36:44.984877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3 2020-07-03 20:36:45.615322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-07-03 20:36:45.615371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 2020-07-03 20:36:45.615377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N N N N 2020-07-03 20:36:45.615382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N N N N 2020-07-03 20:36:45.615386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: N N N N 2020-07-03 20:36:45.615391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: N N N N 2020-07-03 20:36:45.615533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22507 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:01:00.0, compute capability: 7.5) 2020-07-03 20:36:45.615943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22765 MB memory) -> physical GPU (device: 1, name: TITAN RTX, pci bus id: 0000:21:00.0, compute capability: 7.5) 2020-07-03 20:36:45.616219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10231 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:4a:00.0, compute capability: 7.5) 2020-07-03 20:36:45.616500: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10231 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:4b:00.0, compute capability: 7.5) WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00059-of-00100 Finished processing! 2020-07-03 20:36:52.451343: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally Iteration 6451, time = 3.50s, wps = 731, train loss = 4.6777 Iteration 6452, time = 1.77s, wps = 1450, train loss = 4.2731 Iteration 6453, time = 0.06s, wps = 45511, train loss = 4.4238 Iteration 6454, time = 0.06s, wps = 46203, train loss = 4.2444 Iteration 6455, time = 0.05s, wps = 53039, train loss = 4.4297 Iteration 6456, time = 0.05s, wps = 53224, train loss = 4.2471 Iteration 6457, time = 0.05s, wps = 53967, train loss = 4.3129 Iteration 6458, time = 0.05s, wps = 54593, train loss = 4.3913 Iteration 6459, time = 0.06s, wps = 40499, train loss = 4.4073 Iteration 6470, time = 0.52s, wps = 54272, train loss = 4.2758 Iteration 6490, time = 0.93s, wps = 55163, train loss = 4.3338 Iteration 6510, time = 0.93s, wps = 54766, train loss = 4.3048 Iteration 6530, time = 0.93s, wps = 54797, train loss = 4.3544 Iteration 6550, time = 0.95s, wps = 54151, train loss = 4.3956 Iteration 6570, time = 0.94s, wps = 54509, train loss = 4.3126 Iteration 6590, time = 0.95s, wps = 54103, train loss = 4.3722 Iteration 6610, time = 0.92s, wps = 55470, train loss = 4.3549 Iteration 6630, time = 0.94s, wps = 54210, train loss = 4.4096 Iteration 6650, time = 0.94s, wps = 54698, train loss = 4.3494 Iteration 6670, time = 0.94s, wps = 54432, train loss = 4.2923 Iteration 6690, time = 0.93s, wps = 54791, train loss = 4.4391 Iteration 6710, time = 0.94s, wps = 54739, train loss = 4.2547 Iteration 6730, time = 0.93s, wps = 54892, train loss = 4.2306 Iteration 6750, time = 0.94s, wps = 54363, train loss = 4.3636 Iteration 6770, time = 0.94s, wps = 54670, train loss = 4.3918 Iteration 6790, time = 0.94s, wps = 54705, train loss = 4.4057 Iteration 6810, time = 0.94s, wps = 54360, train loss = 4.3237 Iteration 6830, time = 0.94s, wps = 54236, train loss = 4.1814 Iteration 6850, time = 0.93s, wps = 54878, train loss = 4.3886 Iteration 6870, time = 0.94s, wps = 54210, train loss = 4.1603 Iteration 6890, time = 0.97s, wps = 52919, train loss = 4.3085 Iteration 6910, time = 0.95s, wps = 53727, train loss = 4.2235 Iteration 6930, time = 0.94s, wps = 54536, train loss = 4.3242 Iteration 6950, time = 0.93s, wps = 55015, train loss = 4.4212 Iteration 6970, time = 0.94s, wps = 54758, train loss = 4.2456 Iteration 6990, time = 0.93s, wps = 54810, train loss = 4.2334 Iteration 7010, time = 0.93s, wps = 54860, train loss = 4.3451 Iteration 7030, time = 0.94s, wps = 54717, train loss = 4.2675 Iteration 7050, time = 0.95s, wps = 53783, train loss = 4.4030 Iteration 7070, time = 0.93s, wps = 54777, train loss = 4.2852 Iteration 7090, time = 0.96s, wps = 53456, train loss = 4.3537 Iteration 7110, time = 0.95s, wps = 53742, train loss = 4.3489 Iteration 7130, time = 0.96s, wps = 53277, train loss = 4.2176 Iteration 7150, time = 0.95s, wps = 54134, train loss = 4.3151 Iteration 7170, time = 0.95s, wps = 53755, train loss = 4.3400 Iteration 7190, time = 0.94s, wps = 54634, train loss = 4.3765 Iteration 7210, time = 0.96s, wps = 53219, train loss = 4.3329 Iteration 7230, time = 0.95s, wps = 53796, train loss = 4.3129 Iteration 7250, time = 0.95s, wps = 54172, train loss = 4.3271 Iteration 7270, time = 0.95s, wps = 53781, train loss = 4.3494 Iteration 7290, time = 0.94s, wps = 54209, train loss = 4.2962 Iteration 7310, time = 0.94s, wps = 54372, train loss = 4.2353 Iteration 7330, time = 0.95s, wps = 53921, train loss = 4.3164 Iteration 7350, time = 0.93s, wps = 54963, train loss = 4.3208 Iteration 7370, time = 0.95s, wps = 53924, train loss = 4.3438 Iteration 7390, time = 0.96s, wps = 53603, train loss = 4.2505 Iteration 7410, time = 0.95s, wps = 54116, train loss = 4.2347 Iteration 7430, time = 0.94s, wps = 54421, train loss = 4.4168 Iteration 7450, time = 0.93s, wps = 54854, train loss = 4.1881 Iteration 7470, time = 0.94s, wps = 54679, train loss = 4.2033 Iteration 7490, time = 0.96s, wps = 53527, train loss = 4.3865 Iteration 7510, time = 0.94s, wps = 54310, train loss = 4.3096 Iteration 7530, time = 0.95s, wps = 54170, train loss = 4.3442 Iteration 7550, time = 0.94s, wps = 54686, train loss = 4.3252 Iteration 7570, time = 0.94s, wps = 54347, train loss = 4.3073 Iteration 7590, time = 0.95s, wps = 54091, train loss = 4.3619 Iteration 7610, time = 0.95s, wps = 53685, train loss = 4.3391 Iteration 7630, time = 0.96s, wps = 53207, train loss = 4.3219 Iteration 7650, time = 0.94s, wps = 54180, train loss = 4.1954 Iteration 7670, time = 0.94s, wps = 54599, train loss = 4.4075 Iteration 7690, time = 0.95s, wps = 53923, train loss = 4.2891 Iteration 7710, time = 0.95s, wps = 53824, train loss = 4.3469 Iteration 7730, time = 0.94s, wps = 54410, train loss = 4.2807 Iteration 7750, time = 0.94s, wps = 54629, train loss = 4.2085 Iteration 7770, time = 0.98s, wps = 52247, train loss = 4.3223 Iteration 7790, time = 0.95s, wps = 53739, train loss = 4.2517 Iteration 7810, time = 0.95s, wps = 53939, train loss = 4.3064 Iteration 7830, time = 0.95s, wps = 53789, train loss = 4.3574 Iteration 7850, time = 0.97s, wps = 52701, train loss = 4.2633 Iteration 7870, time = 0.95s, wps = 54133, train loss = 4.3462 Iteration 7890, time = 0.96s, wps = 53375, train loss = 4.3864 Iteration 7910, time = 0.94s, wps = 54262, train loss = 4.2801 Iteration 7930, time = 0.95s, wps = 53840, train loss = 4.2710 Iteration 7950, time = 0.97s, wps = 52935, train loss = 4.3189 Iteration 7970, time = 0.95s, wps = 53733, train loss = 4.3592 Iteration 7990, time = 0.94s, wps = 54449, train loss = 4.3028 Iteration 8010, time = 0.96s, wps = 53204, train loss = 4.3420 Iteration 8030, time = 0.96s, wps = 53254, train loss = 4.2715 Iteration 8050, time = 0.94s, wps = 54257, train loss = 4.2917 Iteration 8070, time = 0.94s, wps = 54267, train loss = 4.2659 Iteration 8090, time = 0.94s, wps = 54352, train loss = 4.2719 Iteration 8110, time = 0.97s, wps = 52845, train loss = 4.2686 Iteration 8130, time = 0.94s, wps = 54502, train loss = 4.3949 Iteration 8150, time = 0.96s, wps = 53450, train loss = 4.3552 Iteration 8170, time = 0.94s, wps = 54527, train loss = 4.2444 Iteration 8190, time = 0.95s, wps = 54091, train loss = 4.2280 Iteration 8210, time = 0.95s, wps = 53792, train loss = 4.3938 Iteration 8230, time = 0.96s, wps = 53549, train loss = 4.3159 Iteration 8250, time = 0.95s, wps = 53871, train loss = 4.2705 Iteration 8270, time = 0.96s, wps = 53331, train loss = 4.3917 Iteration 8290, time = 0.96s, wps = 53465, train loss = 4.3766 Iteration 8310, time = 0.95s, wps = 53646, train loss = 4.3086 Iteration 8330, time = 0.97s, wps = 52948, train loss = 4.3424 Iteration 8350, time = 0.95s, wps = 53744, train loss = 4.3699 Iteration 8370, time = 0.96s, wps = 53333, train loss = 4.3843 Iteration 8390, time = 0.96s, wps = 53567, train loss = 4.1813 Iteration 8410, time = 0.94s, wps = 54294, train loss = 4.2914 Iteration 8430, time = 0.96s, wps = 53206, train loss = 4.4531 Iteration 8450, time = 0.96s, wps = 53365, train loss = 4.3233 Iteration 8470, time = 0.97s, wps = 52895, train loss = 4.1578 Iteration 8490, time = 0.96s, wps = 53230, train loss = 4.2882 Iteration 8510, time = 0.96s, wps = 53325, train loss = 4.3853 Iteration 8530, time = 0.94s, wps = 54242, train loss = 4.3186 Iteration 8550, time = 0.96s, wps = 53492, train loss = 4.3456 Iteration 8570, time = 0.94s, wps = 54409, train loss = 4.2145 Iteration 8590, time = 0.96s, wps = 53373, train loss = 4.3739 Iteration 8610, time = 0.96s, wps = 53397, train loss = 4.2908 Iteration 8630, time = 0.95s, wps = 53660, train loss = 4.2461 Iteration 8650, time = 0.96s, wps = 53547, train loss = 4.3158 Iteration 8670, time = 0.96s, wps = 53307, train loss = 4.4578 Iteration 8690, time = 0.96s, wps = 53192, train loss = 4.1869 Iteration 8710, time = 0.97s, wps = 52712, train loss = 4.1915 Iteration 8730, time = 0.96s, wps = 53317, train loss = 4.2524 Iteration 8750, time = 0.94s, wps = 54616, train loss = 4.3421 Iteration 8770, time = 0.96s, wps = 53476, train loss = 4.3493 Iteration 8790, time = 0.96s, wps = 53349, train loss = 4.3034 Iteration 8810, time = 0.96s, wps = 53184, train loss = 4.1477 Iteration 8830, time = 0.95s, wps = 53921, train loss = 4.1954 Iteration 8850, time = 0.96s, wps = 53561, train loss = 4.2916 Iteration 8870, time = 0.97s, wps = 52957, train loss = 4.3326 Iteration 8890, time = 0.95s, wps = 53622, train loss = 4.2821 Iteration 8910, time = 0.96s, wps = 53504, train loss = 4.3059 Iteration 8930, time = 0.96s, wps = 53121, train loss = 4.2231 Iteration 8950, time = 0.96s, wps = 53248, train loss = 4.3064 Iteration 8970, time = 0.96s, wps = 53268, train loss = 4.3420 Iteration 8990, time = 0.95s, wps = 53728, train loss = 4.2802 Iteration 9010, time = 0.96s, wps = 53448, train loss = 4.2714 Iteration 9030, time = 0.97s, wps = 52781, train loss = 4.2865 Iteration 9050, time = 0.96s, wps = 53412, train loss = 4.2792 Iteration 9070, time = 0.96s, wps = 53472, train loss = 4.3132 Iteration 9090, time = 0.97s, wps = 52631, train loss = 4.2233 Iteration 9110, time = 0.96s, wps = 53059, train loss = 4.2813 Iteration 9130, time = 0.96s, wps = 53338, train loss = 4.4175 Iteration 9150, time = 0.97s, wps = 52937, train loss = 4.3164 Iteration 9170, time = 0.97s, wps = 53017, train loss = 4.2114 Iteration 9190, time = 0.96s, wps = 53255, train loss = 4.2360 Iteration 9210, time = 0.96s, wps = 53154, train loss = 4.4008 Iteration 9230, time = 0.95s, wps = 53616, train loss = 4.3528 Iteration 9250, time = 0.98s, wps = 52467, train loss = 4.3540 Iteration 9270, time = 0.98s, wps = 52270, train loss = 4.2514 Iteration 9290, time = 0.98s, wps = 52296, train loss = 4.3603 Iteration 9310, time = 0.95s, wps = 53750, train loss = 4.1433 Iteration 9330, time = 0.96s, wps = 53146, train loss = 4.2757 Iteration 9350, time = 0.97s, wps = 52768, train loss = 4.2382 Iteration 9370, time = 0.97s, wps = 52657, train loss = 4.2875 Iteration 9390, time = 0.97s, wps = 52910, train loss = 4.4140 Iteration 9410, time = 0.96s, wps = 53273, train loss = 4.1057 Iteration 9430, time = 0.96s, wps = 53399, train loss = 4.2534 Iteration 9450, time = 0.98s, wps = 52321, train loss = 4.2354 Iteration 9470, time = 0.96s, wps = 53109, train loss = 4.2248 Iteration 9490, time = 0.97s, wps = 52890, train loss = 4.2501 Iteration 9510, time = 0.98s, wps = 52203, train loss = 4.1438 Iteration 9530, time = 0.96s, wps = 53223, train loss = 4.3459 Iteration 9550, time = 0.96s, wps = 53576, train loss = 4.2899 Iteration 9570, time = 0.96s, wps = 53587, train loss = 4.3883 Iteration 9590, time = 0.97s, wps = 52663, train loss = 4.2760 Iteration 9610, time = 0.96s, wps = 53156, train loss = 4.1458 Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00091-of-00100 Finished processing! Iteration 9630, time = 2.56s, wps = 19962, train loss = 4.1909 Iteration 9650, time = 0.97s, wps = 52568, train loss = 4.2635 Iteration 9670, time = 0.96s, wps = 53077, train loss = 4.2646 Iteration 9690, time = 0.97s, wps = 52769, train loss = 4.2916 Iteration 9710, time = 0.97s, wps = 52691, train loss = 4.3301 Iteration 9730, time = 0.97s, wps = 52815, train loss = 4.3274 Iteration 9750, time = 0.97s, wps = 52826, train loss = 4.1663 Iteration 9770, time = 0.98s, wps = 52100, train loss = 4.2212 Iteration 9790, time = 0.98s, wps = 52123, train loss = 4.3171 Iteration 9810, time = 0.99s, wps = 51969, train loss = 4.2457 Iteration 9830, time = 0.98s, wps = 52337, train loss = 4.2302 Iteration 9850, time = 0.99s, wps = 51754, train loss = 4.2756 Iteration 9870, time = 1.00s, wps = 51096, train loss = 4.4026 Iteration 9890, time = 1.00s, wps = 51455, train loss = 4.2457 Iteration 9910, time = 0.97s, wps = 52886, train loss = 4.2137 Iteration 9930, time = 0.98s, wps = 52041, train loss = 4.2537 Iteration 9950, time = 1.00s, wps = 51172, train loss = 4.4090 Iteration 9970, time = 0.98s, wps = 52393, train loss = 4.3224 /usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened. warnings.warn("Attempting to use a closed FileWriter. " real 3m8.664s user 9m18.069s sys 3m7.050s root@597e78370cf3:/workspace/nvidia-examples/big_lstm# cat /etc/os-release NAME="Ubuntu" VERSION="16.04.6 LTS (Xenial Xerus)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 16.04.6 LTS" VERSION_ID="16.04" HOME_URL="http://www.ubuntu.com/" SUPPORT_URL="http://help.ubuntu.com/" BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/" VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial root@597e78370cf3:/workspace/nvidia-examples/big_lstm# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Fri_Feb__8_19:08:17_PST_2019 Cuda compilation tools, release 10.1, V10.1.105 root@597e78370cf3:/workspace/nvidia-examples/big_lstm# cd data root@597e78370cf3:/workspace/nvidia-examples/big_lstm/data# ls 1-billion-word-language-modeling-benchmark-r13output root@597e78370cf3:/workspace/nvidia-examples/big_lstm/data# cd 1-billion-word-language-modeling-benchmark-r13output root@597e78370cf3:/workspace/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output# ls 1b_word_vocab.txt heldout-monolingual.tokenized.shuffled README training-monolingual.tokenized.shuffled root@597e78370cf3:/workspace/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output# cd training-monolingual.tokenized.shuffled root@597e78370cf3:/workspace/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled# ls news.en-00001-of-00100 news.en-00034-of-00100 news.en-00067-of-00100 news.en-00002-of-00100 news.en-00035-of-00100 news.en-00068-of-00100 news.en-00003-of-00100 news.en-00036-of-00100 news.en-00069-of-00100 news.en-00004-of-00100 news.en-00037-of-00100 news.en-00070-of-00100 news.en-00005-of-00100 news.en-00038-of-00100 news.en-00071-of-00100 news.en-00006-of-00100 news.en-00039-of-00100 news.en-00072-of-00100 news.en-00007-of-00100 news.en-00040-of-00100 news.en-00073-of-00100 news.en-00008-of-00100 news.en-00041-of-00100 news.en-00074-of-00100 news.en-00009-of-00100 news.en-00042-of-00100 news.en-00075-of-00100 news.en-00010-of-00100 news.en-00043-of-00100 news.en-00076-of-00100 news.en-00011-of-00100 news.en-00044-of-00100 news.en-00077-of-00100 news.en-00012-of-00100 news.en-00045-of-00100 news.en-00078-of-00100 news.en-00013-of-00100 news.en-00046-of-00100 news.en-00079-of-00100 news.en-00014-of-00100 news.en-00047-of-00100 news.en-00080-of-00100 news.en-00015-of-00100 news.en-00048-of-00100 news.en-00081-of-00100 news.en-00016-of-00100 news.en-00049-of-00100 news.en-00082-of-00100 news.en-00017-of-00100 news.en-00050-of-00100 news.en-00083-of-00100 news.en-00018-of-00100 news.en-00051-of-00100 news.en-00084-of-00100 news.en-00019-of-00100 news.en-00052-of-00100 news.en-00085-of-00100 news.en-00020-of-00100 news.en-00053-of-00100 news.en-00086-of-00100 news.en-00021-of-00100 news.en-00054-of-00100 news.en-00087-of-00100 news.en-00022-of-00100 news.en-00055-of-00100 news.en-00088-of-00100 news.en-00023-of-00100 news.en-00056-of-00100 news.en-00089-of-00100 news.en-00024-of-00100 news.en-00057-of-00100 news.en-00090-of-00100 news.en-00025-of-00100 news.en-00058-of-00100 news.en-00091-of-00100 news.en-00026-of-00100 news.en-00059-of-00100 news.en-00092-of-00100 news.en-00027-of-00100 news.en-00060-of-00100 news.en-00093-of-00100 news.en-00028-of-00100 news.en-00061-of-00100 news.en-00094-of-00100 news.en-00029-of-00100 news.en-00062-of-00100 news.en-00095-of-00100 news.en-00030-of-00100 news.en-00063-of-00100 news.en-00096-of-00100 news.en-00031-of-00100 news.en-00064-of-00100 news.en-00097-of-00100 news.en-00032-of-00100 news.en-00065-of-00100 news.en-00098-of-00100 news.en-00033-of-00100 news.en-00066-of-00100 news.en-00099-of-00100 root@597e78370cf3:/workspace/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled# exit exit [chibi@centos8 ~]$ cat /etc/redhat-release CentOS Linux release 8.2.2004 (Core) [chibi@centos8 ~]$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Wed_May__6_19:09:25_PDT_2020 Cuda compilation tools, release 11.0, V11.0.167 Build cuda_11.0_bu.TC445_37.28358933_0 [chibi@centos8 ~]$ sensors iwlwifi-virtual-0 Adapter: Virtual device temp1: +33.0°C eth0-pci-4400 Adapter: PCI adapter PHY Temperature: +47.0°C k10temp-pci-00c3 Adapter: PCI adapter Tdie: +40.2°C (high = +70.0°C) Tctl: +40.2°C [chibi@centos8 ~]$ sensors iwlwifi-virtual-0 Adapter: Virtual device temp1: +33.0°C eth0-pci-4400 Adapter: PCI adapter PHY Temperature: +46.7°C k10temp-pci-00c3 Adapter: PCI adapter Tdie: +39.2°C (high = +70.0°C) Tctl: +39.2°C [chibi@centos8 ~]$ sudo hddtemp /dev/sda [sudo] chibi のパスワード: /dev/sda: TS128GSSD370S: 23°C [chibi@centos8 ~]$ nvidia-smi nvlink -c GPU 0: TITAN RTX (UUID: GPU-5a71d61e-f130-637a-b33d-4df555b0ed88) GPU 1: TITAN RTX (UUID: GPU-7fb51c1d-c1e7-35cc-aad7-66971f05ddb7) GPU 2: GeForce RTX 2080 Ti (UUID: GPU-1ac935c2-557f-282e-14e5-3f749ffd63ac) GPU 3: GeForce RTX 2080 Ti (UUID: GPU-13277ce5-e1e9-0cb1-8cee-6c9e6618e774) [chibi@centos8 ~]$ lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x000000000fffffff 256M online no 0-1 0x0000000010000000-0x0000000017ffffff 128M online yes 2 0x0000000018000000-0x000000003fffffff 640M online no 3-7 0x0000000100000000-0x000000010fffffff 256M online no 32-33 0x0000000110000000-0x0000001247ffffff 68.9G online yes 34-584 0x0000001248000000-0x000000125fffffff 384M online no 585-587 0x0000001260000000-0x000000127fffffff 512M online yes 588-591 0x0000001280000000-0x0000001287ffffff 128M online no 592 0x0000001288000000-0x000000128fffffff 128M online yes 593 0x0000001290000000-0x0000001297ffffff 128M online no 594 0x0000001298000000-0x00000013afffffff 4.4G online yes 595-629 0x00000013b0000000-0x00000013bfffffff 256M online no 630-631 0x00000013c0000000-0x00000013cfffffff 256M online yes 632-633 0x00000013d0000000-0x00000013d7ffffff 128M online no 634 0x00000013d8000000-0x0000001417ffffff 1G online yes 635-642 0x0000001418000000-0x000000142fffffff 384M online no 643-645 0x0000001430000000-0x00000014afffffff 2G online yes 646-661 0x00000014b0000000-0x00000014b7ffffff 128M online no 662 0x00000014b8000000-0x000000159fffffff 3.6G online yes 663-691 0x00000015a0000000-0x00000015b7ffffff 384M online no 692-694 0x00000015b8000000-0x00000015f7ffffff 1G online yes 695-702 0x00000015f8000000-0x00000015ffffffff 128M online no 703 0x0000001600000000-0x0000001607ffffff 128M online yes 704 0x0000001608000000-0x000000161fffffff 384M online no 705-707 0x0000001620000000-0x000000163fffffff 512M online yes 708-711 0x0000001640000000-0x0000001677ffffff 896M online no 712-718 0x0000001678000000-0x0000001687ffffff 256M online yes 719-720 0x0000001688000000-0x000000168fffffff 128M online no 721 0x0000001690000000-0x00000016b7ffffff 640M online yes 722-726 0x00000016b8000000-0x00000016c7ffffff 256M online no 727-728 0x00000016c8000000-0x00000016cfffffff 128M online yes 729 0x00000016d0000000-0x00000016d7ffffff 128M online no 730 0x00000016d8000000-0x00000016dfffffff 128M online yes 731 0x00000016e0000000-0x00000016efffffff 256M online no 732-733 0x00000016f0000000-0x00000016ffffffff 256M online yes 734-735 0x0000001700000000-0x000000170fffffff 256M online no 736-737 0x0000001710000000-0x0000001717ffffff 128M online yes 738 0x0000001718000000-0x000000171fffffff 128M online no 739 0x0000001720000000-0x0000001787ffffff 1.6G online yes 740-752 0x0000001788000000-0x000000178fffffff 128M online no 753 0x0000001790000000-0x00000017d7ffffff 1.1G online yes 754-762 0x00000017d8000000-0x00000017dfffffff 128M online no 763 0x00000017e0000000-0x000000183fffffff 1.5G online yes 764-775 0x0000001840000000-0x0000001847ffffff 128M online no 776 0x0000001848000000-0x0000001867ffffff 512M online yes 777-780 0x0000001868000000-0x000000186fffffff 128M online no 781 0x0000001870000000-0x0000001877ffffff 128M online yes 782 0x0000001878000000-0x0000001887ffffff 256M online no 783-784 0x0000001888000000-0x00000018a7ffffff 512M online yes 785-788 0x00000018a8000000-0x00000018afffffff 128M online no 789 0x00000018b0000000-0x00000018b7ffffff 128M online yes 790 0x00000018b8000000-0x00000018bfffffff 128M online no 791 0x00000018c0000000-0x00000018c7ffffff 128M online yes 792 0x00000018c8000000-0x00000018d7ffffff 256M online no 793-794 0x00000018d8000000-0x00000018e7ffffff 256M online yes 795-796 0x00000018e8000000-0x00000018f7ffffff 256M online no 797-798 0x00000018f8000000-0x000000190fffffff 384M online yes 799-801 0x0000001910000000-0x0000001917ffffff 128M online no 802 0x0000001918000000-0x0000001927ffffff 256M online yes 803-804 0x0000001928000000-0x000000192fffffff 128M online no 805 0x0000001930000000-0x0000001937ffffff 128M online yes 806 0x0000001938000000-0x000000193fffffff 128M online no 807 0x0000001940000000-0x0000001967ffffff 640M online yes 808-812 0x0000001968000000-0x0000001977ffffff 256M online no 813-814 0x0000001978000000-0x0000001997ffffff 512M online yes 815-818 0x0000001998000000-0x000000199fffffff 128M online no 819 0x00000019a0000000-0x00000019a7ffffff 128M online yes 820 0x00000019a8000000-0x00000019b7ffffff 256M online no 821-822 0x00000019b8000000-0x00000019e7ffffff 768M online yes 823-828 0x00000019e8000000-0x00000019efffffff 128M online no 829 0x00000019f0000000-0x00000019f7ffffff 128M online yes 830 0x00000019f8000000-0x00000019ffffffff 128M online no 831 0x0000001a00000000-0x0000001a0fffffff 256M online yes 832-833 0x0000001a10000000-0x0000001a17ffffff 128M online no 834 0x0000001a18000000-0x0000001a1fffffff 128M online yes 835 0x0000001a20000000-0x0000001a27ffffff 128M online no 836 0x0000001a28000000-0x0000001a2fffffff 128M online yes 837 0x0000001a30000000-0x0000001a3fffffff 256M online no 838-839 0x0000001a40000000-0x0000001a5fffffff 512M online yes 840-843 0x0000001a60000000-0x0000001a67ffffff 128M online no 844 0x0000001a68000000-0x0000001a77ffffff 256M online yes 845-846 0x0000001a78000000-0x0000001a7fffffff 128M online no 847 0x0000001a80000000-0x0000001a8fffffff 256M online yes 848-849 0x0000001a90000000-0x0000001a9fffffff 256M online no 850-851 0x0000001aa0000000-0x0000001ad7ffffff 896M online yes 852-858 0x0000001ad8000000-0x0000001adfffffff 128M online no 859 0x0000001ae0000000-0x0000001af7ffffff 384M online yes 860-862 0x0000001af8000000-0x0000001affffffff 128M online no 863 0x0000001b00000000-0x0000001b17ffffff 384M online yes 864-866 0x0000001b18000000-0x0000001b1fffffff 128M online no 867 0x0000001b20000000-0x0000001b6fffffff 1.3G online yes 868-877 0x0000001b70000000-0x0000001b77ffffff 128M online no 878 0x0000001b78000000-0x0000001b97ffffff 512M online yes 879-882 0x0000001b98000000-0x0000001b9fffffff 128M online no 883 0x0000001ba0000000-0x0000001bcfffffff 768M online yes 884-889 0x0000001bd0000000-0x0000001bd7ffffff 128M online no 890 0x0000001bd8000000-0x0000001bf7ffffff 512M online yes 891-894 0x0000001bf8000000-0x0000001c1fffffff 640M online no 895-899 0x0000001c20000000-0x0000001c27ffffff 128M online yes 900 0x0000001c28000000-0x0000001c2fffffff 128M online no 901 0x0000001c30000000-0x0000001c4fffffff 512M online yes 902-905 0x0000001c50000000-0x0000001c57ffffff 128M online no 906 0x0000001c58000000-0x0000001c5fffffff 128M online yes 907 0x0000001c60000000-0x0000001c6fffffff 256M online no 908-909 0x0000001c70000000-0x0000001d7fffffff 4.3G online yes 910-943 0x0000001d80000000-0x0000001da7ffffff 640M online no 944-948 0x0000001da8000000-0x0000001db7ffffff 256M online yes 949-950 0x0000001db8000000-0x0000001defffffff 896M online no 951-957 0x0000001df0000000-0x0000001df7ffffff 128M online yes 958 0x0000001df8000000-0x0000001dffffffff 128M online no 959 0x0000001e00000000-0x0000001e07ffffff 128M online yes 960 0x0000001e08000000-0x0000001e0fffffff 128M online no 961 0x0000001e10000000-0x0000001e1fffffff 256M online yes 962-963 0x0000001e20000000-0x0000001e27ffffff 128M online no 964 0x0000001e28000000-0x0000001e2fffffff 128M online yes 965 0x0000001e30000000-0x0000001e4fffffff 512M online no 966-969 0x0000001e50000000-0x0000001e57ffffff 128M online yes 970 0x0000001e58000000-0x0000001e5fffffff 128M online no 971 0x0000001e60000000-0x0000001e67ffffff 128M online yes 972 0x0000001e68000000-0x0000001eb7ffffff 1.3G online no 973-982 0x0000001eb8000000-0x0000001effffffff 1.1G online yes 983-991 0x0000001f00000000-0x0000001f07ffffff 128M online no 992 0x0000001f08000000-0x0000001f4fffffff 1.1G online yes 993-1001 0x0000001f50000000-0x0000001f57ffffff 128M online no 1002 0x0000001f58000000-0x0000001f7fffffff 640M online yes 1003-1007 0x0000001f80000000-0x0000001f87ffffff 128M online no 1008 0x0000001f88000000-0x0000001f9fffffff 384M online yes 1009-1011 0x0000001fa0000000-0x00000020bfffffff 4.5G online no 1012-1047 メモリブロックサイズ 128M Total online memory: 128G Total offline memory: 0B [chibi@centos8 ~]$ lscpu アーキテクチャ: x86_64 CPU 操作モード: 32-bit, 64-bit バイト順序: Little Endian CPU: 128 オンラインになっている CPU のリスト: 0-127 コアあたりのスレッド数: 2 ソケットあたりのコア数: 64 ソケット数: 1 NUMA ノード数: 1 ベンダー ID: AuthenticAMD CPU ファミリー: 23 モデル: 49 モデル名: AMD Ryzen Threadripper 3990X 64-Core Processor ステッピング: 0 CPU MHz: 3594.834 CPU 最大 MHz: 2900.0000 CPU 最小 MHz: 2200.0000 BogoMIPS: 5800.02 仮想化: AMD-V L1d キャッシュ: 32K L1i キャッシュ: 32K L2 キャッシュ: 512K L3 キャッシュ: 16384K NUMA ノード 0 CPU: 0-127 フラグ: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca [chibi@centos8 ~]$ lstopo Machine (126GB) Package L#0 L3 L#0 (16MB) L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#64) L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#65) L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#66) L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#67) L3 L#1 (16MB) L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#68) L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#69) L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#70) L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#71) L3 L#2 (16MB) L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#72) L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#73) L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#74) L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#75) L3 L#3 (16MB) L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 PU L#24 (P#12) PU L#25 (P#76) L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 PU L#26 (P#13) PU L#27 (P#77) L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 PU L#28 (P#14) PU L#29 (P#78) L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 PU L#30 (P#15) PU L#31 (P#79) L3 L#4 (16MB) L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 PU L#32 (P#16) PU L#33 (P#80) L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 PU L#34 (P#17) PU L#35 (P#81) L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 PU L#36 (P#18) PU L#37 (P#82) L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 PU L#38 (P#19) PU L#39 (P#83) L3 L#5 (16MB) L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 PU L#40 (P#20) PU L#41 (P#84) L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 PU L#42 (P#21) PU L#43 (P#85) L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 PU L#44 (P#22) PU L#45 (P#86) L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 PU L#46 (P#23) PU L#47 (P#87) L3 L#6 (16MB) L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 PU L#48 (P#24) PU L#49 (P#88) L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 PU L#50 (P#25) PU L#51 (P#89) L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 PU L#52 (P#26) PU L#53 (P#90) L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 PU L#54 (P#27) PU L#55 (P#91) L3 L#7 (16MB) L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 PU L#56 (P#28) PU L#57 (P#92) L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 PU L#58 (P#29) PU L#59 (P#93) L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 PU L#60 (P#30) PU L#61 (P#94) L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 PU L#62 (P#31) PU L#63 (P#95) L3 L#8 (16MB) L2 L#32 (512KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 PU L#64 (P#32) PU L#65 (P#96) L2 L#33 (512KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 PU L#66 (P#33) PU L#67 (P#97) L2 L#34 (512KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 PU L#68 (P#34) PU L#69 (P#98) L2 L#35 (512KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 PU L#70 (P#35) PU L#71 (P#99) L3 L#9 (16MB) L2 L#36 (512KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 PU L#72 (P#36) PU L#73 (P#100) L2 L#37 (512KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 PU L#74 (P#37) PU L#75 (P#101) L2 L#38 (512KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 PU L#76 (P#38) PU L#77 (P#102) L2 L#39 (512KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 PU L#78 (P#39) PU L#79 (P#103) L3 L#10 (16MB) L2 L#40 (512KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 PU L#80 (P#40) PU L#81 (P#104) L2 L#41 (512KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 PU L#82 (P#41) PU L#83 (P#105) L2 L#42 (512KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 PU L#84 (P#42) PU L#85 (P#106) L2 L#43 (512KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 PU L#86 (P#43) PU L#87 (P#107) L3 L#11 (16MB) L2 L#44 (512KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 PU L#88 (P#44) PU L#89 (P#108) L2 L#45 (512KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 PU L#90 (P#45) PU L#91 (P#109) L2 L#46 (512KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 PU L#92 (P#46) PU L#93 (P#110) L2 L#47 (512KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 PU L#94 (P#47) PU L#95 (P#111) L3 L#12 (16MB) L2 L#48 (512KB) + L1d L#48 (32KB) + L1i L#48 (32KB) + Core L#48 PU L#96 (P#48) PU L#97 (P#112) L2 L#49 (512KB) + L1d L#49 (32KB) + L1i L#49 (32KB) + Core L#49 PU L#98 (P#49) PU L#99 (P#113) L2 L#50 (512KB) + L1d L#50 (32KB) + L1i L#50 (32KB) + Core L#50 PU L#100 (P#50) PU L#101 (P#114) L2 L#51 (512KB) + L1d L#51 (32KB) + L1i L#51 (32KB) + Core L#51 PU L#102 (P#51) PU L#103 (P#115) L3 L#13 (16MB) L2 L#52 (512KB) + L1d L#52 (32KB) + L1i L#52 (32KB) + Core L#52 PU L#104 (P#52) PU L#105 (P#116) L2 L#53 (512KB) + L1d L#53 (32KB) + L1i L#53 (32KB) + Core L#53 PU L#106 (P#53) PU L#107 (P#117) L2 L#54 (512KB) + L1d L#54 (32KB) + L1i L#54 (32KB) + Core L#54 PU L#108 (P#54) PU L#109 (P#118) L2 L#55 (512KB) + L1d L#55 (32KB) + L1i L#55 (32KB) + Core L#55 PU L#110 (P#55) PU L#111 (P#119) L3 L#14 (16MB) L2 L#56 (512KB) + L1d L#56 (32KB) + L1i L#56 (32KB) + Core L#56 PU L#112 (P#56) PU L#113 (P#120) L2 L#57 (512KB) + L1d L#57 (32KB) + L1i L#57 (32KB) + Core L#57 PU L#114 (P#57) PU L#115 (P#121) L2 L#58 (512KB) + L1d L#58 (32KB) + L1i L#58 (32KB) + Core L#58 PU L#116 (P#58) PU L#117 (P#122) L2 L#59 (512KB) + L1d L#59 (32KB) + L1i L#59 (32KB) + Core L#59 PU L#118 (P#59) PU L#119 (P#123) L3 L#15 (16MB) L2 L#60 (512KB) + L1d L#60 (32KB) + L1i L#60 (32KB) + Core L#60 PU L#120 (P#60) PU L#121 (P#124) L2 L#61 (512KB) + L1d L#61 (32KB) + L1i L#61 (32KB) + Core L#61 PU L#122 (P#61) PU L#123 (P#125) L2 L#62 (512KB) + L1d L#62 (32KB) + L1i L#62 (32KB) + Core L#62 PU L#124 (P#62) PU L#125 (P#126) L2 L#63 (512KB) + L1d L#63 (32KB) + L1i L#63 (32KB) + Core L#63 PU L#126 (P#63) PU L#127 (P#127) HostBridge L#0 PCIBridge PCI 10de:1e02 GPU L#0 "renderD128" GPU L#1 "card0" HostBridge L#2 PCIBridge PCI 10de:1e02 GPU L#2 "card1" GPU L#3 "renderD129" HostBridge L#4 PCIBridge PCIBridge PCIBridge PCI 1d6a:07b1 Net L#4 "eth0" PCIBridge PCI 8086:2723 Net L#5 "wlan1" PCIBridge PCI 10ec:8125 PCIBridge PCI 1022:7901 Block(Disk) L#6 "sda" Block(Other) L#7 "sr0" PCIBridge PCI 1022:7901 PCIBridge PCI 10de:1e07 GPU L#8 "renderD130" GPU L#9 "card2" PCIBridge PCI 10de:1e07 GPU L#10 "card3" GPU L#11 "renderD131" [chibi@centos8 ~]$ cat /proc/meminfo MemTotal: 131596560 kB MemFree: 121626608 kB MemAvailable: 128428572 kB Buffers: 1060 kB Cached: 7651916 kB SwapCached: 0 kB Active: 1174140 kB Inactive: 6970692 kB Active(anon): 468424 kB Inactive(anon): 10144 kB Active(file): 705716 kB Inactive(file): 6960548 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 483916 kB Mapped: 297908 kB Shmem: 12808 kB KReclaimable: 384888 kB Slab: 1285820 kB SReclaimable: 384888 kB SUnreclaim: 900932 kB KernelStack: 26960 kB PageTables: 25316 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 65798280 kB Committed_AS: 3583568 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB Percpu: 68608 kB HardwareCorrupted: 0 kB AnonHugePages: 208896 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 1804464 kB DirectMap2M: 28465152 kB DirectMap1G: 103809024 kB [chibi@centos8 ~]$ free total used free shared buff/cache available Mem: 131596560 1932052 121626628 12808 8037880 128428592 Swap: 0 0 0 [chibi@centos8 ~]$ sensors iwlwifi-virtual-0 Adapter: Virtual device temp1: +32.0°C eth0-pci-4400 Adapter: PCI adapter PHY Temperature: +45.4°C k10temp-pci-00c3 Adapter: PCI adapter Tdie: +38.0°C (high = +70.0°C) Tctl: +38.0°C [chibi@centos8 ~]$ sudo hddtemp /dev/sda /dev/sda: TS128GSSD370S: 21°C [chibi@centos8 ~]$