[chibi@centos8 ~]$ sudo nvidia-docker run --rm -ti nvcr.io/nvidia/tensorflow:19.04-py3 [sudo] chibi のパスワード: ================ == TensorFlow == ================ NVIDIA Release 19.04 (build 6132408) TensorFlow Version 1.13.1 Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved. Copyright 2017-2019 The TensorFlow Authors. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file. NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced. NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be insufficient for TensorFlow. NVIDIA recommends the use of the following flags: nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ... root@321ae05d9e8b:/workspace# ls README.md docker-examples nvidia-examples root@321ae05d9e8b:/workspace# cd nvidia-examples root@321ae05d9e8b:/workspace/nvidia-examples# ls NCF bert cnn ssdv1.2 OpenSeq2Seq big_lstm gnmt_v2 tensorrt UNet_Industrial build_imagenet_data resnet50v1.5 root@321ae05d9e8b:/workspace/nvidia-examples# cd big_lstm root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm# ls 1b_word_vocab.txt data_utils_test.py language_model_test.py README.md download_1b_words_data.sh model_utils.py __init__.py hparams.py run_utils.py common.py hparams_test.py single_lm_train.py data_utils.py language_model.py testdata root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm# ./download_1b_words_data.sh Please specify root of dataset directory: data Success: dataset root dir validated --2020-06-18 23:17:55-- http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz Resolving www.statmt.org (www.statmt.org)... 129.215.197.184 Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1792209805 (1.7G) [application/x-gzip] Saving to: ‘1-billion-word-language-modeling-benchmark-r13output.tar.gz’ 1-billion-word-lang 100%[===================>] 1.67G 492KB/s in 54m 15s 2020-06-19 00:12:11 (538 KB/s) - ‘1-billion-word-language-modeling-benchmark-r13output.tar.gz’ saved [1792209805/1792209805] 1-billion-word-language-modeling-benchmark-r13output/ 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/ 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00024-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00057-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00055-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00096-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00081-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00033-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00072-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00082-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00018-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00008-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00059-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00005-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00091-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00062-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00031-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00095-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00076-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00006-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00038-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00015-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00087-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00021-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00049-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00009-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00027-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00056-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00046-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00032-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00029-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00088-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00085-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00011-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00012-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00067-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00003-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00093-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00050-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00053-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00044-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00019-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00066-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00028-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00045-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00039-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00071-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00052-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00078-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00037-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00002-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00014-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00048-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00017-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00004-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00077-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00080-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00020-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00051-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00016-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00079-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00043-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00068-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00099-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00064-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00034-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00054-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00040-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00070-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00063-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00041-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00083-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00061-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00073-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00094-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00030-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00060-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00035-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00023-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00042-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00025-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00090-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00089-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00065-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00075-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00022-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00026-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00098-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00084-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00010-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00069-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00013-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00092-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00036-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00097-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00007-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00074-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00001-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00047-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00086-of-00100 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00058-of-00100 1-billion-word-language-modeling-benchmark-r13output/.svn/ 1-billion-word-language-modeling-benchmark-r13output/.svn/tmp/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/de/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/de/de102cd0c91cd19e6612f0840e68a2f20ba8134c.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/de/deed1b75d3bd5cc36ae6aeb85d56680b892b7948.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/86/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/86/86c58db52fbf362c5bc329afc33b8805085fcb0d.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/9f/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/9f/9f2882e21f860a83ad6ea8898ebab140974ed301.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/bc/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/bc/bcdbc523ee7488dc438cab869b6d5e236578dbfa.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/d2/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/d2/d2718bc26d0ee0a213d7d4add99a304cb5b39ede.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/c5/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/c5/c5b24f61479da923123d0394a188da922ea0359c.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/11/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/11/116d6ea61730d8199127596b072e981338597779.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/b0/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/b0/b0e26559cfe641245584a9400b35ba28d64f1411.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/d3/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/d3/d3ae508e3bcb0e696dd70aecd052410f1f7afc1d.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/9e/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/9e/9e148bd766e8805e0eb97eeae250433ec7a2e996.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/31/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/31/31b645a482e0b81fda3c567cada307c6fcf7ec80.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/da/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/da/da39a3ee5e6b4b0d3255bfef95601890afd80709.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/c1/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/c1/c1ed42c415ec884e591fb5c70d373da640a383b5.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/e3/ 1-billion-word-language-modeling-benchmark-r13output/.svn/pristine/e3/e37ba0f85e94073ccaced1eed7e4f5d737a25f49.svn-base 1-billion-word-language-modeling-benchmark-r13output/.svn/entries 1-billion-word-language-modeling-benchmark-r13output/.svn/format 1-billion-word-language-modeling-benchmark-r13output/.svn/wc.db 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/ 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00015-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00031-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00027-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00010-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00033-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00042-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00046-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00037-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00029-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00013-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00002-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00048-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00006-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00030-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00025-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00039-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00008-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00020-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00001-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00034-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00044-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00045-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00016-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00004-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00035-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00038-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00009-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00024-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00022-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00021-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00032-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00011-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00049-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00041-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00019-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00023-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00040-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00014-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00007-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00017-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00012-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00018-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00003-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00028-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00043-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00005-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00036-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00026-of-00050 1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00047-of-00050 1-billion-word-language-modeling-benchmark-r13output/README Success! One billion words dataset ready at: data/1-billion-word-language-modeling-benchmark-r13output/ Please pass this dir to single_lm_train.py via the --datadir option. root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm# time python single_lm_train.py --mode=train --logdir=./logs --num_gpus=4 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue. *****HYPER PARAMETERS***** {'max_time': 180, 'num_shards': 8, 'projected_size': 512, 'num_sampled': 8192, 'do_summaries': False, 'num_layers': 1, 'keep_prob': 0.9, 'learning_rate': 0.2, 'batch_size': 128, 'run_profiler': False, 'num_steps': 20, 'vocab_size': 793470, 'average_params': True, 'optimizer': 0, 'state_size': 2048, 'num_gpus': 4, 'max_grad_norm': 10.0, 'num_delayed_steps': 150, 'emb_size': 512} ************************** WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/model_utils.py:33: UniformUnitScaling.__init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:75: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:107: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_impl.py:1444: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_grad.py:425: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Current time: 1592525626.2155185 ALL VARIABLES WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:18: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Please use tf.global_variables instead. model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 model/global_step:0 () model/model/emb_0/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_1/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_2/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_3/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_4/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_5/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_6/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_7/Adagrad:0 (99184, 512) /gpu:0 model/model/lstm_0/LSTMCell/W_0/Adagrad:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/Adagrad:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/Adagrad:0 (2048, 512) /gpu:0 model/model/softmax_w_0/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_1/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_2/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_3/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_4/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_5/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_6/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_7/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_b/Adagrad:0 (793470,) /gpu:0 model/model/lstm_0/LSTMCell/W_0/ExponentialMovingAverage:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/ExponentialMovingAverage:0 (2048, 512) /gpu:0 TRAINABLE VARIABLES model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 LOCAL VARIABLES model/model/state_0_0:0 (128, 2560) /gpu:0 model/model_1/state_1_0:0 (128, 2560) /gpu:1 model/model_2/state_2_0:0 (128, 2560) /gpu:2 model/model_3/state_3_0:0 (128, 2560) /gpu:3 WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:32: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-06-19 00:13:47.136968: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2994465000 Hz 2020-06-19 00:13:47.144055: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0xafbaba0 executing computations on platform Host. Devices: 2020-06-19 00:13:47.144097: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2020-06-19 00:13:47.734196: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0xae6af00 executing computations on platform CUDA. Devices: 2020-06-19 00:13:47.734239: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): TITAN RTX, Compute Capability 7.5 2020-06-19 00:13:47.734252: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): TITAN RTX, Compute Capability 7.5 2020-06-19 00:13:47.734263: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (2): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-06-19 00:13:47.734274: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (3): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-06-19 00:13:47.735709: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:03:00.0 totalMemory: 23.65GiB freeMemory: 23.22GiB 2020-06-19 00:13:47.735745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:21:00.0 totalMemory: 23.65GiB freeMemory: 23.49GiB 2020-06-19 00:13:47.735772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:41:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-06-19 00:13:47.735798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:61:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-06-19 00:13:47.735832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3 2020-06-19 00:13:52.080128: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-19 00:13:52.080177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 2020-06-19 00:13:52.080183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N N N N 2020-06-19 00:13:52.080187: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N N N N 2020-06-19 00:13:52.080191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: N N N N 2020-06-19 00:13:52.080196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: N N N N 2020-06-19 00:13:52.080425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22508 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:03:00.0, compute capability: 7.5) 2020-06-19 00:13:52.080726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22765 MB memory) -> physical GPU (device: 1, name: TITAN RTX, pci bus id: 0000:21:00.0, compute capability: 7.5) 2020-06-19 00:13:52.081038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10231 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:41:00.0, compute capability: 7.5) 2020-06-19 00:13:52.081400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10231 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:61:00.0, compute capability: 7.5) Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00019-of-00100 Finished processing! 2020-06-19 00:14:16.923516: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally Iteration 1, time = 13.72s, wps = 746, train loss = 13.0038 Iteration 2, time = 9.83s, wps = 1042, train loss = 12.9694 Iteration 3, time = 0.11s, wps = 96475, train loss = 12.8082 Iteration 4, time = 0.11s, wps = 96160, train loss = 16.7368 Iteration 5, time = 0.11s, wps = 95596, train loss = 12.3217 Iteration 6, time = 0.10s, wps = 98887, train loss = 22.2883 Iteration 7, time = 0.10s, wps = 98291, train loss = 13.3811 Iteration 8, time = 0.11s, wps = 92147, train loss = 11.8788 Iteration 9, time = 0.11s, wps = 95493, train loss = 45.9214 Iteration 20, time = 1.12s, wps = 100675, train loss = 32.4956 Iteration 40, time = 2.03s, wps = 100821, train loss = 15.8944 Iteration 60, time = 2.03s, wps = 100698, train loss = 9.3490 Iteration 80, time = 2.03s, wps = 100940, train loss = 9.9216 Iteration 100, time = 2.04s, wps = 100232, train loss = 8.0002 Iteration 120, time = 2.02s, wps = 101254, train loss = 7.2660 Iteration 140, time = 2.03s, wps = 100863, train loss = 7.0494 Iteration 160, time = 2.02s, wps = 101158, train loss = 7.0931 Iteration 180, time = 2.05s, wps = 100046, train loss = 6.8827 Iteration 200, time = 2.06s, wps = 99275, train loss = 6.6083 Iteration 220, time = 2.03s, wps = 101111, train loss = 6.5956 Iteration 240, time = 2.05s, wps = 99891, train loss = 6.2578 Iteration 260, time = 2.08s, wps = 98369, train loss = 6.3125 Iteration 280, time = 2.04s, wps = 100425, train loss = 6.0726 Iteration 300, time = 2.03s, wps = 100760, train loss = 6.1993 Iteration 320, time = 2.04s, wps = 100251, train loss = 6.1054 Iteration 340, time = 2.09s, wps = 98038, train loss = 6.1392 Iteration 360, time = 2.04s, wps = 100422, train loss = 6.0404 Iteration 380, time = 2.02s, wps = 101582, train loss = 5.9851 Iteration 400, time = 2.03s, wps = 101025, train loss = 5.9272 Iteration 420, time = 2.05s, wps = 99791, train loss = 5.8850 Iteration 440, time = 2.02s, wps = 101545, train loss = 5.9050 Iteration 460, time = 2.04s, wps = 100431, train loss = 5.8149 Iteration 480, time = 2.04s, wps = 100444, train loss = 5.8012 Iteration 500, time = 2.04s, wps = 100537, train loss = 5.7251 Iteration 520, time = 2.05s, wps = 99822, train loss = 5.7477 Iteration 540, time = 2.03s, wps = 100663, train loss = 5.7529 Iteration 560, time = 2.07s, wps = 98933, train loss = 5.7439 Iteration 580, time = 2.08s, wps = 98258, train loss = 5.6509 Iteration 600, time = 2.05s, wps = 99690, train loss = 5.6871 Iteration 620, time = 2.04s, wps = 100246, train loss = 5.6099 Iteration 640, time = 2.05s, wps = 99731, train loss = 5.5867 Iteration 660, time = 2.06s, wps = 99360, train loss = 5.5841 Iteration 680, time = 2.04s, wps = 100281, train loss = 5.5501 Iteration 700, time = 2.05s, wps = 99912, train loss = 5.6182 Iteration 720, time = 2.06s, wps = 99328, train loss = 5.5711 Iteration 740, time = 2.05s, wps = 100130, train loss = 5.5200 Iteration 760, time = 2.06s, wps = 99362, train loss = 5.4987 Iteration 780, time = 2.04s, wps = 100266, train loss = 5.5152 Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00099-of-00100 Finished processing! Iteration 800, time = 4.10s, wps = 49892, train loss = 5.5035 Iteration 820, time = 2.05s, wps = 99791, train loss = 5.4163 Iteration 840, time = 2.07s, wps = 98861, train loss = 5.4949 Iteration 860, time = 2.03s, wps = 100802, train loss = 5.3665 Iteration 880, time = 2.04s, wps = 100470, train loss = 5.3956 Iteration 900, time = 2.09s, wps = 98144, train loss = 5.3705 Iteration 920, time = 2.07s, wps = 99011, train loss = 5.3500 Iteration 940, time = 2.07s, wps = 98723, train loss = 5.3475 Iteration 960, time = 2.07s, wps = 98744, train loss = 5.3691 Iteration 980, time = 2.04s, wps = 100237, train loss = 5.2818 Iteration 1000, time = 2.05s, wps = 99848, train loss = 5.3399 Iteration 1020, time = 2.05s, wps = 100138, train loss = 5.2373 Iteration 1040, time = 2.07s, wps = 99034, train loss = 5.2471 Iteration 1060, time = 2.06s, wps = 99357, train loss = 5.2739 Iteration 1080, time = 2.04s, wps = 100150, train loss = 5.3494 Iteration 1100, time = 2.13s, wps = 96315, train loss = 5.2379 Iteration 1120, time = 2.05s, wps = 99943, train loss = 5.2218 Iteration 1140, time = 2.07s, wps = 99051, train loss = 5.2110 Iteration 1160, time = 2.04s, wps = 100359, train loss = 5.1960 Iteration 1180, time = 2.06s, wps = 99553, train loss = 5.2355 Iteration 1200, time = 2.05s, wps = 99825, train loss = 5.1358 Iteration 1220, time = 2.04s, wps = 100520, train loss = 5.1886 Iteration 1240, time = 2.04s, wps = 100191, train loss = 5.1418 Iteration 1260, time = 2.08s, wps = 98532, train loss = 5.1151 Iteration 1280, time = 2.07s, wps = 98963, train loss = 5.1284 Iteration 1300, time = 2.05s, wps = 100042, train loss = 5.1208 Iteration 1320, time = 2.04s, wps = 100278, train loss = 5.1130 /usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened. warnings.warn("Attempting to use a closed FileWriter. " real 3m25.470s user 17m12.367s sys 1m29.609s root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm# time python single_lm_train.py --mode=train --logdir=./logs --num_gpus=3 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue. *****HYPER PARAMETERS***** {'num_steps': 20, 'max_time': 180, 'vocab_size': 793470, 'keep_prob': 0.9, 'run_profiler': False, 'num_gpus': 3, 'learning_rate': 0.2, 'num_sampled': 8192, 'do_summaries': False, 'num_layers': 1, 'optimizer': 0, 'emb_size': 512, 'projected_size': 512, 'max_grad_norm': 10.0, 'state_size': 2048, 'average_params': True, 'num_delayed_steps': 150, 'batch_size': 128, 'num_shards': 8} ************************** WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/model_utils.py:33: UniformUnitScaling.__init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:75: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:107: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_impl.py:1444: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_grad.py:425: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Current time: 1592526538.7911236 ALL VARIABLES WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:18: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Please use tf.global_variables instead. model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 model/global_step:0 () model/model/emb_0/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_1/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_2/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_3/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_4/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_5/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_6/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_7/Adagrad:0 (99184, 512) /gpu:0 model/model/lstm_0/LSTMCell/W_0/Adagrad:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/Adagrad:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/Adagrad:0 (2048, 512) /gpu:0 model/model/softmax_w_0/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_1/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_2/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_3/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_4/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_5/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_6/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_7/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_b/Adagrad:0 (793470,) /gpu:0 model/model/lstm_0/LSTMCell/W_0/ExponentialMovingAverage:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/ExponentialMovingAverage:0 (2048, 512) /gpu:0 TRAINABLE VARIABLES model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 LOCAL VARIABLES model/model/state_0_0:0 (128, 2560) /gpu:0 model/model_1/state_1_0:0 (128, 2560) /gpu:1 model/model_2/state_2_0:0 (128, 2560) /gpu:2 WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:32: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-06-19 00:28:59.447968: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2994465000 Hz 2020-06-19 00:28:59.449882: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x8eeed40 executing computations on platform Host. Devices: 2020-06-19 00:28:59.449924: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2020-06-19 00:29:00.000728: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x8eeca00 executing computations on platform CUDA. Devices: 2020-06-19 00:29:00.000762: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): TITAN RTX, Compute Capability 7.5 2020-06-19 00:29:00.000772: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): TITAN RTX, Compute Capability 7.5 2020-06-19 00:29:00.000779: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (2): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-06-19 00:29:00.000787: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (3): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-06-19 00:29:00.002404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:03:00.0 totalMemory: 23.65GiB freeMemory: 23.22GiB 2020-06-19 00:29:00.002441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:21:00.0 totalMemory: 23.65GiB freeMemory: 23.49GiB 2020-06-19 00:29:00.002469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:41:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-06-19 00:29:00.002499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:61:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-06-19 00:29:00.002532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3 2020-06-19 00:29:00.829828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-19 00:29:00.829881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 2020-06-19 00:29:00.829888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N N N N 2020-06-19 00:29:00.829892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N N N N 2020-06-19 00:29:00.829896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: N N N N 2020-06-19 00:29:00.829900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: N N N N 2020-06-19 00:29:00.830079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22507 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:03:00.0, compute capability: 7.5) 2020-06-19 00:29:00.830468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22765 MB memory) -> physical GPU (device: 1, name: TITAN RTX, pci bus id: 0000:21:00.0, compute capability: 7.5) 2020-06-19 00:29:00.830673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10231 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:41:00.0, compute capability: 7.5) 2020-06-19 00:29:00.830969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10231 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:61:00.0, compute capability: 7.5) WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00086-of-00100 Finished processing! 2020-06-19 00:29:16.769095: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally Iteration 1323, time = 9.93s, wps = 773, train loss = 5.7636 Iteration 1324, time = 7.34s, wps = 1047, train loss = 5.1925 Iteration 1325, time = 0.09s, wps = 86222, train loss = 5.1715 Iteration 1326, time = 0.09s, wps = 83396, train loss = 5.1111 Iteration 1327, time = 0.08s, wps = 91510, train loss = 5.1339 Iteration 1328, time = 0.08s, wps = 91095, train loss = 5.1471 Iteration 1329, time = 0.08s, wps = 92603, train loss = 5.1169 Iteration 1330, time = 0.08s, wps = 90431, train loss = 5.1330 Iteration 1331, time = 0.08s, wps = 94402, train loss = 5.1252 Iteration 1342, time = 0.89s, wps = 94939, train loss = 5.1234 Iteration 1362, time = 1.62s, wps = 94643, train loss = 5.0234 Iteration 1382, time = 1.66s, wps = 92367, train loss = 5.0195 Iteration 1402, time = 1.64s, wps = 93609, train loss = 5.0871 Iteration 1422, time = 1.63s, wps = 94225, train loss = 5.0676 Iteration 1442, time = 1.63s, wps = 93968, train loss = 5.0814 Iteration 1462, time = 1.62s, wps = 94561, train loss = 5.0680 Iteration 1482, time = 1.61s, wps = 95154, train loss = 5.0525 Iteration 1502, time = 1.61s, wps = 95427, train loss = 5.1288 Iteration 1522, time = 1.61s, wps = 95449, train loss = 5.0290 Iteration 1542, time = 1.61s, wps = 95330, train loss = 5.0411 Iteration 1562, time = 1.60s, wps = 96091, train loss = 5.0530 Iteration 1582, time = 1.62s, wps = 94898, train loss = 4.9772 Iteration 1602, time = 1.59s, wps = 96590, train loss = 4.9448 Iteration 1622, time = 1.59s, wps = 96725, train loss = 5.0067 Iteration 1642, time = 1.61s, wps = 95331, train loss = 5.0449 Iteration 1662, time = 1.63s, wps = 94007, train loss = 4.9020 Iteration 1682, time = 1.62s, wps = 95104, train loss = 5.0125 Iteration 1702, time = 1.64s, wps = 93838, train loss = 4.8975 Iteration 1722, time = 1.63s, wps = 94266, train loss = 5.0013 Iteration 1742, time = 1.62s, wps = 94602, train loss = 4.9747 Iteration 1762, time = 1.62s, wps = 94810, train loss = 4.9266 Iteration 1782, time = 1.63s, wps = 94337, train loss = 4.9902 Iteration 1802, time = 1.64s, wps = 93640, train loss = 4.9599 Iteration 1822, time = 1.62s, wps = 94879, train loss = 4.9811 Iteration 1842, time = 1.63s, wps = 94516, train loss = 4.9562 Iteration 1862, time = 1.59s, wps = 96311, train loss = 4.9528 Iteration 1882, time = 1.60s, wps = 95709, train loss = 4.8788 Iteration 1902, time = 1.60s, wps = 96019, train loss = 4.8242 Iteration 1922, time = 1.60s, wps = 95836, train loss = 4.9220 Iteration 1942, time = 1.61s, wps = 95444, train loss = 4.9393 Iteration 1962, time = 1.62s, wps = 94614, train loss = 4.8862 Iteration 1982, time = 1.64s, wps = 93930, train loss = 4.8707 Iteration 2002, time = 1.61s, wps = 95127, train loss = 4.9438 Iteration 2022, time = 1.60s, wps = 96105, train loss = 4.8943 Iteration 2042, time = 1.62s, wps = 94994, train loss = 4.8641 Iteration 2062, time = 1.64s, wps = 93712, train loss = 4.8001 Iteration 2082, time = 1.61s, wps = 95165, train loss = 4.8427 Iteration 2102, time = 1.61s, wps = 95343, train loss = 4.8351 Iteration 2122, time = 1.63s, wps = 94054, train loss = 4.8179 Iteration 2142, time = 1.60s, wps = 96285, train loss = 4.7926 Iteration 2162, time = 1.59s, wps = 96326, train loss = 4.8550 Iteration 2182, time = 1.61s, wps = 95322, train loss = 4.8313 Iteration 2202, time = 1.64s, wps = 93941, train loss = 4.8912 Iteration 2222, time = 1.61s, wps = 95654, train loss = 4.7619 Iteration 2242, time = 1.61s, wps = 95450, train loss = 4.8911 Iteration 2262, time = 1.62s, wps = 94663, train loss = 4.7896 Iteration 2282, time = 1.63s, wps = 94051, train loss = 4.8308 Iteration 2302, time = 1.60s, wps = 95769, train loss = 4.7886 Iteration 2322, time = 1.62s, wps = 94653, train loss = 4.8033 Iteration 2342, time = 1.61s, wps = 95517, train loss = 4.8260 Iteration 2362, time = 1.62s, wps = 95001, train loss = 4.8057 Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00019-of-00100 Finished processing! Iteration 2382, time = 3.64s, wps = 42154, train loss = 4.7251 Iteration 2402, time = 1.61s, wps = 95620, train loss = 4.7892 Iteration 2422, time = 1.57s, wps = 97875, train loss = 4.7161 Iteration 2442, time = 1.62s, wps = 95039, train loss = 4.7225 Iteration 2462, time = 1.60s, wps = 96227, train loss = 4.7577 Iteration 2482, time = 1.61s, wps = 95522, train loss = 4.7752 Iteration 2502, time = 1.60s, wps = 95882, train loss = 4.7282 Iteration 2522, time = 1.60s, wps = 96105, train loss = 4.7041 Iteration 2542, time = 1.61s, wps = 95429, train loss = 4.7373 Iteration 2562, time = 1.60s, wps = 95799, train loss = 4.7148 Iteration 2582, time = 1.62s, wps = 94784, train loss = 4.7381 Iteration 2602, time = 1.61s, wps = 95621, train loss = 4.7935 Iteration 2622, time = 1.62s, wps = 95066, train loss = 4.6724 Iteration 2642, time = 1.61s, wps = 95663, train loss = 4.7204 Iteration 2662, time = 1.63s, wps = 94380, train loss = 4.7476 Iteration 2682, time = 1.62s, wps = 94933, train loss = 4.6836 Iteration 2702, time = 1.61s, wps = 95386, train loss = 4.6810 Iteration 2722, time = 1.62s, wps = 94916, train loss = 4.6661 Iteration 2742, time = 1.60s, wps = 96124, train loss = 4.7495 Iteration 2762, time = 1.61s, wps = 95646, train loss = 4.6780 Iteration 2782, time = 1.61s, wps = 95241, train loss = 4.7231 Iteration 2802, time = 1.62s, wps = 94921, train loss = 4.6488 Iteration 2822, time = 1.60s, wps = 96053, train loss = 4.6366 Iteration 2842, time = 1.63s, wps = 94382, train loss = 4.6613 Iteration 2862, time = 1.61s, wps = 95696, train loss = 4.7292 Iteration 2882, time = 1.63s, wps = 94509, train loss = 4.6826 Iteration 2902, time = 1.60s, wps = 95955, train loss = 4.7550 Iteration 2922, time = 1.62s, wps = 94903, train loss = 4.7103 Iteration 2942, time = 1.60s, wps = 95710, train loss = 4.6835 Iteration 2962, time = 1.59s, wps = 96636, train loss = 4.5716 Iteration 2982, time = 1.62s, wps = 94531, train loss = 4.6996 Iteration 3002, time = 1.60s, wps = 95834, train loss = 4.5720 Iteration 3022, time = 1.61s, wps = 95587, train loss = 4.6635 Iteration 3042, time = 1.61s, wps = 95485, train loss = 4.6688 Iteration 3062, time = 1.59s, wps = 96531, train loss = 4.6574 Iteration 3082, time = 1.60s, wps = 96263, train loss = 4.6225 Iteration 3102, time = 1.63s, wps = 94244, train loss = 4.6616 Iteration 3122, time = 1.63s, wps = 94057, train loss = 4.5951 Iteration 3142, time = 1.63s, wps = 94260, train loss = 4.6475 Iteration 3162, time = 1.62s, wps = 94979, train loss = 4.7235 Iteration 3182, time = 1.62s, wps = 94698, train loss = 4.6778 Iteration 3202, time = 1.63s, wps = 94206, train loss = 4.5874 /usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened. warnings.warn("Attempting to use a closed FileWriter. " real 3m14.976s user 14m59.627s sys 1m18.592s root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm# time python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=./data/1-billion-word- language-modeling-benchmark-r13output ^[[A WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue. *****HYPER PARAMETERS***** {'projected_size': 512, 'state_size': 2048, 'num_gpus': 2, 'do_summaries': False, 'num_delayed_steps': 150, 'max_grad_norm': 10.0, 'keep_prob': 0.9, 'batch_size': 128, 'num_steps': 20, 'emb_size': 512, 'num_sampled': 8192, 'run_profiler': False, 'max_time': 180, 'num_shards': 8, 'average_params': True, 'optimizer': 0, 'vocab_size': 793470, 'num_layers': 1, 'learning_rate': 0.2} ************************** WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/model_utils.py:33: UniformUnitScaling.__init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:75: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:107: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_impl.py:1444: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_grad.py:425: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Current time: 1592527453.6144607 ALL VARIABLES WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:18: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Please use tf.global_variables instead. model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 model/global_step:0 () model/model/emb_0/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_1/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_2/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_3/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_4/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_5/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_6/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_7/Adagrad:0 (99184, 512) /gpu:0 model/model/lstm_0/LSTMCell/W_0/Adagrad:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/Adagrad:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/Adagrad:0 (2048, 512) /gpu:0 model/model/softmax_w_0/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_1/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_2/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_3/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_4/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_5/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_6/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_7/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_b/Adagrad:0 (793470,) /gpu:0 model/model/lstm_0/LSTMCell/W_0/ExponentialMovingAverage:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/ExponentialMovingAverage:0 (2048, 512) /gpu:0 TRAINABLE VARIABLES model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 LOCAL VARIABLES model/model/state_0_0:0 (128, 2560) /gpu:0 model/model_1/state_1_0:0 (128, 2560) /gpu:1 WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:32: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-06-19 00:44:14.139028: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2994465000 Hz 2020-06-19 00:44:14.140958: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x7b74a60 executing computations on platform Host. Devices: 2020-06-19 00:44:14.140988: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2020-06-19 00:44:14.678801: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x7b74480 executing computations on platform CUDA. Devices: 2020-06-19 00:44:14.678856: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): TITAN RTX, Compute Capability 7.5 2020-06-19 00:44:14.678868: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): TITAN RTX, Compute Capability 7.5 2020-06-19 00:44:14.678878: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (2): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-06-19 00:44:14.678889: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (3): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-06-19 00:44:14.680241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:03:00.0 totalMemory: 23.65GiB freeMemory: 23.22GiB 2020-06-19 00:44:14.680276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:21:00.0 totalMemory: 23.65GiB freeMemory: 23.49GiB 2020-06-19 00:44:14.680303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:41:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-06-19 00:44:14.680330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:61:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-06-19 00:44:14.680359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3 2020-06-19 00:44:15.488600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-19 00:44:15.488647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 2020-06-19 00:44:15.488654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N N N N 2020-06-19 00:44:15.488658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N N N N 2020-06-19 00:44:15.488663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: N N N N 2020-06-19 00:44:15.488667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: N N N N 2020-06-19 00:44:15.488853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22507 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:03:00.0, compute capability: 7.5) 2020-06-19 00:44:15.489195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22765 MB memory) -> physical GPU (device: 1, name: TITAN RTX, pci bus id: 0000:21:00.0, compute capability: 7.5) 2020-06-19 00:44:15.489520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10231 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:41:00.0, compute capability: 7.5) 2020-06-19 00:44:15.489651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10231 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:61:00.0, compute capability: 7.5) WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00017-of-00100 Finished processing! 2020-06-19 00:44:27.397383: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally Iteration 3207, time = 6.79s, wps = 754, train loss = 4.9671 Iteration 3208, time = 4.38s, wps = 1170, train loss = 4.6902 Iteration 3209, time = 0.08s, wps = 65984, train loss = 4.6328 Iteration 3210, time = 0.07s, wps = 69629, train loss = 4.5938 Iteration 3211, time = 0.07s, wps = 75853, train loss = 4.4936 Iteration 3212, time = 0.06s, wps = 80432, train loss = 4.5689 Iteration 3213, time = 0.07s, wps = 76388, train loss = 4.5002 Iteration 3214, time = 0.06s, wps = 79909, train loss = 4.5909 Iteration 3215, time = 0.07s, wps = 77746, train loss = 4.6077 Iteration 3226, time = 0.72s, wps = 78612, train loss = 4.5529 Iteration 3246, time = 1.34s, wps = 76569, train loss = 4.6618 Iteration 3266, time = 1.32s, wps = 77559, train loss = 4.6730 Iteration 3286, time = 1.32s, wps = 77349, train loss = 4.5694 Iteration 3306, time = 1.30s, wps = 79072, train loss = 4.5892 Iteration 3326, time = 1.31s, wps = 78265, train loss = 4.6499 Iteration 3346, time = 1.32s, wps = 77562, train loss = 4.6144 Iteration 3366, time = 1.29s, wps = 79553, train loss = 4.6070 Iteration 3386, time = 1.30s, wps = 79025, train loss = 4.6434 Iteration 3406, time = 1.32s, wps = 77676, train loss = 4.6052 Iteration 3426, time = 1.31s, wps = 78345, train loss = 4.5641 Iteration 3446, time = 1.31s, wps = 78172, train loss = 4.5784 Iteration 3466, time = 1.32s, wps = 77602, train loss = 4.5932 Iteration 3486, time = 1.30s, wps = 78493, train loss = 4.5165 Iteration 3506, time = 1.31s, wps = 78447, train loss = 4.6642 Iteration 3526, time = 1.31s, wps = 78085, train loss = 4.6575 Iteration 3546, time = 1.32s, wps = 77761, train loss = 4.5922 Iteration 3566, time = 1.31s, wps = 78177, train loss = 4.6412 Iteration 3586, time = 1.32s, wps = 77426, train loss = 4.5636 Iteration 3606, time = 1.33s, wps = 76988, train loss = 4.6264 Iteration 3626, time = 1.31s, wps = 78070, train loss = 4.5636 Iteration 3646, time = 1.31s, wps = 78132, train loss = 4.5469 Iteration 3666, time = 1.32s, wps = 77871, train loss = 4.5744 Iteration 3686, time = 1.32s, wps = 77834, train loss = 4.6390 Iteration 3706, time = 1.31s, wps = 78436, train loss = 4.6033 Iteration 3726, time = 1.31s, wps = 78067, train loss = 4.4652 Iteration 3746, time = 1.30s, wps = 78634, train loss = 4.5661 Iteration 3766, time = 1.32s, wps = 77473, train loss = 4.5721 Iteration 3786, time = 1.30s, wps = 78508, train loss = 4.5445 Iteration 3806, time = 1.31s, wps = 78049, train loss = 4.5810 Iteration 3826, time = 1.31s, wps = 78173, train loss = 4.5604 Iteration 3846, time = 1.32s, wps = 77398, train loss = 4.5185 Iteration 3866, time = 1.29s, wps = 79300, train loss = 4.5388 Iteration 3886, time = 1.30s, wps = 78648, train loss = 4.5701 Iteration 3906, time = 1.31s, wps = 77975, train loss = 4.5581 Iteration 3926, time = 1.31s, wps = 78369, train loss = 4.6299 Iteration 3946, time = 1.33s, wps = 77080, train loss = 4.5392 Iteration 3966, time = 1.32s, wps = 77718, train loss = 4.5799 Iteration 3986, time = 1.30s, wps = 78686, train loss = 4.4792 Iteration 4006, time = 1.31s, wps = 78190, train loss = 4.5170 Iteration 4026, time = 1.31s, wps = 77881, train loss = 4.5674 Iteration 4046, time = 1.31s, wps = 78143, train loss = 4.4985 Iteration 4066, time = 1.30s, wps = 78878, train loss = 4.4853 Iteration 4086, time = 1.31s, wps = 78210, train loss = 4.5398 Iteration 4106, time = 1.31s, wps = 78074, train loss = 4.4516 Iteration 4126, time = 1.30s, wps = 79037, train loss = 4.5131 Iteration 4146, time = 1.32s, wps = 77762, train loss = 4.5020 Iteration 4166, time = 1.29s, wps = 79499, train loss = 4.4626 Iteration 4186, time = 1.30s, wps = 78833, train loss = 4.5054 Iteration 4206, time = 1.30s, wps = 78598, train loss = 4.5108 Iteration 4226, time = 1.31s, wps = 78337, train loss = 4.5926 Iteration 4246, time = 1.31s, wps = 77958, train loss = 4.5264 Iteration 4266, time = 1.33s, wps = 76812, train loss = 4.5411 Iteration 4286, time = 1.29s, wps = 79466, train loss = 4.4682 Iteration 4306, time = 1.35s, wps = 75953, train loss = 4.5436 Iteration 4326, time = 1.31s, wps = 78025, train loss = 4.5273 Iteration 4346, time = 1.32s, wps = 77486, train loss = 4.5264 Iteration 4366, time = 1.31s, wps = 78350, train loss = 4.4718 Iteration 4386, time = 1.32s, wps = 77825, train loss = 4.4000 Iteration 4406, time = 1.33s, wps = 76898, train loss = 4.5149 Iteration 4426, time = 1.32s, wps = 77790, train loss = 4.4575 Iteration 4446, time = 1.33s, wps = 77189, train loss = 4.4788 Iteration 4466, time = 1.32s, wps = 77729, train loss = 4.4414 Iteration 4486, time = 1.32s, wps = 77684, train loss = 4.5001 Iteration 4506, time = 1.32s, wps = 77660, train loss = 4.4957 Iteration 4526, time = 1.32s, wps = 77573, train loss = 4.5535 Iteration 4546, time = 1.33s, wps = 77259, train loss = 4.5492 Iteration 4566, time = 1.33s, wps = 77247, train loss = 4.4447 Iteration 4586, time = 1.33s, wps = 77258, train loss = 4.5158 Iteration 4606, time = 1.33s, wps = 77027, train loss = 4.3700 Iteration 4626, time = 1.31s, wps = 77878, train loss = 4.4789 Iteration 4646, time = 1.30s, wps = 78871, train loss = 4.4896 Iteration 4666, time = 1.32s, wps = 77564, train loss = 4.4614 Iteration 4686, time = 1.31s, wps = 78243, train loss = 4.5247 Iteration 4706, time = 1.32s, wps = 77605, train loss = 4.4675 Iteration 4726, time = 1.31s, wps = 78073, train loss = 4.4827 Iteration 4746, time = 1.31s, wps = 78453, train loss = 4.5165 Iteration 4766, time = 1.30s, wps = 78588, train loss = 4.4886 Iteration 4786, time = 1.30s, wps = 78609, train loss = 4.5339 Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00007-of-00100 Finished processing! Iteration 4806, time = 3.34s, wps = 30624, train loss = 4.5367 Iteration 4826, time = 1.33s, wps = 77147, train loss = 4.4584 Iteration 4846, time = 1.30s, wps = 78813, train loss = 4.5210 Iteration 4866, time = 1.32s, wps = 77866, train loss = 4.4817 Iteration 4886, time = 1.31s, wps = 78466, train loss = 4.5222 Iteration 4906, time = 1.32s, wps = 77447, train loss = 4.4257 Iteration 4926, time = 1.32s, wps = 77480, train loss = 4.4422 Iteration 4946, time = 1.31s, wps = 78376, train loss = 4.4473 Iteration 4966, time = 1.32s, wps = 77363, train loss = 4.4815 Iteration 4986, time = 1.32s, wps = 77772, train loss = 4.4429 Iteration 5006, time = 1.31s, wps = 77953, train loss = 4.4607 Iteration 5026, time = 1.31s, wps = 78234, train loss = 4.4802 Iteration 5046, time = 1.31s, wps = 78189, train loss = 4.3517 Iteration 5066, time = 1.32s, wps = 77554, train loss = 4.5284 Iteration 5086, time = 1.33s, wps = 76828, train loss = 4.5073 Iteration 5106, time = 1.32s, wps = 77428, train loss = 4.4899 Iteration 5126, time = 1.32s, wps = 77777, train loss = 4.4220 Iteration 5146, time = 1.32s, wps = 77458, train loss = 4.3505 Iteration 5166, time = 1.32s, wps = 77506, train loss = 4.5463 Iteration 5186, time = 1.31s, wps = 78114, train loss = 4.4079 Iteration 5206, time = 1.31s, wps = 78194, train loss = 4.5861 Iteration 5226, time = 1.32s, wps = 77853, train loss = 4.4926 Iteration 5246, time = 1.34s, wps = 76501, train loss = 4.4635 Iteration 5266, time = 1.33s, wps = 76755, train loss = 4.4937 Iteration 5286, time = 1.32s, wps = 77568, train loss = 4.3780 Iteration 5306, time = 1.32s, wps = 77656, train loss = 4.4497 Iteration 5326, time = 1.33s, wps = 77276, train loss = 4.4229 Iteration 5346, time = 1.32s, wps = 77391, train loss = 4.3993 Iteration 5366, time = 1.34s, wps = 76586, train loss = 4.3766 Iteration 5386, time = 1.32s, wps = 77486, train loss = 4.3824 Iteration 5406, time = 1.31s, wps = 78436, train loss = 4.3761 Iteration 5426, time = 1.32s, wps = 77495, train loss = 4.4359 Iteration 5446, time = 1.32s, wps = 77698, train loss = 4.3256 Iteration 5466, time = 1.34s, wps = 76395, train loss = 4.4253 Iteration 5486, time = 1.31s, wps = 78003, train loss = 4.4403 Iteration 5506, time = 1.31s, wps = 77888, train loss = 4.4593 Iteration 5526, time = 1.34s, wps = 76595, train loss = 4.4272 Iteration 5546, time = 1.31s, wps = 78105, train loss = 4.3994 Iteration 5566, time = 1.31s, wps = 78021, train loss = 4.3878 Iteration 5586, time = 1.32s, wps = 77656, train loss = 4.3195 Iteration 5606, time = 1.32s, wps = 77535, train loss = 4.4699 Iteration 5626, time = 1.32s, wps = 77836, train loss = 4.3903 /usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened. warnings.warn("Attempting to use a closed FileWriter. " real 3m12.953s user 11m35.279s sys 1m9.390s root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm# time python single_lm_train.py --mode=train --logdir=./logs --num_gpus=1 --datadir=./data/1-billion-word- language-modeling-benchmark-r13output WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see: * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md * https://github.com/tensorflow/addons If you depend on functionality not listed there, please file an issue. *****HYPER PARAMETERS***** {'optimizer': 0, 'learning_rate': 0.2, 'do_summaries': False, 'num_steps': 20, 'keep_prob': 0.9, 'num_layers': 1, 'batch_size': 128, 'vocab_size': 793470, 'run_profiler': False, 'num_sampled': 8192, 'num_delayed_steps': 150, 'projected_size': 512, 'max_time': 180, 'max_grad_norm': 10.0, 'average_params': True, 'state_size': 2048, 'num_shards': 8, 'emb_size': 512, 'num_gpus': 1} ************************** WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/model_utils.py:33: UniformUnitScaling.__init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:75: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`. WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/language_model.py:107: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_impl.py:1444: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version. Instructions for updating: Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_grad.py:425: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Current time: 1592528499.4209394 ALL VARIABLES WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:18: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Please use tf.global_variables instead. model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 model/global_step:0 () model/model/emb_0/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_1/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_2/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_3/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_4/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_5/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_6/Adagrad:0 (99184, 512) /gpu:0 model/model/emb_7/Adagrad:0 (99184, 512) /gpu:0 model/model/lstm_0/LSTMCell/W_0/Adagrad:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/Adagrad:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/Adagrad:0 (2048, 512) /gpu:0 model/model/softmax_w_0/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_1/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_2/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_3/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_4/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_5/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_6/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_w_7/Adagrad:0 (99184, 512) /gpu:0 model/model/softmax_b/Adagrad:0 (793470,) /gpu:0 model/model/lstm_0/LSTMCell/W_0/ExponentialMovingAverage:0 (1024, 8192) /gpu:0 model/model/lstm_0/LSTMCell/B/ExponentialMovingAverage:0 (8192,) /gpu:0 model/model/lstm_0/LSTMCell/W_P_0/ExponentialMovingAverage:0 (2048, 512) /gpu:0 TRAINABLE VARIABLES model/emb_0:0 (99184, 512) /gpu:0 model/emb_1:0 (99184, 512) /gpu:0 model/emb_2:0 (99184, 512) /gpu:0 model/emb_3:0 (99184, 512) /gpu:0 model/emb_4:0 (99184, 512) /gpu:0 model/emb_5:0 (99184, 512) /gpu:0 model/emb_6:0 (99184, 512) /gpu:0 model/emb_7:0 (99184, 512) /gpu:0 model/lstm_0/LSTMCell/W_0:0 (1024, 8192) /gpu:0 model/lstm_0/LSTMCell/B:0 (8192,) /gpu:0 model/lstm_0/LSTMCell/W_P_0:0 (2048, 512) /gpu:0 model/softmax_w_0:0 (99184, 512) /gpu:0 model/softmax_w_1:0 (99184, 512) /gpu:0 model/softmax_w_2:0 (99184, 512) /gpu:0 model/softmax_w_3:0 (99184, 512) /gpu:0 model/softmax_w_4:0 (99184, 512) /gpu:0 model/softmax_w_5:0 (99184, 512) /gpu:0 model/softmax_w_6:0 (99184, 512) /gpu:0 model/softmax_w_7:0 (99184, 512) /gpu:0 model/softmax_b:0 (793470,) /gpu:0 LOCAL VARIABLES model/model/state_0_0:0 (128, 2560) /gpu:0 WARNING:tensorflow:From /opt/tensorflow/nvidia-examples/big_lstm/run_utils.py:32: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2020-06-19 01:01:39.669019: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2994465000 Hz 2020-06-19 01:01:39.671117: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x74e1d40 executing computations on platform Host. Devices: 2020-06-19 01:01:39.671145: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2020-06-19 01:01:40.190007: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x74bcd60 executing computations on platform CUDA. Devices: 2020-06-19 01:01:40.190042: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): TITAN RTX, Compute Capability 7.5 2020-06-19 01:01:40.190050: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): TITAN RTX, Compute Capability 7.5 2020-06-19 01:01:40.190058: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (2): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-06-19 01:01:40.190067: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (3): GeForce RTX 2080 Ti, Compute Capability 7.5 2020-06-19 01:01:40.191515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:03:00.0 totalMemory: 23.65GiB freeMemory: 23.22GiB 2020-06-19 01:01:40.191555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77 pciBusID: 0000:21:00.0 totalMemory: 23.65GiB freeMemory: 23.49GiB 2020-06-19 01:01:40.191584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:41:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-06-19 01:01:40.191612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635 pciBusID: 0000:61:00.0 totalMemory: 10.76GiB freeMemory: 10.61GiB 2020-06-19 01:01:40.191639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3 2020-06-19 01:01:41.011511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-19 01:01:41.011564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 2020-06-19 01:01:41.011571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N N N N 2020-06-19 01:01:41.011575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N N N N 2020-06-19 01:01:41.011579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: N N N N 2020-06-19 01:01:41.011584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: N N N N 2020-06-19 01:01:41.011756: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22508 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:03:00.0, compute capability: 7.5) 2020-06-19 01:01:41.012072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22765 MB memory) -> physical GPU (device: 1, name: TITAN RTX, pci bus id: 0000:21:00.0, compute capability: 7.5) 2020-06-19 01:01:41.012376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10231 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:41:00.0, compute capability: 7.5) 2020-06-19 01:01:41.012507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10231 MB memory) -> physical GPU (device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:61:00.0, compute capability: 7.5) WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00053-of-00100 Finished processing! 2020-06-19 01:01:49.678599: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally Iteration 5632, time = 4.39s, wps = 583, train loss = 4.5950 Iteration 5633, time = 2.17s, wps = 1179, train loss = 4.3710 Iteration 5634, time = 0.06s, wps = 42077, train loss = 4.4637 Iteration 5635, time = 0.06s, wps = 43216, train loss = 4.4968 Iteration 5636, time = 0.06s, wps = 46384, train loss = 4.4530 Iteration 5637, time = 0.05s, wps = 49309, train loss = 4.4553 Iteration 5638, time = 0.05s, wps = 48821, train loss = 4.4319 Iteration 5639, time = 0.05s, wps = 48903, train loss = 4.3708 Iteration 5640, time = 0.05s, wps = 49282, train loss = 4.3975 Iteration 5651, time = 0.56s, wps = 50287, train loss = 4.4348 Iteration 5671, time = 1.04s, wps = 49013, train loss = 4.5061 Iteration 5691, time = 1.03s, wps = 49634, train loss = 4.4728 Iteration 5711, time = 1.05s, wps = 48699, train loss = 4.4699 Iteration 5731, time = 1.03s, wps = 49537, train loss = 4.4942 Iteration 5751, time = 1.05s, wps = 48958, train loss = 4.4631 Iteration 5771, time = 1.06s, wps = 48478, train loss = 4.4402 Iteration 5791, time = 1.04s, wps = 49022, train loss = 4.4523 Iteration 5811, time = 1.05s, wps = 48701, train loss = 4.5094 Iteration 5831, time = 1.04s, wps = 49314, train loss = 4.3123 Iteration 5851, time = 1.04s, wps = 49144, train loss = 4.3729 Iteration 5871, time = 1.02s, wps = 50027, train loss = 4.5390 Iteration 5891, time = 1.05s, wps = 48566, train loss = 4.5881 Iteration 5911, time = 1.06s, wps = 48351, train loss = 4.3408 Iteration 5931, time = 1.05s, wps = 48711, train loss = 4.3848 Iteration 5951, time = 1.04s, wps = 49087, train loss = 4.3839 Iteration 5971, time = 1.05s, wps = 48700, train loss = 4.5645 Iteration 5991, time = 1.04s, wps = 49126, train loss = 4.4373 Iteration 6011, time = 1.04s, wps = 49304, train loss = 4.3669 Iteration 6031, time = 1.06s, wps = 48528, train loss = 4.4664 Iteration 6051, time = 1.04s, wps = 49312, train loss = 4.4684 Iteration 6071, time = 1.04s, wps = 49057, train loss = 4.4357 Iteration 6091, time = 1.03s, wps = 49610, train loss = 4.4446 Iteration 6111, time = 1.06s, wps = 48349, train loss = 4.4090 Iteration 6131, time = 1.05s, wps = 48815, train loss = 4.4442 Iteration 6151, time = 1.04s, wps = 49394, train loss = 4.3645 Iteration 6171, time = 1.04s, wps = 49406, train loss = 4.5163 Iteration 6191, time = 1.05s, wps = 48531, train loss = 4.4372 Iteration 6211, time = 1.04s, wps = 49101, train loss = 4.4507 Iteration 6231, time = 1.04s, wps = 49044, train loss = 4.6072 Iteration 6251, time = 1.06s, wps = 48525, train loss = 4.4662 Iteration 6271, time = 1.03s, wps = 49482, train loss = 4.4670 Iteration 6291, time = 1.05s, wps = 48635, train loss = 4.4644 Iteration 6311, time = 1.05s, wps = 48995, train loss = 4.4431 Iteration 6331, time = 1.05s, wps = 48798, train loss = 4.4891 Iteration 6351, time = 1.03s, wps = 49482, train loss = 4.3737 Iteration 6371, time = 1.04s, wps = 49131, train loss = 4.3803 Iteration 6391, time = 1.05s, wps = 48787, train loss = 4.4592 Iteration 6411, time = 1.03s, wps = 49556, train loss = 4.4513 Iteration 6431, time = 1.05s, wps = 48912, train loss = 4.3617 Iteration 6451, time = 1.06s, wps = 48381, train loss = 4.3443 Iteration 6471, time = 1.04s, wps = 49028, train loss = 4.3906 Iteration 6491, time = 1.05s, wps = 48947, train loss = 4.3924 Iteration 6511, time = 1.06s, wps = 48462, train loss = 4.4957 Iteration 6531, time = 1.03s, wps = 49766, train loss = 4.3212 Iteration 6551, time = 1.04s, wps = 49438, train loss = 4.4303 Iteration 6571, time = 1.04s, wps = 49071, train loss = 4.3489 Iteration 6591, time = 1.04s, wps = 49355, train loss = 4.4810 Iteration 6611, time = 1.02s, wps = 50041, train loss = 4.3533 Iteration 6631, time = 1.06s, wps = 48119, train loss = 4.4840 Iteration 6651, time = 1.04s, wps = 49103, train loss = 4.3768 Iteration 6671, time = 1.05s, wps = 48722, train loss = 4.3643 Iteration 6691, time = 1.04s, wps = 49184, train loss = 4.3551 Iteration 6711, time = 1.05s, wps = 48815, train loss = 4.4002 Iteration 6731, time = 1.05s, wps = 48851, train loss = 4.3160 Iteration 6751, time = 1.04s, wps = 49294, train loss = 4.3013 Iteration 6771, time = 1.04s, wps = 49407, train loss = 4.4955 Iteration 6791, time = 1.04s, wps = 49144, train loss = 4.3671 Iteration 6811, time = 1.05s, wps = 48989, train loss = 4.3513 Iteration 6831, time = 1.06s, wps = 48353, train loss = 4.3238 Iteration 6851, time = 1.04s, wps = 49293, train loss = 4.4181 Iteration 6871, time = 1.05s, wps = 48836, train loss = 4.4145 Iteration 6891, time = 1.03s, wps = 49553, train loss = 4.4250 Iteration 6911, time = 1.04s, wps = 49326, train loss = 4.4649 Iteration 6931, time = 1.05s, wps = 48979, train loss = 4.4224 Iteration 6951, time = 1.03s, wps = 49489, train loss = 4.2975 Iteration 6971, time = 1.05s, wps = 48843, train loss = 4.3443 Iteration 6991, time = 1.05s, wps = 48577, train loss = 4.4178 Iteration 7011, time = 1.04s, wps = 49106, train loss = 4.4794 Iteration 7031, time = 1.04s, wps = 49242, train loss = 4.4613 Iteration 7051, time = 1.05s, wps = 48615, train loss = 4.4936 Iteration 7071, time = 1.06s, wps = 48384, train loss = 4.4590 Iteration 7091, time = 1.05s, wps = 48854, train loss = 4.4129 Iteration 7111, time = 1.04s, wps = 49287, train loss = 4.4976 Iteration 7131, time = 1.05s, wps = 48604, train loss = 4.3716 Iteration 7151, time = 1.04s, wps = 49177, train loss = 4.3687 Iteration 7171, time = 1.04s, wps = 49296, train loss = 4.3863 Iteration 7191, time = 1.06s, wps = 48194, train loss = 4.3923 Iteration 7211, time = 1.04s, wps = 49037, train loss = 4.4918 Iteration 7231, time = 1.03s, wps = 49497, train loss = 4.3135 Iteration 7251, time = 1.05s, wps = 48922, train loss = 4.3603 Iteration 7271, time = 1.05s, wps = 48705, train loss = 4.3768 Iteration 7291, time = 1.04s, wps = 49256, train loss = 4.3616 Iteration 7311, time = 1.05s, wps = 48940, train loss = 4.3967 Iteration 7331, time = 1.04s, wps = 49115, train loss = 4.3766 Iteration 7351, time = 1.06s, wps = 48128, train loss = 4.2992 Iteration 7371, time = 1.04s, wps = 49442, train loss = 4.4080 Iteration 7391, time = 1.06s, wps = 48347, train loss = 4.3798 Iteration 7411, time = 1.04s, wps = 49020, train loss = 4.3259 Iteration 7431, time = 1.05s, wps = 48572, train loss = 4.3273 Iteration 7451, time = 1.05s, wps = 48984, train loss = 4.4837 Iteration 7471, time = 1.05s, wps = 48787, train loss = 4.4368 Iteration 7491, time = 1.04s, wps = 49235, train loss = 4.3174 Iteration 7511, time = 1.05s, wps = 48689, train loss = 4.3694 Iteration 7531, time = 1.04s, wps = 49331, train loss = 4.4079 Iteration 7551, time = 1.06s, wps = 48310, train loss = 4.3200 Iteration 7571, time = 1.06s, wps = 48280, train loss = 4.4249 Iteration 7591, time = 1.05s, wps = 48587, train loss = 4.3295 Iteration 7611, time = 1.04s, wps = 49287, train loss = 4.3598 Iteration 7631, time = 1.04s, wps = 49078, train loss = 4.3653 Iteration 7651, time = 1.06s, wps = 48481, train loss = 4.3119 Iteration 7671, time = 1.04s, wps = 49140, train loss = 4.4893 Iteration 7691, time = 1.04s, wps = 49032, train loss = 4.4302 Iteration 7711, time = 1.05s, wps = 48925, train loss = 4.3152 Iteration 7731, time = 1.05s, wps = 48791, train loss = 4.2852 Iteration 7751, time = 1.06s, wps = 48455, train loss = 4.3278 Iteration 7771, time = 1.05s, wps = 48730, train loss = 4.4293 Iteration 7791, time = 1.04s, wps = 49064, train loss = 4.3671 Iteration 7811, time = 1.05s, wps = 48830, train loss = 4.4924 Iteration 7831, time = 1.06s, wps = 48324, train loss = 4.3186 Iteration 7851, time = 1.05s, wps = 48965, train loss = 4.3269 Iteration 7871, time = 1.05s, wps = 48692, train loss = 4.3145 Iteration 7891, time = 1.05s, wps = 48850, train loss = 4.4000 Iteration 7911, time = 1.05s, wps = 48728, train loss = 4.4292 Iteration 7931, time = 1.05s, wps = 48987, train loss = 4.4746 Iteration 7951, time = 1.06s, wps = 48214, train loss = 4.3613 Iteration 7971, time = 1.07s, wps = 47937, train loss = 4.2739 Iteration 7991, time = 1.05s, wps = 48748, train loss = 4.3963 Iteration 8011, time = 1.04s, wps = 49006, train loss = 4.2758 Iteration 8031, time = 1.05s, wps = 48651, train loss = 4.3292 Iteration 8051, time = 1.07s, wps = 48043, train loss = 4.3833 Iteration 8071, time = 1.06s, wps = 48444, train loss = 4.4287 Iteration 8091, time = 1.06s, wps = 48520, train loss = 4.2462 Iteration 8111, time = 1.06s, wps = 48268, train loss = 4.2364 Iteration 8131, time = 1.05s, wps = 48660, train loss = 4.3336 Iteration 8151, time = 1.05s, wps = 48546, train loss = 4.2138 Iteration 8171, time = 1.05s, wps = 48550, train loss = 4.2971 Iteration 8191, time = 1.05s, wps = 48653, train loss = 4.3624 Iteration 8211, time = 1.05s, wps = 48945, train loss = 4.3140 Iteration 8231, time = 1.05s, wps = 48681, train loss = 4.4572 Iteration 8251, time = 1.05s, wps = 48911, train loss = 4.3352 Iteration 8271, time = 1.06s, wps = 48262, train loss = 4.2418 Iteration 8291, time = 1.08s, wps = 47594, train loss = 4.2729 Iteration 8311, time = 1.06s, wps = 48309, train loss = 4.2081 Iteration 8331, time = 1.05s, wps = 48859, train loss = 4.3871 Iteration 8351, time = 1.05s, wps = 48858, train loss = 4.4224 Iteration 8371, time = 1.07s, wps = 47729, train loss = 4.3640 Iteration 8391, time = 1.06s, wps = 48137, train loss = 4.2892 Iteration 8411, time = 1.05s, wps = 48577, train loss = 4.3134 Iteration 8431, time = 1.07s, wps = 48055, train loss = 4.3711 Iteration 8451, time = 1.06s, wps = 48305, train loss = 4.2093 Iteration 8471, time = 1.07s, wps = 47809, train loss = 4.3716 Iteration 8491, time = 1.05s, wps = 48912, train loss = 4.4062 Iteration 8511, time = 1.06s, wps = 48441, train loss = 4.2194 Iteration 8531, time = 1.05s, wps = 48711, train loss = 4.3395 Iteration 8551, time = 1.09s, wps = 47141, train loss = 4.3177 Iteration 8571, time = 1.07s, wps = 47700, train loss = 4.3391 Iteration 8591, time = 1.06s, wps = 48124, train loss = 4.4332 Iteration 8611, time = 1.07s, wps = 47700, train loss = 4.3795 Iteration 8631, time = 1.07s, wps = 48050, train loss = 4.3018 Iteration 8651, time = 1.06s, wps = 48206, train loss = 4.2883 Iteration 8671, time = 1.06s, wps = 48122, train loss = 4.2287 Iteration 8691, time = 1.07s, wps = 47869, train loss = 4.2646 Iteration 8711, time = 1.07s, wps = 47809, train loss = 4.3424 Iteration 8731, time = 1.06s, wps = 48224, train loss = 4.3940 Iteration 8751, time = 1.07s, wps = 48039, train loss = 4.3181 Iteration 8771, time = 1.07s, wps = 47695, train loss = 4.2187 Processing file: ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00082-of-00100 Finished processing! /usr/local/lib/python3.5/dist-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened. warnings.warn("Attempting to use a closed FileWriter. " real 3m11.082s user 7m22.917s sys 0m45.684s root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm# cat /etc/os-release NAME="Ubuntu" VERSION="16.04.6 LTS (Xenial Xerus)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 16.04.6 LTS" VERSION_ID="16.04" HOME_URL="http://www.ubuntu.com/" SUPPORT_URL="http://help.ubuntu.com/" BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/" VERSION_CODENAME=xenial UBUNTU_CODENAME=xenial root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Fri_Feb__8_19:08:17_PST_2019 Cuda compilation tools, release 10.1, V10.1.105 root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm# cd data root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm/data# ls 1-billion-word-language-modeling-benchmark-r13output root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm/data# cd 1-billion-word-language-modeling-benchmark-r13output root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output# ls 1b_word_vocab.txt heldout-monolingual.tokenized.shuffled README training-monolingual.tokenized.shuffled root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output# cd training-monolingual.tokenized.shuffled root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled# ls news.en-00001-of-00100 news.en-00034-of-00100 news.en-00067-of-00100 news.en-00002-of-00100 news.en-00035-of-00100 news.en-00068-of-00100 news.en-00003-of-00100 news.en-00036-of-00100 news.en-00069-of-00100 news.en-00004-of-00100 news.en-00037-of-00100 news.en-00070-of-00100 news.en-00005-of-00100 news.en-00038-of-00100 news.en-00071-of-00100 news.en-00006-of-00100 news.en-00039-of-00100 news.en-00072-of-00100 news.en-00007-of-00100 news.en-00040-of-00100 news.en-00073-of-00100 news.en-00008-of-00100 news.en-00041-of-00100 news.en-00074-of-00100 news.en-00009-of-00100 news.en-00042-of-00100 news.en-00075-of-00100 news.en-00010-of-00100 news.en-00043-of-00100 news.en-00076-of-00100 news.en-00011-of-00100 news.en-00044-of-00100 news.en-00077-of-00100 news.en-00012-of-00100 news.en-00045-of-00100 news.en-00078-of-00100 news.en-00013-of-00100 news.en-00046-of-00100 news.en-00079-of-00100 news.en-00014-of-00100 news.en-00047-of-00100 news.en-00080-of-00100 news.en-00015-of-00100 news.en-00048-of-00100 news.en-00081-of-00100 news.en-00016-of-00100 news.en-00049-of-00100 news.en-00082-of-00100 news.en-00017-of-00100 news.en-00050-of-00100 news.en-00083-of-00100 news.en-00018-of-00100 news.en-00051-of-00100 news.en-00084-of-00100 news.en-00019-of-00100 news.en-00052-of-00100 news.en-00085-of-00100 news.en-00020-of-00100 news.en-00053-of-00100 news.en-00086-of-00100 news.en-00021-of-00100 news.en-00054-of-00100 news.en-00087-of-00100 news.en-00022-of-00100 news.en-00055-of-00100 news.en-00088-of-00100 news.en-00023-of-00100 news.en-00056-of-00100 news.en-00089-of-00100 news.en-00024-of-00100 news.en-00057-of-00100 news.en-00090-of-00100 news.en-00025-of-00100 news.en-00058-of-00100 news.en-00091-of-00100 news.en-00026-of-00100 news.en-00059-of-00100 news.en-00092-of-00100 news.en-00027-of-00100 news.en-00060-of-00100 news.en-00093-of-00100 news.en-00028-of-00100 news.en-00061-of-00100 news.en-00094-of-00100 news.en-00029-of-00100 news.en-00062-of-00100 news.en-00095-of-00100 news.en-00030-of-00100 news.en-00063-of-00100 news.en-00096-of-00100 news.en-00031-of-00100 news.en-00064-of-00100 news.en-00097-of-00100 news.en-00032-of-00100 news.en-00065-of-00100 news.en-00098-of-00100 news.en-00033-of-00100 news.en-00066-of-00100 news.en-00099-of-00100 root@321ae05d9e8b:/workspace/nvidia-examples/big_lstm/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled# exit exit [chibi@centos8 ~]$ cat /etc/redhat-release CentOS Linux release 8.2.2004 (Core) [chibi@centos8 ~]$ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Wed_May__6_19:09:25_PDT_2020 Cuda compilation tools, release 11.0, V11.0.167 Build cuda_11.0_bu.TC445_37.28358933_0 [chibi@centos8 ~]$ sensors k10temp-pci-00c3 Adapter: PCI adapter Tdie: +27.5°C (high = +70.0°C) Tctl: +27.5°C [chibi@centos8 ~]$ sudo hddtemp /dev/sda [sudo] chibi のパスワード: /dev/sda: WDC WD10EZEX-00BN5A0: 28°C [chibi@centos8 ~]$ nvidia-smi nvlink -c GPU 0: TITAN RTX (UUID: GPU-7fb51c1d-c1e7-35cc-aad7-66971f05ddb7) GPU 1: TITAN RTX (UUID: GPU-5a71d61e-f130-637a-b33d-4df555b0ed88) GPU 2: GeForce RTX 2080 Ti (UUID: GPU-13277ce5-e1e9-0cb1-8cee-6c9e6618e774) GPU 3: GeForce RTX 2080 Ti (UUID: GPU-1ac935c2-557f-282e-14e5-3f749ffd63ac) [chibi@centos8 ~]$ lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x0000000007ffffff 128M online no 0 0x0000000008000000-0x000000002fffffff 640M online yes 1-5 0x0000000030000000-0x0000000037ffffff 128M online no 6 0x0000000038000000-0x000000006fffffff 896M online yes 7-13 0x0000000070000000-0x0000000077ffffff 128M online no 14 0x0000000078000000-0x000000007fffffff 128M online yes 15 0x0000000080000000-0x000000009fffffff 512M online no 16-19 0x0000000100000000-0x0000000107ffffff 128M online no 32 0x0000000108000000-0x0000000347ffffff 9G online yes 33-104 0x0000000348000000-0x000000034fffffff 128M online no 105 0x0000000350000000-0x000000078fffffff 17G online yes 106-241 0x0000000790000000-0x0000000797ffffff 128M online no 242 0x0000000798000000-0x00000007c7ffffff 768M online yes 243-248 0x00000007c8000000-0x00000007dfffffff 384M online no 249-251 0x00000007e0000000-0x000000080fffffff 768M online yes 252-257 0x0000000810000000-0x0000000817ffffff 128M online no 258 0x0000000818000000-0x0000000827ffffff 256M online yes 259-260 0x0000000828000000-0x0000000867ffffff 1G online no 261-268 0x0000000868000000-0x0000000d7fffffff 20.4G online yes 269-431 0x0000000d80000000-0x0000000d87ffffff 128M online no 432 0x0000000d88000000-0x0000000e0fffffff 2.1G online yes 433-449 0x0000000e10000000-0x0000000e17ffffff 128M online no 450 0x0000000e18000000-0x0000000e37ffffff 512M online yes 451-454 0x0000000e38000000-0x0000000e3fffffff 128M online no 455 0x0000000e40000000-0x0000000e47ffffff 128M online yes 456 0x0000000e48000000-0x0000000e4fffffff 128M online no 457 0x0000000e50000000-0x0000000f37ffffff 3.6G online yes 458-486 0x0000000f38000000-0x0000000f47ffffff 256M online no 487-488 0x0000000f48000000-0x0000000f7fffffff 896M online yes 489-495 0x0000000f80000000-0x0000000f87ffffff 128M online no 496 0x0000000f88000000-0x0000000fe7ffffff 1.5G online yes 497-508 0x0000000fe8000000-0x0000000fefffffff 128M online no 509 0x0000000ff0000000-0x0000001007ffffff 384M online yes 510-512 0x0000001008000000-0x0000001017ffffff 256M online no 513-514 0x0000001018000000-0x0000001027ffffff 256M online yes 515-516 0x0000001028000000-0x0000001067ffffff 1G online no 517-524 0x0000001068000000-0x00000013bfffffff 13.4G online yes 525-631 0x00000013c0000000-0x00000013c7ffffff 128M online no 632 0x00000013c8000000-0x00000013f7ffffff 768M online yes 633-638 0x00000013f8000000-0x0000001407ffffff 256M online no 639-640 0x0000001408000000-0x0000001427ffffff 512M online yes 641-644 0x0000001428000000-0x000000142fffffff 128M online no 645 0x0000001430000000-0x0000001447ffffff 384M online yes 646-648 0x0000001448000000-0x000000144fffffff 128M online no 649 0x0000001450000000-0x0000001467ffffff 384M online yes 650-652 0x0000001468000000-0x000000146fffffff 128M online no 653 0x0000001470000000-0x0000001477ffffff 128M online yes 654 0x0000001478000000-0x000000147fffffff 128M online no 655 0x0000001480000000-0x0000001497ffffff 384M online yes 656-658 0x0000001498000000-0x000000149fffffff 128M online no 659 0x00000014a0000000-0x00000014b7ffffff 384M online yes 660-662 0x00000014b8000000-0x00000014cfffffff 384M online no 663-665 0x00000014d0000000-0x00000014d7ffffff 128M online yes 666 0x00000014d8000000-0x00000014dfffffff 128M online no 667 0x00000014e0000000-0x00000014efffffff 256M online yes 668-669 0x00000014f0000000-0x0000001507ffffff 384M online no 670-672 0x0000001508000000-0x000000150fffffff 128M online yes 673 0x0000001510000000-0x000000151fffffff 256M online no 674-675 0x0000001520000000-0x0000001587ffffff 1.6G online yes 676-688 0x0000001588000000-0x000000158fffffff 128M online no 689 0x0000001590000000-0x0000001617ffffff 2.1G online yes 690-706 0x0000001618000000-0x000000161fffffff 128M online no 707 0x0000001620000000-0x000000166fffffff 1.3G online yes 708-717 0x0000001670000000-0x0000001677ffffff 128M online no 718 0x0000001678000000-0x000000169fffffff 640M online yes 719-723 0x00000016a0000000-0x00000016a7ffffff 128M online no 724 0x00000016a8000000-0x00000016d7ffffff 768M online yes 725-730 0x00000016d8000000-0x00000016e7ffffff 256M online no 731-732 0x00000016e8000000-0x000000170fffffff 640M online yes 733-737 0x0000001710000000-0x000000171fffffff 256M online no 738-739 0x0000001720000000-0x00000017a7ffffff 2.1G online yes 740-756 0x00000017a8000000-0x00000017bfffffff 384M online no 757-759 0x00000017c0000000-0x00000017c7ffffff 128M online yes 760 0x00000017c8000000-0x00000017e7ffffff 512M online no 761-764 0x00000017e8000000-0x00000017ffffffff 384M online yes 765-767 0x0000001800000000-0x0000001807ffffff 128M online no 768 0x0000001808000000-0x000000180fffffff 128M online yes 769 0x0000001810000000-0x0000001867ffffff 1.4G online no 770-780 0x0000001868000000-0x0000001a57ffffff 7.8G online yes 781-842 0x0000001a58000000-0x0000001a5fffffff 128M online no 843 0x0000001a60000000-0x0000001ac7ffffff 1.6G online yes 844-856 0x0000001ac8000000-0x0000001acfffffff 128M online no 857 0x0000001ad0000000-0x0000001ad7ffffff 128M online yes 858 0x0000001ad8000000-0x0000001adfffffff 128M online no 859 0x0000001ae0000000-0x0000001b17ffffff 896M online yes 860-866 0x0000001b18000000-0x0000001b1fffffff 128M online no 867 0x0000001b20000000-0x0000001b37ffffff 384M online yes 868-870 0x0000001b38000000-0x0000001b3fffffff 128M online no 871 0x0000001b40000000-0x0000001ba7ffffff 1.6G online yes 872-884 0x0000001ba8000000-0x0000001bafffffff 128M online no 885 0x0000001bb0000000-0x0000001bf7ffffff 1.1G online yes 886-894 0x0000001bf8000000-0x0000001bffffffff 128M online no 895 0x0000001c00000000-0x0000001d0fffffff 4.3G online yes 896-929 0x0000001d10000000-0x0000001d1fffffff 256M online no 930-931 0x0000001d20000000-0x0000001d27ffffff 128M online yes 932 0x0000001d28000000-0x0000001d37ffffff 256M online no 933-934 0x0000001d38000000-0x0000001d3fffffff 128M online yes 935 0x0000001d40000000-0x0000001d4fffffff 256M online no 936-937 0x0000001d50000000-0x0000001d67ffffff 384M online yes 938-940 0x0000001d68000000-0x0000001d77ffffff 256M online no 941-942 0x0000001d78000000-0x0000001d8fffffff 384M online yes 943-945 0x0000001d90000000-0x0000001d97ffffff 128M online no 946 0x0000001d98000000-0x0000001d9fffffff 128M online yes 947 0x0000001da0000000-0x0000001dafffffff 256M online no 948-949 0x0000001db0000000-0x0000001db7ffffff 128M online yes 950 0x0000001db8000000-0x0000001dbfffffff 128M online no 951 0x0000001dc0000000-0x0000001dcfffffff 256M online yes 952-953 0x0000001dd0000000-0x0000001ddfffffff 256M online no 954-955 0x0000001de0000000-0x0000001de7ffffff 128M online yes 956 0x0000001de8000000-0x0000001defffffff 128M online no 957 0x0000001df0000000-0x0000001df7ffffff 128M online yes 958 0x0000001df8000000-0x0000001e07ffffff 256M online no 959-960 0x0000001e08000000-0x0000001e27ffffff 512M online yes 961-964 0x0000001e28000000-0x0000001e2fffffff 128M online no 965 0x0000001e30000000-0x0000001e67ffffff 896M online yes 966-972 0x0000001e68000000-0x0000001e6fffffff 128M online no 973 0x0000001e70000000-0x0000001e8fffffff 512M online yes 974-977 0x0000001e90000000-0x0000001e97ffffff 128M online no 978 0x0000001e98000000-0x0000001eafffffff 384M online yes 979-981 0x0000001eb0000000-0x0000001ecfffffff 512M online no 982-985 0x0000001ed0000000-0x0000001ee7ffffff 384M online yes 986-988 0x0000001ee8000000-0x0000001f07ffffff 512M online no 989-992 0x0000001f08000000-0x0000001ff7ffffff 3.8G online yes 993-1022 0x0000001ff8000000-0x000000205fffffff 1.6G online no 1023-1035 メモリブロックサイズ 128M Total online memory: 128G Total offline memory: 0B [chibi@centos8 ~]$ lscpu アーキテクチャ: x86_64 CPU 操作モード: 32-bit, 64-bit バイト順序: Little Endian CPU: 32 オンラインになっている CPU のリスト: 0-31 コアあたりのスレッド数: 2 ソケットあたりのコア数: 16 ソケット数: 1 NUMA ノード数: 4 ベンダー ID: AuthenticAMD CPU ファミリー: 23 モデル: 49 モデル名: AMD EPYC 7302P 16-Core Processor ステッピング: 0 CPU MHz: 1640.757 CPU 最大 MHz: 3000.0000 CPU 最小 MHz: 1500.0000 BogoMIPS: 5988.93 仮想化: AMD-V L1d キャッシュ: 32K L1i キャッシュ: 32K L2 キャッシュ: 512K L3 キャッシュ: 16384K NUMA ノード 0 CPU: 0-3,16-19 NUMA ノード 1 CPU: 4-7,20-23 NUMA ノード 2 CPU: 8-11,24-27 NUMA ノード 3 CPU: 12-15,28-31 フラグ: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca [chibi@centos8 ~]$ lstopo Machine (126GB total) + Package L#0 NUMANode L#0 (P#0 31GB) L3 L#0 (16MB) L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#16) L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#17) L3 L#1 (16MB) L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#18) L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#19) HostBridge L#0 PCIBridge PCI 10de:1e07 GPU L#0 "renderD128" GPU L#1 "card0" NUMANode L#1 (P#1 31GB) L3 L#2 (16MB) L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#20) L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#21) L3 L#3 (16MB) L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#22) L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#23) HostBridge L#2 PCIBridge PCI 10de:1e07 GPU L#2 "card1" GPU L#3 "renderD129" PCIBridge PCI 1022:7901 Block(Disk) L#4 "sda" PCIBridge PCI 1022:7901 NUMANode L#2 (P#2 31GB) L3 L#4 (16MB) L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#24) L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#25) L3 L#5 (16MB) L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#26) L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#27) HostBridge L#6 PCIBridge PCI 10de:1e02 GPU L#5 "renderD130" GPU L#6 "card2" PCIBridge PCI 1022:7901 NUMANode L#3 (P#3 31GB) L3 L#6 (16MB) L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 PU L#24 (P#12) PU L#25 (P#28) L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 PU L#26 (P#13) PU L#27 (P#29) L3 L#7 (16MB) L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 PU L#28 (P#14) PU L#29 (P#30) L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 PU L#30 (P#15) PU L#31 (P#31) HostBridge L#9 PCIBridge PCI 8086:1533 Net L#7 "eth0" PCIBridge PCI 8086:1533 Net L#8 "eth1" PCIBridge PCI 10de:1e02 GPU L#9 "card3" GPU L#10 "renderD131" [chibi@centos8 ~]$ free total used free shared buff/cache available Mem: 131618404 1254720 129009696 12220 1353988 129481020 Swap: 0 0 0 [chibi@centos8 ~]$ cat /proc/meminfo MemTotal: 131618404 kB MemFree: 129008384 kB MemAvailable: 129479772 kB Buffers: 1072 kB Cached: 1204536 kB SwapCached: 0 kB Active: 1094332 kB Inactive: 571416 kB Active(anon): 459840 kB Inactive(anon): 9880 kB Active(file): 634492 kB Inactive(file): 561536 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 442516 kB Mapped: 296088 kB Shmem: 12216 kB KReclaimable: 148456 kB Slab: 454784 kB SReclaimable: 148456 kB SUnreclaim: 306328 kB KernelStack: 11904 kB PageTables: 25588 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 65809200 kB Committed_AS: 3087264 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB Percpu: 20480 kB HardwareCorrupted: 0 kB AnonHugePages: 188416 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 1288144 kB DirectMap2M: 33179648 kB DirectMap1G: 100663296 kB [chibi@centos8 ~]$