Completion of the process when learning convolution networks

Question

Teaching convolutional networks with Keras, backend - TensorFlow. When training several networks (in a loop), after several (approximately 6-10) trained networks, the process is completed by issuing the following messages:

2018-08-01 03:12:29.507320: WT:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:2440] 2018-08-01 03:12:32.384245: WT:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:2440] 2018-08-01 03:12:43.149686: WT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 87.89MiB. Current allocation summary follows. 2018-08-01 03:12:43.150195: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (256): Total Chunks: 239, Chunks in use: 238. 59.8KiB allocated for chunks. 59.5KiB in use in bin. 1.9KiB client-requested in use in bin. 2018-08-01 03:12:43.150888: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (512): Total Chunks: 64, Chunks in use: 63. 32.0KiB allocated for chunks. 31.5KiB in use in bin. 31.5KiB client-requested in use in bin. 2018-08-01 03:12:43.151555: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (1024): Total Chunks: 5, Chunks in use: 5. 5.3KiB allocated for chunks. 5.3KiB in use in bin. 5.0KiB client-requested in use in bin. 2018-08-01 03:12:43.152226: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (2048): Total Chunks: 8, Chunks in use: 7. 17.3KiB allocated for chunks. 14.0KiB in use in bin. 14.0KiB client-requested in use in bin. 2018-08-01 03:12:43.152878: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (4096): Total Chunks: 1, Chunks in use: 1. 6.8KiB allocated for chunks. 6.8KiB in use in bin. 6.8KiB client-requested in use in bin. 2018-08-01 03:12:43.153526: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. и т.п. сообщения 2018-08-01 03:12:43.165310: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:646] Bin for 87.89MiB was 64.00MiB, Chunk State: 2018-08-01 03:12:43.165660: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:652] Size: 87.89MiB | Requested Size: 512.0KiB | in_use: 0, prev: Size: 512B | Requested Size: 512B | in_use: 1, next: Size: 256B | Requested Size: 4B | in_use: 1 2018-08-01 03:12:43.166916: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 0000000500D60000 of size 1280 2018-08-01 03:12:43.167233: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 0000000500D60500 of size 256 2018-08-01 03:12:43.167544: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 0000000500D60600 of size 256 далее много подобных сообщений 2018-08-01 03:12:43.345725: IT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:680] Stats: Limit: 591550873 InUse: 460528896 MaxInUse: 591547392 NumAllocs: 153014 MaxAllocSize: 368753664 2018-08-01 03:12:43.346681: WT:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:279] ********************************______________******************_____****************xxxxxxxxxxxxxxx 2018-08-01 03:12:43.485313: WT:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at conv_ops.cc:693 : Resource exhausted: OOM when allocating tensor with shape[16,64,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "C:/Users/[path]/multiply_learning.py", line 118, in <module> callbacks=callbacks) File "C:\Users\[path]\venv\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "C:\Users\[path]\venv\lib\site-packages\keras\engine\training.py", line 1415, in fit_generator initial_epoch=initial_epoch) File "C:\Users\[path]\venv\lib\site-packages\keras\engine\training_generator.py", line 213, in fit_generator class_weight=class_weight) File "C:\Users\[path]\venv\lib\site-packages\keras\engine\training.py", line 1215, in train_on_batch outputs = self.train_function(ins) File "C:\Users\[path]\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 2666, in __call__ return self._call(inputs) File "C:\Users\[path]\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 2636, in _call fetched = self._callable_fn(*array_vals) File "C:\Users\[path]\venv\lib\site-packages\tensorflow\python\client\session.py", line 1454, in __call__ self._session._session, self._handle, args, status, None) File "C:\Users\[path]\venv\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,64,150,150] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: vgg16_7/block1_conv2/convolution = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](vgg16_7/block1_conv1/Relu, block1_conv2/kernel/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[Node: dense_16/BiasAdd/_1005 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_236_dense_16/BiasAdd", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. Process finished with exit code 1

When teaching one network, everything happens normally with any number of epochs. What is the reason for such departures, and what are the ways to eliminate them?

Completion of the process when learning convolution networks

0

More articles: