跑算法 奇怪的现象
发表于 : 2018-08-07 12:23
系统:Linux titanxp-desktop 4.15.0-30-generic #32~16.04.1-Ubuntu SMP Thu Jul 26 20:25:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
CPU:Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
内存:32G
硬盘:ssd 250G sata 3T
显卡:NVIDIA TiTanXP 公版带涡轮风扇 (后改水冷)
显卡驱动:NVIDIA-Linux-x86_64-390.77.run
安装软件:cuda9.0 cudnn9.0 python版本的tensorflow
现象:
1、跑模型算法时候,经常死机。并且死机后跟TiTanXP相连的路由器也死机 也就是说网络中断。重启后网络恢复……
2、通过nvidia-smi 发现显卡温度有记录的是89°C(还是涡轮风扇时候),换水冷并且重做系统后 温度降到59°C 但是还是经常跑算法死机
Fri Aug 3 18:23:01 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77 Driver Version: 390.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:01:00.0 On | N/A |
| 35% 59C P2 238W / 250W | 11773MiB / 12194MiB | 40% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1861 G /usr/lib/xorg/Xorg 103MiB |
| 0 2398 G /opt/teamviewer/tv_bin/TeamViewer 3MiB |
| 0 2616 G compiz 143MiB |
| 0 11747 C python3 11519MiB |
+-----------------------------------------------------------------------------+
3、通过demsg | grep NV得到信息如下
[ 0.004000] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 0.618965] rtc_cmos 00:04: alarms up to one month, y3k, 242 bytes nvram, hpet irqs
[ 1.902320] nvidia: loading out-of-tree module taints kernel.
[ 1.902324] nvidia: module license 'NVIDIA' taints kernel.
[ 1.904907] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1.909199] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[ 1.909383] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 1.936579] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 390.77 Tue Jul 10 22:10:46 PDT 2018
[ 1.942002] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 1.942003] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
4、今天跑算法死机后重启,发现驱动没有了…… 并且登录界面 输入密码自动注销回到登录界面 只好重新安装驱动
请教各位大神 这是什么问题 我该如何排查哪里的问题造成频繁死机
CPU:Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
内存:32G
硬盘:ssd 250G sata 3T
显卡:NVIDIA TiTanXP 公版带涡轮风扇 (后改水冷)
显卡驱动:NVIDIA-Linux-x86_64-390.77.run
安装软件:cuda9.0 cudnn9.0 python版本的tensorflow
现象:
1、跑模型算法时候,经常死机。并且死机后跟TiTanXP相连的路由器也死机 也就是说网络中断。重启后网络恢复……
2、通过nvidia-smi 发现显卡温度有记录的是89°C(还是涡轮风扇时候),换水冷并且重做系统后 温度降到59°C 但是还是经常跑算法死机
Fri Aug 3 18:23:01 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77 Driver Version: 390.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:01:00.0 On | N/A |
| 35% 59C P2 238W / 250W | 11773MiB / 12194MiB | 40% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1861 G /usr/lib/xorg/Xorg 103MiB |
| 0 2398 G /opt/teamviewer/tv_bin/TeamViewer 3MiB |
| 0 2616 G compiz 143MiB |
| 0 11747 C python3 11519MiB |
+-----------------------------------------------------------------------------+
3、通过demsg | grep NV得到信息如下
[ 0.004000] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 0.618965] rtc_cmos 00:04: alarms up to one month, y3k, 242 bytes nvram, hpet irqs
[ 1.902320] nvidia: loading out-of-tree module taints kernel.
[ 1.902324] nvidia: module license 'NVIDIA' taints kernel.
[ 1.904907] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1.909199] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[ 1.909383] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 1.936579] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 390.77 Tue Jul 10 22:10:46 PDT 2018
[ 1.942002] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 1.942003] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
4、今天跑算法死机后重启,发现驱动没有了…… 并且登录界面 输入密码自动注销回到登录界面 只好重新安装驱动
请教各位大神 这是什么问题 我该如何排查哪里的问题造成频繁死机