GPU检查#
检查NVIDIA驱动#
[1]:
!nvidia-smi
Thu May 18 03:47:56 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:07.0 Off | 0 |
| N/A 45C P0 27W / 70W | 2MiB / 15360MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
在 PyTorch 中检查GPU#
导入 PyTorch 以检查 GPU 设备#
[2]:
import torch
检查 PyTorch 的版本#
[3]:
torch.__version__
[3]:
'2.0.0+cu117'
检查PyTorch是否能调用GPU#
[4]:
torch.cuda.is_available()
[4]:
True
在PyTorch中检查计算机的GPU的数量#
[5]:
gpu_num = torch.cuda.device_count()
print(gpu_num)
1
在PyTorch中检查计算机的GPU型号#
[6]:
for i in range(gpu_num):
print('GPU {}.: {}'.format(i, torch.cuda.get_device_name(i)))
GPU 0.: Tesla T4
在 TensorFlow 中检查GPU#
导入 TensorFlow 检查 GPU 设备#
[7]:
import tensorflow as tf
2023-05-18 03:48:02.355928: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-18 03:48:03.469790: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
检查 TensorFlow 的版本#
[8]:
tf.__version__
[8]:
'2.12.0'
检查TensorFlow是否能调用GPU#
[9]:
tf.test.is_gpu_available()
WARNING:tensorflow:From /tmp/ipykernel_3029/337460670.py:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-05-18 03:48:05.350918: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:05.353979: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:05.355430: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.207251: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.209036: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.210484: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.211886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 13167 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5
[9]:
True
在TensorFlow中检查物理GPU#
[10]:
tf.config.list_physical_devices('GPU')
2023-05-18 03:48:08.223882: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.225389: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.226798: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[10]:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
在TensorFlow中检查计算机的GPU型号#
[11]:
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
for x in local_device_protos:
if x.device_type == 'GPU':
print(x.physical_device_desc)
device: 0, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5
2023-05-18 03:48:08.237404: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.238885: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.240274: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.241709: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.243151: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-18 03:48:08.244566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 13167 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5
在 jax和jaxlib 中检查GPU#
导入 Jax 和 jaxlib 以检查 GPU 设备#
[12]:
import jax
import jaxlib
检查jax和jaxlib的版本#
[13]:
jax.__version__
[13]:
'0.4.1'
[14]:
jaxlib.__version__
[14]:
'0.4.1'
从 jax 的默认后端检查所有设备#
[15]:
jax.devices()
[15]:
[StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0)]
查看jax的设备总数#
[16]:
jax.device_count()
[16]:
1
查看jax后端关联的JAX进程数#
[17]:
jax.process_count()
[17]:
1
检查jaxlib的后端#
[18]:
from jax.lib import xla_bridge
print(xla_bridge.get_backend().platform)
gpu
运行一个线性回归的样例来验证Jax能正常运行#
样例代码来自`https://www.secretflow.org.cn/docs/secretflow/zh_CN/tutorial/lr_with_spu.html <https://www.secretflow.org.cn/docs/secretflow/zh_CN/tutorial/lr_with_spu.html>`#### 加载数据集
[19]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer
def breast_cancer(party_id=None, train: bool = True) -> (np.ndarray, np.ndarray):
x, y = load_breast_cancer(return_X_y=True)
x = (x - np.min(x)) / (np.max(x) - np.min(x))
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=42
)
if train:
if party_id:
if party_id == 1:
return x_train[:, :15], _
else:
return x_train[:, 15:], y_train
else:
return x_train, y_train
else:
return x_test, y_test
定义模型#
[20]:
import jax.numpy as jnp
def sigmoid(x):
return 1 / (1 + jnp.exp(-x))
# Outputs probability of a label being true.
def predict(W, b, inputs):
return sigmoid(jnp.dot(inputs, W) + b)
# Training loss is the negative log-likelihood of the training examples.
def loss(W, b, inputs, targets):
preds = predict(W, b, inputs)
label_probs = preds * targets + (1 - preds) * (1 - targets)
return -jnp.mean(jnp.log(label_probs))
定义训练步骤#
[21]:
from jax import grad
def train_step(W, b, x1, x2, y, learning_rate):
x = jnp.concatenate([x1, x2], axis=1)
Wb_grad = grad(loss, (0, 1))(W, b, x, y)
W -= learning_rate * Wb_grad[0]
b -= learning_rate * Wb_grad[1]
return W, b
构建拟合函数#
[22]:
def fit(W, b, x1, x2, y, epochs=1, learning_rate=1e-2):
for _ in range(epochs):
W, b = train_step(W, b, x1, x2, y, learning_rate=learning_rate)
return W, b
验证模型#
[23]:
from sklearn.metrics import roc_auc_score
def validate_model(W, b, X_test, y_test):
y_pred = predict(W, b, X_test)
return roc_auc_score(y_test, y_pred)
训练模型#
[24]:
%matplotlib inline
# Load the data
x1, _ = breast_cancer(party_id=1, train=True)
x2, y = breast_cancer(party_id=2, train=True)
# Hyperparameter
W = jnp.zeros((30,))
b = 0.0
epochs = 10
learning_rate = 1e-2
# Train the model
W, b = fit(W, b, x1, x2, y, epochs=10, learning_rate=1e-2)
# Validate the model
X_test, y_test = breast_cancer(train=False)
auc = validate_model(W, b, X_test, y_test)
print(f'auc={auc}')
auc=0.9880445463478545
正如您看到的,警告信息
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
没有出现,这暗示着jax的代码运行在GPU上
导入SecretFlow以验证此环境中未报告任何错误#
[25]:
import secretflow