“点亮蓝灯”暨水立方公益基金成立仪式盛大开幕

严跃进:点亮现在城中村改造专项告贷、保证性住宅再借款两类金融工具现已活跃执行。

那怎么从零到一建立分布式核算的环境呢?本文将从硬件选型,蓝灯立方到服务器侧的根底装备、蓝灯立方GPU驱动装置和调集通讯库装备,以及无损以太网的启用,直至大模型导入和练习测验,带您跑通建立分布式核算环境的全流程。在完结装置后,暨水咱们能够运用WebUI进行快速调测验证,没问题后可运用命令行东西进行多机分布式练习。

“点亮蓝灯”暨水立方公益基金成立仪式盛大开幕

因为资源约束,公益本次试验验证中,运用三台通用服务器稍加改造进行后续的并行练习和推理测验。-NCCL操作相关装备--o,--op指定那种操作为reduce,基金仅适用于Allreduce、Reduce或ReduceScatter等减缩操作。图1:成立智算中心计划的概要规划拓扑清晰了智算中心的全体规划后,成立咱们将比照通用核算服务器与GPU服务器的内部硬件衔接拓扑图,来详细了解GPU服务器的选型逻辑:图2:通用核算服务器内部的硬件衔接拓扑图3:GPU服务器内部的硬件衔接拓扑图2是一台通用核算服务器内部的硬件衔接拓扑,这台服务器的中心是两块AMD的EPYCCPU,依据IOChiplet扩展出了若干接口,辅佐CPU充沛开释通用核算才能。

“点亮蓝灯”暨水立方公益基金成立仪式盛大开幕

类型事务接口交流容量CX864E-N64x800GEOSFP,仪式2x10GESFP+102.4TbpsCX732Q-N32x400GEQSFP-DD,2x10GESFP+25.6TbpsCX664D-N64x200GEQSFP56,2x10GESFP+25.6TbpsCX564P-N64x100GEQSFP28,2x10GESFP+12.8TbpsCX532P-N32x100GEQSFP28,2x10GESFP+6.4TbpsCX308P-48Y-N48x25GESFP28,8x100GEQSFP284.0Tbps表1:仪式详细类型标准暗示提高大模型练习功率CX-N数据中心交流机的单机转发时延(400ns)低至业界平均水平的1/4~1/5,将网络时延在AI/ML运用端到端时延中的占比降至最低,一起多维度的高牢靠规划保证网络在任何时候都不中止,协助大模型的练习大幅度下降练习时刻、提高全体功率。比较特别的便是Scale-out核算网络和存储网络,开幕这两张网络承载的事务流量决议了交流机设备的选型需求:支撑RDMA、低时延、高吞吐。

“点亮蓝灯”暨水立方公益基金成立仪式盛大开幕

RoCEv2交流机图8:点亮CX308P-48Y-N设备图本次并行练习的环境中设备数量较少,组网相对简略:1.将CX5网卡的25GE事务接口衔接到CX308P。

[root@server3AIGC]#nvidia-smiMonJun311:59:362024+-----------------------------------------------------------------------------------------+|NVIDIA-SMI550.67DriverVersion:550.67CUDAVersion:12.4||-----------------------------------------+------------------------+----------------------+|GPUNamePersistence-M|Bus-IdDisp.A|VolatileUncorr.ECC||FanTempPerfPwr:Usage/Cap|Memory-Usage|GPU-UtilComputeM.||||MIGM.||=========================================+========================+======================||0NVIDIAGeForceRTX4060TiOff|00000000:02:00.0Off|N/A||0%34CP027W/165W|1MiB/16380MiB|0%Default||||N/A|+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+|Processes:||GPUGICIPIDTypeProcessnameGPUMemory||IDIDUsage||=========================================================================================||Norunningprocessesfound|+-----------------------------------------------------------------------------------------+[root@server3AIGC]#编译装置OpenMPI[root@server3AIGC]#tarxvfopenmpi-4.1.6.tar.gz[root@server3openmpi-4.1.6]#[root@server3openmpi-4.1.6]#mkdir-p/home/lichao/lib/openmpi[root@server3openmpi-4.1.6]#./configure--prefix=/home/lichao/lib/openmpi-with-cuda=/usr/local/cuda-12.4-with-nccl=/usr/lib64OpenMPIconfiguration:-----------------------Version:4.1.6BuildMPICbindings:yesBuildMPIC++bindings(deprecated):noBuildMPIFortranbindings:mpif.h,usempiMPIBuildJavabindings(experimental):noBuildOpenSHMEMsupport:yesDebugbuild:noPlatformfile:(none)Miscellaneous-----------------------CUDAsupport:yesHWLOCsupport:internalLibeventsupport:internalOpenUCC:noPMIxsupport:InternalTransports-----------------------CiscousNIC:noCrayuGNI(Gemini/Aries):noIntelOmnipath(PSM2):noIntelTrueScale(PSM):noMellanoxMXM:noOpenUCX:yesOpenFabricsOFILibfabric:noOpenFabricsVerbs:yesPortals4:noSharedmemory/copyin+copyout:yesSharedmemory/LinuxCMA:yesSharedmemory/LinuxKNEM:noSharedmemory/XPMEM:noTCP:yesResourceManagers-----------------------CrayAlps:noGridEngine:noLSF:noMoab:noSlurm:yesssh/rsh:yesTorque:noOMPIOFileSystems-----------------------DDNInfiniteMemoryEngine:noGenericUnixFS:yesIBMSpectrumScale/GPFS:noLustre:noPVFS2/OrangeFS:no[root@server3openmpi-4.1.6]#编译装置NCCL-Test[root@server3lichao]#cdAIGC/[root@server3AIGC]#gitclonehttps://github.com/NVIDIA/nccl-tests.git[root@server3AIGC]#cdnccl-tests/[root@server3nccl-tests]#makeclean[root@server3nccl-tests]#makeMPI=1MPI_HOME=/home/lichao/opt/openmpi/CUDA_HOME=/usr/local/cuda-12.4/NCCL_HOME=/usr/lib64/调集通讯功能测验办法(all_reduce)[root@server1lichao]#catrun_nccl-test.sh/home/lichao/opt/openmpi/bin/mpirun--allow-run-as-root-np3-hostserver1,server2,server3-mcabtl^openib-xNCCL_DEBUG=INFO-xNCCL_ALGO=ring-xNCCL_IB_DISABLE=0-xNCCL_IB_GID_INDEX=3-xNCCL_SOCKET_IFNAME=ens11f1-xNCCL_IB_HCA=mlx5_1:1/home/lichao/AIGC/nccl-tests/build/all_reduce_perf-b128-e8G-f2-g1[root@server1lichao]#./run_nccl-test.sh#nThread1nGpus1minBytes128maxBytes8589934592step:2(factor)warmupiters:5iters:20aggiters:1validation:1graph:0##Usingdevices#Rank0Group0Pid18697onserver1device0[0x02]NVIDIAGeForceRTX4060Ti#Rank1Group0Pid20893onserver2device0[0x02]NVIDIAGeForceRTX4060Ti#Rank2Group0Pid2458onserver3device0[0x02]NVIDIAGeForceRTX4060Ti##ReducingmaxBytesto5261099008duetomemorylimitationserver1:18697:18697[0]NCCLINFONCCL_SOCKET_IFNAMEsetbyenvironmenttoens11f1server1:18697:18697[0]NCCLINFOBootstrap:Usingens11f1:172.16.0.11server1:18697:18697[0]NCCLINFONET/Plugin:Nopluginfound(libnccl-net.so)server1:18697:18697[0]NCCLINFONET/Plugin:Pluginloadreturned2:libnccl-net.so:cannotopensharedobjectfile:Nosuchfileordirectory:whenloadinglibnccl-net.soserver1:18697:18697[0]NCCLINFONET/Plugin:Usinginternalnetworkplugin.server2:20893:20893[0]NCCLINFOcudaDriverVersion12040server2:20893:20893[0]NCCLINFONCCL_SOCKET_IFNAMEsetbyenvironmenttoens11f1server2:20893:20893[0]NCCLINFOBootstrap:Usingens11f1:172.16.0.12server2:20893:20893[0]NCCLINFONET/Plugin:Nopluginfound(libnccl-net.so)server2:20893:20893[0]NCCLINFONET/Plugin:Pluginloadreturned2:libnccl-net.so:cannotopensharedobjectfile:Nosuchfileordirectory:whenloadinglibnccl-net.soserver2:20893:20893[0]NCCLINFONET/Plugin:Usinginternalnetworkplugin.server1:18697:18697[0]NCCLINFOcudaDriverVersion12040NCCLversion2.21.5+cuda12.4server3:2458:2458[0]NCCLINFOcudaDriverVersion12040server3:2458:2458[0]NCCLINFONCCL_SOCKET_IFNAMEsetbyenvironmenttoens11f1server3:2458:2458[0]NCCLINFOBootstrap:Usingens11f1:172.16.0.13server3:2458:2458[0]NCCLINFONET/Plugin:Nopluginfound(libnccl-net.so)server3:2458:2458[0]NCCLINFONET/Plugin:Pluginloadreturned2:libnccl-net.so:cannotopensharedobjectfile:Nosuchfileordirectory:whenloadinglibnccl-net.soserver3:2458:2458[0]NCCLINFONET/Plugin:Usinginternalnetworkplugin.server2:20893:20907[0]NCCLINFONCCL_IB_DISABLEsetbyenvironmentto0.server2:20893:20907[0]NCCLINFONCCL_SOCKET_IFNAMEsetbyenvironmenttoens11f1server2:20893:20907[0]NCCLINFONCCL_IB_HCAsettomlx5_1:1server2:20893:20907[0]NCCLINFONET/IB:Using[0]mlx5_1:1/RoCE[RO];OOBens11f1:172.16.0.12server2:20893:20907[0]NCCLINFOUsingnon-devicenetpluginversion0server2:20893:20907[0]NCCLINFOUsingnetworkIBserver3:2458:2473[0]NCCLINFONCCL_IB_DISABLEsetbyenvironmentto0.server3:2458:2473[0]NCCLINFONCCL_SOCKET_IFNAMEsetbyenvironmenttoens11f1server3:2458:2473[0]NCCLINFONCCL_IB_HCAsettomlx5_1:1server1:18697:18712[0]NCCLINFONCCL_IB_DISABLEsetbyenvironmentto0.server1:18697:18712[0]NCCLINFONCCL_SOCKET_IFNAMEsetbyenvironmenttoens11f1server3:2458:2473[0]NCCLINFONET/IB:Using[0]mlx5_1:1/RoCE[RO];OOBens11f1:172.16.0.13server1:18697:18712[0]NCCLINFONCCL_IB_HCAsettomlx5_1:1server3:2458:2473[0]NCCLINFOUsingnon-devicenetpluginversion0server3:2458:2473[0]NCCLINFOUsingnetworkIBserver1:18697:18712[0]NCCLINFONET/IB:Using[0]mlx5_1:1/RoCE[RO];OOBens11f1:172.16.0.11server1:18697:18712[0]NCCLINFOUsingnon-devicenetpluginversion0server1:18697:18712[0]NCCLINFOUsingnetworkIBserver1:18697:18712[0]NCCLINFOncclCommInitRankcomm0x23622c0rank0nranks3cudaDev0nvmlDev0busId2000commId0x35491327c8228dd0-InitSTARTserver3:2458:2473[0]NCCLINFOncclCommInitRankcomm0x346ffc0rank2nranks3cudaDev0nvmlDev0busId2000commId0x35491327c8228dd0-InitSTARTserver2:20893:20907[0]NCCLINFOncclCommInitRankcomm0x2a1af20rank1nranks3cudaDev0nvmlDev0busId2000commId0x35491327c8228dd0-InitSTARTserver3:2458:2473[0]NCCLINFOSettingaffinityforGPU0to0f,ff000fffserver2:20893:20907[0]NCCLINFOSettingaffinityforGPU0to0f,ff000fffserver1:18697:18712[0]NCCLINFOSettingaffinityforGPU0to0f,ff000fffserver1:18697:18712[0]NCCLINFOcomm0x23622c0rank0nRanks3nNodes3localRanks1localRank0MNNVL0server1:18697:18712[0]NCCLINFOChannel00/02:012server1:18697:18712[0]NCCLINFOChannel01/02:012server1:18697:18712[0]NCCLINFOTrees[0]2/-1/-1->0->-1[1]2/-1/-1->0->1server1:18697:18712[0]NCCLINFOP2PChunksizesetto131072server3:2458:2473[0]NCCLINFOcomm0x346ffc0rank2nRanks3nNodes3localRanks1localRank0MNNVL0server2:20893:20907[0]NCCLINFOcomm0x2a1af20rank1nRanks3nNodes3localRanks1localRank0MNNVL0server3:2458:2473[0]NCCLINFOTrees[0]1/-1/-1->2->0[1]-1/-1/-1->2->0server3:2458:2473[0]NCCLINFOP2PChunksizesetto131072server2:20893:20907[0]NCCLINFOTrees[0]-1/-1/-1->1->2[1]0/-1/-1->1->-1server2:20893:20907[0]NCCLINFOP2PChunksizesetto131072server3:2458:2473[0]NCCLINFOChannel00/0:1[0]->2[0][receive]viaNET/IB/0server3:2458:2473[0]NCCLINFOChannel01/0:1[0]->2[0][receive]viaNET/IB/0server3:2458:2473[0]NCCLINFOChannel00/0:2[0]->0[0][send]viaNET/IB/0server3:2458:2473[0]NCCLINFOChannel01/0:2[0]->0[0][send]viaNET/IB/0server2:20893:20907[0]NCCLINFOChannel00/0:0[0]->1[0][receive]viaNET/IB/0server2:20893:20907[0]NCCLINFOChannel01/0:0[0]->1[0][receive]viaNET/IB/0server2:20893:20907[0]NCCLINFOChannel00/0:1[0]->2[0][send]viaNET/IB/0server2:20893:20907[0]NCCLINFOChannel01/0:1[0]->2[0][send]viaNET/IB/0server1:18697:18712[0]NCCLINFOChannel00/0:2[0]->0[0][receive]viaNET/IB/0server1:18697:18712[0]NCCLINFOChannel01/0:2[0]->0[0][receive]viaNET/IB/0server1:18697:18712[0]NCCLINFOChannel00/0:0[0]->1[0][send]viaNET/IB/0server1:18697:18712[0]NCCLINFOChannel01/0:0[0]->1[0][send]viaNET/IB/0server3:2458:2475[0]NCCLINFONCCL_IB_GID_INDEXsetbyenvironmentto3.server1:18697:18714[0]NCCLINFONCCL_IB_GID_INDEXsetbyenvironmentto3.server2:20893:20909[0]NCCLINFONCCL_IB_GID_INDEXsetbyenvironmentto3.server1:18697:18712[0]NCCLINFOConnectedallringsserver1:18697:18712[0]NCCLINFOChannel01/0:1[0]->0[0][receive]viaNET/IB/0server3:2458:2473[0]NCCLINFOConnectedallringsserver2:20893:20907[0]NCCLINFOConnectedallringsserver1:18697:18712[0]NCCLINFOChannel00/0:0[0]->2[0][send]viaNET/IB/0server2:20893:20907[0]NCCLINFOChannel00/0:2[0]->1[0][receive]viaNET/IB/0server1:18697:18712[0]NCCLINFOChannel01/0:0[0]->2[0][send]viaNET/IB/0server3:2458:2473[0]NCCLINFOChannel00/0:0[0]->2[0][receive]viaNET/IB/0server2:20893:20907[0]NCCLINFOChannel01/0:1[0]->0[0][send]viaNET/IB/0server3:2458:2473[0]NCCLINFOChannel01/0:0[0]->2[0][receive]viaNET/IB/0server3:2458:2473[0]NCCLINFOChannel00/0:2[0]->1[0][send]viaNET/IB/0server3:2458:2473[0]NCCLINFOConnectedalltreesserver1:18697:18712[0]NCCLINFOConnectedalltreesserver1:18697:18712[0]NCCLINFONCCL_ALGOsetbyenvironmenttoringserver3:2458:2473[0]NCCLINFONCCL_ALGOsetbyenvironmenttoringserver3:2458:2473[0]NCCLINFOthreadThresholds8/8/64|24/8/64|512|512server3:2458:2473[0]NCCLINFO2collchannels,2collnetchannels,0nvlschannels,2p2pchannels,2p2pchannelsperpeerserver2:20893:20907[0]NCCLINFOConnectedalltreesserver2:20893:20907[0]NCCLINFONCCL_ALGOsetbyenvironmenttoringserver2:20893:20907[0]NCCLINFOthreadThresholds8/8/64|24/8/64|512|512server2:20893:20907[0]NCCLINFO2collchannels,2collnetchannels,0nvlschannels,2p2pchannels,2p2pchannelsperpeerserver1:18697:18712[0]NCCLINFOthreadThresholds8/8/64|24/8/64|512|512server1:18697:18712[0]NCCLINFO2collchannels,2collnetchannels,0nvlschannels,2p2pchannels,2p2pchannelsperpeerserver2:20893:20907[0]NCCLINFOTUNER/Plugin:Pluginloadreturned11:libnccl-net.so:cannotopensharedobjectfile:Nosuchfileordirectory:whenloadinglibnccl-tuner.soserver2:20893:20907[0]NCCLINFOTUNER/Plugin:Usinginternaltunerplugin.server2:20893:20907[0]NCCLINFOncclCommInitRankcomm0x2a1af20rank1nranks3cudaDev0nvmlDev0busId2000commId0x35491327c8228dd0-InitCOMPLETEserver3:2458:2473[0]NCCLINFOTUNER/Plugin:Pluginloadreturned11:libnccl-net.so:cannotopensharedobjectfile:Nosuchfileordirectory:whenloadinglibnccl-tuner.soserver3:2458:2473[0]NCCLINFOTUNER/Plugin:Usinginternaltunerplugin.server3:2458:2473[0]NCCLINFOncclCommInitRankcomm0x346ffc0rank2nranks3cudaDev0nvmlDev0busId2000commId0x35491327c8228dd0-InitCOMPLETEserver1:18697:18712[0]NCCLINFOTUNER/Plugin:Pluginloadreturned11:libnccl-net.so:cannotopensharedobjectfile:Nosuchfileordirectory:whenloadinglibnccl-tuner.soserver1:18697:18712[0]NCCLINFOTUNER/Plugin:Usinginternaltunerplugin.server1:18697:18712[0]NCCLINFOncclCommInitRankcomm0x23622c0rank0nranks3cudaDev0nvmlDev0busId2000commId0x35491327c8228dd0-InitCOMPLETE##out-of-placein-place#sizecounttyperedoproottimealgbwbusbw#wrongtimealgbwbusbw#wrong#(B)(elements)(us)(GB/s)(GB/s)(us)(GB/s)(GB/s)12832floatsum-128.390.000.01027.350.000.01025664floatsum-129.440.010.01028.540.010.010512128floatsum-129.990.020.02029.660.020.0201024256floatsum-132.890.030.04030.640.030.0402048512floatsum-134.810.060.08031.870.060.09040961024floatsum-137.320.110.15036.090.110.15081922048floatsum-145.110.180.24043.120.190.250163844096floatsum-157.920.280.38056.980.290.380327688192floatsum-172.680.450.60070.790.460.6206553616384floatsum-195.770.680.91093.730.700.93013107232768floatsum-1162.70.811.070161.50.811.08026214465536floatsum-1177.31.481.970177.41.481.970524288131072floatsum-1301.41.742.320302.01.742.3101048576262144floatsum-1557.91.882.510559.21.882.5002097152524288floatsum-11089.81.922.5701092.21.922.56041943041048576floatsum-12165.71.942.5802166.61.942.58083886082097152floatsum-14315.71.942.5904316.11.942.590167772164194304floatsum-18528.81.972.6208529.31.972.620335544328388608floatsum-1166222.022.690166102.022.6906710886416777216floatsum-1326022.062.740325422.062.75013421772833554432floatsum-1639462.102.800638312.102.80026843545667108864floatsum-11265292.122.8301264122.122.830536870912134217728floatsum-12515992.132.8502513272.142.8501073741824268435456floatsum-15006642.142.8605019112.142.8502147483648536870912floatsum-110014152.142.86010001782.152.86042949672961073741824floatsum-119993612.152.86019973802.152.870server1:18697:18697[0]NCCLINFOcomm0x23622c0rank0nranks3cudaDev0busId2000-DestroyCOMPLETEserver2:20893:20893[0]NCCLINFOcomm0x2a1af20rank1nranks3cudaDev0busId2000-DestroyCOMPLETEserver3:2458:2458[0]NCCLINFOcomm0x346ffc0rank2nranks3cudaDev0busId2000-DestroyCOMPLETE#Outofboundsvalues:0OK#Avgbusbandwidth:1.66163#[root@server1lichao]#成果详解-size(B):蓝灯立方操作处理的数据的巨细,蓝灯立方以字节为单位。与会者将环绕才智医疗的前沿技能、暨水立异使用及未来展望打开深化沟通与讨论,一起推进全球医疗工业的立异开展。

此外,公益上海的多家医疗科技公司也在研讨会上进行了精彩展现,共享了各安闲健康工业方面的最新技能和产品。此次协作将为中美两国在医疗科技立异、基金产品研制、商场拓宽等方面供给宽广的空间和时机。

此次活动汇聚了国内外很多医疗科技范畴的专家学者、成立企业代表及职业精英,一起讨论才智医疗的未来开展趋势。近年来,仪式跟着健康我国战略的深化施行,社区和家庭健康立异已成为完成这一宏伟方针的重要组成部分。