marp | theme | title | math |
---|---|---|---|
true |
gaia |
数据中心技术 |
katex |
施展 武汉光电国家研究中心 光电信息存储研究部
https://shizhan.github.io/ https://shi_zhan.gitee.io/
- 背景现状和驱动力
- 历史起源和定义
- 经典案例
- 超算和数据中心
- 主要问题和挑战
<style scoped> h2 { padding-top: 200px; text-align: center; font-size: 72px; } </style>
<style scoped> p { font-size: 18px; text-align: center; } </style>
Source: https://codecondo.com/web-application-architecture/
<style scoped> table, p { font-size: 20px; } </style>
全球服务器 | 数据存储规模 | 全球加速节点 | 带宽储备 | 云产品服务 |
---|---|---|---|---|
100W+ | EB级 | 2800+ | 200T | 300+ |
Source: https://cloud.tencent.com/about
<style scoped> p, a { padding-top: 620px; font-size: 18px; color: #F0F0F0; } </style>
Source: https://datareportal.com/reports/
<style scoped> p { padding-top: 200px; text-align: center; font-size: 72px; color: #0040FF; } </style>
网络服务已渗透社会各个方面
<style scoped> p { padding-top: 200px; text-align: center; font-size: 72px; color: #0040FF; } </style>
疫情加快了这个过程
<style scoped> li, p { font-size: 27px; } </style>
数据中心加速向算力中心演变,AI算力成为算力中心关键承载要素。人工智能应用、大模型训练、推理新需求、新业务快速崛起,带动人工智能市场规模高速增长,预计2022年到2032年全球人工智能市场规模的复合增长率高达42%,2032年达1.3万亿美元。
<style scoped> li, p { font-size: 30px; } </style>
近年来,自动驾驶、生命医学、智能制造等领域发展迅速,随之而来的是超大规模人工智能模型和海量数据对算力需求的不断提高,智算中心建设正当其时。
-
工信部数据表示,截至2021年底,我国在用数据中心机架总规模达520万标准机架,在用数据中心服务器规模1900万台,算力总规模超过140 EFLOPS。全国在用超大型和大型数据中心超过450个,智算中心超过20个。
-
据不完全统计,从2021年1月1到2022年2月15日,全国共有至少26个城市在推动或刚刚完成当地智算中心的建设,其中投入使用的有8个,包括南京、合肥等地的智算中心。除了这些投入使用的,全国至少还有18个城市签约、开工、招标、计划建设智算中心项目,包括深圳、长沙的项目都已经开工建设。
<style scoped> li, p { font-size: 20px; } </style>
Source: https://www.datanami.com/2018/11/27/global-datasphere-to-hit-175-zettabytes-by-2025-idc-says/
<style scoped> p { font-size: 18px; } </style>
Source: On Global Electricity Usage of Communication Technology: Trends to 2030, Challenges, 2015
百度指数
<style scoped> h2 { padding-top: 200px; text-align: center; font-size: 72px; } </style>
<style scoped> li { font-size: 30px; } p { font-size: 18px; } </style>
- 云计算
- Continued cloud adoption
- 物联网
- IoT will further data center demand
- 大数据
- Analytics workloads driving computing demands
Source: Understanding the drivers behind data center demand, Data Centre Dynamics, 2018
<style scoped> li { font-size: 30px; } p { font-size: 18px; } table { font-size: 22px; width: 100%; } th { background: #007FFF; } </style>
- The extraordinary growth in the use of Artificial Intelligence (AI) in various sectors of activity is posing challenges and requiring changes in the design and operation of datacenters so that they can meet ever-increasing demand.
GPU | TDP (W) | TFLOPS (Training) | Over V100 | TOPS (Inference) | Over V100 |
---|---|---|---|---|---|
V100 SXM2 32GB | 300 | 15.7 | 1X | 62 | 1X |
A100 SXM 80GB | 400 | 156 | 9.9X | 624 | 10.1X |
H100 SXM 80GB | 700 | 500 | 31.8X | 2,000 | 32.3X |
Source: Schneider Electric – Energy Management Research Center White Paper 110 Version 1.1
<style scoped> li { font-size: 30px; } p { font-size: 14px; } table { font-size: 22px; width: 100%; } th { background: #007FFF; } </style>
- An estimate by Schneider Electric, a company that operates inthe field of energy systems management and automation, points out that AI currently represents 4.3 GW of energy demand, a figure that is expected to grow at a compound annual rate of 26% to36%, resulting in a total of between 13.5 GW and 20 GW by 2028.
Schneider Electric estimate | 2023 | 2028 |
---|---|---|
Total data center workload | 54 GW | 90 GW |
AI workload | 4.3 GW | 13.5-20 GW |
AI workload (% of total) | 8% | 15-20% |
AI workload (Training vs Inference) | 20% Training, 80% Inference | 15% Training, 85% Inference |
AI workload (Central vs Edge) | 95% Central, 5% Edge | 50% Central, 50% Edge |
Source: Challenges for datacenters in the face of advancing AI, The IT Monitoring Magazine, 2023 AI and the Data Center: Challenges and Investment Strategies, Information Week, 2023
<style scoped> li, p { font-size: 27px; } </style>
- 国家发展改革委创新和高技术发展司2020年4月发布
- 新型基础设施是以新发展理念为引领,以技术创新为驱动,以信息网络为基础,面向高质量发展需要,提供数字转型、智能升级、融合创新等服务的基础设施体系,主要包括信息基础设施、融合基础设施、创新基础设施等三方面内容。
- 信息基础设施主要是指基于新一代信息技术演化生成的基础设施。
- 以5G、物联网、工业互联网、卫星互联网为代表的通信网络基础设施
- 以人工智能、云计算、区块链等为代表的新技术基础设施
- 以数据中心、智能计算中心为代表的算力基础设施
- 中国智能算力占全国总算力的比重也由2016年的3%提升至2020年41%,预计到2023年智能算力的占比将提升至70%。
Source: http://www.xinhuanet.com/fortune/2020-04/21/c_1125883443.htm
<style scoped> h2 { padding-top: 200px; text-align: center; font-size: 72px; } </style>
- 数据中心的概念可以追溯到互联网时代的早期 (60s)
- ARPANET (70s) 与 WWW (90s)
- 应用日渐丰富
- EMail、SNS、IM、博客/微博、视频/短视频、地图 ...
- 网络能力增长
- 拨号、ADSL、宽带、光纤入户,2G至5G ...
- 应用日渐丰富
- Server-side Computing -- Cloud
- 2006年,亚马逊开创性发布了Amazon Web Services云计算平台
<style scoped> li, p { font-size: 25px; } </style>
- 数据中心设备及系统可靠性规范 ANSI/TIA-942,2005年发表,2010、2014年修订
- 一个集中存储、处理和分发大量数据的设施,用于支持各种信息技术服务和业务运营。
- 通常包括服务器、网络设备、存储设备、电力供应系统、冷却系统等基础设施,并提供安全性、稳定性和可靠性保障。
- 为企业和组织提供高效的数据管理和处理能力,支持云计算、大数据分析、在线服务等应用场景。
<style scoped> th { background: #007FFF; } </style>
Tier | Feature |
---|---|
Tier 1 –– basic data center | no redundancy |
Tier 2 –– redundant components | Single distribution path with redundant components |
Tier 3 –– concurrently maintainable | Multiple distribution paths with only one active |
Tier 4 –– fault tolerant | Multiple active distribution paths |
Source: ANSI/TIA-942 Standard
- 计算中心阶段(2001-2006 年)
- 基础资源和设施托管、维护
- 信息中心阶段(2006-2012 年)
- 大型化、虚拟化、综合化
- 云中心阶段(2012-2019 年)
- 云计算技术成熟,指标监控和度量
- 算力中心阶段(2019 年至今)
- 绿色化、智能化,敏捷运营和精细管理
<style scoped> h2 { padding-top: 200px; text-align: center; font-size: 72px; } </style>
<style scoped> h3 { color: #F0F0F0; } p, a { font-size: 18px; padding-top: 520px; text-align: left; color: #F0F0F0; } </style>
Source: Top 10 Data Centers in the World Today, Preetipadma, September 8, 2020
<style scoped> h4 { color: #F0F0F0; } p, li,a { font-size: 27px; color: #F0F0F0; background: rgba(0, 80, 192, 0.5); } </style>
- Begin from 2014, DuPont Fabros Technology, whose business is building massive data centers and leasing wholesale space to companies on a long-term basis, brought online its biggest facility yet: ACC7 in Ashburn, Virginia.
- The ACC7 is 446,000 square feet in size and has a total power capacity of a whopping 41.6 megawatts. The building includes 28 large computer rooms, with a standard critical load of 1.486 megawatts each, and the ability to increase density to offer up to 2.1 megawatts each. Each data hall can accomodate approximately 378 standard cabinets.
- The company applies a new approach "water side economization plant with chiller assist." This means that outside air will cool water for the cooling system, using a plate and frame heat exchanger, which is expected to be the primary cooling source for 75 percent of the calendar year.
Source: New Data Center Design Drives Efficiency Gains for Dupont Fabros, 2014
<style scoped> h4 { color: #F0F0F0; } p, li, a { font-size: 27px; color: #F0F0F0; background: rgba(0, 80, 192, 0.5); } </style>
- Built and designed to Tier IV standards, Tahoe Reno 1 consists of 1.3 million square feet (120,000 sq m) of data center space, which Switch claims is the largest data center for colocation in the world. Switch plans to expand this to a total of 7.2 million sq ft (670,000 sq m). It has a power capacity of 130 MW, a fifth of its 650 MW goal.
- Switch highlighted the data center’s security, reliability and low latency, which is backed by the Superloop system, a 500-mile, multi-terabyte fiber optic network to San Francisco and Los Angeles, as well as the company’s 2.5 million sq ft of data center space located in Las Vegas with 10Gbps circuits at 4-millisecond latency. The facility has a tri-redundant UPS power system, and offers up to 42 kW of power per cabinet.
- 100 percent renewable energy, which Switch currently purchases externally, but plans to produce itself in future using Switch I and Switch II, the company’s ongoing solar projects located near the Apex Industrial Park in Southern Nevada.
Source: Switch opens Tahoe Reno 1, "world’s largest" colo data center, 2017
<style scoped> h4 { color: #F0F0F0; } p, li, a { font-size: 27px; color: #F0F0F0; background: rgba(0, 80, 192, 0.5); } </style>
- It was designed to help meet the skyrocketing needs of the Chinese economic and technological boom that has been running for about two decades. As with most large scale projects in China, this data center was built by a combined public and private investment and is overseen by IBM. It consumes 150 megawatts of power.
- Located in Langfang China, Range International Information Group is the world’s largest data center and occupies 6.3 million square feet of space.
- It is equivalent to the area occupied by the Pentagon or a combination of 110 football fields. Construction of the Range International Information Group was completed in 2016.
Source: And The Title of The Largest Data Center in the World and Largest Data Center in US Goes To..., 2018
<style scoped> li { font-size: 20px; } </style>
- Lakeside Technology Center
- Location: Chicago, Illinois
- 印刷厂改;大量备用发电机组(53);大量冷却水(8.5 million gallons of cooling fluid per year);客户有IBM, CenturyLink, Facebook, and TelX。
- Kolos Data Centre
- Location: Ballengen, Norway
- 北欧天然冷却;挪威丰富水电;北大西洋高速互联。
- Tulip Data City
- Location: Bangalore, India
- 一度非美国最大(Tulip Telecom Ltd.);IBM帮助设计。
- Bahnhof’s Pionen
- Location: Central Stockholm, Sweden
- 斯德哥尔摩人防工程(in 1943 to protect essential government functions);潜艇发动机做备电(Maybach MTU diesel engines)。
- Next-Generation Data
- Location: Newport, UK
- 服务于BT、IBM各路公有云;全英最高PUE;独占电网(has its own sub-station with a direct connection to the 400kV Super Grid)。
<style scoped> li { font-size: 27px; } </style>
- Location: Baar, Switzerland
- 号称世界最安全的数据中心,源自2010年的欧盟 Planets (Preservation and Long-term Access through Networked Services) 项目
- 爱因斯坦的纸质笔记现在我们仍能看到,但斯蒂芬·霍金的数字笔记在70年后我们很有可能看不到。项目旨在确保“我们的数字化文化和科学宝藏可被长期访问”。
- Built in 1994, by Christoph Oschwald, and his business partner Hanspeter Baumann, who converted the former headquarters of the Swiss Air Force into a top-notch data center by installing emergency diesel engines, a ventilation system, a filter, and an air-pressure system to prevent the entry of any poisonous gases.
- Water from an underground lake keeps the center’s cooling system at 8 degrees Celsius.
<style scoped> p { font-size: 18px; padding-top: 520px; text-align: left; } </style>
<style scoped> li { text-align: center; font-size: 60px; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } p { font-size: 18px; padding-top: 380px; text-align: left; } </style>
- 客户专属密钥备份
Source: Encrypted, daily monitored and fully automatic
<style scoped> p { text-align: center; font-size: 60px; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } li { font-size: 30px; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } </style>
Ideal protection against NSA and PRISM!
- Data storage inside of Switzerland (<www.swissfortknox.com>)
- Encryption of the data with 256-bit AES (wikipedia)
- Personal encryption key which is NOT known to us (no backdoors)
- Redundant data storage and contractual availability of 99.7% (GTC)
- Compliance with the legal requirements for a backup in accordance with Swiss law ( Certificate (German) und Report (German) )
<style scoped> h3 { color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } p, li, a { font-size: 30px; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } </style>
- Location: Bluffdale, Utah
- The $1.2billion project included 100,000-square-feet of Tier III data center space and 1,350,000-square-feet of technical support and administrative space. Support facilities include water treatment facilities, vehicle inspection facility, interim visitor control center, perimeter site security measures, fuel storage, water storage, chiller plant, fire suppression systems and 100% electrical generator and UPS back up capacity.
- The facility showcases numerous innovative technology and energy efficiency features and was designed and constructed to achieve LEED Silver certification.
Source: FLAGSHIP UTAH DATA CENTER
<style scoped> h3 { color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } li, a { font-size: 30px; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } </style>
- 拥有尧字节 (yottabyte) 级的设计存储能力 ... 此处尚缺可信数据
$Y_{otta}Byte=2^{10}*Z_{etta}B=2^{20}*E_{xa}B=2^{30}*P_{etta}B=2^{80}B$ - 能储存100年有价值的通讯信息(全世界2011年整个互联网的容量总和也不过52艾$2^{60}$字节),目的是支持综合性国家计算机安全计划 (Comprehensive National Cybersecurity Initiative, CNCI),也是国家情报总监 (DNI) 的执行机构,具体职责保密。
- 位于犹他州小城布拉夫代尔的犹他州大数据中心,是继马里兰州米德堡国家安全局总部、得克萨斯州圣安东尼奥备份中心之后,美国安全局建立的第三个数据中心。其目的,是在全面共享国家安全情报的基础上,存储、处理、分析网络空间大数据,支撑新时期美国国家安全。
<style scoped> p { font-size: 18px; padding-top: 620px; text-align: left; } </style>
Source: http://www.iiclouds.org/20141114/maps-of-data-center-localization/
<style scoped> h3 { padding-top: 5%; } p { font-size: 18px; padding-top: 40%; text-align: center; } </style>
Source: https://www.google.cn/about/datacenters/locations/
<style scoped> p { font-size: 18px; padding-top: 47%; text-align: left; } </style>
Source: https://www.cloudwards.net/news/amazon-announces-new-aws-paris-region-opening-in-2017-14326/
<style scoped> p { font-size: 18px; padding-top: 47%; text-align: left; } </style>
Source: https://aws.amazon.com/cn/cloudfront/features/
<style scoped> h3 { color: #F0F0F0; } p, a { font-size: 18px; padding-top: 47%; text-align: left; color: #F0F0F0; } </style>
Source: https://www.urtech.ca/2019/01/solved-where-are-microsofts-data-centers-located/
<style scoped> p { font-size: 18px; padding-top: 520px; text-align: left; } </style>
Source: https://www.atomia.com/2016/11/24/comparing-the-geographical-coverage-of-aws-azure-and-google-cloud/
<style scoped> p { font-size: 18px; } </style>
Source: https://www.atomia.com/2016/11/24/comparing-the-geographical-coverage-of-aws-azure-and-google-cloud/
<style scoped> p { font-size: 18px; } </style>
Source: 中国数据中心产业发展白皮书,中国通服数字基建产业研究院,2023
<style scoped> p { font-size: 18px; } </style>
Source: 中国信息通信研究院 开放数据中心委员会
<style scoped> p { font-size: 18px; } </style>
Source: https://www.newdc.org.cn/datacenter.html
<style scoped> h3 { opacity: 0; } p, a { font-size: 18px; padding-top: 550px; text-align: left; color: #F0F0F0; } </style>
Source: 阿里云宣布五大超级数据中心落成 未来还将再添十座, 2020年07月31日
<style scoped> h3 { opacity: 0; } p { font-size: 45px; text-align: left; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } </style>
超级数据中心广泛使用液冷、水冷、风能等节能技术,此次新建成的杭州数据中心就部署了全球最大的液冷服务器集群,通过将服务器“泡在水里”(实际使用的是特殊的冷却液)的方式散热,可为数据中心节能70%以上;在五大超级数据中心内,还采用了自动运维机器人进行智能运维,24小时保障数据中心安全运行。
<style scoped> h3 { opacity: 0; } p, a { font-size: 18px; padding-top: 530px; text-align: right; } </style>
Source: 腾讯云全球基础设施, 腾讯云印尼数据中心开服 未来将打造双可用区格局
<style scoped> h3 { opacity: 0; } p { font-size: 45px; text-align: left; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } </style>
2021年4月,腾讯云宣布其在印尼的首个云计算数据中心正式开服。该数据中心位于 印尼首都雅加达未来一年内将在印尼开放第二个数据中心,打造印尼双可用区格局。此次印尼数据中心开服后,腾讯云已经在全球27个地理区域,运营61个可用区。其中,腾讯云海外数据中心已经落地韩国、日本、印度、新加坡、美国、德国、俄罗斯、加拿大、泰国等国家。
腾讯位于清远市的云计算数据中心于2020年7月开服,8栋机房,容纳的服务器将超过100万台。
T-Block使机房、空调、电力等等部件全部模块化,高度简化数据中心的建设,现场施工周期减少了80%以上。
Source: 探访腾讯国内最大数据中心,百万台服务器啥概念
- 中金武汉数据中心
- 光谷八路运营商数据中心
- 中国电信在建临空港中部数据中心
- 中国电信光谷八路数据中心
- 汉口银行武汉光谷主数据中心
- 建设银行武汉南湖数据中心
- 楚天云花山数据中心
数据中心的建造者
2021: These are the World’s Largest Data Center Colocation Providers, Yevgeniy Sverdlik, Jan 15, 2021
<style scoped> table { width: 100%; font-size: 22px; } th { background: #007FFF; } tr:nth-of-type(3), tr:nth-of-type(5), tr:nth-of-type(6), tr:nth-of-type(9), tr:nth-of-type(11) { color: #F08000; } </style>
Company | Market share | Headquarters | |
---|---|---|---|
1 | Equinix | 11.1 % | Redwood City, California |
2 | Digital Realty Trust | 7.6 % | Austin, Texas |
3 | China Telecom | 6.1 % | Beijing, China |
4 | NTT GDC | 4.3 % | Tokyo, Japan |
5 | China Unicom | 4.2 % | Beijing, China |
6 | China Mobile | 2.1 % | Beijing, China |
7 | CyrusOne | 1.9 % | Dallas, Texas |
8 | KDDI Telehouse | 1.9 % | Tokyo, Japan |
9 | GDS | 1.6 % | Shanghai, China |
10 | Global Switch | 1.4 % | London, UK |
11 | 21Vianet | 1.4 % | Beijing, China |
12 | CoreSite | 1.3 % | Denver, Colorado |
13 | Cyxtera | 1.2 % | Coral Gables, Florida |
14 | Lumen (CenturyLink) | 1.1 % | Monroe, Louisiana |
15 | Flexential | 1.1 % | Charlotte, North Carolina |
It’s important to note that China Telecom is one of five Chinese companies on the leaderboard (also China Unicom, China Mobile, GDS 万国数据, and 21Vianet 世纪互联), all of whom do business primarily in China. China’s market is so vast that these providers can stay mostly domestic (with some international presence) and still have huge share of the global market.
China’s protectionist regulatory policy makes it extremely difficult for foreign companies to compete in the country’s vast data center market, and international players’ interest in China has waned. As a result, Chinese hyperscalers’ explosive growth in recent years has driven huge growth for Chinese companies that build and operate data centers for the likes of Alibaba and Tencent.
<style scoped> h2 { padding-top: 200px; text-align: center; font-size: 72px; } </style>
<style scoped> p { padding-top: 350px; font-size: 27px; text-align: left; color: #F0F0F0; } </style>
南京智算中心采用浪潮AI服务器算力机组,搭载寒武纪思元270 和思元290 智能芯片及加速卡。目前已运营系统的AI计算能力达每秒80亿亿次(AI 算力远超传统数据中心提供的基础算力供给),1小时可完成100亿张图像识别、300万小时语音翻译或1万公里的自动驾驶AI数据处理任务。
<style scoped> p { font-size: 27px; text-align: left; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } </style>
在WAIC2020大会期间,商汤科技宣布,上海“新一代人工智能计算与赋能平台”临港超算中心启动动工。该算力中心占地面积近80亩,总投资金额超过50亿元人民币,一期将安置5000个等效8000W的机柜。算力中心建成并投入使用后,总算力规模将超过3700PFLOPS,可同时接入850万路视频,1天即可完成23600年时长的视频处理工作。
<style scoped> p { font-size: 27px; text-align: left; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } li { font-size: 27px; text-align: left; padding-top: 300px; } </style>
2022年8月30日,阿里云宣布正式启动张北超级智算中心。该智算中心总建设规模为12EFLOPS(每秒1200亿亿次浮点运算)AI算力,将超过谷歌的9EFLOPS和特斯拉的1.8EFLOPS,成为全球最大的智算中心,可为AI大模型训练、自动驾驶、空间地理等人工智能探索应用提供强大的智能算力服务。
- Source: 阿里云启动超级智算中心,总算力达12 EFLOPS
<style scoped> h3 { color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } p { font-size: 27px; text-align: left; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } </style>
对于那些听说过埃隆·马斯克的xAI在孟菲斯建造巨型AI超级计算机Colossus。这个价值数十亿美元的AI集群拥有100,000个 NVIDIA H100 GPU (正在通过增加另50,000个H100和50,000个H200 GPU,使其规模翻倍),不仅规模大,而且建造速度快。仅用122天,团队就建造了这个巨型集群。
<style scoped> h3 { color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } p { font-size: 27px; text-align: left; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } </style>
Colossus 的基本构建模块是 Supermicro 液冷机架。它由 8 台 4U 服务器组成,每台服务器配备 8 个 NVIDIA H100,每台机架总共有 64 个 GPU。8 台这样的 GPU 服务器加上一台Supermicro 冷却液分配单元 (CDU)和相关硬件构成了一个 GPU 计算机架。这些机架以八个为一组排列,共计 512 个 GPU,再加上网络,又构成了一个子集群系统。
<style scoped> h3 { color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } p { font-size: 27px; text-align: left; color: #F0F0F0; background: rgba(0, 80, 192, 0.7); } </style>
其网络中,每路光纤连接的速度是 400GbE,比我们在其他地方看到的常见 1GbE 网络快 400 倍。每台服务器有 9 个这样的链接,这意味着单 GPU 计算服务器的带宽约为 3.6Tbps。
<style scoped> h3 { padding-top: 200px; text-align: center; font-size: 70px; } </style>
<style scoped> p { font-size: 18px; text-align: left; } </style>
- 是 集中力量办大事
- …
- …
Source: https://jgbarbosa.github.io/vis/docs/intro_to_hpc/intro_to_hpc_01.html
<style scoped> p { font-size: 18px; text-align: left; } </style>
- 是 集中力量办大事
- 或 人民群众无小事
- …
Source: https://www.networkcomputing.com/cloud-infrastructure/guide-cloud-computing-architectures
<style scoped> p { font-size: 14px; text-align: left; } </style>
- 是 集中力量办大事
- 或 人民群众无小事
- 又或者 动员广大人民办大事
Source: https://docs.microsoft.com/en-us/azure/architecture/example-scenario/infrastructure/hpc-cfd
Discussions: 2013 A comparative study of high-performance computing on the cloud, HPDC'13 2017 Understanding the Performance and Potential of Cloud Computing for Scientific Applications, ToCC'17 2018 HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges, CSUR'18 2019 Use Cases for HPC in the Cloud 2020 HPC in the Cloud? Yes, No and In Between 2020 High Performance Computing Vs Cloud Computing: Which is Better? 2021 HPC and the Cloud
<style scoped> p { font-size: 18px; text-align: left; } </style>
- 从群众中来,到群众中去
Source: The Power of the Community – Crowd Sourcing, Open Source and Social Networking
<style scoped> p { font-size: 18px; padding-top: 620px; text-align: left; } </style>
Source: https://www.winsystems.com/cloud-fog-and-edge-computing-whats-the-difference/
<style scoped> h2 { padding-top: 200px; text-align: center; font-size: 70px; } </style>
<style scoped> p { text-align: center; } </style>
- 多云
- BAT新增机柜主要用于云承载,占比达60%-80%
- 从小规模单中心向行业/区域大规模多中心、跨行业/区域中心演进
- 东数西算
- 呈现由中心向周边转移趋势,未来也将由东部向西部迁移
<style scoped> h3 { padding-top: 200px; text-align: center; font-size: 70px; } p { text-align: center; } </style>
Datacenters as a Computer
……
Nation as a Computer?
<style scoped> p { text-align: center; } </style>
算力规模持续扩大,缩小与世界先进国家差距。
<style scoped> p { font-size: 14px; } </style>
Source: https://www.36kr.com/p/1964095815756041
<style scoped> h2 { padding-top: 200px; text-align: center; font-size: 70px; } p { text-align: center; } </style>
隐藏在规模化的背后的是?
<style scoped> p { font-size: 25px; } </style>
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc.
Source: Software Engineering Advice from Building Large-Scale Distributed Systems Source: Designs, Lessons and Advice from Building Large Distributed Systems, LADIS 2009
- 可靠性
- Failover to other replicas/datacenters.
- 扩展性
- Ensure your design works if scale changes by 10X or 20X, but the right solution for X often not optimal for 100X.
- 一致性
- Multiple data centers implies dealing with consistency issues.
- 可用性
- Worry about variance! Redundancy or timeouts can help bring in latency tail.
<style scoped> table { width: 100% } li, th, td, p { font-size: 14px; } </style>
操作 | 用时 | |||
---|---|---|---|---|
L1 cache reference | 0.5 ns | |||
Branch mispredict | 5 ns | |||
L2 cache reference | 7 ns | 14x L1 cache | ||
Mutex lock/unlock | 25 ns | |||
Main memory reference | 100 ns | 20x L2 cache, 200x L1 cache | ||
Compress 1K bytes with Zippy | 3,000 ns | 3 us | ||
Send 1K bytes over 1 Gbps network | 10,000 ns | 10 us | ||
Read 4K randomly from SSD* | 150,000 ns | 150 us | ~1GB/sec SSD | |
Read 1 MB sequentially from memory | 250,000 ns | 250 us | ||
Round trip within same datacenter | 500,000 ns | 500 us | ||
Read 1 MB sequentially from SSD* | 1,000,000 ns | 1,000 us | 1 ms | ~1GB/sec SSD, 4X memory |
Disk seek | 10,000,000 ns | 10,000 us | 10 ms | 20x datacenter roundtrip |
Read 1 MB sequentially from disk | 20,000,000 ns | 20,000 us | 20 ms | 80x memory, 20X SSD |
Send packet CA->Netherlands->CA | 150,000,000 ns | 150,000 us | 150 ms |
Notes
Credit
By Jeff Dean: http://research.google.com/people/jeff/, Originally by Peter Norvig: http://norvig.com/21-days.html#answers
<style scoped> p { padding-top: 620px; font-size: 18px; } </style>
Source: https://colin-scott.github.io/personal_website/research/interactive_latency.html
<style scoped> p { padding-top: 620px; font-size: 18px; } </style>
Source: http://stereobooster.github.io/latency-numbers-every-programmer-should-know
<style scoped> p { padding-top: 620px; font-size: 18px; } </style>
https://www.servethehome.com/compute-express-link-cxl-latency-how-much-is-added-at-hc34/
<style scoped> p { font-size: 27px; } </style>
Globally, data centers were estimated to use between 196 terawatt hours (TWh) (Masanet et al, 2020) and 400 TWh (Hintemann, 2020) in 2020. This would mean data centers consume between 1-2% of global electricity demand.
Source: Recalibrating global data center energy-use estimates, Science, 28 Feb 2020
- PUE (Power Usage Effectiveness) 指标:几成能源用在实际业务中?
- 由 Green Grid 倡导和维护
- 理想
$PUE=1.0$ - IT设备以外基本没有能耗,包括冷却
- 现实中不可能,哪台电脑不散热?
- 早期一般在2.0左右,即整体耗能倍增。
<style scoped> p { padding-top: 620px; font-size: 18px; } </style>
Source: How Much Energy Do Data Centers Really Use?, March 17, 2020
<style scoped> p { padding-top: 620px; font-size: 18px; } </style>
Google: Our PUE is Lower, and It's Scrupulous, Mar 26, 2012
- 科技巨头 Amazon、Google 和 Microsoft 可以控制在 1.2 以内
- 迄今最优 1.07 (Facebook)、1.12 (Google)
- 其余可不好说
- According to the Uptime Institute research, an average US data center has a PUE of 2.5. However, servers with a PUE of 3.3 and higher are common to find as well...
- 工信部、国家机关事务管理局、国家能源局联合印发《关于加强绿色数据中心建设的指导意见》,提出到2022年全国新建大型、超大型数据中心PUE需达到1.4以下。
《行动计划》以2021年和2023年两个时间节点提出了分阶段发展量化指标,引导传统数据中心向新型数据中心演进。为科学衡量数据中心产业发展水平,加快把体量优势变为质量优势,《行动计划》强化了新型数据中心利用率、算力规模、能效水平、网络时延等反映数据中心高质量发展的指标,弱化了反映体量的数据中心规模指标。
计划到2023年底,利用率方面,全国数据中心平均利用率力争提升到60%以上;算力规模方面,总算力规模超过200 EFLOPS,高性能算力占比达到10%;能效水平方面,新建大型及以上数据中心PUE降低到1.3以下,严寒和寒冷地区力争降低到1.25以下;网络时延方面,国家枢纽节点内数据中心端到端网络单向时延原则上小于20毫秒。
<style scoped> p { padding-top: 620px; font-size: 18px; } </style>
https://natick.research.microsoft.com/
<style scoped> p { padding-top: 620px; font-size: 18px; } </style>
https://www.demilked.com/facebook-server-farm-arctic-lule-sweden/
<style scoped> p { padding-top: 620px; font-size: 18px; } </style>
https://www.sohu.com/a/233201201_398039
<style scoped> p { padding-top: 620px; font-size: 18px; } </style>
https://alibabagroup.com/cn/news/article?news=p150908
<style scoped> h3 { text-align: center; } </style>
<style scoped> h3 { text-align: left; } </style>
<style scoped> h3, p { color: #FFFFFF; } </style>
数据中心可再生能源利用率在未来几年有望快速改善。根据国际环保组织绿色和平研究,2018 年中国数据中心火电用电量占其总用电量的 73%,而中国数据中心可再生能源使用比例仅为 23%,低于我国市电中可再生能源使用比例 26.5%。
到 2020 年,我国数据中心可再生能源利用率达到 30%左右,相较于 2018 年已有所提升。
未来几年,随着国家及各省市加大对数据中心化石能源使用的约束,新型储能、分布式光伏等技术及应用的规模化发展,数据中心可再生能源利用率将大幅提升,绿电占比或将大于 50%。
<style scoped> h3, p { color: #FFFFFF; } </style>
打造“零碳数据中心”成为数据中心低碳化发展的终极目标。随着国家对数据中心能耗管控趋严,以及 PUE 优化、源网荷储一体化技术发展,打造 100%可再生能源的“零碳数据中心”或“低碳数据中心”成为主流服务商的重要发展方向。
如中国电信创新推动数字经济与青海清洁能源深度融合发展,打造中国电信数字青海绿色大数据中心,成为全国首个 100%清洁能源可溯源绿色大数据中心,也是首个数据中心源网荷储绿电智慧供应系统示范样板,重新定义了绿色大数据中心新标准和绿色能源消费新模式。
- 问题依然复杂
- PUE作为评估数据中心总用电量的衡量指标
- 主要考察资源运转效率
- 只考虑数据中心的内部操作,未揭露电力来源与实际用电量
- 数据中心占全球用电量的1~2%,但是电力来源呢?
<style scoped> p { padding-top: 620px; font-size: 18px; } </style>
Source: How much energy do data centers use? October 8, 2021
- 测算标准
- 不同的测算方法和标准可能会导致PUE值的差异
- 环境因素
- 在炎热或寒冷的地区,数据中心的制冷或加热能耗可能会增加
- 负载情况
- 如果数据中心的负载较低,其能耗可能会与满载时有所不同,导致PUE值的变化
服务器升级 --> 算力更高的同时更节能
温控设备,供电系统没有更换 --> 非计算能耗保持不变
服务器升级 --> 算力更高的同时更节能
温控设备,供电系统没有更换 --> 非计算能耗保持不变
于是PUE……反而更差?
可是PUE已成为政策依据,或许因此影响设备选择
- 千亿级模型训练集群,涉及万卡以上规模,千万级以上器件
- …
- 单器件故障会触发集群训练中断
- …
- 分布式训练交互复杂,跨域故障定界技术难度与流程难度高
- …
- …
- 千亿级模型训练集群,涉及万卡以上规模,千万级以上器件
- 日均电费十余万
- 单器件故障会触发集群训练中断
- 意味着每天十余万的损失
- 分布式训练交互复杂,跨域故障定界技术难度与流程难度高
- 业界大模型平均稳定的集群训练时长在天级别
- 分布式训练故障处理时间在1至30天
- 千亿级模型训练集群,涉及万卡以上规模,千万级以上器件
- 日均电费十余万
- 单器件故障会触发集群训练中断
- 意味着每天十余万的损失
- 分布式训练交互复杂,跨域故障定界技术难度与流程难度高
- 业界大模型平均稳定的集群训练时长在天级别
- 分布式训练故障处理时间在1至30天
热门研究方向,如何在大模型训练过程中减少重训练代价?
- 数据中心技术的基础知识:起源、定义、趋势
- 大规模计算机系统性能、可用性、可靠性问题
- 核心指标PUE及经典举措
- 数据中心专题讲座与实践
- 选读论文,准备下个月研讨
- 读前提示,请务必落实主要引用文献,精辟阐述研究背景
- 熟悉实验内容,准备以日常赛促实践
- 日常赛作为入门学习,学有余力可以遍寻各大竞赛平台PVP
- 构建多模态模型,生成主机观测指标学习赛