by Brian Wang
The BaGuaLu AI system used the Chinese Sunway exaflop supercomputer to train the largest AI model with over 174 trillion parameters. The miraculous capabilities of neural net AI systems like ChatGPT (AI generate novel text and stories) and Dall-E (AI generate novel pictures) and Alphafold2 (protein folding) comes from the growth of the AI models. Going to 100 trillion parameters means you can do things like take all of the text data of the internet or all of the pictures or all of the videos and learn from those massive datasets.
Metaculus the public future predictions site had a question set up in February of 2020 asking if a 100 trillion parameter deep learning system would be trained by 2026. I was at 99% certainty that it would happen. The BaGuaLu model actually published March 2022 with the 174 trillion parameters trained. Unpublished systems are probably already in the 200-500 trillion parameter level.
Advances in neural models such as the 2020 Reformer system have enable the ability to train large models that use memory much more efficiently. 100 trillion parameters is considered by some to be the median estimate of the number of synapses in a human neocortex.
Large-scale pretrained AI models have shown state-of-the-art accuracy in a series of important applications. As the size of pretrained AI models grows dramatically each year in an effort to achieve higher accuracy, training such models requires massive computing and memory capabilities, which accelerates the convergence of AI and HPC. However, there are still gaps in deploying AI applications on HPC systems, which need application and system co-design based on specific hardware features.
To this end, this paper proposes BaGuaLu, the first work targeting training brain scale models on an entire exascale supercomputer, the New Generation Sunway Supercomputer. By combining hardware-specific intra-node optimization and hybrid parallel strategies, BaGuaLu enables decent performance and scalability on unprecedentedly large models. The evaluation shows that BaGuaLu can train 14.5-trillion-parameter models with a performance of over 1 EFLOPS using mixed-precision and has the capability to train 174-trillion-parameter models, which rivals the number of synapses in a human brain.
Tesla Dojo and Others Racing to 100 Exaflops and 100X to 1000X the BaGuaLu System in 3-5 Years
Tesla is looking to mass produce the Dojo AI training system.
Tesla has made sure that the IO (input output), memory, software, power, cooling and all other aspects of the system are perfectly scalable. This will enable them to just build and add tiles to scale to the Exaflop level and beyond.
About 120 compute tiles would equal 1.1 Exaflops and 120,000 compute tiles would be 1100 Exaflops (or 1.1 Zettaflops).
This AI training will be used to perfect self driving and to train the Teslabot.
Google, Facebook, Amazon, Microsoft, Nvidia and other big technology companies are racing to make larger and more capable AI and AI training systems.
Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.
Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.
A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.