Artificial Intelligence (AI) chips are starting to enable a new chapter of highly optimised, highly efficient machine learning applications. Artificial Intelligence applications are being utilised by millions of users on a global level on devices like smartphones and tablets, however over the next few years this will evolve and spread to a whole range of new applications and platforms.
AI chips give the CPU time to focus on other tasks, reducing draw on other critical resources within the system. As the CPU temperatures rise the server power consumption drastically increases, therefore real time temperature monitoring systems are necessary to allow for power optimisation.
Moortec’s Embedded In-Chip Monitoring Subsystem can provide increased performance, reliability and efficiency for CPUs, GPUs, DSPs in AI chips.
AI chips are scaling down from 90nm to the most advanced node technologies today, however the scaling stopped at 90nm. The whole idea is to make it smaller and faster together with some kind of thermal control to avoid making a hole in the chip.
AI Companies are trying to build an architecture so that the SoC can run as fast as possible with optimized power per sq mm meaning taking the max performance out of such chips, as the power densities increase the SoC will reach the thermal limits very quickly therefore AI SoC manufacturers are investing a lot of time and resources in removing the heat from the SoC in the field, methods such as Liquid cooling (liquid cold plate is attached to the die), Air cooling are explored, heat sink is the most common one which is forced air onto the chip, vapour chamber is another one.
Die sizes are as large as 100mm2, this basically means that AI chips are reaching the reticle limits in-terms of the size.
It is important for AI chip designers to consider PVT variations over and across the die, one way to do this is to run thermal characterization of the design by means of several simulation tools to find the hot spots, High amount of computation means not only the core processors are loaded but also the digital logic busy with arithmetic operations is also a part of circuit where the power is most consumed, there are few techniques/architecture AI SoC designs adopt such as FinT and FinD architectures to optimize the computation. In terms of Number of Cores used in AI starts these can start from 10’s to 100’s and 1000’s with many more independent cores – and each core with lots of vectors, these designs differ in architecture, memory, type of interface etc.
Analysis & Simulation
During the design phase, analysis & simulation is mostly undertaken with solid works, thermal tools, I/O high speed signal integrity checks, AI Designers will often run into hard clock tree synchronization issues with signal integration.
The accuracy of in-chip PVT monitors is key in defining margin in-terms of thermal limits, the higher the accuracy the closer we can get to the thermal limit, Reliability of AI SoC depends on the target for such a design something to take into account during design phase as it can start from 2-3 years of consumer warranty to 10-15 years for automotive, however MTBF (mean time before failure) should be higher.
There are several ways that AI SoC developers deal with this issue, few of them deal with the software other deal with the redundancy etc, some of them deal with this by distribution backups where even if 5% of the servers are offline they can take the same amount of load without impact.
Supply Voltage Issues
AI chip designs have to maintain sufficient decoupling to compensate should there be any supply droops on current, for custom layout of the chip – designers to make sure critical data and clock are properly distributed over the entire chip.