In this blog Moortec CEO, Stephen Crosher talks about monitoring temperature differences on-chip in AI chips and how to make the most of the power that can be delivered to a device and why accuracy is so critical. Stephen also recently recorded a short video on the topic for Semiconductor Engineering.
Things are getting far more complicated as we move down to 7nm & 5nm but the tolerances of some of the physical effects that we have been measuring in the past are much tighter than they were at the older nodes. How do we track all that?
What we see is that as we descend through the advanced nodes, say from 16nm down to 12nm, 7nm and more recently 5nm we start to see that gate density starts to have an effect on many areas of the dynamic conditions on the chip so quite often there is more hot spots, they are more localised, there is a bit of variation of supplies and also you have challenge of process variability across the device as well.
In order to measure these variations in the dynamic conditions you need to decide whereabouts on the device you are going to monitor these dynamic conditions and these variances in the process and of course there is that law or rule that says whenever you try to monitor something you are displacing some of that inherent information on what you are trying to measure in the first place.
A typical type of arrangement for an AI chip has a general structure of a multi-core architecture which can vary tremendously in scale compared to either a data center type AI chip compared to something that is more on the edge, maybe in automotive for example.
What tends to happen on an typical AI chip is that a lot of these are always on and you are moving data through at very high speed so monitoring temperature differences across the chip is even more important compared to perhaps a smaller device where it is simply on or off. The conditions are much more extreme and one of the things we find is that you may have multi CPU, multi core architecture and it can consist of hundreds or maybe even hundred of thousands of cores but what tends to happen is that the workloads are very bursty as they run the algorithms and as they execute on the compute so what we see is that there is often a case where you cannot quite deliver enough power to have all cores operating at once so you never reach 100% utilisation so you have to make the most of the power that you can deliver to the device.
So essentially what we are looking at is load balancing across the chip that needs to happen dynamically. If you start to an uneven balance of load and workload across the cores that can cause stress to areas of the chip where they are over working mainly due to the heat that’s generated due to certain regions of the chip being over active.
So when placing the monitors on the chip it is important to firstly to try and consider that you are working with repeated structures, so multi cores perhaps grouped within clusters so what you would tend to see is that the monitors are placed per cluster but then the placement of those monitors is repeated along with the clusters so it becomes quite uniform and that also makes it easier for the design teams and those doing the floor plans to handle the repetitive nature of the monitor placement.
Sometimes as these devices heat up and because of the conductivity of the silicon the heat can drift across the chip. After an amount of time there will be a thermal dissipation or thermal flow through the silicon, after all the silicon device and the silicon itself is very thin so that drift inevitably does happen. But, we also see anecdotally form the customer base that we have is that you actually get hot spots that are maybe 20 to 30 degrees higher than other areas of the chip, which is quite significant.
If we talk about accuracy it is very desirable to have the entire die being monitored thermally but of course there is an overhead to any sensor that you put into the chip so you have to be able to distribute them carefully and in some sort of granular way. You need to be aware that there will be distances between where the actual hot spot lies within the chip and maybe a particular core that’s being over worked compared to where the sensor is actually placed so there can be a little bit of a correlation required between that hot spot and where the sensor is.
In terms of where to the place the sensor to make sure you are getting an accurate reading there are few tools that can be used as part of the development flow and basically good practice. So there are a lot of thermal analysis tools out there that for different workloads and for different software and different activity profiles that have been run over the chip you can start to see where the hot spots are that can give you some general guidance as to were to place the thermal sensors. Also it can often be down to the floor planning and where there is available space but quite often we recommend that you do place the sensors as close to the cores and the highest density grouping of cores as possible.
It is important to plan the placement of monitors up front, early in the design in terms of maximising their capability in terms of architecture. Its all about forethought forward planning and maybe using some of the simulation tools to help you in that design flow.
The interesting thing about AI architectures is that you those that are developed and designed, large scale designs for essentially data center environments so the scaling of the chips can be quite large going to maybe reticle size and maybe drawing hundreds of Watts of power. On the other side you can as we have seen more recently putting AI on the edge and the way its being applied to automotive so you are getting server grade systems being placed actually within the car itself and obviously they have to downscale that and they also have to think about the longevity of the devices more reaching 10, 15 , maybe even 20 year lifetimes within an automotive context, whereas within a data center the equipment/technology may only last for maybe 3 or 4 years.
If you have AI devices in a data center context and hence the device is quite large there is of course the supply level that’s at the input pin which may not be at the same level in the center of the chip so you do have that IR drop effect and so trying to monitor that is quite an interesting area. Dynamic issues regarding IR drop is especially relevant to AI applications and AI architectures because of the bursty nature in which the cores are being utilized you can then get high demand, quite quickly with an in-rush of current into the device and then a pulling down on the supply so we see a droop on the supply and that can be quite important to try and monitor and try and capture so that you can compensate for that with some of the voltage regulators that are supplying the chip in the first place
In terms of the longevity of the chip and also how well it functions, we categories things into tow areas, there’s the monitoring for the dynamic conditions, the live conditions that depend on what sort of activity profiles are on the device and then there is those kind of static conditions based on how the chip is made conditions, those process, built in conditions which are relevant to be monitored.
Moortec have been providing innovative embedded subsystem PVT IP solutions for over a decade, empowering their customers with the most advanced monitoring IP on 40nm, 28nm, 16nm, 12nm, 7nm and 5nm. Moortec’s in-chip sensing products support the semiconductor design community’s demands for increased device reliability and enhanced performance optimisation, helping to bring product success by differentiating the customers’ technology. With a world-class design team, excellent support and a rapidly expanding global customer base, Moortec are the go-to leaders in innovative in-chip technologies for the automotive, consumer, high performance computing, mobile and telecommunications market sectors.
If you would like to find out more about monitoring heat on AI Chips or learn more about managing and controlling the conditions on your devices please get in touch using the contact form below or email firstname.lastname@example.org