By Stephen Crosher, Moortec CEO
At 7nm and 5nm, in-circuit monitoring is becoming essential. Moortec CEO Stephen Crosher recently sat down to talk to Ed Sperling from Semiconductor Engineering about the impact of rising complexity, how different use cases and implementations can affect reliability and uptime, and why measuring electrical, voltage and thermal stress can be used to statistically predict failures and improve reliability throughout a chip’s lifetime.
Resolution & Complexity
We are seeing with the advanced nodes from 28nm all the way down to 5nm the complexity of the designs and the complexity of the devices is increasing and also the variability in terms of process speed of the silicon, this all creates and uncertainty in the behaviour of the silicon and when you apply varying workloads with multicore architectures you are faced with an environment where trying to assess reliability sometimes becomes a bit challenging especially if you look at certain market sectors such as automotive but also in other platforms such as Data Center and High Performance Computing (HPC) for example.
At the system level and when you are applying reliability, the uninterrupted service is a key aspect, also the up time for systems which is certainly applicable for data center. The challenges they have, when you look at the advanced nodes and how that mixes with a high-performance compute or data center environment is that a lot of the time these devices are working at the extreme. They really need to be reaching high data throughput levels and there are high algorithmic capacities that need to be reached and really making the most of the power that is available to the devices. That’s where it becomes quite nuanced and having the ability to see what the activity is like within those chips certainly helps towards the devices reliability in terms of reducing voltage and also thermal stress to try and achieve a longer lifetime and a longer life cycle for the silicon.
What we are starting to see is the opportunity to in some systems and if the assessment can be made through analytics to determine what points of the silicon’s lifetime you can intersect to apply some maintenance and also to the point where you are starting to be able to statistically predict failure of the silicon which is of huge benefit to the systems, the system operators and also the products that spin out of this process.
If you are applying circuits deep within the silicon you can start to look at other aspects that also have an impact on its life-cycle. If you are able to make assessments prior to mission mode as well then you are also able to apply some screening, so that silicon that is deployed into the field you can then make some selection as to what you think will be longer lifetime devices compared to others that may go to a different application. What we are seeing at the moment is this whole field of in-chip monitoring, through the chips life-cycle and telemetry of data coming out of the devices, combined with the additional analysis of that data is creating is creating a whole new environment.
If we look at the design cycle of the chip as it goes from the design stage through to deployment, into the field and into the applications mission mode. In circuit monitoring has been around for essentially a number of decades but what we are seeing now is that if there is the option and possibility to integrate sensors within a subsystem that are producing meaningful and insightful data as part of the general system within the chip you can then start to extract data at every stage and as an IP vendor that has been delivering monitors and sensors and sensing subsystems into advanced node technology devices, this means we are in a position to be able to generate this insightful, meaningful data that helps the system and the system level to improve product optimization or reliability of devices.
Insightful within the silicon
The demands that are placed on technology products needing to be more reliable and less likely to fail with less interruptions so in terms of battery power devices and trying to extend the battery lifetimes. Also, when you look at the seismic change that we have seen over the last few years with the amount of data that is put into the cloud and the amount of data that is then processed into the cloud, it’s trying to make the get the most efficiency from the power that goes into data centers for computation. So, what we are seeing is that if we can be insightful within the silicon and the activity within the chip, we can then optimize that power that is being used essentially for computation for cloud applications that we are all using today.
The importance of accuracy
There is certainly a desire to have more accurate sensors especially in terms of thermal sensors because you then have a tighter thermal guard banding, for example the levels at which you start to throttle clock frequencies of the system back if the temperature is rising too high on the silicon, you want to do that at the latest point possible. We see that there are repeated structures in terms of multicore architectures where being able to monitor the temperature of say each processor core becomes more valuable, the software that has been written and the way the software has been deployed can sometimes be unpredictable in terms of the workloads on each of the cores. So if you are able to sense temperature for each of those cores and very locally, you can then apply better workload balancing schemes across those cores, ultimately for longer device lifetime and also in consideration of the power that goes into the chip.
You are going from package structures that rather than just contain one die are then being increased to multiple die and some of those die are being used for different purposes, things like stacked memory and silicon for the actual computation itself and maybe even some interfacing or some analog silicon that’s also incorporated. What it does is that it creates and environment that is more uncertain in the field, in the mission mode as to how that is going to behave thermally. You have multiple system level software in operation, potentially for each die that is being used. So, it’s that unpredictability and there is only so far you can go with the modelling and at the design phase, that only gets you so far. What you are needing to do is to actually see what is happening in real time, actually in its operation to make judgements as to whether you are over-stressing or overheating the device. There is the compounding problem as well of heat dissipation when you have multiple die, within one package and so these are real challenges for the industry and so it’s a combination of things that are helping us make sure that the products are reliable going forward.
“Known good die”
The ability to make those assessments on known good die to a greater degree is obviously going to be helpful for the eventual deployment of the silicon and you can also make those assessments in terms of which applications certain die should be applied to. It’s quite a common concept that we have seen in other walks of life and in other technology spheres, that if you are providing people with more information then their decision making can be that much better for the improvement of products.
Chip lifetimes increasing
When you are designing for in particular for applications that do require high reliability, that’s often the spaces where you have that long lifetime. What we are seeing is that for the advanced nodes some good assessment is being made at the simulation level and the modelling level in terms of understanding device lifetime, but obviously we are at the very early stage of these advanced nodes, say for example 5nm and they’ll be deployed, we’ll have some indication of how they are behaving in terms of maturity just over the next couple of years, bit obviously they are going into designs that could potentially be out there for 15 to 20 years. So, what is the opportunity there? The opportunity is to be making better assessments of the die as it goes through its life-cycle in terms of manufacture, packaging and deployment and then also while its actually while its in its mission mode and in the field and making assessments for the remaining 15 to 20 years of its lifetime.
Drowning in data!
There are two levels to sampling rates and the data, the first level is at the very micro level within the silicon and quite often you see bursty behaviour especially in AI applications where utilizations of cores can be switched on very quickly. You want a fast response in terms of thermal sensing to understand and try and protect those devices. In terms of the longer lifetime and the longer life-cycle we are seeing that data can be produced to understand things like thermal profiles that have been applied depending on the software, you can also start to see over a chips lifetime the thermal signatures during that period. You can also get an understanding of how the chip has been designed into the product and whether it has been given a fair change to reach those 15 to 20 year life-cycles.
So, yes we are drowning in data and this applies to a kind of secondary area in that you can have sensors pouring this information out but you have to make that information meaningful. You can’t just be sending raw data, there has to be some interpretation of that data to make it insightful and that’s another area where we see there being much movement in the marketplace and something that Moortec is particularly interested in.
In case you missed any of Moortec’s previous “Talking Sense” blogs, you can catch up HERE