Stephen Crosher, CEO of Moortec, sat down with Ed Sperling from Semiconductor Engineering to discuss on-chip monitoring and its impact on power, security and reliability, including predictive maintenance. What follows are excerpts of that conversation.
SE: What new problems are you seeing in design?
Crosher: There are challenges emerging for companies working on advanced nodes, including scaling and transistor density. That stirs up a number of problems, including supply and thermal issues and process variability. We started seeing this at 40nm and 28nm, when we encountered thermal runaway and leakage. At that stage we were a design services company.
SE: Leakage was a bigger problem at 40 and 28nm than at 16/14nm with the introduction of finFETs, although it’s creeping back up at 10/7nm, right?
Crosher: Yes. It did reduce dramatically when it went from 28nm to finFETs. But one of the issues with the finFET structures involves their ability to dissipate heat. And because they’re smaller devices, there are still thermal issues. The greater the gate density, the more the power density, and then you get thermal issues and hotspots.
SE: And that’s dynamic power density as well as leakage current, right?
Crosher: Yes, that’s right. Where we started at 40nm and 28nm, it was the leakage, and then also the dynamic activity on the chip that was correcting high-temperature issues, and then thermal runaway. It’s at that stage where monitoring ceased to be more of an interest for software developers running different software profiles on a chip and became something that was quite critical. You had to actively manage heat and power.
SE: You can’t get away with assuming you’re going to get a 30% or 40% boost in performance and lower power anymore at each new node. Now you have to get more granular with everything, from design to simulation, right?
Crosher: Yes, and you always have a disconnect between what you can simulate and what you actually see in silicon. You can’t allow for the dynamic conditions on-chip completely through simulation. You have to have some actually monitoring on-chip to account for corner cases. You may have multi-core processor architectures, where you have processes being active for different purposes, and then you create a scenario where you’re actually heating up the chip more than you’ve seen in simulation. Having the monitor there is quite important. Having monitors that are accurate is an underlying principle for optimization schemes. Whether it’s adaptive voltage scaling or dynamic frequency scaling, that optimization is only going to be as good as the monitor beneath there. That’s what gives you the opportunity to save more power.
SE: What about near-threshold computing? Can this be adapted to an individual use-case type of approach?
Crosher: If you can actively measure the supply levels and bring that down in a controlled way to a point that’s low enough on the supply where the logic is still working, you can choose an optimized solution for that particular chip. With the monitors, you can optimize each chip as it’s being manufactured, rather than the coarse approach we’ve been using in the industry.
SE: There are far fewer companies pushing to 7nm and beyond. How does that affect all of this?
Crosher: It certainly narrows the customer base. But those customers have multiple business units within them, so there are still opportunities. And then there are the legacy nodes. The crowds are descending onto those nodes, and that customer base is increasing.
SE: And they need more knobs to turn than they did in the past, right?
Crosher: Yes. Unless it’s an ultra-low-power application, it needs monitoring. Even for IoT applications, we’re seeing companies optimize the supply to minimize the die temperature.
SE: Isn’t this part of the shift to more computing at the edge? The idea that we will have a world filled with dumb sensors sending all of their data to the cloud doesn’t work for a variety of reasons, including cost, bandwidth and latency.
Crosher: Yes, and as long as it’s supported by low-power schemes and more innovation around these edge devices, that can be a viable solution. We’re part of that picture. There are people doing thermal analysis tools. Anyone developing these kinds of devices needs to have these sorts of solutions available to them. If there is extra processing, you need to minimize these effects to make it viable.
SE: What’s changing on the chip development side? What are you hearing back from customers in terms of what they’re looking for?
Crosher: We’re seeing an increasing trend toward optimization solutions, and slightly less for reliability. Reliability is still very important for part of the customer base, particularly automotive customers. When we started out, the big concerns were things like thermal shutdown. We’re also seeing a trend toward finer-grained optimization schemes. Human beings are very good at extracting information and making use of that information once they place monitors in sensors and gauges. They use the information to iteratively make products better and better. We see examples out there in technology. With monitoring and analytics in automotive and aerospace, the more information that comes back the more you’re able to improve your next designs.
SE: What’s happening in automotive? Reliability seems to be the big concern.
Crosher: Some of the leading-edge automotive manufacturers really stirred up the industry, driving the other manufacturers to bring a high level of electronics sophistication into their cars. So we saw a step change in the requirements for our IP. We had to work out what problems they were facing in putting server-grade systems into cars. It became like data centers, but with even higher reliability considerations.
SE: The servers we have today are reliable to a point. But you don’t expect servers to function at 150° C with strong mechanical vibration and shock for 15 years, right?
Crosher: Yes, and managing the devices for that lifetime adds a unique set of challenges. You’ve got cutting-edge finFET-node technology going into these devices. You need control over electromigration and thermal issues. Automotive stress issues need to be looked at carefully to achieve those product lifetimes. With automotive, you also need monitors that can self-check the health status. If you have a failing braking system on your car, you need a monitor or sensor to tell you that. If you look back on what the monitors are being used for, it’s things like thermal shutdown. We’re in that data path of controlling the clock frequencies and the power going into the chip. You can stress devices if there’s a failing sensor on a chip. That’s why we build in self-checking and health safety.
SE: There’s been a shift toward resilience and the ability to recover. How does that enter into the picture?
Crosher: A lot of our designs are the core sensor designs are not used in minimal feature widths, so they’re fairly robust in design. We’re careful about current loading throughout the circuit. We have fairly robust circuits just by the nature of them being analog design. But for the digital parts of our subsystems, it comes down to the technologies being offered by the foundries and making sure the logic being implemented is the best. We can see switching out redundancies becoming more important for areas such as automotive.
SE: How much overhead is there in circuit monitoring?
Crosher: You need to consider that in two ways—in terms of the burden on the CPUs and in sizing. The monitoring subsystems require little of the CPU. It’s gathering data, which can then be taken by the CPU. In terms of sizing the IP, it’s a fine-tuned balance between a product that is accurate enough, but also small enough in size. That’s always a tradeoff in the analog world.
SE: Is it essential to do full-time monitoring, or can it be sporadic?
Crosher: You can duty-cycle the monitoring, and that’s particularly useful for low-power applications such as mobile communications and IoT. In those sorts of environments, changes in temperature may be less drastic, versus a data center where you’re working at the ragged edge. In telecommunications, if the die temperature is two or three degrees higher than they expect, that’s a big deal for them because that means their 10-year lifetime won’t be reached.
SE: What else can be done with monitoring?
Crosher: If you’re able to monitor all of this, you can start to predict failure.
SE: Is that a function of how much or how good your data is?
Crosher: Yes, and it goes back to reliability and self-checking of the monitors and accuracy. Your failure prediction is only going to be as good as the accuracy and robustness of the monitors. If you’re going to be running a regimen with equipment in a data center based upon the statistical lifetime of products, that’s very expensive. You want to do it at the right time, so that when you do swap out that equipment it needs to be at the right stage in its lifecycle. What you’re working out there is cost of the failure versus the cost of swapping out equipment. We’re seeing more of those sorts of calculations.
SE: So it’s a tradeoff of why spend money if your computers are running, versus what happens if there is downtime?
Crosher: If you’re monitoring, you’re able to look across an entire product range. It could be tens of millions of instances of your product, for example, and you can start to see where products are failing. That feeds back into swap-out regimens for the lifetime of your product.
SE: As semiconductor content increases in areas such as cars, do you find that monitoring technology is moving to the system level rather than just the chip?
Crosher: We’re very much at the chip level today, but the opportunity for analytics is very interesting. You have a sea of data, and you’re trying to work out what that means. If you look at jet engines, there’s a lot of vibration and you can see early on if there’s something wrong. It’s the same with other technology. You can see signatures of behavior, where it indicates issues are cropping up that need to be resolved.
SE: The danger is that sometimes you push this too far for business reasons, right?
Crosher: It’s about whether you’re reactionary to these issues, or whether your business model compels you to be proactive. It depends on the product type. But what’s clear is there’s a value to that data. The assessments made today on the ownership and value of that data will raise questions going forward. Data that was collected 5 or 10 years ago is still there, somewhere, in a data center, and it still has value.
SE: From the standpoint of data, does it matter if it’s a particular type of data versus another type of data? For example, video data will be different than vibration data or voice data.
Crosher: We’re focused on the physical device level of the chip and how they’re being manufactured. But we’re operating at a low level, so it’s not data that can be recognized by people.
SE: Does that make it useful for security reasons?
Crosher: Yes, some of the some of the techniques for heating up devices or adding glitches onto supplies can be picked up by the monitors, and from there you can determine if it’s been hacked. These are signatures of malicious behavior. We’re providing that information to the software and the system above us. How that information is used is up to them. But step changes in temperatures or supplies would be noticed by the monitors. You’re talking tens of millicelsius being detected over a fairly rapid sampling time of tens of microseconds. If you’ve got something that’s working well at room temperature, and suddenly it hits minus 40, there’s either a fault or some malicious behavior.
You can also read the interview on the Semiconductor Engineering website: https://semiengineering.com/toward-on-chip-monitoring/