IEEE websites place cookies on your device to give you the best user experience. By using our websites, you agree to the placement of these cookies. To learn more, read our Privacy Policy.
A space expert warns NASA's safety culture may be eroding again
Russia's "Nauka" Multipurpose Laboratory Module is pictured shortly after docking to the Zvezda service module's Earth-facing port on the International Space Station, with the Brazilian coast 263 miles below. In the foreground is the Soyuz MS-18 crew ship docked to the Rassvet module on 29 July 2021.
This is a guest post. The views expressed here are solely those of the author and do not represent positions of IEEE Spectrum or the IEEE.
In an International Space Station major milestone more than fifteen years in the making, a long-delayed Russian science laboratory named Nauka automatically docked to the station on 29 July, prompting sighs of relief in the Mission Control Centers in Houston and Moscow. But within a few hours, it became shockingly obvious the celebrations were premature, and the ISS was coming closer to disaster than at anytime in its nearly 25 years in orbit.
While the proximate cause of the incident is still being unravelled, there are worrisome signs that NASA may be repeating some of the lapses that lead to the loss of the Challenger and Columbia space shuttles and their crews. And because political pressures seem to be driving much of the problem, only an independent investigation with serious political heft can reverse any erosion in safety culture.
Let's step back and look at what we know happened: In a cyber-logical process still not entirely clear, while passing northwest to southeast over Indonesia, the Nauka module's autopilot apparently decided it was supposed to fly away from the station. Although actually attached, and with the latches on the station side closed, the module began trying to line itself up in preparation to fire its main engines using an attitude adjustment thruster. As the thruster fired, the entire station was slowly dragged askew as well.
Since the ISS was well beyond the coverage of Russian ground stations, and since the world-wide Soviet-era fleet of tracking ships and world-circling network of "Luch" relay comsats had long since been scrapped, and replacements were slow in coming, nobody even knew Nauka was firing its thruster, until a slight but growing shift in the ISS's orientation was finally detected by NASA.
Nauka approaches the space station, preparing to dock on 29 July 2021. NASA
Within minutes, the Flight Director in Houston declared a "spacecraft emergency"—the first in the station's lifetime—and his team tried to figure out what could be done to avoid the ISS spinning up so fast that structural damage could result. The football-field-sized array of pressurized modules, support girders, solar arrays, radiator panels, robotic arms, and other mechanisms was designed to operate in a weightless environment. But it was also built to handle stresses both from directional thrusting (used to boost the altitude periodically) and rotational torques (usually to maintain a horizon-level orientation, or to turn to a specific different orientation to facilitate arrival or departure of visiting vehicles). The juncture latches that held the ISS's module together had been sized to accommodate these forces with a comfortable safety margin, but a maneuver of this scale had never been expected.
Meanwhile, the station's automated attitude control system had also noted the deviation and began firing other thrusters to countermand it. These too were on the Russian half of the station. The only US orientation-control system is a set of spinning flywheels that gently turn the structure without the need for thruster propellant, but which would have been unable to cope with the unrelenting push of Nauka's thruster. Later mass-media scenarios depicted teams of specialists manually directing on-board systems into action, but the exact actions taken in response still remain unclear—and probably were mostly if not entirely automatic. The drama continued as the station crossed the Pacific, then South America and the mid-Atlantic, finally entering Russian radio contact over central Europe an hour after the crisis had begun. By then the thrusting had stopped, probably when the guilty thruster exhausted its fuel supply. The sane half of the Russian segment then restored the desired station orientation.
Initial private attempts to use telemetry data to visually represent the station's tumble that were posted online looked bizarre, with enormous rapid gyrations in different directions. Mercifully, the truth of the situation is that the ISS went through a simple long-axis spin of one and a half full turns, and then a half turn back to the starting alignment. The jumps and zig-zags were computational artifacts of the representational schemes used by NASA, which relate to the concept of "gimbal lock" in gyroscopes.
How close the station had come to disaster is an open question, and the flight director humorously alluded to it in a later tweet that he'd never been so happy as when he saw on external TV cameras that the solar arrays and radiators were still standing straight in place. And any excessive bending stress along docking interfaces between the Russian and American segments would have demanded quick leak checks. But even if the rotation was "simple," the undeniably dramatic event has both short term and long-term significance for the future of the space station. And it has antecedents dating back to the very birth of the ISS in 1997.
How close the ISS had come to disaster is still an open question.
At this point, unfortunately, is when the human misjudgments began to surface. To calm things down, official NASA spokesmen provided very preliminary underestimates in how big and how fast the station's spin had been. These were presented without any caveat that the numbers were unverified—and the real figures turned out to be much worse. The Russian side, for its part, dismissed the attitude deviation as a routine bump in a normal process of automatic docking and proclaimed there would be no formal incident investigation, especially any that would involve their American partners. Indeed, both sides seemed to agree that the sooner the incident was forgotten, the better. As of now, the US side is deep into analysis of induced stresses on critical ISS structures, with the most important ones, such as the solar arrays, first. Another standard procedure after this kind of event is to assess potential indicators of stress-induced damage, especially in terms of air leaks, and where best to monitor cabin pressure and other parameters to detect any such leaks.
The bureaucratic instinct to minimize the described potential severity of the event needs cold-blooded assessment. Sadly, from past experience, this mindset of complacency and hoping for the best is the result of natural human mental drift that comes when there are long periods of apparent normalcy. Even if there is a slowly emerging problem, as long as everything looks okay in the day to day, the tendency is ignore warning signals as minor perturbations. The safety of the system is assumed rather than verified—and consequently managers are led into missing clues, or making careless choices, that lead to disaster. So these recent indications of this mental attitude about the station's attitude are worrisome. The NASA team has experienced that same slow cultural rot of assuming safety several times over the past decades, with hideous consequences. Team members in the year leading up to the 1986 Challenger disaster (and I was deep within the Mission Control operations then) had noticed and begun voicing concerns over growing carelessness and even humorous reactions to occasional "stupid mistakes," without effect. Then, after imprudent management decisions, seven people died.
The same drift was noticed in the late 1990s, especially in the joint US/Russian operations on Mir and on early ISS flights. It led to the forced departure of a number of top NASA officials, who had objected to the trend that was being imposed by the White House's post-Cold War diplomatic goals, implemented by NASA Administrator Dan Goldin. Safety took a decidedly secondary priority to international diplomatic value. Legendary Mission Control leader Gene Kranz described the decisions that were made in the mid-1990s over his own objections, objections that led to his sudden departure from NASA. "Russia was subsequently assigned partnership responsibilities for critical in-line tasks with minimal concern for the political and technical difficulties as well as the cost and schedule risks," he wrote in 1999. "This was the first time in the history of US manned space flight that NASA assigned critical path, in-line tasks with little or no backup." By 2001-'02, the results were as Kranz and his colleagues had warned. "Today's problems with the space station are the product of a program driven by an overriding political objective and developed by an ad hoc committee, which bypassed NASA's proven management and engineering teams," he concluded.
To reverse the apparent new cultural drift, NASA headquarters or some even higher office is going to have to intervene.
By then the warped NASA management culture that soon enabled the Columbia disaster in 2003 was fully in place. Some of the wording in current management proclamations regarding the Nauka docking have an eerie ring of familiarity. "Space cooperation continues to be a hallmark of U.S.-Russian relations and I have no doubt that our joint work reinforces the ties that have bound our collaborative efforts over the many years" wrote NASA Director Bill Nelson to Dmitry Rogozin, head of the Russian space agency, on July 31. There was no mention of the ISS's first declared spacecraft emergency, nor any dissatisfaction with Russian contribution to it.
To reverse the apparent new cultural drift, and thus potentially forestall the same kind of dismal results as before, NASA headquarters or some even higher office is going to have to intervene. The causes of the Nauka-induced "space sumo match" of massive cross-pushing bodies need to be determined and verified. And somebody needs to expose the decision process that allowed NASA to approve the ISS docking of a powerful thruster-equipped module without the on-site real-time capability to quickly disarm that system in an emergency. Because the apparent sloppiness of NASA's safety oversight on visiting vehicles looks to be directly associated with maintaining good relations with Moscow, the driving factor seems to be White House diplomatic goals—and that's the level where a corrective impetus must originate. With a long-time U.S. Senate colleague, Nelson, recently named head of NASA, President Biden is well connected to issue such guidance for a thorough investigation by an independent commission, followed by implementation of needed reforms. The buck stops with him.
As far as Nauka's role in this process of safety-culture repair, it turns out that quite by bizarre coincidence, a similar pattern was played out by the very first Russian launch that inaugurated the ISS program, the 'Zarya' module [called the 'FGB'] in late 1997. Nauka turns out to be the repeatedly rebuilt and upgraded backup module for that very launch, and the parallels are remarkable. The day the FGB was launched, on 20 November 1998, the mission faced disaster when it refused to accept ground commands to raise its original atmosphere-skimming parking orbit. As it crossed over Russian ground sites, controllers in Moscow sent commands, and the spacecraft didn't answer. Meanwhile, NASA guests at a nearby facility were celebrating with Russian colleagues as nobody told them of the crisis. Finally, on the last available in-range pass, controllers tried a new command format that the onboard computer did recognize and acknowledge. The mission—and the entire ISS project—was saved, and the American side never knew. Only years later did the story appear in Russian newspapers.
Still, for all its messy difficulties and frustrating disappointments, the U.S./Russian partnership turned out to be a remarkably robust "mutual co-dependence" arrangement, when managed with "tough love." Neither side really had practical alternatives if it wanted a permanent human presence in space, and they still don't—so both teams were devoted to making it work. And it could still work—if NASA keeps faith with its traditional safety culture and with the lives of those astronauts who died in the past because NASA had failed them.
Postscript: As this story was going to press, a NASA spokesperson responded to queries about the incident saying:
James Oberg is a retired "rocket scientist" in Texas, after a 20+ year career in NASA Mission Control and subsequently an on-air space consultant for ABC News and then NBC News. The author of a dozen books and hundreds of magazine articles on the past, present, and potential future of space exploration, he has reported from space launch and operations centers across the United States and Russia and North Korea.
Oberg, in his usual thorough and analytical manner, has sounded the alarm on another round of potential complacency on the part of those entrusted with the lives of our astronauts- on all sides. We MUST conduct a complete investigation into the incident to avoid any repetition of the Challenger/Columbia disasters. Before we even think about a return to the Moon and eventually Mars, the attitude and work ethic to achieve total success must be all pervasive.
And why do the Russians have the **only** ability to communicate with the module, if they only can connect to it when it's overhead? Why were they unable to be routed through a NASA antenna somewhere else on Earth, or even on the ISS? I realize that it's more complex than downloading a file from MSDN, but if that capability doesn't exist today it damn well better before they try something like this again.
Which brings up the question: Can the automatically docking SpaceX modules be overridden?
As a long-time product safety engineer I know how difficult it can be to convince others (R&D engineers, marketeers, executive management) that safety and hazard mitigation are essential. Sometimes the phrase "A dead customer is not a repeat customer" works. However, a common mindset seems to be "So far, so good" (as the man said while falling off the Empire State Building).
Do we really need a disaster before we take action and implement appropriate safety processes - and even more important ensure all personnel really understand and internalize those processes? Perhaps we can point to all the disasters that did NOT occur to keep that safety culture alive, although proving a negative is problematic. An excellent and timely article, but now (not after the disaster) is the time for action!
When transistors can’t get any smaller, the only direction is up
Perhaps the most far-reaching technological achievement over the last 50 years has been the steady march toward ever smaller transistors, fitting them more tightly together, and reducing their power consumption. And yet, ever since the two of us started our careers at Intel more than 20 years ago, we’ve been hearing the alarms that the descent into the infinitesimal was about to end. Yet year after year, brilliant new innovations continue to propel the semiconductor industry further.
Along this journey, we engineers had to change the transistor’s architecture as we continued to scale down area and power consumption while boosting performance. The “planar” transistor designs that took us through the last half of the 20th century gave way to 3D fin-shaped devices by the first half of the 2010s. Now, these too have an end date in sight, with a new gate-all-around (GAA) structure rolling into production soon. But we have to look even further ahead because our ability to scale down even this new transistor architecture, which we call RibbonFET, has its limits.
So where will we turn for future scaling? We will continue to look to the third dimension. We’ve created experimental devices that stack atop each other, delivering logic that is 30 to 50 percent smaller. Crucially, the top and bottom devices are of the two complementary types, NMOS and PMOS, that are the foundation of all the logic circuits of the last several decades. We believe this 3D-stacked complementary metal-oxide semiconductor (CMOS), or CFET (complementary field-effect transistor), will be the key to extending Moore’s Law into the next decade.
Continuous innovation is an essential underpinning of Moore’s Law, but each improvement comes with trade-offs. To understand these trade-offs and how they’re leading us inevitably toward 3D-stacked CMOS, you need a bit of background on transistor operation.
Every metal-oxide-semiconductor field-effect transistor, or MOSFET, has the same set of basic parts: the gate stack, the channel region, the source, and the drain. The source and drain are chemically doped to make them both either rich in mobile electrons ( n-type) or deficient in them (p-type). The channel region has the opposite doping to the source and drain.
In the planar version in use in advanced microprocessors up to 2011, the MOSFET’s gate stack is situated just above the channel region and is designed to project an electric field into the channel region. Applying a large enough voltage to the gate (relative to the source) creates a layer of mobile charge carriers in the channel region that allows current to flow between the source and drain.
As we scaled down the classic planar transistors, what device physicists call short-channel effects took center stage. Basically, the distance between the source and drain became so small that current would leak across the channel when it wasn’t supposed to, because the gate electrode struggled to deplete the channel of charge carriers. To address this, the industry moved to an entirely different transistor architecture called a FinFET. It wrapped the gate around the channel on three sides to provide better electrostatic control.
The shift from a planar transistor architecture [left] to the FinFET [right] provided greater control of the channel [covered by blue box], resulting in a reduction in power consumption of 50 percent and an increase in performance of 37 percent.
Intel introduced its FinFETs in 2011, at the 22-nanometer node, with the third-generation Core processor, and the device architecture has been the workhorse of Moore’s Law ever since. With FinFETs, we could operate at a lower voltage and still have less leakage, reducing power consumption by some 50 percent at the same performance level as the previous-generation planar architecture. FinFETs also switched faster, boosting performance by 37 percent. And because conduction occurs on both vertical sides of the “fin,” the device can drive more current through a given area of silicon than can a planar device, which only conducts along one surface.
However, we did lose something in moving to FinFETs. In planar devices, the width of a transistor was defined by lithography, and therefore it is a highly flexible parameter. But in FinFETs, the transistor width comes in the form of discrete increments—adding one fin at a time–a characteristic often referred to as fin quantization. As flexible as the FinFET may be, fin quantization remains a significant design constraint. The design rules around it and the desire to add more fins to boost performance increase the overall area of logic cells and complicate the stack of interconnects that turn individual transistors into complete logic circuits. It also increases the transistor’s capacitance, thereby sapping some of its switching speed. So, while the FinFET has served us well as the industry’s workhorse, a new, more refined approach is needed. And it’s that approach that led us to the 3D transistors we’re introducing soon.
In the RibbonFET, the gate wraps around the transistor channel region to enhance control of charge carriers. The new structure also enables better performance and more refined optimization. Emily Cooper
This advance, the RibbonFET, is our first new transistor architecture since the FinFET’s debut 11 years ago. In it, the gate fully surrounds the channel, providing even tighter control of charge carriers within channels that are now formed by nanometer-scale ribbons of silicon. With these nanoribbons (also called nanosheets), we can again vary the width of a transistor as needed using lithography.
With the quantization constraint removed, we can produce the appropriately sized width for the application. That lets us balance power, performance, and cost. What’s more, with the ribbons stacked and operating in parallel, the device can drive more current, boosting performance without increasing the area of the device.
We see RibbonFETs as the best option for higher performance at reasonable power, and we will be introducing them in 2024 along with other innovations, such as PowerVia, our version of backside power delivery, with the Intel 20A fabrication process.
One commonality of planar, FinFET, and RibbonFET transistors is that they all use CMOS technology, which, as mentioned, consists of n-type (NMOS) and p-type (PMOS) transistors. CMOS logic became mainstream in the 1980s because it draws significantly less current than do the alternative technologies, notably NMOS-only circuits. Less current also led to greater operating frequencies and higher transistor densities.
To date, all CMOS technologies place the standard NMOS and PMOS transistor pair side by side. But in a keynote at the IEEE International Electron Devices Meeting (IEDM) in 2019, we introduced the concept of a 3D-stacked transistor that places the NMOS transistor on top of the PMOS transistor. The following year, at IEDM 2020, we presented the design for the first logic circuit using this 3D technique, an inverter. Combined with appropriate interconnects, the 3D-stacked CMOS approach effectively cuts the inverter footprint in half, doubling the area density and further pushing the limits of Moore’s Law.
3D-stacked CMOS puts a PMOS device on top of an NMOS device in the same footprint a single RibbonFET would occupy. The NMOS and PMOS gates use different metals.Emily Cooper
Taking advantage of the potential benefits of 3D stacking means solving a number of process integration challenges, some of which will stretch the limits of CMOS fabrication.
We built the 3D-stacked CMOS inverter using what is known as a self-aligned process, in which both transistors are constructed in one manufacturing step. This means constructing both n-type and p-type sources and drains by epitaxy—crystal deposition—and adding different metal gates for the two transistors. By combining the source-drain and dual-metal-gate processes, we are able to create different conductive types of silicon nanoribbons (p-type and n-type) to make up the stacked CMOS transistor pairs. It also allows us to adjust the device’s threshold voltage—the voltage at which a transistor begins to switch—separately for the top and bottom nanoribbons.
In CMOS logic, NMOS and PMOS devices usually sit side by side on chips. An early prototype has NMOS devices stacked on top of PMOS devices, compressing circuit sizes.Intel
How do we do all that? The self-aligned 3D CMOS fabrication begins with a silicon wafer. On this wafer, we deposit repeating layers of silicon and silicon germanium, a structure called a superlattice. We then use lithographic patterning to cut away parts of the superlattice and leave a finlike structure. The superlattice crystal provides a strong support structure for what comes later.
Next, we deposit a block of “dummy” polycrystalline silicon atop the part of the superlattice where the device gates will go, protecting them from the next step in the procedure. That step, called the vertically stacked dual source/drain process, grows phosphorous-doped silicon on both ends of the top nanoribbons (the future NMOS device) while also selectively growing boron-doped silicon germanium on the bottom nanoribbons (the future PMOS device). After this, we deposit dielectric around the sources and drains to electrically isolate them from one another. The latter step requires that we then polish the wafer down to perfect flatness.
An edge-on view of the 3D stacked inverter shows how complicated its connections are. Emily Cooper
By stacking NMOS on top of PMOS transistors, 3D stacking effectively doubles CMOS transistor density per square millimeter, though the real density depends on the complexity of the logic cell involved. The inverter cells are shown from above indicating source and drain interconnects [red], gate interconnects [blue], and vertical connections [green].
Finally, we construct the gate. First, we remove that dummy gate we’d put in place earlier, exposing the silicon nanoribbons. We next etch away only the silicon germanium, releasing a stack of parallel silicon nanoribbons, which will be the channel regions of the transistors. We then coat the nanoribbons on all sides with a vanishingly thin layer of an insulator that has a high dielectric constant. The nanoribbon channels are so small and positioned in such a way that we can’t effectively dope them chemically as we would with a planar transistor. Instead, we use a property of the metal gates called the work function to impart the same effect. We surround the bottom nanoribbons with one metal to make a p-doped channel and the top ones with another to form an n-doped channel. Thus, the gate stacks are finished off and the two transistors are complete.
The process might seem complex, but it’s better than the alternative—a technology called sequential 3D-stacked CMOS. With that method, the NMOS devices and the PMOS devices are built on separate wafers, the two are bonded, and the PMOS layer is transferred to the NMOS wafer. In comparison, the self-aligned 3D process takes fewer manufacturing steps and keeps a tighter rein on manufacturing cost, something we demonstrated in research and reported at IEDM 2019.
Importantly, the self-aligned method also circumvents the problem of misalignment that can occur when bonding two wafers. Still, sequential 3D stacking is being explored to facilitate integration of silicon with nonsilicon channel materials, such as germanium and III-V semiconductor materials. These approaches and materials may become relevant as we look to tightly integrate optoelectronics and other functions on a single chip.
Making all the needed connections to 3D-stacked CMOS is a challenge. Power connections will need to be made from below the device stack. In this design, the NMOS device [top] and PMOS device [bottom] have separate source/drain contacts, but both devices have a gate in common.Emily Cooper
The new self-aligned CMOS process, and the 3D-stacked CMOS it creates, work well and appear to have substantial room for further miniaturization. At this early stage, that’s highly encouraging. Devices having a gate length of 75 nm demonstrated both the low leakage that comes with excellent device scalability and a high on-state current. Another promising sign: We’ve made wafers where the smallest distance between two sets of stacked devices is only 55 nm. While the device performance results we achieved are not records in and of themselves, they do compare well with individual nonstacked control devices built on the same wafer with the same processing.
In parallel with the process integration and experimental work, we have many ongoing theoretical, simulation, and design studies underway looking to provide insight into how best to use 3D CMOS. Through these, we’ve found some of the key considerations in the design of our transistors. Notably, we now know that we need to optimize the vertical spacing between the NMOS and PMOS—if it’s too short it will increase parasitic capacitance, and if it’s too long it will increase the resistance of the interconnects between the two devices. Either extreme results in slower circuits that consume more power.
Many design studies, such as one by TEL Research Center America presented at IEDM 2021, focus on providing all the necessary interconnects in the 3D CMOS’s limited space and doing so without significantly increasing the area of the logic cells they make up. The TEL research showed that there are many opportunities for innovation in finding the best interconnect options. That research also highlights that 3D-stacked CMOS will need to have interconnects both above and below the devices. This scheme, called buried power rails, takes the interconnects that provide power to logic cells but don’t carry data and removes them to the silicon below the transistors. Intel’s PowerVIA technology, which does just that and is scheduled for introduction in 2024, will therefore play a key role in making 3D-stacked CMOS a commercial reality.
With RibbonFETs and 3D CMOS, we have a clear path to extend Moore’s Law beyond 2024. In a 2005 interview in which he was asked to reflect on what became his law, Gordon Moore admitted to being “periodically amazed at how we’re able to make progress. Several times along the way, I thought we reached the end of the line, things taper off, and our creative engineers come up with ways around them.”
With the move to FinFETs, the ensuing optimizations, and now the development of RibbonFETs and eventually 3D-stacked CMOS, supported by the myriad packaging enhancements around them, we’d like to think Mr. Moore will be amazed yet again.