Mitchell Davis - Staff Engineer at NASA Goddard Space Flight Center
My father was a civil engineer and I am first exposure to electronics in an electronics class in high school and I really enjoyed it. This made my career choice of electrical engineering an easy one. I started as the lead designer (3 months out of college) of a flight control system for a shuttle payload which included an 8086 microprocessor, power system and large array of analog circuitry. Later I focused on building spaceflight scientific instruments from initial concepts. This involves moving an initial concept to final flight design while staying within the cost and schedule constraints. I found trouble-shooting electrical system architecture shortfalls to be the most challenging and rewarding technical area.
For troubling-shooting, a current probe is a must. When a portion of the energy, typically a very small percentage, of an intended signal/power leaks out and un-intentionally impact another signal we call this Electro-magnetic interference (EMI). EMI trouble-shooting involves finding the current path and quantifying this un-intentional energy path. My favorite hardware tool is a high frequency current probe. I find troubling-shooting electronics problems with a just voltage probes rather difficult, it is far easier and far more intuitive to follow the current (energy) paths through the systems.
Most definitely SPICE circuit modeling. I don’t recall ever finding an electro-magnetic coupled problem of a pure sine wave, rather virtually all EMC problems I have seen are repetitive transient events with unique signatures. I find it extremely useful to characterize the expected “waveform signature” induced or impressed in a “victim’s” signal from a hypothesized “source” through the possible coupling paths. The resulting waveform signature in a victim circuit from a magnetic-field dominate source appears vastly different than an electrical-field dominate sources, (the two current paths are vastly different). The unique signature in the wave shapes are key clues in understanding the system interactions. Of course, the model’s predicted victim absolute magnitude results are typically not very accurate due to many assumptions that are required.
I would have to say the “bugs” that are classified as “cannot duplicate”. These are the “unexpected results” which caught the attention of the operator/user but the user cannot reliably re-create the event. Many times the unexpected results are not immediately characterized as a threat; rather the concern is the unexpected result may be a sign of a larger issue. Just weeks before the final Hubble Space Telescope Servicing Mission (SM4), the Hubble’s Scientific Instrument Command and Data Handling subsystem failed, (power dropped to zero and data flow stopped). This is the subsystem that is the interface between the science instruments and the ground, hence a major critical link in Hubble’s operations. The servicing mission was delayed while a flight spare was taken out of storage and prepared for flight. The first time the subsystem was powered on; its power profile was normally but it wouldn’t communicate. The operators wondered if all the cables were connected correctly, so they powered down to check. When they powered back on, they got the same lack of communication for the first couple of minutes then data began to flow and the problem never returned in any ground testing. We tried thermal cycling, multiple data rates and many other variations to re-create the problem over the next couple of months but we “could not duplicate” the problem. We ended up putting in this Command and Data Handling subsystem in the Hubble Space Telescope on SM4 and had a good couple of months of operation before the problem returned. It ended up locking up a hand full of times over a several month period then disappeared for a whole year to revisit once this fall. We suspected a minor circuit change (not in the original hardware version) in conjunction with a race condition. The computer’s wait state generator was changed to reduce the number of wait states from 8 clock cycles to 2 cycles. This change in combination to a race condition where occasionally a clock cycle is lost results in violating the minimum timing.
When you have a “cannot duplicate”, it changes the trouble-shooting strategy from focusing in the observable characteristics to searching through each and every subsystem for failure modes and effects of the “usual suspects”, (for example, electrical shorts between circuits or resistive open circuits in the interface cables). Although far from perfect, identifying where a problem is most likely not residing can help identify where a problem is residing. It is impossible to show that a failure mode does not exist, however one can show that failures in a given functional area do not result in the un-duplicated fault characteristic. This was our strategy with the Toyota Unintentional Acceleration Study for the National Highway Traffic Safety Administration since no Toyota vehicle was identified that could naturally and repeatedly reproduce large throttle opening Un-intentional Acceleration effects.
Of all books on my shelf, I find that I regularly pull “Introduction to Electromagnetic Compatibility” by Clayton Paul as a reference. By the way, this is where I first saw the approach of separating the magnetic field from the electric field discussion. I also have numerous control systems books and I keep copies of the past anomaly investigation reports which I have participated. “Anomaly” is a NASA term used to encompass all problems encountered during development to on-orbit operations. I use to be surprised by the number of times I would see a variation of a past anomaly, now I just share my knowledge I have on hand.
When we build a spaceflight scientific instrument for the first time from a concept, we use an iterative approach which refines a point design by eliminating the undesired feature. We use the phase; “make it work, make it safe and make it affordable”. The difficulty is trying to stay in the box created by these three variables. First, is to define what is required of the design, then create a point design for what is needed, (not what you can!) Now with that point design, you check to see if you can build it within your allocated cost and schedule. Third, you see how the product will fail. If it is not safe, then make either architectural change to improve safety. As you make decisions to further refine the point design, you must iterate back through the three and check; can I make the point design work better, safer or cheaper? This is an iterative process constantly refining and improving a point design. Let me elaborate on the “make it safe”. “Safe” in this sense is more than human safety. Keep in mind that the products we create operate in space and cannot be serviced easily so “safe” includes any failure modes which prevents the return of science data. At some NASA centers we have distinct definition for the terms “robust” and “reliable”. Robust meaning that the product’s response is ‘predictable’ even when under unexpected operational condition. Reliable is used when referring to wear out failures which tend to be random in nature. More importantly, to be at a level of random failures implies that the product no longer contains ‘un-predictable’ failure modes, (that is, the design is robust). From my experience of making and operating one-of-a-kind products, we rarely get to the point of knowledge on a given system where failures are considered truly random in nature.
I have worked on hundreds of flight systems over the last 30 years, but I have to say the Hubble Space Telescope since it is unique in the servicing missions allowing replacement of undesirable features in the design. I was not involved in the original design; rather I have been involved in numerous anomaly investigations of Hubble’s subsystem failures. It is always a proud feeling when you can determine how an electronic system has failed in space and recommend specific steps to insure the most science return from this national asset. Oddly enough two Hubble Space Telescope anomalies appear to be an unexpected result of an attempt to “improve the design”. It a good reminder that every change has consequences and the consequences are not always as desired.
Once I working on a large space flight instrument which had a custom ASIC designed to read out the signals for a large group of detectors. It was discovered late in the development that during the reset period, the large capacitance detectors with bleeding charge over to the smaller capacitance detectors and severely impacting performance. Thus the current-time profile was insufficient to reset the detectors to zero charge. The designers concluded the only mitigation available was to redesign the ASIC (12 months and $25 million impact) or the preferred path of tinkering with the clock to increase the reset time, (shorten some clock cycles then lengthen the clock cycles during the reset period). Since neither option was desirable, I was asked to confirm there were no other mitigations. While reviewing the schematics, I noticed a current limiting resistor in the reset signal. I asked the designer why it was there and he said he didn’t know exactly why but it was a requirement from the detector folks. Once I tracked down the detector person, he said he didn’t want the designers damaging his detectors during the initial testing so he asked them to limit the current but he never intended this to become part of the design. Therefore, we had a potentially $20 million/12 month impact adverted by removing a 5 cent resistor. The moral of the story is that expensive engineering problems can be avoided by communicating the right information to the correct team members. Unfortunately, one never knows where/when that one important nugget of information will be present. When assigned to a project, I make it a point to read the reports of other discipline engineering team members and attend their design reviews. Knowledge can save cost!
I am currently leading an Anomaly Review Team investigating an anomalous signature from the Hubble Space Telescope’s gyroscopes. The gyroscopes are detecting the day/night orbit terminator, but that’s another story about variations in plating. This is in addition to providing grounding and avionics architectural guidance on various space instruments. After holding the position and associated responsibility and travel as NASA Technical Fellow for Avionics, it is a pleasure to step back to design efforts and working in the laboratory with younger engineers.
Like all government agencies, NASA’s future is uncertain especially with the end of the Shuttle era so this is difficult question to answer. I do see a shift to standardize the aerospace avionics interfaces and standard products. I believe this is a nature evolution on the space application hardware in that the “building blocks” are maturing to the point where they are no longer custom.
I have observed avionics system becoming more and more complex over the years, thus fewer people understanding the end to end working of a system. The HST Science Instrument Command and Data handling system comprised of numerous electronic boxes on a roughly 3 foot by 3 foot tray and took over 6 months of testing to verify all functions. Today, those same functions can be implemented in a single Field Programmable Gate Array smaller than a match box and typically several days to a week are allocated to verify that same functionally. The size is not the issue, rather the allotted time to verify the same the level of complexity results in “un-predictable” system responses and the potential of a reduction in overall safety. We have evolved from robust systems with “obviously no safety concerns” to more complex system with “no obvious safety concern” until discovered by an undesirable outcome.
Feeding back “Lesson’s Learned” to improve our products is another major challenge. On ICESat (Ice, Cloud and Elevation Satellite) shortly after launch, the laser’s operational temperature set point was lower slightly. This resulted in an immediate “noise” in a telemetry signal followed by a catastrophic laser failure with 24 hours. After some painstaking work, I was able to show the “noise” was a nominal “design feature” at that new temperature. The actual cause was found to be excessive indium solder reacting with round gold wires and reducing their ability to carry current. In researching gold –indium issues, we found this issue was discovered before on 2 NASA projects (roughly 7 years apart), our nation’s nuclear warhead triggers and by Texas Instruments’ dating back to the first transistors. Even when we know what information to convey, we have not figured out how to reliability get it into the proper place at the proper time!