Calculating System Reliability

methods · Sep 13, 2017

I am behind on follow-up posts and PM's
Paying work picked up

That said...

In a distributed battery system, especially one which is not hermetically (or even reasonably) sealed, the system reliability is first order driven by the number of balance tap interconnects running around.

Example:

Lets take the example where we start with potted cell boxes with some number of cells in series, say 28
For a "box" you have (# Cells + 1) energized wires exiting + (# wires related to temp sensing)

The reliability of that cell box is calculated in a vacuum. Even if it results in a reliability of 1.0 (which it cant... but for arguments sake) the cell box will have to be used at the system level so it is the system level reliability which drives up into the overall vehicle reliability.

Lets take a random example of 4 cell boxes. Since in this example we are using 28S, lets say those 4 are wired in parallel to increase capacity.

Looking only at balance interconnects (to keep this 1st order) the following calculations are made:

29 pin connections at each cell box protected by an IPXX connector, (call it 30 to keep it clean), so 120pcs of 28awg crimped pins
Mating to that, the balance parallel cable, results in 120 more pcs of 28awg crimped pins
At that interface we have 120 spring loaded pin connections

So far, 240 crimps and 120 spring loaded connections
All energized with respect to each other... such that even the tiniest drip of water would result in electrolysis

That parallel cable can be built in many ways, but most likely the 4 groups of 30 pins will be crimped individually using a barrel covered in heat shrink to a larger gauge wire carrying current for the 4 branches. It may be soldered or crimped, crimp more likely

Add 30pcs barrel crimp
Each crimp has 4pcs 28awg and 1pcs .... say... 18awg
4 coming in from one side, one going out the other
That's 5 X 30 failure points (any one wire pulling out, corroding, fatiguing, etc) - first order and being conservative (on the generous side), this adds 150 failure points

We are now at:
240 pin crimps
120 spring loaded contacts
150 multi-gauge barrel crimps

At the end of that paralleling cable we have the BMS
The BMS is responsible for balancing and measuring cells
There are a minimum of 30 pin crimps for the cable + 30 soldered pins for the PCB (or more if pigtailed), + 30 spring loaded contacts.

270 pin crimps
150 spring loaded contacts
150 multi-gauge barrel crimps

We are calculating ONLY the wiring reliability here and ignoring the solder points inside the cell box (which can be kissing cold etc). We are ignoring the reliability of the BMS and all things down stream of the BMS... WE ARE IGNORING A LOT... but first order... it wont really matter... because our reliability is already shot.

We now have 570 points of mechanical failure
Each of these is vulnerable to a multitude of undetectable failure modes during the manufacturing process including soft crimps, kissing contacts, cold solder joints, etc.
I build testers that validate quality on parts like this so I an assure you... THAT YOU CAN NOT ... affordably detect failures in volume... unless you are having a human or machine inspect every one of these connections one by one.

Compounding that we have 150 spring loaded contact points, in close proximity, protected only by the design of the connector and IP rating of said connector.

Ah... The connectors...

For every spring loaded pair (150) add to that 300 points of failure where a pin or socket may not be pushed completely into the connector housing. Any one of those can push out, pull back, or lodge at an angle.

270 pin crimps
150 spring loaded contacts
150 multi-gauge barrel crimps
300 pin/socket locks at the connectors

Since we are now addressing connectors, if we are not talking about Mill Spec... and we are in fact talking about low cost connectors (which we are) each of those connectors has a reliability rating (they are plastic... and do not see 100% inspection...) and in fact every barrel of every connector is factored in. A single tiny bit of plastic missing? We get an unreliable pin/socket lock.

270 pin crimps
150 spring loaded contacts
150 multi-gauge barrel crimps
300 pin/socket locks at the connectors (assembly)
300 pin/socket molding failures (low cost connectors)

Are you seeing where we are going with this yet?
Lets jump ahead... as we have drilled down far enough to eliminate this path forward in a design review :idea:

Failure Modes:

If any single spring loaded connection opens... or becomes significantly corroded... or even becomes intermittent... there is a catastrophic failure mode.
Take the simplest case of 1 pin on 1 cell box
This one pin on this one cell box is responsible for balancing and monitoring that single cell

UNDETECTABLE:
It is impossible for the BMS to see this singular open pin... unless it were doing some *very complex* operations... (which it is not)... so the remaining connected cell boxes mask this single cell which has disconnected or become a high impedance path

Integrating over time, balancing occurs on all cells. Days, weeks, months, years... depending on cell health, temperature, cycling, charge rates, discharge rates, cell quality, and a dozen other things.

This one cell will never see balancing. It will never have current drawn from it independent of the other cells in the system. It is irrefutable that this cell can slowly rise in voltage with respect to all other cells in the system in an undetected manner.

Thermal sensors may or may not be able to detect this... certainly there are failure modes where the single cell can not be detected thermally

And... on and on.

Conclusion:
The reliability of paralleling individual cell boxes at the cell level via cabling (especially in a harsh environment) is unacceptably low. I have not even touched on shock, vibe, thermal cycling, vendor selection, assembly inspection, or a dozen other metrics for quality. The analysis is not required because the design would already be rejected.

Proposed path forward:
Eliminate balance wires leaving an individual cell box
Encapsulate a qualified BMS slave unit into the cell box.
Exiting the cell box shall be only Main +, Main -, and 4 small wires for isolated communication: AuxPwr, AuxGnd, RX, TX (or the equi)
I suggest using 12V, Gnd, and an isoSPI pair
Others may choose 12V, Gnd, and CAN HI/LO

In order to drive UP reliability on the potted slave unit I am bringing in outside power for the isolated communications.
Conceivably a high voltage DC-DC converter could be internal to power the hungry isolated communications... and this could be qualified at the cell level... but... BUT... the DC-DC would have to be qualified to every permutation of any conceivable way you could stack the cell boxes in series and parallel.

The slave unit must be ultra low cost
The slave unit must converge on 1.0 reliability... as few parts as possible...

The slave unit has a thousand ways to fail... but if kept simple... meaning only a qualified stack from a reputable vendor like Linear Technology... with no fancy crap like a DC-DC... then the reliability will be there.

This is rocket science. Please be aware that every major player in the game (and I am not talking about boot-strap startups here...) does these sorts of system reliability calculations.

IF... a rad vendor is providing a product that should be flying out the door like hot cakes on Sunday... and they are not... well - this is why.
Tesla sure as hell does this sort of calculation.
Sandia does.... and we build all sorts of stuff for all sorts of people.

This is not just an Aerospace thing... or MillSpec... it is basic reliability of a complex system.

Interconnects can be highly reliable.
Those kinds of interconnects are like... $50 to $500 connectors inside of a hermetically sealed container back filled with anhydrous inert gasses and desiccated. There is no moisture in this environment. Shock and vibe is tested ad nauseam.

So... dont compare yourself to Aerospace but do take the lessons learned

thanks,
-methods

P.S. For the prospective 3rd party who is looking to adopt a system as described above... eh... one's and two's are ok. Volume production is a no-go. I would schedule periodic inspection and maintenance... but the epic rub is... (drum roll please...)

The low cost connectors used to connect systems like this are only rated for like... 10, or at most 50... insertion cycles.
The connectors are intended to be connected once and left. Opening them for inspection literally lowers their reliability in a measurable way.
Take that to the frocking bank. Insertion cycles... major issue with inspected systems moving forward over years in field.

So... effectively you can not even manage reliability through inspection... even if you wanted to... and you certainly are not going to be cutting away adhesive lined heat shrink from barrel connectors to look for corrosion starting at a fatigue point. Yea.. not going to happen

That is called a slam dunk argument.
It can be contradicted... in tiny ways... nit-picked at... but fundamentally it is irrefutable. The only valid defense is denial.

And that... is why I am trying to work with suppliers to move in a different direction.

YES... potting electronics into a thermally cycling volume is dangerous business... but it is an art we are MUCH better at now that we were 5 or 10 years ago. We pretty much understand and can prove that a potted in surface mount PCB will be reliable. We do this by potting it and then flexing the crap out of it at temperature (hot and cold) until there are no more failures. The specific surface mount parts selected, surface treatments, mounting angle... a lot go into it. Usually its just luck good or bad... but that luck can be proven and a process can be secured and reliability can be built.

done

EDIT: Fixed typo, wrote it off the cuff top to bottom with no edit

agniusm · Sep 13, 2017

Wow, that read made me depressed :shock:

bigmoose · Sep 13, 2017

Ahhhh this is so easy...not!

Just define all mechanical failure modes as noncredible in the FMECA because of pull test on crimped connectors.

Then do the rest according to MIL-HDBK-217 ... easy peasy... yea right.

Methy, about 8 years ago now, I was contemplating getting into battery systems for mobile sensor platforms (interpret that any way you want and for who you want :wink: ) I consulted with my friends who are in law (patent and general corporate liability attorneys) and made a hard, very hard decision to walk away from a lucrative hardware delivery contract from a deep pockets client because of the implied liability of dealing with and shipping lithium battery systems. This is a very difficult area for the small guy to "safely" work in.

Edit: Nice philosophy on connectors, but no hard reliability data:
http://www.te.com/documentation/whitepapers/pdf/Brief_Overview_of_Reliability_in_General_and_for_Electrical_Connectors_in_Particular.pdf

Edit2: Note MIL-HDBK-217F Notice 2 has a general Connector model in section 15.1http://www.sre.org/pubs/Mil-Hdbk-217F(2).pdf page 56; and a Connection model in section 17.1 http://www.sre.org/pubs/Mil-Hdbk-217F(2).pdf page 63 Connector is for mating/pairing failures. Connection model is for crimp, wire-wrap, solder cup, screw terminal, etc. connections. The two are additive.

So the model of keeping both connectors AND connections minimal leads to increasing reliability.

methods · Sep 14, 2017

Hey bigMoose,

The point of posting this fly trap was not to catch you... but I am glad I did.

We can not afford the likes of you at the moment.... but (in my professional capacity) we are trying to partner with a company that can afford you and will likely need your input for some development that will fall under OEM. You and I both know that for Military and hardcore Aerospace there is really no such thing as "Lawyers". There is Reliability. That's a whole ethos...

Then there is the wild west of Electric Vehicles, markets where those vehicles are distributed, the Safety level we ethically require as responsible engineers, and the regulations that are currently being developed to guide design.

I would like to brief you on what we are working on over the phone and invite you to participate. We have only talked on the phone once... I am sure you recall.

I see you playing a split role requiring you to jump the gap. On one hand you are a great setter of boundaries... with hard data to back up your statements. Very conservative... almost to the point of scuttle. This is perfect... you are the predictor of dead ends. On the other hand I know that you know... that sometimes we are going to launch a submarine... and it is going to get launched... and the best we can do is make it as safe as possible.

I look at it as... This is happening whether we want it to or not. Economics (business) is currently driving safety. Our job is to address that crack and try our best to meet safety requirements while respecting the reality of market... where people WILL buy a cheaper product even if it is unsafe mostly because Marketing guys smooth this over and the subject matter is beyond the interest of your average Joe. Average Joe understands "Spontaneous or unpredictable fire".

What I hope to get from working with you is to develop a complete understanding of the risk... as well as high risk paths forward and metrics/tactics for navigating the gauntlet. I have my mind wrapped around it.. but you have working knowledge to a deeper level chapter and verse.

We have a country full of ICE vehicles that are about as dangerous as it gets. 20 gallons of gas... sparks and fast moving parts and gas stations and rubber hoses running everywhere... brake master cylinders and steering boxes and inconceivably mission critical subsystems which have been cost reduced to be pot metal from China for $30 at Autozone.

This can and will happen for EV and we currently have an opportunity to work on solving the toughest problem which is the environmentally exposed system, medium shock, high vibe, high moisture, high rate of discharge. If we succeed the result will be a super-set of what is required for widely distributed stationary storage that is safe, modular, and stacks like lego's.

Anyhow... I am avoiding my slide deck.

First deck?
What is inside of the IP67 box, what does that mean, what does it take to get things out of the box, what is commercially available, what will it take to build what is not, who can we partner with, what are the risks, how much and how long.

Yea... no problem right 8)

I HAVE had a decade to digest the variables. Now... GO TIME

-methods

methods · Sep 14, 2017

agniusm said:
Wow, that read made me depressed

Ignore it and march on brother. Whatever you are doing I am sure it is radical. Keep building, testing, and sharing what you learn.

Guys like BigMoose and I can absolutely assert that a particular path forward will ultimately result in a high probability of failure.
That does not mean it actually will fail in practice... it simply means that the probabilities involved are too high of risk to nod your head at a guy who is putting his life on the line.

We are looking for the guaranteed path forward... not "a" path forward.
I am looking for the affordable guaranteed path forward... which is a 3 dimensional rubrics cube of dead ends with endless placeholders.
The ultimate knapsack dilemma

-methods

liveforphysics · Sep 14, 2017

Do you think modern EV harnessing has more or less failure points than status-quo accepted vehicle tech?

To quote Musk:

“The Model Y will be on a different platform than the Model 3. Musk said that the car will be quite different, inside, in part because Tesla is learning how to make cars more efficiently. “The wiring harness on Model S is about 3 kilometers in length,” he said. “The wire harness on Model 3 is 1.5 kilometers in length. The wiring harness on Model Y will be 100 meters. And that’s a redundant wiring harness.”

Data over power, and in-device power switches (for everything from turn signals to pumps etc, is how modern vehicles are being architected today for reliability, weight, and cost.

A well thought through system for vehicle safety/reliability which is standard automaker practice today :

https://en.m.wikipedia.org/wiki/Automotive_Safety_Integrity_Level

Bison_69 · Sep 17, 2017

Many transportation compagnies in Europe have been using for years custom made POE (Power Over Ethernet) technologies...
It is really interesting and save a lot in design and cabling hardware... it is also replacing slowly the old CAN bus system.

Calculating System Reliability

methods

1 GW

agniusm

1 MW

bigmoose

1 MW

methods

1 GW

methods

1 GW

liveforphysics

100 TW

Bison_69

100 W

Similar threads