methods
1 GW
I am behind on follow-up posts and PM's
Paying work picked up
That said...
In a distributed battery system, especially one which is not hermetically (or even reasonably) sealed, the system reliability is first order driven by the number of balance tap interconnects running around.
Example:
Lets take the example where we start with potted cell boxes with some number of cells in series, say 28
For a "box" you have (# Cells + 1) energized wires exiting + (# wires related to temp sensing)
The reliability of that cell box is calculated in a vacuum. Even if it results in a reliability of 1.0 (which it cant... but for arguments sake) the cell box will have to be used at the system level so it is the system level reliability which drives up into the overall vehicle reliability.
Lets take a random example of 4 cell boxes. Since in this example we are using 28S, lets say those 4 are wired in parallel to increase capacity.
Looking only at balance interconnects (to keep this 1st order) the following calculations are made:
29 pin connections at each cell box protected by an IPXX connector, (call it 30 to keep it clean), so 120pcs of 28awg crimped pins
Mating to that, the balance parallel cable, results in 120 more pcs of 28awg crimped pins
At that interface we have 120 spring loaded pin connections
So far, 240 crimps and 120 spring loaded connections
All energized with respect to each other... such that even the tiniest drip of water would result in electrolysis
That parallel cable can be built in many ways, but most likely the 4 groups of 30 pins will be crimped individually using a barrel covered in heat shrink to a larger gauge wire carrying current for the 4 branches. It may be soldered or crimped, crimp more likely
Add 30pcs barrel crimp
Each crimp has 4pcs 28awg and 1pcs .... say... 18awg
4 coming in from one side, one going out the other
That's 5 X 30 failure points (any one wire pulling out, corroding, fatiguing, etc) - first order and being conservative (on the generous side), this adds 150 failure points
We are now at:
240 pin crimps
120 spring loaded contacts
150 multi-gauge barrel crimps
At the end of that paralleling cable we have the BMS
The BMS is responsible for balancing and measuring cells
There are a minimum of 30 pin crimps for the cable + 30 soldered pins for the PCB (or more if pigtailed), + 30 spring loaded contacts.
270 pin crimps
150 spring loaded contacts
150 multi-gauge barrel crimps
We are calculating ONLY the wiring reliability here and ignoring the solder points inside the cell box (which can be kissing cold etc). We are ignoring the reliability of the BMS and all things down stream of the BMS... WE ARE IGNORING A LOT... but first order... it wont really matter... because our reliability is already shot.
We now have 570 points of mechanical failure
Each of these is vulnerable to a multitude of undetectable failure modes during the manufacturing process including soft crimps, kissing contacts, cold solder joints, etc.
I build testers that validate quality on parts like this so I an assure you... THAT YOU CAN NOT ... affordably detect failures in volume... unless you are having a human or machine inspect every one of these connections one by one.
Compounding that we have 150 spring loaded contact points, in close proximity, protected only by the design of the connector and IP rating of said connector.
Ah... The connectors...
For every spring loaded pair (150) add to that 300 points of failure where a pin or socket may not be pushed completely into the connector housing. Any one of those can push out, pull back, or lodge at an angle.
270 pin crimps
150 spring loaded contacts
150 multi-gauge barrel crimps
300 pin/socket locks at the connectors
Since we are now addressing connectors, if we are not talking about Mill Spec... and we are in fact talking about low cost connectors (which we are) each of those connectors has a reliability rating (they are plastic... and do not see 100% inspection...) and in fact every barrel of every connector is factored in. A single tiny bit of plastic missing? We get an unreliable pin/socket lock.
270 pin crimps
150 spring loaded contacts
150 multi-gauge barrel crimps
300 pin/socket locks at the connectors (assembly)
300 pin/socket molding failures (low cost connectors)
Are you seeing where we are going with this yet?
Lets jump ahead... as we have drilled down far enough to eliminate this path forward in a design review
Failure Modes:
If any single spring loaded connection opens... or becomes significantly corroded... or even becomes intermittent... there is a catastrophic failure mode.
Take the simplest case of 1 pin on 1 cell box
This one pin on this one cell box is responsible for balancing and monitoring that single cell
UNDETECTABLE:
It is impossible for the BMS to see this singular open pin... unless it were doing some *very complex* operations... (which it is not)... so the remaining connected cell boxes mask this single cell which has disconnected or become a high impedance path
Integrating over time, balancing occurs on all cells. Days, weeks, months, years... depending on cell health, temperature, cycling, charge rates, discharge rates, cell quality, and a dozen other things.
This one cell will never see balancing. It will never have current drawn from it independent of the other cells in the system. It is irrefutable that this cell can slowly rise in voltage with respect to all other cells in the system in an undetected manner.
Thermal sensors may or may not be able to detect this... certainly there are failure modes where the single cell can not be detected thermally
And... on and on.
Conclusion:
The reliability of paralleling individual cell boxes at the cell level via cabling (especially in a harsh environment) is unacceptably low. I have not even touched on shock, vibe, thermal cycling, vendor selection, assembly inspection, or a dozen other metrics for quality. The analysis is not required because the design would already be rejected.
Proposed path forward:
Eliminate balance wires leaving an individual cell box
Encapsulate a qualified BMS slave unit into the cell box.
Exiting the cell box shall be only Main +, Main -, and 4 small wires for isolated communication: AuxPwr, AuxGnd, RX, TX (or the equi)
I suggest using 12V, Gnd, and an isoSPI pair
Others may choose 12V, Gnd, and CAN HI/LO
In order to drive UP reliability on the potted slave unit I am bringing in outside power for the isolated communications.
Conceivably a high voltage DC-DC converter could be internal to power the hungry isolated communications... and this could be qualified at the cell level... but... BUT... the DC-DC would have to be qualified to every permutation of any conceivable way you could stack the cell boxes in series and parallel.
The slave unit must be ultra low cost
The slave unit must converge on 1.0 reliability... as few parts as possible...
The slave unit has a thousand ways to fail... but if kept simple... meaning only a qualified stack from a reputable vendor like Linear Technology... with no fancy crap like a DC-DC... then the reliability will be there.
This is rocket science. Please be aware that every major player in the game (and I am not talking about boot-strap startups here...) does these sorts of system reliability calculations.
IF... a rad vendor is providing a product that should be flying out the door like hot cakes on Sunday... and they are not... well - this is why.
Tesla sure as hell does this sort of calculation.
Sandia does.... and we build all sorts of stuff for all sorts of people.
This is not just an Aerospace thing... or MillSpec... it is basic reliability of a complex system.
Interconnects can be highly reliable.
Those kinds of interconnects are like... $50 to $500 connectors inside of a hermetically sealed container back filled with anhydrous inert gasses and desiccated. There is no moisture in this environment. Shock and vibe is tested ad nauseam.
So... dont compare yourself to Aerospace but do take the lessons learned
thanks,
-methods
P.S. For the prospective 3rd party who is looking to adopt a system as described above... eh... one's and two's are ok. Volume production is a no-go. I would schedule periodic inspection and maintenance... but the epic rub is... (drum roll please...)
The low cost connectors used to connect systems like this are only rated for like... 10, or at most 50... insertion cycles.
The connectors are intended to be connected once and left. Opening them for inspection literally lowers their reliability in a measurable way.
Take that to the frocking bank. Insertion cycles... major issue with inspected systems moving forward over years in field.
So... effectively you can not even manage reliability through inspection... even if you wanted to... and you certainly are not going to be cutting away adhesive lined heat shrink from barrel connectors to look for corrosion starting at a fatigue point. Yea.. not going to happen
That is called a slam dunk argument.
It can be contradicted... in tiny ways... nit-picked at... but fundamentally it is irrefutable. The only valid defense is denial.
And that... is why I am trying to work with suppliers to move in a different direction.
YES... potting electronics into a thermally cycling volume is dangerous business... but it is an art we are MUCH better at now that we were 5 or 10 years ago. We pretty much understand and can prove that a potted in surface mount PCB will be reliable. We do this by potting it and then flexing the crap out of it at temperature (hot and cold) until there are no more failures. The specific surface mount parts selected, surface treatments, mounting angle... a lot go into it. Usually its just luck good or bad... but that luck can be proven and a process can be secured and reliability can be built.
done
EDIT: Fixed typo, wrote it off the cuff top to bottom with no edit
Paying work picked up
That said...
In a distributed battery system, especially one which is not hermetically (or even reasonably) sealed, the system reliability is first order driven by the number of balance tap interconnects running around.
Example:
Lets take the example where we start with potted cell boxes with some number of cells in series, say 28
For a "box" you have (# Cells + 1) energized wires exiting + (# wires related to temp sensing)
The reliability of that cell box is calculated in a vacuum. Even if it results in a reliability of 1.0 (which it cant... but for arguments sake) the cell box will have to be used at the system level so it is the system level reliability which drives up into the overall vehicle reliability.
Lets take a random example of 4 cell boxes. Since in this example we are using 28S, lets say those 4 are wired in parallel to increase capacity.
Looking only at balance interconnects (to keep this 1st order) the following calculations are made:
29 pin connections at each cell box protected by an IPXX connector, (call it 30 to keep it clean), so 120pcs of 28awg crimped pins
Mating to that, the balance parallel cable, results in 120 more pcs of 28awg crimped pins
At that interface we have 120 spring loaded pin connections
So far, 240 crimps and 120 spring loaded connections
All energized with respect to each other... such that even the tiniest drip of water would result in electrolysis
That parallel cable can be built in many ways, but most likely the 4 groups of 30 pins will be crimped individually using a barrel covered in heat shrink to a larger gauge wire carrying current for the 4 branches. It may be soldered or crimped, crimp more likely
Add 30pcs barrel crimp
Each crimp has 4pcs 28awg and 1pcs .... say... 18awg
4 coming in from one side, one going out the other
That's 5 X 30 failure points (any one wire pulling out, corroding, fatiguing, etc) - first order and being conservative (on the generous side), this adds 150 failure points
We are now at:
240 pin crimps
120 spring loaded contacts
150 multi-gauge barrel crimps
At the end of that paralleling cable we have the BMS
The BMS is responsible for balancing and measuring cells
There are a minimum of 30 pin crimps for the cable + 30 soldered pins for the PCB (or more if pigtailed), + 30 spring loaded contacts.
270 pin crimps
150 spring loaded contacts
150 multi-gauge barrel crimps
We are calculating ONLY the wiring reliability here and ignoring the solder points inside the cell box (which can be kissing cold etc). We are ignoring the reliability of the BMS and all things down stream of the BMS... WE ARE IGNORING A LOT... but first order... it wont really matter... because our reliability is already shot.
We now have 570 points of mechanical failure
Each of these is vulnerable to a multitude of undetectable failure modes during the manufacturing process including soft crimps, kissing contacts, cold solder joints, etc.
I build testers that validate quality on parts like this so I an assure you... THAT YOU CAN NOT ... affordably detect failures in volume... unless you are having a human or machine inspect every one of these connections one by one.
Compounding that we have 150 spring loaded contact points, in close proximity, protected only by the design of the connector and IP rating of said connector.
Ah... The connectors...
For every spring loaded pair (150) add to that 300 points of failure where a pin or socket may not be pushed completely into the connector housing. Any one of those can push out, pull back, or lodge at an angle.
270 pin crimps
150 spring loaded contacts
150 multi-gauge barrel crimps
300 pin/socket locks at the connectors
Since we are now addressing connectors, if we are not talking about Mill Spec... and we are in fact talking about low cost connectors (which we are) each of those connectors has a reliability rating (they are plastic... and do not see 100% inspection...) and in fact every barrel of every connector is factored in. A single tiny bit of plastic missing? We get an unreliable pin/socket lock.
270 pin crimps
150 spring loaded contacts
150 multi-gauge barrel crimps
300 pin/socket locks at the connectors (assembly)
300 pin/socket molding failures (low cost connectors)
Are you seeing where we are going with this yet?
Lets jump ahead... as we have drilled down far enough to eliminate this path forward in a design review

Failure Modes:
If any single spring loaded connection opens... or becomes significantly corroded... or even becomes intermittent... there is a catastrophic failure mode.
Take the simplest case of 1 pin on 1 cell box
This one pin on this one cell box is responsible for balancing and monitoring that single cell
UNDETECTABLE:
It is impossible for the BMS to see this singular open pin... unless it were doing some *very complex* operations... (which it is not)... so the remaining connected cell boxes mask this single cell which has disconnected or become a high impedance path
Integrating over time, balancing occurs on all cells. Days, weeks, months, years... depending on cell health, temperature, cycling, charge rates, discharge rates, cell quality, and a dozen other things.
This one cell will never see balancing. It will never have current drawn from it independent of the other cells in the system. It is irrefutable that this cell can slowly rise in voltage with respect to all other cells in the system in an undetected manner.
Thermal sensors may or may not be able to detect this... certainly there are failure modes where the single cell can not be detected thermally
And... on and on.
Conclusion:
The reliability of paralleling individual cell boxes at the cell level via cabling (especially in a harsh environment) is unacceptably low. I have not even touched on shock, vibe, thermal cycling, vendor selection, assembly inspection, or a dozen other metrics for quality. The analysis is not required because the design would already be rejected.
Proposed path forward:
Eliminate balance wires leaving an individual cell box
Encapsulate a qualified BMS slave unit into the cell box.
Exiting the cell box shall be only Main +, Main -, and 4 small wires for isolated communication: AuxPwr, AuxGnd, RX, TX (or the equi)
I suggest using 12V, Gnd, and an isoSPI pair
Others may choose 12V, Gnd, and CAN HI/LO
In order to drive UP reliability on the potted slave unit I am bringing in outside power for the isolated communications.
Conceivably a high voltage DC-DC converter could be internal to power the hungry isolated communications... and this could be qualified at the cell level... but... BUT... the DC-DC would have to be qualified to every permutation of any conceivable way you could stack the cell boxes in series and parallel.
The slave unit must be ultra low cost
The slave unit must converge on 1.0 reliability... as few parts as possible...
The slave unit has a thousand ways to fail... but if kept simple... meaning only a qualified stack from a reputable vendor like Linear Technology... with no fancy crap like a DC-DC... then the reliability will be there.
This is rocket science. Please be aware that every major player in the game (and I am not talking about boot-strap startups here...) does these sorts of system reliability calculations.
IF... a rad vendor is providing a product that should be flying out the door like hot cakes on Sunday... and they are not... well - this is why.
Tesla sure as hell does this sort of calculation.
Sandia does.... and we build all sorts of stuff for all sorts of people.
This is not just an Aerospace thing... or MillSpec... it is basic reliability of a complex system.
Interconnects can be highly reliable.
Those kinds of interconnects are like... $50 to $500 connectors inside of a hermetically sealed container back filled with anhydrous inert gasses and desiccated. There is no moisture in this environment. Shock and vibe is tested ad nauseam.
So... dont compare yourself to Aerospace but do take the lessons learned
thanks,
-methods
P.S. For the prospective 3rd party who is looking to adopt a system as described above... eh... one's and two's are ok. Volume production is a no-go. I would schedule periodic inspection and maintenance... but the epic rub is... (drum roll please...)
The low cost connectors used to connect systems like this are only rated for like... 10, or at most 50... insertion cycles.
The connectors are intended to be connected once and left. Opening them for inspection literally lowers their reliability in a measurable way.
Take that to the frocking bank. Insertion cycles... major issue with inspected systems moving forward over years in field.
So... effectively you can not even manage reliability through inspection... even if you wanted to... and you certainly are not going to be cutting away adhesive lined heat shrink from barrel connectors to look for corrosion starting at a fatigue point. Yea.. not going to happen
That is called a slam dunk argument.
It can be contradicted... in tiny ways... nit-picked at... but fundamentally it is irrefutable. The only valid defense is denial.
And that... is why I am trying to work with suppliers to move in a different direction.
YES... potting electronics into a thermally cycling volume is dangerous business... but it is an art we are MUCH better at now that we were 5 or 10 years ago. We pretty much understand and can prove that a potted in surface mount PCB will be reliable. We do this by potting it and then flexing the crap out of it at temperature (hot and cold) until there are no more failures. The specific surface mount parts selected, surface treatments, mounting angle... a lot go into it. Usually its just luck good or bad... but that luck can be proven and a process can be secured and reliability can be built.
done
EDIT: Fixed typo, wrote it off the cuff top to bottom with no edit