Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

This section will contain posting guidelines for the forum along with any other information that is relevant to this forum.
Post Reply
User avatar
neptronix   100 GW

100 GW
Posts: 12685
Joined: Jun 15 2010 5:56pm
Location: California refugee living in Utah, USA
Contact:

Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

Post by neptronix » Apr 22 2018 2:03pm

Hi all. When we moved from phpbb 3.0 to 3.2, we decided to use the newest version of the database server on AWS.
It turned out that this version has a slow memory leak that would cause the server to stall for an hour, and then reboot on a 3 day cycle..

What we did was double the size of the memory on the database, and this made the system last about a month, but stall for 2-3 hours before rebooting as the memory filled up. Obviously not a solution.

I tried tweaking all kinds of database settings and also upgraded to a new subversion that claimed to fix the problem, but neither approach worked. I had planned on downgrading the database next week, but noticed that the system had stalled and rebooted twice in the last week, and was hung this morning.. so i took the system down and did the downgrade. Total downtime was around 6 hours.. mostly because the size of our database is just enormous and takes eons to process.

Sorry for no heads up on the downtime. I figured sunday morning was an ideal time to tackle the problem while the system was down already.

Here's hoping that the problem is cured once and for all. If not, there's more tricks up my sleeve, but i've already blown collectively ~18 hours on this problem so ... cross your fingers for me, fellas. :)
My first major build: 8T MAC motor on a Trek 4500.
The new all-arounder: Leafmotor 1500w on a Turner O2 full suspension.
The wheelie machine: 20" Rear Magic Pie II on a Trek 4300 MTB

"The best time to plant a tree was 20 years ago. The second best time is now."- Chinese Proverb

User avatar
LockH   100 GW

100 GW
Posts: 17467
Joined: Jul 09 2013 11:06pm
Location: Ummm.. Started out in Victoria BC Canada, then started to move around... Oh oh.

Re: Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

Post by LockH » Apr 22 2018 2:07pm

THANKS Nep. :)
ES changed my life (for the waaaaay better).

Eff. June, 2014 Phoenix Ebike Promotions

(Current ride? High speed lawn chair.)
viewtopic.php?f=3&t=57408

Phoenix Ebike Promotions conversion kit (work in progress. More drink holders, etc etc)
viewtopic.php?f=15&t=60564

Joined yer local chapter of EA yet?
(Ebikers Anonymous - Where we're all miserable failures, but the parties are hilarious...)

swbluto   100 GW

100 GW
Posts: 8987
Joined: May 30 2008 5:23pm
Contact:

Re: Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

Post by swbluto » Apr 22 2018 2:15pm

Yep, the proper response to a memory leaking program is fixing the memory leak. As you found out, throwing more memory at it just simply prolongs the time till memory exhaustion.

So if it's the database... think a latest stable "working" version would work? Pretty sure the best database systems are designed not to leak. Or are there security concerns with not-the-newest version?

User avatar
neptronix   100 GW

100 GW
Posts: 12685
Joined: Jun 15 2010 5:56pm
Location: California refugee living in Utah, USA
Contact:

Re: Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

Post by neptronix » Apr 22 2018 2:29pm

swbluto wrote:
Apr 22 2018 2:15pm
Yep, the proper response to a memory leaking program is fixing the memory leak. As you found out, throwing more memory at it just simply prolongs the time till memory exhaustion.

So if it's the database... think a latest stable "working" version would work? Pretty sure the best database systems are designed not to leak. Or are there security concerns with not-the-newest version?
Initially we thought it was some kind of DDOS because it was happening so rapidly with the lower memory, which was weird because i have reliably hardened many linux servers running this exact OS. The extra memory was kind of a test to rule out a possible memory leak because i'd tried some very restrictive DDOS mitigations on top of that, and not had any results.

Amazingly, it's the latest major version of the database that had the memory leak. And i notice that in the release notes, they are gradually patching one memory leak after another. Mind you, this is a major vendor, so i had a feeling they'd eventually sort it out. Amazon has a note about this and offers the opportunity to use their own fork of the database to get around this leak, but i decided not to take up that offer, lest we end up with vendor lock in.

We are already vendor locked in with phpbb due to our massive amount of content.. that's bad enough :lol:
My first major build: 8T MAC motor on a Trek 4500.
The new all-arounder: Leafmotor 1500w on a Turner O2 full suspension.
The wheelie machine: 20" Rear Magic Pie II on a Trek 4300 MTB

"The best time to plant a tree was 20 years ago. The second best time is now."- Chinese Proverb

User avatar
LockH   100 GW

100 GW
Posts: 17467
Joined: Jul 09 2013 11:06pm
Location: Ummm.. Started out in Victoria BC Canada, then started to move around... Oh oh.

Re: Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

Post by LockH » Apr 22 2018 2:57pm

"Mind you, this is a major vendor..." Watt. Like manufacturers of the 20th-Century horseless carriage in a 21st-Century ("crowded") "urban" world? Surely you jest Sir. :lol:
ES changed my life (for the waaaaay better).

Eff. June, 2014 Phoenix Ebike Promotions

(Current ride? High speed lawn chair.)
viewtopic.php?f=3&t=57408

Phoenix Ebike Promotions conversion kit (work in progress. More drink holders, etc etc)
viewtopic.php?f=15&t=60564

Joined yer local chapter of EA yet?
(Ebikers Anonymous - Where we're all miserable failures, but the parties are hilarious...)

User avatar
TheBeastie   1 MW

1 MW
Posts: 1745
Joined: Jul 28 2012 12:31am
Location: Melbourne Australia

Re: Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

Post by TheBeastie » Apr 23 2018 5:12pm

Is it MySQL as the database?
From my experience if the server is dedicated to database work then the actual memory setting of MySQL can't affect performance much as the dedicated server will just load up every block of hard drive storage into kernel virtual memory it can and access it instantly instead of hard drive storage, its essentially the same as if you configured the db software to use all the available ram, as long as the server is dedicated to db then the kernel memory management can't use the ram for anything else.
If the server has say 96GB of ram and the database software is set to say 16gb then
when you reboot the linux server watch it with "vmstat 30" and see how around 80gb of the free memory gets eaten up by the kernel just loading hard drive storage blocks of data from which the mysql is requesting from db requests.
While it doesn't make much difference from my experience and benchmarking, for the very best performance its best to decide if you want the kernel vm in memory management of db data or to force the db to use as much of the ram as possible. The worst thing you can do is have half and half between the DB and free ram as its most likely the db software and the kernel vm will be caching an exact same copy of the data on each side of the memory, effectively halving the actually used ram.
Speed Kills Range, 10mph = 46 miles range, 20mph = 20 miles, 30mph = 8 miles range https://goo.gl/1JNL53
Over Charging Kills ur battery bit.ly/1hzWKl4
Consider PAS as your only throttle https://goo.gl/Kg1F8F
Fuel-Cell is the ultimate battery coupled with 4th-gen Nuclear https://goo.gl/ZhFFot
https://goo.gl/gfa215
10 Square Miles of solar panels = 0.12GW average power! https://goo.gl/Ub1S39

User avatar
neptronix   100 GW

100 GW
Posts: 12685
Joined: Jun 15 2010 5:56pm
Location: California refugee living in Utah, USA
Contact:

Re: Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

Post by neptronix » Apr 23 2018 9:26pm

Yeah, without revealing anything about our infrastructure, i'm aware of what you're talking about :)

But we had an out of control memory leak on the database's part which would cause it to go into swap and then stall for hours and reboot itself. The newer version of the database wasn't responding to all sorts of parameters i set in order to control memory usage, that normally work on other versions.

I'm seeing what looks like a positive trend in the database memory usage though. We will know whether the fix held in a few days. If not, more drastic measures will be needed :evil:
My first major build: 8T MAC motor on a Trek 4500.
The new all-arounder: Leafmotor 1500w on a Turner O2 full suspension.
The wheelie machine: 20" Rear Magic Pie II on a Trek 4300 MTB

"The best time to plant a tree was 20 years ago. The second best time is now."- Chinese Proverb

User avatar
marty   10 MW

10 MW
Posts: 2095
Joined: Apr 19 2007 5:44pm
Location: Buffalo, New York USA
Contact:

Re: Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

Post by marty » Apr 24 2018 7:53am

Thanks for fixing the forum. Looks good from here.

Lots of problems and solutions here:
https://www.phpbb.com/community/index.php

Wonder what server phpbb.com people use for their forum? Lets do some research.

endless-sphere.com
Name Server: NS-415.AWSDNS-51.COM
Name Server: NS-1669.AWSDNS-16.CO.UK
Name Server: NS-1320.AWSDNS-37.ORG
Name Server: NS-997.AWSDNS-60.NET

phpbb.com
Name Server: ERIC.NS.CLOUDFLARE.COM
Name Server: ERIN.NS.CLOUDFLARE.COM

marty's server
Name Server: NS5.SECURESERVER.NET
Name Server: NS6.SECURESERVER.NET

My thoughts:
AWS is Amazon Web Services. Looked at https://aws.amazon.com/ I am overwhelmed. Too much to comprehend. No phone number for questions?

CLOUDFLARE.COM know nothing about them? Located in San Francisco CA. Got a phone number for sales. The Free plan looks interesting. Should try moving one of my web sites over there.

SECURESERVER.NET is GoDaddy. Call them any time for any question. I am happy. Had a delusional idea of becoming a reseller. Having major Dreamweaver confusion :(
MARTY
Volt Electric Vehicles
http://www.voltev.com

User avatar
neptronix   100 GW

100 GW
Posts: 12685
Joined: Jun 15 2010 5:56pm
Location: California refugee living in Utah, USA
Contact:

Re: Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

Post by neptronix » Apr 24 2018 11:21am

Thanks for trying to be helpful but you're thinking a couple levels out of the scope of the problem.
I am the guy you call when your linux server has gone tits up. There is no 'call a friend' option for these kinds of problems :lol:

Anyway, i'm seeing what looks like a promising trend for the memory consumption on the database. We will know if the fix held within a few days, but our DB was crashing on a 1-2 day cycle and we're on day 3 now, so :mrgreen:
My first major build: 8T MAC motor on a Trek 4500.
The new all-arounder: Leafmotor 1500w on a Turner O2 full suspension.
The wheelie machine: 20" Rear Magic Pie II on a Trek 4300 MTB

"The best time to plant a tree was 20 years ago. The second best time is now."- Chinese Proverb

User avatar
neptronix   100 GW

100 GW
Posts: 12685
Joined: Jun 15 2010 5:56pm
Location: California refugee living in Utah, USA
Contact:

Re: Why we had big downtime 04/22/18 and intermittent downtime since the upgrade

Post by neptronix » Apr 25 2018 2:11pm

Yahoo!!!
2018-04-25 11_58_12-RDS · AWS Console.png
As you can see, the database starts out with a lot of free ram and then starts using more memory to cache our huge database as time goes on. But then, it reaches a stasis point after 3 days. It was not doing this before. The curve just went down to zero megabytes, and into swap.

But for the last 12 hours, memory usage has been between +/- 3 megabytes. I'm calling this fixed. :mrgreen:
You do not have the required permissions to view the files attached to this post.
My first major build: 8T MAC motor on a Trek 4500.
The new all-arounder: Leafmotor 1500w on a Turner O2 full suspension.
The wheelie machine: 20" Rear Magic Pie II on a Trek 4300 MTB

"The best time to plant a tree was 20 years ago. The second best time is now."- Chinese Proverb

Post Reply