inconsistent performance between staging and prod

Bug #660291 reported by Robert Collins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Stuart Bishop

Bug Description

Staging is meant to be no better than production - that is, something that works on staging should always work on production.

However since the pg8.4 upgrade we've had multiple queries (e.g. the distribution:+ppas count(*) query) that are tolerable (1.2s) on staging and intolerable (15s) on production.

Discussing with Stuart, there may be a tuning parameter difference, or it may be the greater ram available on prod triggering query plans that are actually less efficient.

If we can make prod fast for these queries we should, otherwise we need to get staging back to being a good predictor for 'bad queries' asap so that developers can have improvements validated.

Revision history for this message
Stuart Bishop (stub) wrote :

The following settings are different on staging:

shared_buffers = 2GB
random_page_cost = 3.0
effective_cache_size = 25GB
default_statistics_target = 1000

On production we have:

shared_buffers = 5GB
random_page_cost = 1.8
effective_cache_size = 120GB
default_statistics_target = 250

All of these will affect query plans. The first thing that stands out is our statistics target (how many rows we sample to generate statistics) is too low on production. I'm bumping it up to 2500 and will run a full analyze.

If the statistics target doesn't help, lowering our shared_buffers seems the next step. There never seems to be a straight answer on how big this should be, especially when most of the rules of thumb are years out of date. The reports I trust most indicate performance starts degrading at 10GB, but it is rare to find places running with more than 2GB since most sites are running with <32GB RAM.

random_page_cost is the other variable we should try tweaking, maybe before tweaking the shared_buffers as we need to bounce PG to change shared_buffers. random_page_cost is quite low as, at least with PG 8.3, you needed to take into account the probability of data being in RAM and we have lots of that (we probably should have been running with about 1.2 on production with 8.3, but I took the conservative option since it was performing well enough and didn't want to risk experiments). We should confirm this setting is still sane under PG 8.4.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 660291] Re: inconsistent performance between staging and prod

Awesome, I'm so glad there are differences. Perhaps we should increase
the stats size on staging to match?

Gary Poster (gary)
Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Stuart Bishop (stub) wrote :

statistics_target and random_page_cost are now the same between production and staging. We can't really make the others match as the hardware doesn't match.

Changed in launchpad:
assignee: nobody → Stuart Bishop (stub)
status: Triaged → Fix Committed
tags: added: needsrt
Revision history for this message
Robert Collins (lifeless) wrote :

Ok, we will need RT for the hardware aspect

Revision history for this message
Stuart Bishop (stub) wrote :

On Tue, Jan 17, 2012 at 5:26 AM, Robert Collins
<email address hidden> wrote:
> Ok, we will need RT for the hardware aspect

If staging is brought up to spec to match production, that causes
other issues - staging will always perform better production as there
is much less concurrency.

Is the goal here purely to get the planner to plan the same way?
Again, I don't think it can ever be identical due to the difference in
concurrency (production will think tables are hot and staging will
not), but we can lie in our tuning parameters - performance on staging
will suffer as the planner is not planning to actual hardware, but the
plans will be closer.

--
Stuart Bishop <email address hidden>

Stuart Bishop (stub)
Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.