Comment 43 for bug 215728

Revision history for this message
In , Dcamp (dcamp) wrote :

OK, so I split the current google database into 3 pieces (which are much bigger chunks than they're currently giving us now, and bit bigger than they're planning to start feeding us during an initial update) and fed them to the DB one piece at a time. This gives us ~800,000 urls to deal with each update (some are adds that will be applied, some are subs that will cause an add to be deleted, some are subs that will be saved to apply to a future add, and some are adds that will cause a previously-applied sub to delete).

On OSX and Vista and vista, it takes ~30-35 seconds to apply one of these chunks the entire process. On ubuntu, it takes 10 minutes of system-performance-degrading IO to deal with that. I'm sure it's worse on network filesystems (though we use the Local data dir on osx and windows).

At its largest size, (which is actually near the middle of the update, because the update tends to send us a bunch of subs-that-don't-have-adds-yet that fill up the database, then the adds that remove them come later) the database has roughly 35,000 1k pages. During one of these big initial updates all the pages will be touched quite a few times.

Once we've caught up to the state of the server, our updates are going to be much smaller - usually in the low hundreds or occasionally a thousand or two. This will touch many fewer pages in the database (I haven't measured this closely).

The only way I see to get this performing decently is to bump up the max page cache size to fit all the pages during updates. If we bump the max page cache to a pretty big size (say, 35-40 megs), we'll avoid hitting the disk until commit time for even the largest updates. We can blow away that cache after the update, so it would be a transient spike in memory during updates, relative to the size of the update.

This sucks, but it's all I can think of. Any thoughts or suggestions?