Comment 65 for bug 131094

Revision history for this message
Jamie McCracken (jamiemcc-blueyonder) wrote :

Hi Jamie Lokier,

Just to clear up a few issues:

1) All the Dbs (btree/hashes/ berkeley) only have api for updating one key/Value at a time AFAIK

2) In our case, the final index merge is not updating anything as its creating a new index therefore the disk space for hits is contiguous so regardless of what word we start from, space is allocated on a first come first served basis (IE its appended) so whatever word order we choose basically. The buckets in the header are random of course but they are fixed at first 1MB of index (256,000 buckets at 32bits each)

3) all major indexers Lucene (Beagle/strigi) and google use index merges as updating a big index is slow + it helps remove deleted entries and fragmentation. Without merges no index would be scalable

4) we dont wanna use multiple tables and sql dbs are not appropriate as they store the word twice (once in index and once in table) hence bloating things up

5) The high end oracle RDBMs has support for clustered tables which allow storage of stuff in key order (normal tables are appended and only indexes are sorted). These are not practical as they are even more painful to update due to massive relocation (in fact its far quicker to append records then copy to new table in sorted order).

6) performance problems with existing merges dissappear on XFS (they merge in seconds as opposed to minutes on EXT3). If EXT4 gets similar delayed allocation then hopefully we will see same too