Comment 157 for bug 541511

Revision history for this message
In , Indan (indan) wrote :

(In reply to comment #127)
> The fact that it's 512 shows that the problem is a failed cacheflush and
> nothing else (this is actually what I've checked). The chipset flush
> checker changes the place it writes the check value every chipset flush.
> And it reuses the same place every 512th chipset flush. So when the
> chipset flush failes, the old value is there, which should be exactly 512
> less than what's expected.

Yeah, I figured it would be that, reading through your old comments.

By the way, I think I got those failed flushes without xcompmgr running.
(I killed it to see if there was any difference.) That might explain why I
didn't see failed flushes before, xcompmgr is more or less always running.

My case might be related to suspend, because both failures happened within
a minute or so from resume.

I wish I knew a way to trigger it easily, now it takes days to test anything.

> Well, there is _no_ way to do a reliable flush. And the hw docs explicitly
> says so. But we need to move stuff in/out of the graphics mem (i.e. the
> gtt). The other option would be to copy stuff in/out, which is even worse:
> - Wastes memory (actually simply uses twice as much for everything).
> - Would be even slower than what my hack currently does.
>
> And to add insult to injury, some of the chipsets from the 2nd gen (i8xx)
> suffer from other cache coherency problems in addition to this.

What I don't understand is why your patch slows things down so much for me,
it seems to do only a few thousand flushes anyway.

I guess copying around is what the old drivers did?

Some random ideas:

- Increase I830_MCH_WRITE_BUFFER_SIZE?

- Instead of writing zeroes, actually change the content of the flush page.
  Flushing caches doesn't seem to do much if the new content is the same as
  the old one?

- The text you quoted in one of your commit messages said that the memory
  content isn't coherent, but it didn't say anything about the mapping itself.
  Can't you update the gtt mapping to effectively flush it? I mean, if you
  move pages out of the gtt and back in, shouldn't that flush the old content?
  Maybe move it to a different index, e.g. insert new mapping to the start
  instead of the end, in case the hw caches it by address+index. Similar to
  Chris Wilson's gtt disabling thing, but instead of disabling, altering it
  in a smart, flush causing way.

If the problem is that the flush is needed to avoid the hardware from writing
stale data to old gtt mapped physical memory:

- If an entry is added, there should be no need for a flush, because the all
  memory is still valid. If an entry is removed, the gpu can continue to write
  to those pages. What about copying the content to a new physical page and
  keeping the original page for a while until the gpu is done with it?

(I don't know what I'm talking about, just trying to inspire you to come up
with some genius plan to solve all problems. :-)

> Ok, that's bad. Can you change the following define in
> include/drm/intel-gtt.h and see whether you still get failed chipset
> flushes?
>
> -#define I830_CC_CANARY_FLOCK_GTT_PAGES 8
> +#define I830_CC_CANARY_FLOCK_GTT_PAGES 16
>
> The whole stuff make somewhat more sense this way around, anyway.

I will try this later, first I'm going to try without your latest commit
("fix i85x gtt chipset flush") to see how it behaves without that stuff,
both performance and amount of failed flushes.

> Oh, and add some details about your box, please (brand&model + cpu,
> mostly, the rest is all in the dmesg, anyway).

See my first post: Thinkpad X40, 855GM (rev 02), Pentium M (family 6, model 13,
stepping 6: It has clflush).

But the hangs are gone, so I'm happy. I prefer slight glyph corruption that goes
away when I cause a refresh (e.g. increase text size) with snappy performance to
the sluggishness caused by the current patch.