lost interrupts... help please

lost interrupts... help please

Post by ales » Tue Aug 22, 2006 8:26 am

Hello there,
I got a BP6 just recently. I got it for nothing with no guarantee of function.
Put processors and all the components in and switched on. It booted nicely,
so I've upgraded the BIOS to the latest release available. I was aware of
the issues the BP6 has, hence test it with memtest86++. The board didn't pass
test #8 and also #10. I knew it's a board problem, because I was sure that the
memory sticks are OK since they has been tested in another machine.
Replacement of EC10 capacitor for 1500µF, 6.3V one has solved the problem.
Linux installation was a breeze and after a short while a had the BP6 up and
running. The machine was quite stable, but occasionally there have been nasty
DMA timeouts, especially during copy of big files which led to corruption. I've
also find out that /proc/interrupts statistics shows some errors, especially
during long compiles. From what I've learned from kernel docs these are not
dangerous since these are errors detected by IO-APIC which repeats
the transaction that failed. But it indicates some issue with interrupt
distribution. And I've got a suspicion that DMA timeouts problem is somewhat
related with that. As I've said earlier the timeouts happened rather rarely and
apart from that the machine was very stable. Hence I thought that all the
problems are caused by heat (it were very very hot summer days), but after
installation of lm_sensors I realized that I was mistaken, since
all the crucial temperatures were within the limits (outdoor temperatures
dropped down significantly since than as well). The same goes for voltages,
all swingins are well within the limits. Installation of X-Window has deepen
the problem even more. I had to switch of GLX (and DRI) since the machine
sooner or later freezed completely. From what I'm picking up, a lock-up occurs
when there are outstanding interrupts from GFX board and IO subsystem
simultaneously (that is, when X-Window system is accelerated and there happen
to be a disk request serviced). That happens more frequently when the machine
is under a heavy load, but very often when the machine is almost idle as well.
My suspicion is on interrupt distribution, that is, lost interrupts from IO
subsystem lead cause the DMA timeouts and similarily lost interrupts from GFX
board lock-up the X.

I was aware of the IRQ sharing from the beginning and did placed all the add-on
cards so that they don't share a interrupt. So that shouldn't be a problem.
I also set the interrupts for every IRQ line manually in the BIOS. I've tried
with ACPI on as well as off. Tried MP specification v1.4 as well as v1.1.
However, as you might imagine, to no avail. One solution might seem to be set
the interrupt affinity to one processor, but since the APIC errors happens on
both, it probably wouldn't work. On the other hand, what helped, was removing
one processor from its socket. Tried both CPUs in each socket and with only
one CPU in any socket the machine was rock solid, no DMA timeouts, no APIC
errors, no lock-ups whatsoever. I will repeat myself, but the whole problem
seems to be in distribution of interrupts among CPUs, which is not the case
when there is only one CPU in the system. As I've mentioned above, I did
replaced EC10 capacitor already and the rest caps seem to be in shape (no
leakage, no voltage swingins). Will the replacement of all capacitors help?
Anybody with a similar problem? Any bright idea?


Abit BP6, BIOS RV release, HPT 1.30b
|-> Celeron 500MHz, FSB 66MHz (no overclock)
|-> Celeron 500MHz, FSB 66MHz (no overclock)
|-> 128MB, 100MHz SDRAM
|-> 128MB, 100MHz SDRAM
|-> 128MB, 100MHz SDRAM
|-> AGP -> Radeon 7000, 64MB DDR
|-> PCI
|-> PCI
|-> PCI
|-> PCI -> SoundBlaster 128
|-> PCI
|-> ISA
|-> ISA
|-> PIIX4 IDE -> Seagate Medalist ST33210A, 3.2GB
|-> PIIX4 IDE -> Teac CD-W552E 52x CD-RW
|-> HPT366 -> Western Digital Protege WD200AA 20GB
|-> HPT366 -> Western Digital Caviar WD205EE 20.5GB

Linux 2.6.17, GCC 4.1.1, Xorg 7.1 :?:

Post by Dave Rave » Wed Aug 23, 2006 3:20 am

I haven't read it all
but if you get errors in memtest, change out a memory module and re-test
until you don't get errors

errors are bad.!!!
Post by KliK » Wed Aug 23, 2006 10:11 am

test the board with only one RAM in place...
so first 128MB allways in slot1...do the test overnight...if it's OK, get it out & test another one...and after that a third one...

if everything is OK, then test the first two, until you have combination without errors...also do that with all three RAMs in place...

might be the slots on the board...might be the RAMs (in 99% it's that)...might be the different RAMs...might be the CAS2 that the RAMs can't hanbdle, so go to CAS3 - might be slower, but it will be stable!!

Post by ales » Fri Aug 25, 2006 7:01 am

Well, I did what you propose already, and the problem with memory has been solved with replacement of EC10 capacitor, which is what I wrote in the first post. The problem I have right now is lost interrupts and unstable behaviour. At least it seems like lost interrupts or maybe unresolved collisions. It might be the HPT366. I`m gonna try the system with no disks attached to HPT. Please read the whole article (my first post). Anyway thank you for your interest.

Post by ales » Thu Sep 07, 2006 10:00 am

Hello again,
The crashes of X-Window seem to be finally solved. I have tried the machine
without any devices connected to HPT366, but the X locked up when I put it in
accelerated mode (DRI & GLX) with the same symptoms as before. I have also
changed the Radeon 7000 gfx board for Radeon 9000 and to my big surprise the
X-Window works nicely with acceleration and all the optimizations turned on
(color tiling, page flipping...). Well, it is great, but I just don`t
understand that. The Radeon 7000 works absolutely flawlessly on uniprocessor,
but is in all sorts of trouble in SMP mode. After all, it migth have been a
software issue, but the driver is basically the same for both boards, although
the DRI portion is different for different family of chips (R100, R200,
R300...). The Radeon boards are well supported on Linux (at least till model
9200), and again the board works without a hickup on uniprocessor. I really
don`t understand that. I`ve got no idea what experiences have other bp6
people with Radeon 7000, but it doesn`t work for me.

As I said the Radeon 9000 works great, the only catch is that produces a lot
of heat in 3D acceleration and comes only with a passive heatsink. It really
is red hot, you can barely touch the heatsink. Any ideas about that. It is,
of course, not a major problem to attach a fan to the heatsink, but I would
like to know experiences of others with Radeon 9000. Is such a high
temperature within the design limits of the chip or is improving the airflow
strongly recommended?

Thats all for the moment. Thanks for any coments.

Post by InactiveX » Thu Sep 07, 2006 7:08 pm

ales wrote:Please read the whole article (my first post). Anyway thank you for your interest.

I think you would get more people to look at your post if you separated it up into paragraphs.

What you've written looks very monolithic and unreadable, and more folks will look at it if you break it up a bit.

Regards, IX
Post by Derek » Wed Sep 20, 2006 9:50 pm

It took all my power to refrain from asking him to look under his desk for his lost interrupts :D

Post by hyperspace » Tue Oct 10, 2006 11:34 am

