Send Bristol mailing list submissions to
bristol@mailman.lug.org.uk
To subscribe or unsubscribe via the World Wide Web, visit
https://mailman.lug.org.uk/mailman/listinfo/bristol
or, via email, send a message with subject or body 'help' to
bristol-request@mailman.lug.org.uk
You can reach the person managing the list at
bristol-owner@mailman.lug.org.uk
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bristol digest..."
Today's Topics:
1. Re: Diagnosing a machine that freezes intermittently
(James Womack)
2. Re: Diagnosing a machine that freezes intermittently
(Adrian Portway)
----------------------------------------------------------------------
Message: 1
Date: Fri, 04 Apr 2014 11:23:10 +0100
From: James Womack <5inowsy1maiq@gmail.com>
To: Bristol and Bath Linux User Group <bristol@mailman.lug.org.uk>
Subject: Re: [bristol] Diagnosing a machine that freezes
intermittently
Message-ID: <533E880E.5010905@gmail.com>
Content-Type: text/plain; charset=windows-1252
Hi,
Resurrecting this thread to say that I wasn't able to identify the
true cause of this issue.
I eliminated as many individual components as I could (by swapping
out, disconnecting cards, disks etc and running individual tests for
memory, disks etc), tried different kernel versions and different
graphics drivers, but the problem still remained. I can only assume
that the problem lies somewhere on the main board, some subtle
hardware fault perhaps.
I have abandoned trying to find the fault now and switched to using a
spare machine I had, switching the drives over from the buggy machine.
In the process of trying to debug the machine I found an interesting
tool which may be of interest to other LUGers:
Breakin is a live bootable Linux distribution that stress tests your
system. It is really quite effective at pushing the RAM and CPU, and
you can leave it running for hours/days to really make sure your
system is functioning well.
http://www.advancedclustering.com/software/breakin.html
Incidentally, this ran for 24 hours on my buggy machine. It failed
only 1 cycle of the high performance Linpack in this time. Not sure
whether this was indicative of anything unusual, though, since one
might expect very rare failures, especially where ECC memory is not used.
Regards,
James
On 18/03/14 14:09, James Womack wrote:
> Okay, getting somewhere now. The problem seems to be graphics card
> or graphics driver related.
>
> I switched back to lightdm from gdm, and this time, instead of
> becoming completely unresponsive, only the GUI froze, and I could
> SSH into the machine. The process /usr/bin/X was using 100% CPU,
> and the Xorg.0.log file ends with the lines
>
> [ 14392.979] (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x0000bc38,
> 0x0000bf6c) [ 14399.967] (WW) NVIDIA(0): WAIT (1, 6, 0x8000,
> 0x0000bc38, 0x0000bf6c) [ 14431.419] (WW) NVIDIA(0): WAIT (2, 6,
> 0x8000, 0x0000b298, 0x0000b5cc) [ 14436.394] (WW) NVIDIA(0): WAIT
> (0, 6, 0x8000, 0x0000b5cc, 0x0000b5cc) [ 14453.653] (WW) NVIDIA(0):
> WAIT (2, 6, 0x8000, 0x00002694, 0x00006ba4) [ 14459.162] (WW)
> NVIDIA(0): WAIT (0, 6, 0x8000, 0x00006ba4, 0x00006ba4) [ 14462.496]
> (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x00004760, 0x00006638)
>
> I think this is reasonable evidence that something odd is going on
> graphics-wise.
>
> On 17/03/14 22:20, James Womack wrote:
>> Okay, memtest86+ ran without errors overnight. I have also tried
>> to stress the GPU by running many simultaneous instances of
>> glxgears, but the GPU temperature just hovers around 72 C which I
>> gather is a respectable temperature for a nVidia GPU.
>>
>> Have just set stress to run for 5 hours with 32 GPU threads,
>> though no matter how many threads I threw at my CPU , the system
>> remained responsive. CPU temps have increases by about 10 C,
>> though. The only way to get a noticeable lag in the GUI was to
>> use up so much RAM that swapping was necessary, but then the CPU
>> is not being stressed because it is having to wait for memory
>> reads. I wonder if there is something that is more effective at
>> stressing the machine? On Windows I have used Prime95 very
>> effectively in the past.
>>
>> Thanks for all the help so far!
>>
>> On 17 Mar 2014 13:53, "Nigel Sollars" <nsollars@gmail.com
>> <mailto:nsollars@gmail.com>> wrote:
>>
>> Hi,
>>
>> Not for nothing, have you tried just running it run level 3
>> style?. I had a laptop kinda do this, turned out to be a dying
>> graphics adapter.
>>
>> Regards
>>
>>
>> On Mon, Mar 17, 2014 at 7:00 AM, James Womack
>> <5inowsy1maiq@gmail.com <mailto:5inowsy1maiq@gmail.com>> wrote:
>>
>> On 17/03/14 09:50, Shane McEwan wrote:
>>> It's almost certainly a hardware issue. The Linux kernel is
>> usually very
>>> good at oopsing whenever software does something it's not
>> supposed to.
>>> Upgrade to the latest nVidia driver if you think that could be
>> a problem
>>> but in my experience you'll get errors in the logs, X crashes
>> or no
>>> video at all if there's a driver problem. I can't remember an
>> nVidia
>>> card or driver hanging the machine entirely and I've supported
>> studios
>>> with 100s of Linux machines with nVidia cards.
>>>
>>> Definitely run a memory test. Are all your fans running? Use
>> something
>>> like Munin to monitor CPU and graphics card temperatures and
>> fan speeds
>>> and see if there's a pattern to when the machine hangs.
>>>
>>> Some motherboards have diagnotic LEDs on them that can
>> sometimes give
>>> you a clue what state it's in when it hangs.
>>>
>>> Shane.
>>>
>>> _______________________________________________ Bristol mailing
>>> list Bristol@mailman.lug.org.uk
>>> <mailto:Bristol@mailman.lug.org.uk>
>>> https://mailman.lug.org.uk/mailman/listinfo/bristol
>> Thanks, yes, I am beginning to suspect it is hardware based.
>> Thanks for your suggestions! Looks like I am going to have a fun
>> time getting to the root of this...
>>
>> _______________________________________________ Bristol mailing
>> list Bristol@mailman.lug.org.uk
>> <mailto:Bristol@mailman.lug.org.uk>
>> https://mailman.lug.org.uk/mailman/listinfo/bristol
>>
>>
>>
>>
>> -- ?Science is a differential equation. Religion is a boundary
>> condition.?
>>
>> Alan Turing
>>
>> _______________________________________________ Bristol mailing
>> list Bristol@mailman.lug.org.uk
>> <mailto:Bristol@mailman.lug.org.uk>
>> https://mailman.lug.org.uk/mailman/listinfo/bristol
>>
--
James Womack
james.c.womack@gmail.com
http://jcwomack.co.uk
------------------------------
Message: 2
Date: Fri, 04 Apr 2014 11:46:09 +0100
From: Adrian Portway <adrian.portway@gmail.com>
To: Bristol and Bath Linux User Group <bristol@mailman.lug.org.uk>
Subject: Re: [bristol] Diagnosing a machine that freezes
intermittently
Message-ID: <533E8D71.50900@gmail.com>
Content-Type: text/plain; charset=windows-1252
Thanks James, always good to have new tools to add to the collection :)
Cheers,
Adrian
A penny saved is a Governmental oversight.
On 04/04/14 11:23, James Womack wrote:
> Hi,
>
> Resurrecting this thread to say that I wasn't able to identify the
> true cause of this issue.
>
> I eliminated as many individual components as I could (by swapping
> out, disconnecting cards, disks etc and running individual tests for
> memory, disks etc), tried different kernel versions and different
> graphics drivers, but the problem still remained. I can only assume
> that the problem lies somewhere on the main board, some subtle
> hardware fault perhaps.
>
> I have abandoned trying to find the fault now and switched to using a
> spare machine I had, switching the drives over from the buggy machine.
>
> In the process of trying to debug the machine I found an interesting
> tool which may be of interest to other LUGers:
>
> Breakin is a live bootable Linux distribution that stress tests your
> system. It is really quite effective at pushing the RAM and CPU, and
> you can leave it running for hours/days to really make sure your
> system is functioning well.
> http://www.advancedclustering.com/software/breakin.html
>
> Incidentally, this ran for 24 hours on my buggy machine. It failed
> only 1 cycle of the high performance Linpack in this time. Not sure
> whether this was indicative of anything unusual, though, since one
> might expect very rare failures, especially where ECC memory is not used.
>
> Regards,
> James
>
>
> On 18/03/14 14:09, James Womack wrote:
>> Okay, getting somewhere now. The problem seems to be graphics card
>> or graphics driver related.
>>
>> I switched back to lightdm from gdm, and this time, instead of
>> becoming completely unresponsive, only the GUI froze, and I could
>> SSH into the machine. The process /usr/bin/X was using 100% CPU,
>> and the Xorg.0.log file ends with the lines
>>
>> [ 14392.979] (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x0000bc38,
>> 0x0000bf6c) [ 14399.967] (WW) NVIDIA(0): WAIT (1, 6, 0x8000,
>> 0x0000bc38, 0x0000bf6c) [ 14431.419] (WW) NVIDIA(0): WAIT (2, 6,
>> 0x8000, 0x0000b298, 0x0000b5cc) [ 14436.394] (WW) NVIDIA(0): WAIT
>> (0, 6, 0x8000, 0x0000b5cc, 0x0000b5cc) [ 14453.653] (WW) NVIDIA(0):
>> WAIT (2, 6, 0x8000, 0x00002694, 0x00006ba4) [ 14459.162] (WW)
>> NVIDIA(0): WAIT (0, 6, 0x8000, 0x00006ba4, 0x00006ba4) [ 14462.496]
>> (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x00004760, 0x00006638)
>>
>> I think this is reasonable evidence that something odd is going on
>> graphics-wise.
>>
>> On 17/03/14 22:20, James Womack wrote:
>>> Okay, memtest86+ ran without errors overnight. I have also tried
>>> to stress the GPU by running many simultaneous instances of
>>> glxgears, but the GPU temperature just hovers around 72 C which I
>>> gather is a respectable temperature for a nVidia GPU.
>>>
>>> Have just set stress to run for 5 hours with 32 GPU threads,
>>> though no matter how many threads I threw at my CPU , the system
>>> remained responsive. CPU temps have increases by about 10 C,
>>> though. The only way to get a noticeable lag in the GUI was to
>>> use up so much RAM that swapping was necessary, but then the CPU
>>> is not being stressed because it is having to wait for memory
>>> reads. I wonder if there is something that is more effective at
>>> stressing the machine? On Windows I have used Prime95 very
>>> effectively in the past.
>>>
>>> Thanks for all the help so far!
>>>
>>> On 17 Mar 2014 13:53, "Nigel Sollars" <nsollars@gmail.com
>>> <mailto:nsollars@gmail.com>> wrote:
>>>
>>> Hi,
>>>
>>> Not for nothing, have you tried just running it run level 3
>>> style?. I had a laptop kinda do this, turned out to be a dying
>>> graphics adapter.
>>>
>>> Regards
>>>
>>>
>>> On Mon, Mar 17, 2014 at 7:00 AM, James Womack
>>> <5inowsy1maiq@gmail.com <mailto:5inowsy1maiq@gmail.com>> wrote:
>>>
>>> On 17/03/14 09:50, Shane McEwan wrote:
>>>> It's almost certainly a hardware issue. The Linux kernel is
>>> usually very
>>>> good at oopsing whenever software does something it's not
>>> supposed to.
>>>> Upgrade to the latest nVidia driver if you think that could be
>>> a problem
>>>> but in my experience you'll get errors in the logs, X crashes
>>> or no
>>>> video at all if there's a driver problem. I can't remember an
>>> nVidia
>>>> card or driver hanging the machine entirely and I've supported
>>> studios
>>>> with 100s of Linux machines with nVidia cards.
>>>>
>>>> Definitely run a memory test. Are all your fans running? Use
>>> something
>>>> like Munin to monitor CPU and graphics card temperatures and
>>> fan speeds
>>>> and see if there's a pattern to when the machine hangs.
>>>>
>>>> Some motherboards have diagnotic LEDs on them that can
>>> sometimes give
>>>> you a clue what state it's in when it hangs.
>>>>
>>>> Shane.
>>>>
>>>> _______________________________________________ Bristol mailing
>>>> list Bristol@mailman.lug.org.uk
>>>> <mailto:Bristol@mailman.lug.org.uk>
>>>> https://mailman.lug.org.uk/mailman/listinfo/bristol
>>> Thanks, yes, I am beginning to suspect it is hardware based.
>>> Thanks for your suggestions! Looks like I am going to have a fun
>>> time getting to the root of this...
>>>
>>> _______________________________________________ Bristol mailing
>>> list Bristol@mailman.lug.org.uk
>>> <mailto:Bristol@mailman.lug.org.uk>
>>> https://mailman.lug.org.uk/mailman/listinfo/bristol
>>>
>>>
>>>
>>>
>>> -- ?Science is a differential equation. Religion is a boundary
>>> condition.?
>>>
>>> Alan Turing
>>>
>>> _______________________________________________ Bristol mailing
>>> list Bristol@mailman.lug.org.uk
>>> <mailto:Bristol@mailman.lug.org.uk>
>>> https://mailman.lug.org.uk/mailman/listinfo/bristol
>>>
------------------------------
_______________________________________________
Bristol mailing list
Bristol@mailman.lug.org.uk
https://mailman.lug.org.uk/mailman/listinfo/bristol
End of Bristol Digest, Vol 544, Issue 3
***************************************
Tidak ada komentar:
Posting Komentar