PF*_aivan_* tasks on Rosetta 4.0+ - 20% failure rate uncorrected for 3 months

Message boards : Number crunching : PF*_aivan_* tasks on Rosetta 4.0+ - 20% failure rate uncorrected for 3 months

To post messages, you must log in.

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 88190 - Posted: 30 Jan 2018, 15:40:07 UTC

I've been reporting errors in this task in the Rosetta 4.0+ thread since October, first in a task named "BBGCBeNTF2_24_fold_SAVE_ALL_OUT_516172_1197_0" under Rosetta 4.03 but more recently in tasks with the format PF*_aivan_SAVE_ALL_OUT_* under Rosetta 4.06

std::cerr: Exception was thrown:
chi angle must be between -180 and 180: nan

It only seemed occasional so I wasn't that bothered - it happens - but a closer examination reveals it's a bit more significant.

In my current task history I'm showing 111 tasks, of which 48 are Rosetta 4.06 and 63 are mini-Rosetta. 40 of the tasks haven't reported yet as they're in my queue (I complete 24 per day so my buffer is only 1.6 days).

Of the completed tasks, 20% of all Rosetta 4.06 tasks are reporting "Error while computing" with this one specific error message.

Apart from that I think the only errors I get are caused by my computer crashing/locking up (I overclock and run 24/7 so the cause is more likely down to me) - maybe once or twice a month, so not significant.

Can someone look into this further, seeing as I've been reporting it since October and a 20% failure rate is very high. I don't know if you're getting useful data out of it, because they're running the full 8 hours, but validation fails and no credits are awarded for ~56 hours work per week. Thanks
ID: 88190 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 88192 - Posted: 30 Jan 2018, 17:05:37 UTC - in response to Message 88190.  
Last modified: 30 Jan 2018, 17:22:09 UTC

I've been reporting errors in this task in the Rosetta 4.0+ thread since October, first in a task named "BBGCBeNTF2_24_fold_SAVE_ALL_OUT_516172_1197_0" under Rosetta 4.03 but more recently in tasks with the format PF*_aivan_SAVE_ALL_OUT_* under Rosetta 4.06

std::cerr: Exception was thrown:
chi angle must be between -180 and 180: nan

It only seemed occasional so I wasn't that bothered - it happens - but a closer examination reveals it's a bit more significant.

In my current task history I'm showing 111 tasks, of which 48 are Rosetta 4.06 and 63 are mini-Rosetta. 40 of the tasks haven't reported yet as they're in my queue (I complete 24 per day so my buffer is only 1.6 days).

Of the completed tasks, 20% of all Rosetta 4.06 tasks are reporting "Error while computing" with this one specific error message.

Those seem to be mainly on your AMD machine. I have reported the problems I had with my Ryzen 1700 earlier, and I no longer use it on Rosetta.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87874#87874

But I have no problems on my Intel machines (i7-3770 on Ubuntu and i7-4771 on Win7 64-bit):
https://boinc.bakerlab.org/show_host_detail.php?hostid=3285911
https://boinc.bakerlab.org/show_host_detail.php?hostid=3118747

I think they need to fix their AMD stuff.

PS - I see a few errors on your Intel machines too that I would not expect. Are you overclocking?
ID: 88192 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 88193 - Posted: 30 Jan 2018, 21:23:53 UTC - in response to Message 88192.  
Last modified: 30 Jan 2018, 21:25:00 UTC

I've been reporting errors in this task in the Rosetta 4.0+ thread since October, first in a task named "BBGCBeNTF2_24_fold_SAVE_ALL_OUT_516172_1197_0" under Rosetta 4.03 but more recently in tasks with the format PF*_aivan_SAVE_ALL_OUT_* under Rosetta 4.06

std::cerr: Exception was thrown:
chi angle must be between -180 and 180: nan

It only seemed occasional so I wasn't that bothered - it happens - but a closer examination reveals it's a bit more significant.

In my current task history I'm showing 111 tasks, of which 48 are Rosetta 4.06 and 63 are mini-Rosetta. 40 of the tasks haven't reported yet as they're in my queue (I complete 24 per day so my buffer is only 1.6 days).

Of the completed tasks, 20% of all Rosetta 4.06 tasks are reporting "Error while computing" with this one specific error message.

Those seem to be mainly on your AMD machine. I have reported the problems I had with my Ryzen 1700 earlier, and I no longer use it on Rosetta.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87874#87874

I noticed that, but it seems to be connected with a hardware problem with early Ryzens. My issue doesn't seem to crash out - just gives an error message, runs to completion then won't validate. I'm thinking it's more a coding issue with the task rather than the machine. Though you are certainly right - it's only this machine that throws up the errors. Otherwise, though, it's my most reliable machine over time.

But I have no problems on my Intel machines (i7-3770 on Ubuntu and i7-4771 on Win7 64-bit):
https://boinc.bakerlab.org/show_host_detail.php?hostid=3285911
https://boinc.bakerlab.org/show_host_detail.php?hostid=3118747

Yup. I had just had a motherboard blow on an old Core 2 Quad and rebuilt it with an i3-8350 and both seem sweet with everything thrown at them (until I ramp the i3 up and ruin it)

PS - I see a few errors on your Intel machines too that I would not expect. Are you overclocking?

Neither (yet). The old laptop is on its last legs and suffering some heat-related issues and sound-chip weirdness. The i3 had a problem with its 1st few jobs because I had some corrupted downloads. Everything fine after the first 10 minutes.
ID: 88193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 88194 - Posted: 30 Jan 2018, 23:30:28 UTC - in response to Message 88193.  
Last modified: 30 Jan 2018, 23:46:31 UTC

Those seem to be mainly on your AMD machine. I have reported the problems I had with my Ryzen 1700 earlier, and I no longer use it on Rosetta.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87874#87874

I noticed that, but it seems to be connected with a hardware problem with early Ryzens. My issue doesn't seem to crash out - just gives an error message, runs to completion then won't validate. I'm thinking it's more a coding issue with the task rather than the machine. Though you are certainly right - it's only this machine that throws up the errors. Otherwise, though, it's my most reliable machine over time.

My Ryzen 1700 is one of the "fixed" ones built after the segfault problem was solved. It works great on WCG (MCM, MIP thus far), Universe (BHspin V2), LHC/SixTrack (SSE2 and AVX), DrugDiscovery (VINA and Smina) and GPUGrid (Quantum Chemistry). So they exercise enough different parts of the chip that I know it is OK. But the problems with Rosetta aren't just with Ryzens anyway, but with all the other AMD chips that I looked at as wingmen. They all had a higher failure rate than any of the Intel chips I saw. I don't know enough to propose a fix (except maybe to recompile it), but I am sure there are plenty of people here who can suggest something.

EDIT: I tried it on TN-Grid also. The fma version of "gene@home PC-IM v1.10" is faster on the Ryzen than the AVX version on my i7-4770, with no errors.
ID: 88194 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 88196 - Posted: 31 Jan 2018, 0:20:20 UTC - in response to Message 88194.  

Those seem to be mainly on your AMD machine. I have reported the problems I had with my Ryzen 1700 earlier, and I no longer use it on Rosetta.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87874#87874

I noticed that, but it seems to be connected with a hardware problem with early Ryzens. My issue doesn't seem to crash out - just gives an error message, runs to completion then won't validate. I'm thinking it's more a coding issue with the task rather than the machine. Though you are certainly right - it's only this machine that throws up the errors. Otherwise, though, it's my most reliable machine over time.

My Ryzen 1700 is one of the "fixed" ones built after the segfault problem was solved. It works great on WCG (MCM, MIP thus far), Universe (BHspin V2), LHC/SixTrack (SSE2 and AVX), DrugDiscovery (VINA and Smina) and GPUGrid (Quantum Chemistry). So they exercise enough different parts of the chip that I know it is OK. But the problems with Rosetta aren't just with Ryzens anyway, but with all the other AMD chips that I looked at as wingmen. They all had a higher failure rate than any of the Intel chips I saw. I don't know enough to propose a fix (except maybe to recompile it), but I am sure there are plenty of people here who can suggest something.

EDIT: I tried it on TN-Grid also. The fma version of "gene@home PC-IM v1.10" is faster on the Ryzen than the AVX version on my i7-4770, with no errors.

Okay, but when it produces an error that says "chi angle must be between -180 and 180: nan" it still sounds more like a coding error than a processor error
ID: 88196 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 88197 - Posted: 31 Jan 2018, 1:42:33 UTC - in response to Message 88196.  

Okay, but when it produces an error that says "chi angle must be between -180 and 180: nan" it still sounds more like a coding error than a processor error

I suppose so. I was just responding to the view that there was something wrong with Ryzens (or AMD in general). It seems like a problem with coding to me too. But people who have tried to interact with the Rosetta developers, and who know a lot more about it than I do, have not had much luck. I hope they give some consideration to this problem.
ID: 88197 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 88198 - Posted: 31 Jan 2018, 6:40:55 UTC - in response to Message 88197.  

Okay, but when it produces an error that says "chi angle must be between -180 and 180: nan" it still sounds more like a coding error than a processor error

I suppose so. I was just responding to the view that there was something wrong with Ryzens (or AMD in general). It seems like a problem with coding to me too. But people who have tried to interact with the Rosetta developers, and who know a lot more about it than I do, have not had much luck. I hope they give some consideration to this problem.


I crunched a lot of 4.06 with my Amd Fx6300, without problems
Is an OS problem? Linux, Windows??
ID: 88198 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 88201 - Posted: 1 Feb 2018, 4:41:13 UTC - in response to Message 88198.  

Okay, but when it produces an error that says "chi angle must be between -180 and 180: nan" it still sounds more like a coding error than a processor error

I suppose so. I was just responding to the view that there was something wrong with Ryzens (or AMD in general). It seems like a problem with coding to me too. But people who have tried to interact with the Rosetta developers, and who know a lot more about it than I do, have not had much luck. I hope they give some consideration to this problem.

I crunched a lot of 4.06 with my Amd Fx6300, without problems
Is an OS problem? Linux, Windows??

Doubt it. Windows 7 Home 64bit
ID: 88201 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 88204 - Posted: 1 Feb 2018, 8:20:26 UTC - in response to Message 88201.  

Doubt it. Windows 7 Home 64bit


I have Win10 (version 1709) 64 bit.
ID: 88204 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 88209 - Posted: 1 Feb 2018, 16:11:01 UTC - in response to Message 88198.  
Last modified: 1 Feb 2018, 16:12:33 UTC

I crunched a lot of 4.06 with my Amd Fx6300, without problems
Is an OS problem? Linux, Windows??

From looking at other users with AMD machines, it seems to occur on both Windows and Linux. But surely Rosetta can look at the error rates themselves. I am a bit concerned that they have not noticed it yet, or at least not commented. Or if it is wrong, they can just say so and I will look elsewhere. But I am planning a new Ryzen+ machine later this year, and if there is no new AMD application by then, why bother trying Rosetta? There are plenty of other projects for it.
ID: 88209 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : PF*_aivan_* tasks on Rosetta 4.0+ - 20% failure rate uncorrected for 3 months



©2024 University of Washington
https://www.bakerlab.org