March 2022 - WU error rates

Message boards : Number crunching : March 2022 - WU error rates

To post messages, you must log in.

AuthorMessage
Profile jay

Send message
Joined: 12 Jan 08
Posts: 20
Credit: 195,801
RAC: 0
Message 105411 - Posted: 11 Mar 2022, 15:46:08 UTC

Greetings,
I am working the non-vbox WU and getting errors.
I compare my errored results with my wing-man.
for example, on one WU a Windows Volunteer errors out in 20 seconds.
My Linux WU ran for 35939 seconds and had the error:
"Too many errors (may have bug) Too many total results" on validation.
See
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1315068895
What gives?
Is this a time of testing the WU?
If so, can they be tested by Rosetta before releasing?

For me it is a matter of electricity, heat, and non-productive work.


Another WU,
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1315163591
has already errored-out by my windows wing-man, while My linux box is still crunching.
( I has run for about 6 hours with 3.5 hours remaining. I am concerned about a similar failure.)

I looked on the forums for recent errors - but did not see any.
Anyone else have Problems - or having no errors?

THANKS,
Jay
ID: 105411 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,543,381
RAC: 5,926
Message 105412 - Posted: 11 Mar 2022, 16:07:41 UTC - in response to Message 105411.  

Is this a time of testing the WU?

Yes

If so, can they be tested by Rosetta before releasing?

No, Ralph@Home is largely unused
ID: 105412 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1675
Credit: 17,697,137
RAC: 20,072
Message 105414 - Posted: 11 Mar 2022, 22:41:12 UTC - in response to Message 105411.  

What gives?
Is this a time of testing the WU?
If so, can they be tested by Rosetta before releasing?

For me it is a matter of electricity, heat, and non-productive work.
Due to the nature of Rosetta work, Tasks that error out can still give useful results, which is why in most cases you will still get Credit for a Task that produces an error.

Unfortunately there has been little work (if any) to actually code so that such Tasks instead of crashing out with an error just end early (as they should). Only Tasks that are truly an error (ie not producing useful data) should actually error out.
And the applications should be fixed so that one version -eg for Windows- produces errors when the other -eg LINUX- doesn't produce errors (the reverse has also occurred in the past).

But since Rosetta 4.20 has pretty much been abandoned apart from the occasional small batch every so often, there is no such effort. Not surprising as the new type of Rosetta Tasks -Python- have plenty of significant issues of their own of which there has been no updated application to address them at all.
If they aren't going to fix their current application, there's no way on Erath they're going to fix the old depreciated ones.
Grant
Darwin NT
ID: 105414 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2119
Credit: 41,174,978
RAC: 11,583
Message 105420 - Posted: 12 Mar 2022, 2:41:39 UTC - in response to Message 105414.  

What gives?
Is this a time of testing the WU?
If so, can they be tested by Rosetta before releasing?

For me it is a matter of electricity, heat, and non-productive work.
Due to the nature of Rosetta work, Tasks that error out can still give useful results, which is why in most cases you will still get Credit for a Task that produces an error.

Unfortunately there has been little work (if any) to actually code so that such Tasks instead of crashing out with an error just end early (as they should). Only Tasks that are truly an error (ie not producing useful data) should actually error out.
And the applications should be fixed so that one version -eg for Windows- produces errors when the other -eg LINUX- doesn't produce errors (the reverse has also occurred in the past).

But since Rosetta 4.20 has pretty much been abandoned apart from the occasional small batch every so often, there is no such effort. Not surprising as the new type of Rosetta Tasks -Python- have plenty of significant issues of their own of which there has been no updated application to address them at all.
If they aren't going to fix their current application, there's no way on Erath they're going to fix the old depreciated ones.

Taking a quick look, they're both "preetham" tasks.
I just had one here and it barely reached 2 seconds of CPU time before crashing.
Interesting to read that they last much longer on linux than windows. That seems to be a pattern in recent times. Ugh...
ID: 105420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,543,381
RAC: 5,926
Message 105423 - Posted: 12 Mar 2022, 9:46:54 UTC - in response to Message 105414.  

But since Rosetta 4.20 has pretty much been abandoned apart from the occasional small batch every so often, there is no such effort. Not surprising as the new type of Rosetta Tasks -Python- have plenty of significant issues of their own of which there has been no updated application to address them at all.


I continue to write messages (about, for example, multi-attach disks on Virtualbox) on Twitter to "stimulate" admins.
Up to now, without results.
ID: 105423 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,543,381
RAC: 5,926
Message 105424 - Posted: 12 Mar 2022, 9:53:38 UTC - in response to Message 105420.  

Interesting to read that they last much longer on linux than windows. That seems to be a pattern in recent times. Ugh...


They write the native code on linux and after they compiled it for other platforms.
So, probably, they don't pay attention to this part of coding (that is important as much as write the code)
ID: 105424 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 105425 - Posted: 12 Mar 2022, 16:15:30 UTC - in response to Message 105423.  

I continue to write messages (about, for example, multi-attach disks on Virtualbox) on Twitter to "stimulate" admins.
Up to now, without results.

I could live with their other problems, but not the high writes rates to the SSD. Even with a huge (32 GB) write cache, and running only six work units on 50% of the cores of a Ryzen 3600 (Ubuntu 20.04.4), I was seeing writes to disk of over 800 GB/day. It is probably because of how they handle the .VDI files; computezrmle tells them how to do it.

I get the impression that this researcher has never developed a program for BOINC before, and isn't interested in learning how to do it now.
ID: 105425 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 105426 - Posted: 12 Mar 2022, 18:52:37 UTC - in response to Message 105425.  

I continue to write messages (about, for example, multi-attach disks on Virtualbox) on Twitter to "stimulate" admins.
Up to now, without results.

I could live with their other problems, but not the high writes rates to the SSD. Even with a huge (32 GB) write cache, and running only six work units on 50% of the cores of a Ryzen 3600 (Ubuntu 20.04.4), I was seeing writes to disk of over 800 GB/day. It is probably because of how they handle the .VDI files; computezrmle tells them how to do it.

I get the impression that this researcher has never developed a program for BOINC before, and isn't interested in learning how to do it now.



Your just stating the obvious.
As long as they get something around a 95% clean result (perhaps as low as 90) then they are happy.
We have discussed this to the end of the world and beyond.
They don't care about the PC side as long as they get a pretty good result.
They do not monitor twitter as far as I know and never here anymore.
They are not open to suggestions. Their way works, why change it or update it.
RALPH, they should shut that off. They never use it.
That's the summary of it all.

You get what you get, if it works good on linux, great, then they have a result from the linux.
If it works on windows, even better, if it doesn't, oh well.

The only thing that is important to them is their neural network system. The PC's are a nice addition.
Very much the way BOINC TACC works.
ID: 105426 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 105427 - Posted: 12 Mar 2022, 20:52:12 UTC - in response to Message 105426.  

Your just stating the obvious.
As long as they get something around a 95% clean result (perhaps as low as 90) then they are happy.
We have discussed this to the end of the world and beyond.

I have been discussing it for far longer.
And you managed to miss the point about the writes. Maybe you don't monitor them?
ID: 105427 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 105435 - Posted: 13 Mar 2022, 10:31:03 UTC - in response to Message 105427.  

Your just stating the obvious.
As long as they get something around a 95% clean result (perhaps as low as 90) then they are happy.
We have discussed this to the end of the world and beyond.

I have been discussing it for far longer.
And you managed to miss the point about the writes. Maybe you don't monitor them?


I don't write as much data to my drive as you probably.
So far on my oldest drive I have written around 65TB and it is still in good health.
According to Samsung's information I am just approaching the middle age of the drive.

Wasn't it you who talked about a cache program that would reduce the writes?
But again, its another topic that has been discussed and ignored by the team, so why holler on about it?
They obviously don't care.

I've said it a lot already and others say the same. The team does NOT care about PC users.
They only care about their neural network.
We gets the scraps or the wild ideas in whatever form they come in.
They do not change anything. You get what you get good or bad, large or small.
if you burn out your drive due to the writes, that's nothing of concern to them.
There will always be someone to take your place.

We can make suggestions and complain all we want, but they are NOT interested.
That is very clear here in the messages boards and via twitter and by the one person who can get through to them, of which they just acknowledge the email and do nothing.

As long as they get the data by whatever means necessary they are happy. If machines fail, people quit, that does not matter.
ID: 105435 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,543,381
RAC: 5,926
Message 105438 - Posted: 13 Mar 2022, 16:36:59 UTC - in response to Message 105435.  

They obviously don't care.

I've said it a lot already and others say the same. The team does NOT care about PC users.
They only care about their neural network.
We gets the scraps or the wild ideas in whatever form they come in.
They do not change anything. You get what you get good or bad, large or small.
if you burn out your drive due to the writes, that's nothing of concern to them.
There will always be someone to take your place.

We can make suggestions and complain all we want, but they are NOT interested.
That is very clear here in the messages boards and via twitter and by the one person who can get through to them, of which they just acknowledge the email and do nothing.

As long as they get the data by whatever means necessary they are happy. If machines fail, people quit, that does not matter.



I'm starting to think you're right.
I love this project, I've supported it for YEARS, but I'm starting to get a little tired
ID: 105438 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile xroule
Avatar

Send message
Joined: 9 Feb 15
Posts: 4
Credit: 58,740,306
RAC: 11,763
Message 105465 - Posted: 16 Mar 2022, 15:34:34 UTC - in response to Message 105438.  

With 3371 wu and 3118 errors in 12 hours, I cant wait for WCG to reopen. For now, this is the only project for me. What a waste of resources!
9 PM, do you know what your PC is doing??
ID: 105465 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
keputnam

Send message
Joined: 18 Sep 05
Posts: 24
Credit: 2,088,785
RAC: 0
Message 105466 - Posted: 16 Mar 2022, 17:59:54 UTC - in response to Message 105426.  

"As long as they get something around a 95% clean result (perhaps as low as 90) then they are happy."


really?

In the last 2 1/2 days I have had 140 "Error while computing" results My wingmen have all had the same results


They are NOT getting any results at all, and are awarding NO credit for these jobs
ID: 105466 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 187
Credit: 6,391,649
RAC: 4,843
Message 105467 - Posted: 16 Mar 2022, 18:36:42 UTC - in response to Message 105426.  

As long as they get something around a 95% clean result (perhaps as low as 90) then they are happy.
We have discussed this to the end of the world and beyond.
They don't care about the PC side as long as they get a pretty good result.


Well, they sure are not getting 95% clean results from me and my "wingmen." We are getting 100% failure rates.
My wingmen and I run different hardware and different operating systems. Some Linux, some Windows. It does not matter: they all fail.

I disabled getting new work units last evening, and when I noticed more units added to the list today, I got a bunch more.
They all failed immediately. Over 300 failures in this batch just for me. So I disabled getting new work units again.
ID: 105467 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
spiralfeel

Send message
Joined: 25 Apr 20
Posts: 1
Credit: 235,796
RAC: 0
Message 105477 - Posted: 16 Mar 2022, 20:25:05 UTC - in response to Message 105465.  

With 3371 wu and 3118 errors in 12 hours, I cant wait for WCG to reopen. For now, this is the only project for me. What a waste of resources!

You should consider TN-Grid http://gene.disi.unitn.it/test/ and SiDock@home https://www.sidock.si/sidock/
ID: 105477 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : March 2022 - WU error rates



©2024 University of Washington
https://www.bakerlab.org