Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 96 · 97 · 98 · 99 · 100 · 101 · 102 . . . 113 · Next

AuthorMessage
Falconet

Send message
Joined: 9 Mar 09
Posts: 164
Credit: 588,265
RAC: 2,019
Message 101241 - Posted: 11 Apr 2021, 18:59:36 UTC - in response to Message 101240.  
Last modified: 11 Apr 2021, 19:01:26 UTC

I've noticed than some of the latest Tasks aren't checkpointing properly, so if you interrupt them they will revert back to the last successful checkpoint.
Next time, just let it run- the default time is 8 hours, and there is a 10 hour watchdog timer in case it's not done within 8 hours. If it's still going after 20hours, then you might want to kill it off.


Grant, thank you for your reply. I don't quite understand your "20 hours" comment. I let the task run for 16 hours. If there is a watchdog timer at 10 hours, what is the different between anything over 10 hours (e.g. 11 hours, 16 hours and 20 hours) not completing? Isn't it just stuck at that point?



The watchdog isn't at 10 hours. It's 10 hours AFTER whatever the CPU runtime setting is at. So, if you are running with the default setting, which is 8 CPU hours, then the watchdog will only kick in at 18 hours.

What Grant meant is that considering the watchdog should kick in at 18 hours, if the task is still running at 20 hours, you might want to abort it.
ID: 101241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 74
Credit: 523,116
RAC: 8,950
Message 101245 - Posted: 11 Apr 2021, 22:43:24 UTC - in response to Message 101238.  

No idea what you think I've changed

I know. It's that damned Dunning-Kruger thingy.
ID: 101245 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jsm

Send message
Joined: 4 Apr 20
Posts: 3
Credit: 18,923,016
RAC: 60,260
Message 101247 - Posted: 12 Apr 2021, 6:50:00 UTC - in response to Message 101049.  

Running at 22 hours has substantially reduced the bandwidth hog but detailed checking has turned up a query. All the computers are asking the scheduler every minute or so for new tasks to be told 'no can do you have plenty' (I paraphrase). This is clearly putting an unnecessary load on the scheduler and contributing to my bandwidth loss. Is there a way to instruct the preferences only to seek additional work every so often eg 1 hour?
capt
ID: 101247 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 927
Credit: 8,729,142
RAC: 22,739
Message 101248 - Posted: 12 Apr 2021, 7:51:08 UTC - in response to Message 101247.  
Last modified: 12 Apr 2021, 7:53:16 UTC

Running at 22 hours has substantially reduced the bandwidth hog but detailed checking has turned up a query. All the computers are asking the scheduler every minute or so for new tasks to be told 'no can do you have plenty' (I paraphrase). This is clearly putting an unnecessary load on the scheduler and contributing to my bandwidth loss. Is there a way to instruct the preferences only to seek additional work every so often eg 1 hour?
capt
How often it asks for work depends on the number of cores/threads you have, the amount of time the system is actually able to process work, and most importantly- on your cache settings.
The fact that many of your Tasks time out before you even return them due to missed deadlines indicates your cache setting is way, way, way, way too large. The estimated completion time for all Tasks, regardless of how long your CPU Target time is set to is 8 hours.
So having a multi-day cache, combined with a longer than the default 8 hour Target CPU time is going to result in endless requests for work, and huge numbers of Tasks missing their deadlines.

In your computing preferences, Other
           Store at least 0.01 days of work
Store up to an additional 0.01 days of work
And they will stop trashing Work Units due to missed deadlines, and stop continually asking for more work.
If you go back to the default 8 hours in the future, you could then bump up the "Store at least 0.01 days of work" to something like 0.2 to maintain a reasonable buffer, that won't result in missed deadlines when things change.
Grant
Darwin NT
ID: 101248 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 692
Credit: 5,079,593
RAC: 33,194
Message 101249 - Posted: 12 Apr 2021, 9:51:39 UTC - in response to Message 101245.  

No idea what you think I've changed
I know. It's that damned Dunning-Kruger thingy.
No context, no conversation.
ID: 101249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 692
Credit: 5,079,593
RAC: 33,194
Message 101250 - Posted: 12 Apr 2021, 9:54:49 UTC

The 6.5GB problem goes away on an 8GB machine if you set it to use 100% memory. It never actually uses 100% since everything overestimates. I just changed my old Boinc-only machines [1] and Rosettas downloaded and ran.

[1] Who has 8GB on a machine they actually interact with? You could maybe load Windows 10 and 1 application. But dare to play a game, or use email and a photo editor at once and it'll grind to a halt. Another example of modern shoddy lazy bloated programming. I can boot Linux off a 1GB flash drive. Yet Windows is 20 times bigger.
ID: 101250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 74
Credit: 523,116
RAC: 8,950
Message 101252 - Posted: 12 Apr 2021, 12:10:13 UTC - in response to Message 101249.  

No idea what you think I've changed
I know. It's that damned Dunning-Kruger thingy.
No context, no conversation.

Unless you are a relative -- which you are not -- it's not my duty to compensate for your inability to keep up with a conversation due to age-related infirmities. I counsel making use of Google.
ID: 101252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1806
Credit: 5,952,659
RAC: 214
Message 101254 - Posted: 12 Apr 2021, 13:19:53 UTC - in response to Message 101250.  

The 6.5GB problem goes away on an 8GB machine if you set it to use 100% memory. It never actually uses 100% since everything overestimates. I just changed my old Boinc-only machines [1] and Rosettas downloaded and ran.

[1] Who has 8GB on a machine they actually interact with? You could maybe load Windows 10 and 1 application. But dare to play a game, or use email and a photo editor at once and it'll grind to a halt. Another example of modern shoddy lazy bloated programming. I can boot Linux off a 1GB flash drive. Yet Windows is 20 times bigger.


Windows10 runs just fine with 8gb of ram, even on a laptop, and can even crunch Boinc projects quite well if you have the right processor and choose your projects wisely. Playing games is a whole other story though and you are correct unless you are playing a non competitive game like MineCraft or the sort. The size of the Windows OS is what it is it's not like it can be changed by any of us so you just learn to deal with what you have to deal with or you change to something else.
ID: 101254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 21,056,041
RAC: 30,986
Message 101255 - Posted: 12 Apr 2021, 13:27:43 UTC - in response to Message 101205.  
Last modified: 12 Apr 2021, 13:33:17 UTC

Over the course of this afternoon I’ve had 6 segv errors, all on files starting miniprotien.

Anyone else? Or do I start checking my hardware?

Its not just you. I've got 29 that failed across a number of machines. They are all miniprotein_relax8 series that have died after running for an hour.
BOINC blog
ID: 101255 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 692
Credit: 5,079,593
RAC: 33,194
Message 101256 - Posted: 12 Apr 2021, 14:14:59 UTC - in response to Message 101254.  

The 6.5GB problem goes away on an 8GB machine if you set it to use 100% memory. It never actually uses 100% since everything overestimates. I just changed my old Boinc-only machines [1] and Rosettas downloaded and ran.

[1] Who has 8GB on a machine they actually interact with? You could maybe load Windows 10 and 1 application. But dare to play a game, or use email and a photo editor at once and it'll grind to a halt. Another example of modern shoddy lazy bloated programming. I can boot Linux off a 1GB flash drive. Yet Windows is 20 times bigger.


Windows10 runs just fine with 8gb of ram, even on a laptop, and can even crunch Boinc projects quite well if you have the right processor and choose your projects wisely. Playing games is a whole other story though and you are correct unless you are playing a non competitive game like MineCraft or the sort. The size of the Windows OS is what it is it's not like it can be changed by any of us so you just learn to deal with what you have to deal with or you change to something else.
My Aunt doesn't play games. She finds 4GB (Hewlett Packard actually sold her a laptop with such a stupidly pitiful amount, which could not be upgraded!) unusable, and 8GB ok if she only runs one program at a time, 12GB was needed just to use email and a photo editor. If I make a computer for someone it has 16Gb, or 32GB for games or anything else demanding. I put 64GB in my own. Programmers don't code as neatly as they used to!
ID: 101256 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 692
Credit: 5,079,593
RAC: 33,194
Message 101257 - Posted: 12 Apr 2021, 14:17:21 UTC - in response to Message 101252.  

No idea what you think I've changed
I know. It's that damned Dunning-Kruger thingy.
No context, no conversation.

Unless you are a relative -- which you are not -- it's not my duty to compensate for your inability to keep up with a conversation due to age-related infirmities. I counsel making use of Google.
You seem confused. "Context" in this context (titter) means that you failed to quote enough text so I knew what the conversation was about. It has nothing to do with the hypothetical Dunning-Kruger bullshit. Virtually nobody can remember every single conversation they have, I'm probably in 200 of them.
ID: 101257 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 692
Credit: 5,079,593
RAC: 33,194
Message 101258 - Posted: 12 Apr 2021, 14:18:22 UTC - in response to Message 101255.  

Over the course of this afternoon I’ve had 6 segv errors, all on files starting miniprotien.

Anyone else? Or do I start checking my hardware?

Its not just you. I've got 29 that failed across a number of machines. They are all miniprotein_relax8 series that have died after running for an hour.
Same here, and on prehelical (although I didn't check the error type).
ID: 101258 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 99
Credit: 20,809,728
RAC: 14,256
Message 101260 - Posted: 12 Apr 2021, 16:38:23 UTC - in response to Message 101258.  
Last modified: 12 Apr 2021, 16:38:46 UTC

Over the course of this afternoon I’ve had 6 segv errors, all on files starting miniprotien.

Anyone else? Or do I start checking my hardware?

Its not just you. I've got 29 that failed across a number of machines. They are all miniprotein_relax8 series that have died after running for an hour.
Same here, and on prehelical (although I didn't check the error type).



Pretty much all of my mini protein_relax8 units are seconds (meaning they failed on another machine before I got them), and almost all of them are completing but taking 18 hours to do so. They are creating very few decoys.

Example: https://boinc.bakerlab.org/rosetta/result.php?resultid=1366333671
ID: 101260 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 692
Credit: 5,079,593
RAC: 33,194
Message 101262 - Posted: 12 Apr 2021, 17:30:41 UTC - in response to Message 101260.  

Over the course of this afternoon I’ve had 6 segv errors, all on files starting miniprotien.

Anyone else? Or do I start checking my hardware?

Its not just you. I've got 29 that failed across a number of machines. They are all miniprotein_relax8 series that have died after running for an hour.
Same here, and on prehelical (although I didn't check the error type).



Pretty much all of my mini protein_relax8 units are seconds (meaning they failed on another machine before I got them), and almost all of them are completing but taking 18 hours to do so. They are creating very few decoys.

Example: https://boinc.bakerlab.org/rosetta/result.php?resultid=1366333671
Have you changed the setting to allow 18 hours? Because all mine are sticking to the 8 hours. I'm getting 50% of the mini protein_relax8 completing in 8 hours, and the other 50% failing, usually taking 5 hours to do so.
ID: 101262 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 99
Credit: 20,809,728
RAC: 14,256
Message 101263 - Posted: 12 Apr 2021, 18:18:03 UTC - in response to Message 101262.  
Last modified: 12 Apr 2021, 18:28:58 UTC


Pretty much all of my mini protein_relax8 units are seconds (meaning they failed on another machine before I got them), and almost all of them are completing but taking 18 hours to do so. They are creating very few decoys.

Example: https://boinc.bakerlab.org/rosetta/result.php?resultid=1366333671
Have you changed the setting to allow 18 hours? Because all mine are sticking to the 8 hours. I'm getting 50% of the mini protein_relax8 completing in 8 hours, and the other 50% failing, usually taking 5 hours to do so.


During the latest drought I had this machine set to 36 hours, but Friday when it became clear the drought has ended I set it back to its normal default 8 hour runtime. So it's running for the standard 8hr and then 10 additional hours on top as others have mentioned before the auto-cutoff happens.

All my other machines are set to 36 hours, and while none of them have completed any of these longer units, some of them are showing signs it will happen to them also. For example on one machine I have a miniprotein WU that is only 57% done 22 hours in. I have a feeling it's going to crunch for 46 hours (set time limit +10hr cutoff).


/edit. Just to add a datapoint. While it's not conclusive, all the Miniprotein_relax8 units I'm getting that run long do "complete" and show as valid, even after going 10 hours over. Of these units that run over, many are "seconds" sent to me from other machines that failed to process the WU. My machine is running OSX and completes them fine (beyond running 10hrs over). All the failed machines are windows or linux based. That said, I know Macs make up a small percentage of computers on this project, so I might have just not gotten a resend from a Mac in my small sample.
ID: 101263 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 74
Credit: 523,116
RAC: 8,950
Message 101264 - Posted: 12 Apr 2021, 21:50:03 UTC - in response to Message 101257.  

you failed to quote enough text so I knew what the conversation was about.


There was enough for you to recognize that I was replying to you, but not enough for you to remember what we were talking about, from a conversation within the past 24 hours, even though you knew it was you. Got it.

Just between us girls, isn't the real issue here the same as the one with "dood" and "@": you're immensely irritated at some features of my posting style. Including quoting only the essence of an exchange.

I think Letterman said it best: "An old man in a bathrobe on his front porch, shaking his fist at passing cars."
ID: 101264 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 209
Credit: 6,406,026
RAC: 13,581
Message 101265 - Posted: 12 Apr 2021, 22:23:43 UTC - in response to Message 101255.  

Over the course of this afternoon I’ve had 6 segv errors, all on files starting miniprotien.

Anyone else? Or do I start checking my hardware?

Its not just you. I've got 29 that failed across a number of machines. They are all miniprotein_relax8 series that have died after running for an hour.


Thanks, I was hoping it was the tasks rather than my hardware.
ID: 101265 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1061
Credit: 11,646,946
RAC: 7,412
Message 101268 - Posted: 13 Apr 2021, 0:59:28 UTC - in response to Message 101265.  

Over the course of this afternoon I’ve had 6 segv errors, all on files starting miniprotien.

Anyone else? Or do I start checking my hardware?

Its not just you. I've got 29 that failed across a number of machines. They are all miniprotein_relax8 series that have died after running for an hour.


Thanks, I was hoping it was the tasks rather than my hardware.

I've had several miniprotein_relax8 tasks fail also, but only one of them failed after one hour. The rest ran for at least two hours before failing. All were reissued to someone else, and either failed for that someone else as well, or aren't yet finished for that someone else.

I've thought of a possible reason why some tasks are set to ask for 6 GB of memory. Quite a bit more is loaded to produce a core dump if they fail, but isn't needed if they don't fail. Not the best idea, but possible.
ID: 101268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DizzyD

Send message
Joined: 23 Nov 20
Posts: 6
Credit: 1,281,984
RAC: 9,759
Message 101270 - Posted: 13 Apr 2021, 2:24:59 UTC - in response to Message 101263.  

/edit. Just to add a datapoint. While it's not conclusive, all the Miniprotein_relax8 units I'm getting that run long do "complete" and show as valid, even after going 10 hours over. Of these units that run over, many are "seconds" sent to me from other machines that failed to process the WU. My machine is running OSX and completes them fine (beyond running 10hrs over). All the failed machines are windows or linux based. That said, I know Macs make up a small percentage of computers on this project, so I might have just not gotten a resend from a Mac in my small sample.

I am also running on a Mac. The mini protein_relax8 units also do complete after ~18.7 hours and provide credit; however, the credit is in the "two-hundred" range for 67,000+ seconds of work. So, I've gone in and aborted all of the "ready to start" mini protein_relax8 units and now I have all pre-helical-bundles_round1_attempt1 queued up.
ID: 101270 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 927
Credit: 8,729,142
RAC: 22,739
Message 101272 - Posted: 13 Apr 2021, 7:38:49 UTC - in response to Message 101256.  

My Aunt doesn't play games. She finds 4GB (Hewlett Packard actually sold her a laptop with such a stupidly pitiful amount, which could not be upgraded!) unusable, and 8GB ok if she only runs one program at a time, 12GB was needed just to use email and a photo editor.
The issue is the photo editor.
I know several people running Windows 10 systems with 4GB of RAM with no issues (i was one for quite some time myself). Of course if you use software that requires huge amounts of RAM to do the work it needs to do- such as photo editing- then you need a system with the appropriate amount of RAM. That has always been the case.
It also helps (a massive amount) if you have a SSD and not a HDD.
Grant
Darwin NT
ID: 101272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 96 · 97 · 98 · 99 · 100 · 101 · 102 . . . 113 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2021 University of Washington
https://www.bakerlab.org