Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 70 · 71 · 72 · 73 · 74 · 75 · 76 . . . 310 · Next

AuthorMessage
hangint3n

Send message
Joined: 23 Mar 20
Posts: 8
Credit: 1,958,078
RAC: 0
Message 98846 - Posted: 8 Sep 2020, 0:55:20 UTC - in response to Message 98812.  

Just had a similar problem on my box. froze the whole thing up.

===
hangint3n
ID: 98846 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 7,326
Message 98853 - Posted: 8 Sep 2020, 2:00:03 UTC - in response to Message 98838.  
Last modified: 8 Sep 2020, 2:01:11 UTC

You are correct, no, one shouldn't have to log in to see the image, now that you mention it. I'll just link to the thread, but beware, have your adblocker turned on: https://forums.anandtech.com/threads/recent-changes-in-projects.2500471/post-40275238

My tasks that timed out were not due to an inability to complete them, it was forgetfulness that I had 'temporarily' suspended Rosetta on that machine. ///insert forehead slap emoji here///

I would caution against having zero cache as you suggest....I pay too much for my energy bill to have my machines idle for ANY length of time (internet outage/server outage/server upgrade/home router locked up/etc etc). Rosetta has run dry many times and I do not check my machines but once daily.


I can get to the forum with your link, but clicking the image requests me to log in. I don't have an account.

And I have 11 ad blockers, will that do? Not only do they block ads, but also youtube video ads, EU cookie notices, government coronavirus advice, and links to grass people off in forums that used a naughty word.

Electricity isn't wasted when the PC is idle, they don't use much then.

I have all 6 machines displayed permanently on a monitor [1] in here, via Boinctasks. I spot immediately if one is playing up. The other 5 machines are in the garage where I can't hear the many fans, but usually I can sort stuff via Boinctasks or remote desktop.

[1] Correction, two monitors, one above the other. The list got too large with 5 GPUs and 66 cores.
ID: 98853 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 7,326
Message 98854 - Posted: 8 Sep 2020, 2:02:09 UTC - in response to Message 98839.  

Otherwise Boinc only ever asks MW for a couple of 30 second tasks, as that's all it needs to fill the buffer. Then it hits the problem of not getting any more until it's backed off for 10 minutes. So even if I've said half Einstein, half MW, it ends up only managing to run MW a tenth of the time.
Looks like it's been an issue forever.
J Stateson built a BOINC client to work around Milkyway's stuffed up server configuration.

Finally getting new tasks only seconds after running out. May not be worth the hassle.


Yes I've been attacking that problem a lot.
ID: 98854 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 7,326
Message 98855 - Posted: 8 Sep 2020, 2:04:18 UTC - in response to Message 98840.  

Peter Hucker wrote:
I run more than Milkyway and I need the buffer. Otherwise Boinc only ever asks MW for a couple of 30 second tasks, as that's all it needs to fill the buffer. Then it hits the problem of not getting any more until it's backed off for 10 minutes. So even if I've said half Einstein, half MW, it ends up only managing to run MW a tenth of the time.


MilkyWay needs us to run other projects tasks that run more than 10 minutes because that's the backoff the Project requires...NO communication with MW for 10 minutes before it will send new gpu tasks, personally I use PrimeGrid as they have short tasks and respect the zero resources share. I run 1 maybe 2 PG tasks and them MW refills the cache and I am off and crunching them again. If the gpu is not the fastest then Collatz will work as a zero resource share project too.

IF you want to go outside the norm then a user made an alternative Boinc Manager at MilkyWay and it handles the 10 minute backoff so that it's not a problem, I don't know how but people that use it say it works.


The 10 minutes isn't enforced by MW servers. Boinc chooses to wait that long when it's denied it the first time. If you do a manual update after about 2 minutes, it gets them. So presumably the modified Boinc just changes that setting. Or it could stop Boinc reporting tasks every time it contacts the server, that would work.
ID: 98855 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1729
Credit: 18,451,410
RAC: 20,088
Message 98861 - Posted: 8 Sep 2020, 2:29:23 UTC - in response to Message 98791.  
Last modified: 8 Sep 2020, 2:52:38 UTC

Problems and Technical Issues, eh? How about 41GB of RAM for ONE task? Name: ygG5REMC******1009391_1307_0
So far all of these reports of out of control Memory Tasks have been on Linux systems. Has anyone with a Windows system got one of the problem Tasks yet?


Edit-
Even if the RAM usage doesn't get out of control, it looks like they crash and burn anyway.

kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_4701_0

              Outcome Computation error
         Client state Compute error
          Exit status 1 (0x00000001) Unknown error code
          Computer ID 3930525
             Run time 22 min 26 sec
             CPU time 21 min 53 sec
       Validate state Invalid
               Credit 0.00
    Device peak FLOPS 5.60 GFLOPS
  Application version Rosetta v4.20 x86_64-pc-linux-gnu
Peak working set size 617.72 MB
       Peak swap size 758.16 MB
      Peak disk usage 48.62 MB


Stderr output
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @kp8RjDVk_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_kp8RjDVk_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3868745
Using database: database_357d5d93529_n_methyl/minirosetta_database

ERROR: Error in core::kinematics::FoldTree::get_jump_that_builds_residue(): This residue is not the child of (built by) a jump!
ERROR:: Exit from: src/core/kinematics/FoldTree.cc line: 436
BOINC:: Error reading and gzipping output datafile: default.out
20:45:43 (4601): called boinc_finish(1)

</stderr_txt>
]]>



And an out of control RAM error Task,

kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_893_0

              Outcome Computation error
         Client state Compute error
          Exit status 1 (0x00000001) Unknown error code
          Computer ID 3930525
             Run time 44 min 11 sec
             CPU time 44 min 11 sec
       Validate state Invalid
               Credit 24.00
    Device peak FLOPS 5.60 GFLOPS
  Application version Rosetta v4.20 x86_64-pc-linux-gnu
Peak working set size 19,307.60 MB
       Peak swap size 20,495.17 MB
      Peak disk usage 49.49 MB[/pre


[pre]Stderr output
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @kp8RjDVk_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_kp8RjDVk_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3872553
Using database: database_357d5d93529_n_methyl/minirosetta_database

ERROR: Error in core::kinematics::FoldTree::get_jump_that_builds_residue(): This residue is not the child of (built by) a jump!
ERROR:: Exit from: src/core/kinematics/FoldTree.cc line: 436
BOINC:: Error reading and gzipping output datafile: default.out
19:10:32 (4261): called boinc_finish(1)

</stderr_txt>
]]>

Grant
Darwin NT
ID: 98861 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
10esseetony

Send message
Joined: 24 Dec 11
Posts: 5
Credit: 23,602,985
RAC: 0
Message 98862 - Posted: 8 Sep 2020, 3:02:38 UTC - in response to Message 98841.  



.....Rosetta might have run out, but you are also doing work for over a dozen other projects....



LOL, I am just curious, where are you guys getting the info that I am running 12+ projects at once [presumably on a single computer]? Let me help you out: https://stats.free-dc.org/userbycpid/627a6be35f3dbebd60ed8b5cda8c0b95

I am currently in 'Summer' mode, only running 4 computers out of the 21 at my disposal. Well, running 5 if you want to count that poor old iMac in my daughter's room. My current projects are Universe, WCG, and Rosetta, all other points received today are from quorum 2 projects (wingmen double checking my work finally).

If I do run multiple projects on one machine, I prefer only 3 per computer, but I assure you they each will have their own client/manager running just one project each at a set percentage of CPU usage, and in no way are fighting with other projects for run time. If you would like to know how to do that, see this thread:
https://forums.anandtech.com/threads/multiple-boinc-clients-on-the-same-computer.2573424/



Now, back to the topic, good catch that the problem is (possibly) Linux only, and that they crash and burn anyway. I was curious to see the points on that one, but I'll go nuke it instead.
ID: 98862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1729
Credit: 18,451,410
RAC: 20,088
Message 98864 - Posted: 8 Sep 2020, 3:56:06 UTC - in response to Message 98862.  
Last modified: 8 Sep 2020, 3:56:25 UTC

LOL, I am just curious, where are you guys getting the info that I am running 12+ projects at once
Click on a person's name & it shows what projects they are doing.


[presumably on a single computer]
Because that was the whole point of BOINC, one manager to let you run multiple projects. Whether you have 1 or 1,000 systems doing the work, you install BOINC, attach to the projects of your choice & then let it manage things according to your Resource share settings. If people choose to complicate things, it's their choice.




Now, back to the topic, good catch that the problem is (possibly) Linux only, and that they crash and burn anyway. I was curious to see the points on that one, but I'll go nuke it instead.
Hopefully over the next day or so we'll see some results from Windows machines as to whether they crash and burn as well (most likely), and do some of the Work Units also have runaway memory usage issues?
Grant
Darwin NT
ID: 98864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
10esseetony

Send message
Joined: 24 Dec 11
Posts: 5
Credit: 23,602,985
RAC: 0
Message 98865 - Posted: 8 Sep 2020, 4:24:08 UTC - in response to Message 98864.  
Last modified: 8 Sep 2020, 4:43:19 UTC

Well, thanks to your findings, I have switched my allocated 8 of 32 threads of Ryzen under Linux to 10 threads of Haswell on Windows. Hopefully the issue is therefore solved (for me).....and then I downloaded 10+10 days of tasks! (J/K!!!!!)

Regarding resource share settings......I have Rosetta at 1 and WCG at 9999, and yet Rosetta still takes control and suspends WCG tasks during this transition between machines. I am glad the BOINC client works for you 100% as intended. Which I am sure you have tested. Meanwhile I'll simply continue to complicate things.

PS: Click on a person's name and it shows everything they have EVER done. You have some very nice systems, and I appreciate you donating your computers and your time and your money for citizen science research.
ID: 98865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1729
Credit: 18,451,410
RAC: 20,088
Message 98866 - Posted: 8 Sep 2020, 5:59:18 UTC - in response to Message 98865.  

Regarding resource share settings......I have Rosetta at 1 and WCG at 9999, and yet Rosetta still takes control and suspends WCG tasks during this transition between machines.
Because you have effectively joined a new project with that system.
To date all the work has been on the existing project, the new/increased computation resource project is now owed a debt for it to actually match up with your resource share settings.
And with the short deadlines for Rosetta, the long Task processing times, and the amount of work the system has just got it needs to do what it has for Rosetta to meet those deadlines. Once that is done, it will then process mostly WCG until the debt then owed to it is met, then some more Rosetta, then more WCG etc, etc until it settles down to the work being processed at any given time being in accordance with your Resource share settings.

Resource share is something that balances out over the longer term, not just a matter of hours- and certainly not straight off the bat.
The less projects, the smaller the cache, the more cores & threads you have, the sooner the Resource share settings will be honoured (within a week, even within a few days in many cases). The less cores & threads, the larger the cache and the more projects you have then the longer it takes for your Resource share to be honoured (as in months- and as in many months if people then start trying to micro manage things).
Grant
Darwin NT
ID: 98866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1729
Credit: 18,451,410
RAC: 20,088
Message 98916 - Posted: 9 Sep 2020, 1:13:00 UTC

Ah, we're back.
Forums/server info was all MIA for a while there due to the database being down/unavailable.
Grant
Darwin NT
ID: 98916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1729
Credit: 18,451,410
RAC: 20,088
Message 98918 - Posted: 9 Sep 2020, 2:26:39 UTC - in response to Message 98916.  
Last modified: 9 Sep 2020, 2:28:58 UTC

Ah, we're back.
Forums/server info was all MIA for a while there due to the database being down/unavailable.


Now just getting random
Project is down
The project's database server is either down or ran out of connections at the moment. Please check back in a few minutes.
errors.
Grant
Darwin NT
ID: 98918 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 7,326
Message 98938 - Posted: 9 Sep 2020, 21:17:03 UTC - in response to Message 98862.  


If I do run multiple projects on one machine, I prefer only 3 per computer, but I assure you they each will have their own client/manager running just one project each at a set percentage of CPU usage, and in no way are fighting with other projects for run time. If you would like to know how to do that, see this thread:
https://forums.anandtech.com/threads/multiple-boinc-clients-on-the-same-computer.2573424/


What's the advantage of a client per project?
ID: 98938 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 7,326
Message 98939 - Posted: 9 Sep 2020, 21:20:12 UTC - in response to Message 98864.  

Because that was the whole point of BOINC, one manager to let you run multiple projects. Whether you have 1 or 1,000 systems doing the work, you install BOINC, attach to the projects of your choice & then let it manage things according to your Resource share settings. If people choose to complicate things, it's their choice.


It's a pity Boinc doesn't manage multiple computers and we have to use third party programs to do so. I use Boinctasks, and in fact I'd use it for a single machine too, because it's display is 10 times better than Boinc. For a start it colour codes running, queued, etc, and collapses a queue of 50 tasks into one line. The actual Boinc manager is unusable as an interface.
ID: 98939 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 7,326
Message 98940 - Posted: 9 Sep 2020, 21:25:00 UTC - in response to Message 98865.  
Last modified: 9 Sep 2020, 21:25:26 UTC

Regarding resource share settings......I have Rosetta at 1 and WCG at 9999, and yet Rosetta still takes control and suspends WCG tasks during this transition between machines. I am glad the BOINC client works for you 100% as intended. Which I am sure you have tested. Meanwhile I'll simply continue to complicate things.


And so will I, as like you Boinc never does what I ask. You join Rosetta at 1 and it panics, thinking it's not done any over the last 10 days (WTF?) but it should do a 1/10000. So it runs it at 100%. Changing what projects you run and what the weighting is should reset the counter. I changed this on mine to make things slightly more sensible, in config.xml: <rec_half_life_days>1.000000</rec_half_life_days> - this means it looks at the last day instead of the last 10 days to figure out what to run.

PS: Click on a person's name and it shows everything they have EVER done.


It shows you with a recent credit on over a dozen. I guess it's an average over quite some time - I think it's a month.
ID: 98940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,214,786
RAC: 1,136
Message 98941 - Posted: 9 Sep 2020, 22:48:54 UTC - in response to Message 98939.  

Peter Huycker said
It's a pity Boinc doesn't manage multiple computers and we have to use third party programs to do so. I use Boinctasks, and in fact I'd use it for a single machine too, because it's display is 10 times better than Boinc. For a start it colour codes running, queued, etc, and collapses a queue of 50 tasks into one line. The actual Boinc manager is unusable as an interface.


I hope that's an option if it's ever implemented as I much prefer to manage each pc as it's own computer and then choose whether to run the same project as other computers or to run it's own set of projects. Sometimes I have every pc running the same project while other times I prefer to run something different on each pc.

In some cases I just can't blast thru the tasks at a project because I am a 'Team friendly person' meaning if you are on my team and I am behind you I will not pass you as long as you are crunching. It's the old thing 'just because I can doesn't mean I should' for me!!! I have more resources than any otehr cruncher on my team and could be easily #1 on every running project I crunch for but that then ruins the incentive for my teammates to keep on crunching because they are #1 or #2. I am already #1 at enough projects that I can move things around to keep the pressure on but not pass them. ie at PrimeGrid a teammate is almost 30 million credits ahead of me but a new challenge could easily see me doing 5 million credits a week if I really want too. He just doesn't have, nor can he afford, the kind of horsepower needed to keep me behind him, but he keeps crunching the way he is and that's a good thing for Boinc in general!!
ID: 98941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 7,326
Message 98978 - Posted: 12 Sep 2020, 20:37:24 UTC - in response to Message 98941.  

Peter Huycker said
It's a pity Boinc doesn't manage multiple computers and we have to use third party programs to do so. I use Boinctasks, and in fact I'd use it for a single machine too, because it's display is 10 times better than Boinc. For a start it colour codes running, queued, etc, and collapses a queue of 50 tasks into one line. The actual Boinc manager is unusable as an interface.


I hope that's an option if it's ever implemented as I much prefer to manage each pc as it's own computer and then choose whether to run the same project as other computers or to run it's own set of projects. Sometimes I have every pc running the same project while other times I prefer to run something different on each pc.

In some cases I just can't blast thru the tasks at a project because I am a 'Team friendly person' meaning if you are on my team and I am behind you I will not pass you as long as you are crunching. It's the old thing 'just because I can doesn't mean I should' for me!!! I have more resources than any otehr cruncher on my team and could be easily #1 on every running project I crunch for but that then ruins the incentive for my teammates to keep on crunching because they are #1 or #2. I am already #1 at enough projects that I can move things around to keep the pressure on but not pass them. ie at PrimeGrid a teammate is almost 30 million credits ahead of me but a new challenge could easily see me doing 5 million credits a week if I really want too. He just doesn't have, nor can he afford, the kind of horsepower needed to keep me behind him, but he keeps crunching the way he is and that's a good thing for Boinc in general!!


You've never seen Boinctasks have you? You can do exactly what you want aswell as what I want. I can select one computer, a few, or all of them, and give them an instruction. I like you set different machines doing different things. Some are better at certain projects, some can't do them at all.
ID: 98978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aravah

Send message
Joined: 12 Apr 20
Posts: 6
Credit: 1,101,172
RAC: 0
Message 98987 - Posted: 13 Sep 2020, 9:23:17 UTC

I am seeing Rosetta task requesting much much more memory than usual?
Is this expected?
Application Rosetta 4.20
Name kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_4381
State Waiting for memory
Received Thu 10 Sep 2020 16:00:19 BST
Report deadline Sun 13 Sep 2020 16:00:18 BST
Estimated computation size 80,000 GFLOPs
CPU time 01:09:16
CPU time since checkpoint ---
Elapsed time 01:17:21
Estimated time remaining 07:32:17
Fraction done 5.772%
Virtual memory size 33.02 GB
Working set size 28.44 GB
Directory slots/6
Progress rate 4.320% per hour
Executable rosetta_4.20_x86_64-pc-linux-gnu
ID: 98987 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,238,680
RAC: 0
Message 98988 - Posted: 13 Sep 2020, 9:40:06 UTC - in response to Message 98987.  

I am seeing Rosetta task requesting much much more memory than usual?
Is this expected?
Application Rosetta 4.20
Name kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_4381
State Waiting for memory

Most of us had these earlier in the week. I aborted all the fold_and_dock tasks. They seem to have a serious problem with the amount of memory they need.
BOINC blog
ID: 98988 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 98989 - Posted: 13 Sep 2020, 9:47:09 UTC - in response to Message 98987.  

Is this expected?
Several users have reported fold_and_dock tasks trying (and usually eventually failing) to allocate tens of gigabytes of memory. If you have a vast amount of swap space they might be able to complete, but you’re probably as well just aborting them and doing something else instead.
ID: 98989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1729
Credit: 18,451,410
RAC: 20,088
Message 98990 - Posted: 13 Sep 2020, 10:42:39 UTC - in response to Message 98987.  

I am seeing Rosetta task requesting much much more memory than usual?
Is this expected?
Because of the size of your cache, and Rosetta being your secondary project, it's taken you several days to start processing what was a resend of those problem Tasks.
Grant
Darwin NT
ID: 98990 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 70 · 71 · 72 · 73 · 74 · 75 · 76 . . . 310 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org