Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 70 · 71 · 72 · 73 · 74 · 75 · 76 . . . 310 · Next
Author | Message |
---|---|
hangint3n Send message Joined: 23 Mar 20 Posts: 8 Credit: 1,958,078 RAC: 0 |
Just had a similar problem on my box. froze the whole thing up. === hangint3n |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 7,326 |
You are correct, no, one shouldn't have to log in to see the image, now that you mention it. I'll just link to the thread, but beware, have your adblocker turned on: https://forums.anandtech.com/threads/recent-changes-in-projects.2500471/post-40275238 I can get to the forum with your link, but clicking the image requests me to log in. I don't have an account. And I have 11 ad blockers, will that do? Not only do they block ads, but also youtube video ads, EU cookie notices, government coronavirus advice, and links to grass people off in forums that used a naughty word. Electricity isn't wasted when the PC is idle, they don't use much then. I have all 6 machines displayed permanently on a monitor [1] in here, via Boinctasks. I spot immediately if one is playing up. The other 5 machines are in the garage where I can't hear the many fans, but usually I can sort stuff via Boinctasks or remote desktop. [1] Correction, two monitors, one above the other. The list got too large with 5 GPUs and 66 cores. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 7,326 |
Otherwise Boinc only ever asks MW for a couple of 30 second tasks, as that's all it needs to fill the buffer. Then it hits the problem of not getting any more until it's backed off for 10 minutes. So even if I've said half Einstein, half MW, it ends up only managing to run MW a tenth of the time.Looks like it's been an issue forever. Yes I've been attacking that problem a lot. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 7,326 |
Peter Hucker wrote:I run more than Milkyway and I need the buffer. Otherwise Boinc only ever asks MW for a couple of 30 second tasks, as that's all it needs to fill the buffer. Then it hits the problem of not getting any more until it's backed off for 10 minutes. So even if I've said half Einstein, half MW, it ends up only managing to run MW a tenth of the time. The 10 minutes isn't enforced by MW servers. Boinc chooses to wait that long when it's denied it the first time. If you do a manual update after about 2 minutes, it gets them. So presumably the modified Boinc just changes that setting. Or it could stop Boinc reporting tasks every time it contacts the server, that would work. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1729 Credit: 18,451,410 RAC: 20,088 |
Problems and Technical Issues, eh? How about 41GB of RAM for ONE task? Name: ygG5REMC******1009391_1307_0So far all of these reports of out of control Memory Tasks have been on Linux systems. Has anyone with a Windows system got one of the problem Tasks yet? Edit- Even if the RAM usage doesn't get out of control, it looks like they crash and burn anyway. kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_4701_0 Outcome Computation error Client state Compute error Exit status 1 (0x00000001) Unknown error code Computer ID 3930525 Run time 22 min 26 sec CPU time 21 min 53 sec Validate state Invalid Credit 0.00 Device peak FLOPS 5.60 GFLOPS Application version Rosetta v4.20 x86_64-pc-linux-gnu Peak working set size 617.72 MB Peak swap size 758.16 MB Peak disk usage 48.62 MB Stderr output <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @kp8RjDVk_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_kp8RjDVk_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3868745 Using database: database_357d5d93529_n_methyl/minirosetta_database ERROR: Error in core::kinematics::FoldTree::get_jump_that_builds_residue(): This residue is not the child of (built by) a jump! ERROR:: Exit from: src/core/kinematics/FoldTree.cc line: 436 BOINC:: Error reading and gzipping output datafile: default.out 20:45:43 (4601): called boinc_finish(1) </stderr_txt> ]]> And an out of control RAM error Task, kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_893_0 Outcome Computation error Client state Compute error Exit status 1 (0x00000001) Unknown error code Computer ID 3930525 Run time 44 min 11 sec CPU time 44 min 11 sec Validate state Invalid Credit 24.00 Device peak FLOPS 5.60 GFLOPS Application version Rosetta v4.20 x86_64-pc-linux-gnu Peak working set size 19,307.60 MB Peak swap size 20,495.17 MB Peak disk usage 49.49 MB[/pre [pre]Stderr output <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @kp8RjDVk_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_kp8RjDVk_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3872553 Using database: database_357d5d93529_n_methyl/minirosetta_database ERROR: Error in core::kinematics::FoldTree::get_jump_that_builds_residue(): This residue is not the child of (built by) a jump! ERROR:: Exit from: src/core/kinematics/FoldTree.cc line: 436 BOINC:: Error reading and gzipping output datafile: default.out 19:10:32 (4261): called boinc_finish(1) </stderr_txt> ]]> Grant Darwin NT |
10esseetony Send message Joined: 24 Dec 11 Posts: 5 Credit: 23,602,985 RAC: 0 |
LOL, I am just curious, where are you guys getting the info that I am running 12+ projects at once [presumably on a single computer]? Let me help you out: https://stats.free-dc.org/userbycpid/627a6be35f3dbebd60ed8b5cda8c0b95 I am currently in 'Summer' mode, only running 4 computers out of the 21 at my disposal. Well, running 5 if you want to count that poor old iMac in my daughter's room. My current projects are Universe, WCG, and Rosetta, all other points received today are from quorum 2 projects (wingmen double checking my work finally). If I do run multiple projects on one machine, I prefer only 3 per computer, but I assure you they each will have their own client/manager running just one project each at a set percentage of CPU usage, and in no way are fighting with other projects for run time. If you would like to know how to do that, see this thread: https://forums.anandtech.com/threads/multiple-boinc-clients-on-the-same-computer.2573424/ Now, back to the topic, good catch that the problem is (possibly) Linux only, and that they crash and burn anyway. I was curious to see the points on that one, but I'll go nuke it instead. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1729 Credit: 18,451,410 RAC: 20,088 |
LOL, I am just curious, where are you guys getting the info that I am running 12+ projects at onceClick on a person's name & it shows what projects they are doing. [presumably on a single computer]Because that was the whole point of BOINC, one manager to let you run multiple projects. Whether you have 1 or 1,000 systems doing the work, you install BOINC, attach to the projects of your choice & then let it manage things according to your Resource share settings. If people choose to complicate things, it's their choice. Now, back to the topic, good catch that the problem is (possibly) Linux only, and that they crash and burn anyway. I was curious to see the points on that one, but I'll go nuke it instead.Hopefully over the next day or so we'll see some results from Windows machines as to whether they crash and burn as well (most likely), and do some of the Work Units also have runaway memory usage issues? Grant Darwin NT |
10esseetony Send message Joined: 24 Dec 11 Posts: 5 Credit: 23,602,985 RAC: 0 |
Well, thanks to your findings, I have switched my allocated 8 of 32 threads of Ryzen under Linux to 10 threads of Haswell on Windows. Hopefully the issue is therefore solved (for me).....and then I downloaded 10+10 days of tasks! (J/K!!!!!) Regarding resource share settings......I have Rosetta at 1 and WCG at 9999, and yet Rosetta still takes control and suspends WCG tasks during this transition between machines. I am glad the BOINC client works for you 100% as intended. Which I am sure you have tested. Meanwhile I'll simply continue to complicate things. PS: Click on a person's name and it shows everything they have EVER done. You have some very nice systems, and I appreciate you donating your computers and your time and your money for citizen science research. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1729 Credit: 18,451,410 RAC: 20,088 |
Regarding resource share settings......I have Rosetta at 1 and WCG at 9999, and yet Rosetta still takes control and suspends WCG tasks during this transition between machines.Because you have effectively joined a new project with that system. To date all the work has been on the existing project, the new/increased computation resource project is now owed a debt for it to actually match up with your resource share settings. And with the short deadlines for Rosetta, the long Task processing times, and the amount of work the system has just got it needs to do what it has for Rosetta to meet those deadlines. Once that is done, it will then process mostly WCG until the debt then owed to it is met, then some more Rosetta, then more WCG etc, etc until it settles down to the work being processed at any given time being in accordance with your Resource share settings. Resource share is something that balances out over the longer term, not just a matter of hours- and certainly not straight off the bat. The less projects, the smaller the cache, the more cores & threads you have, the sooner the Resource share settings will be honoured (within a week, even within a few days in many cases). The less cores & threads, the larger the cache and the more projects you have then the longer it takes for your Resource share to be honoured (as in months- and as in many months if people then start trying to micro manage things). Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1729 Credit: 18,451,410 RAC: 20,088 |
Ah, we're back. Forums/server info was all MIA for a while there due to the database being down/unavailable. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1729 Credit: 18,451,410 RAC: 20,088 |
Ah, we're back. Now just getting random Project is down The project's database server is either down or ran out of connections at the moment. Please check back in a few minutes.errors. Grant Darwin NT |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 7,326 |
What's the advantage of a client per project? |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 7,326 |
Because that was the whole point of BOINC, one manager to let you run multiple projects. Whether you have 1 or 1,000 systems doing the work, you install BOINC, attach to the projects of your choice & then let it manage things according to your Resource share settings. If people choose to complicate things, it's their choice. It's a pity Boinc doesn't manage multiple computers and we have to use third party programs to do so. I use Boinctasks, and in fact I'd use it for a single machine too, because it's display is 10 times better than Boinc. For a start it colour codes running, queued, etc, and collapses a queue of 50 tasks into one line. The actual Boinc manager is unusable as an interface. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 7,326 |
Regarding resource share settings......I have Rosetta at 1 and WCG at 9999, and yet Rosetta still takes control and suspends WCG tasks during this transition between machines. I am glad the BOINC client works for you 100% as intended. Which I am sure you have tested. Meanwhile I'll simply continue to complicate things. And so will I, as like you Boinc never does what I ask. You join Rosetta at 1 and it panics, thinking it's not done any over the last 10 days (WTF?) but it should do a 1/10000. So it runs it at 100%. Changing what projects you run and what the weighting is should reset the counter. I changed this on mine to make things slightly more sensible, in config.xml: <rec_half_life_days>1.000000</rec_half_life_days> - this means it looks at the last day instead of the last 10 days to figure out what to run. PS: Click on a person's name and it shows everything they have EVER done. It shows you with a recent credit on over a dozen. I guess it's an average over quite some time - I think it's a month. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,786 RAC: 1,136 |
Peter Huycker said It's a pity Boinc doesn't manage multiple computers and we have to use third party programs to do so. I use Boinctasks, and in fact I'd use it for a single machine too, because it's display is 10 times better than Boinc. For a start it colour codes running, queued, etc, and collapses a queue of 50 tasks into one line. The actual Boinc manager is unusable as an interface. I hope that's an option if it's ever implemented as I much prefer to manage each pc as it's own computer and then choose whether to run the same project as other computers or to run it's own set of projects. Sometimes I have every pc running the same project while other times I prefer to run something different on each pc. In some cases I just can't blast thru the tasks at a project because I am a 'Team friendly person' meaning if you are on my team and I am behind you I will not pass you as long as you are crunching. It's the old thing 'just because I can doesn't mean I should' for me!!! I have more resources than any otehr cruncher on my team and could be easily #1 on every running project I crunch for but that then ruins the incentive for my teammates to keep on crunching because they are #1 or #2. I am already #1 at enough projects that I can move things around to keep the pressure on but not pass them. ie at PrimeGrid a teammate is almost 30 million credits ahead of me but a new challenge could easily see me doing 5 million credits a week if I really want too. He just doesn't have, nor can he afford, the kind of horsepower needed to keep me behind him, but he keeps crunching the way he is and that's a good thing for Boinc in general!! |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 7,326 |
Peter Huycker saidIt's a pity Boinc doesn't manage multiple computers and we have to use third party programs to do so. I use Boinctasks, and in fact I'd use it for a single machine too, because it's display is 10 times better than Boinc. For a start it colour codes running, queued, etc, and collapses a queue of 50 tasks into one line. The actual Boinc manager is unusable as an interface. You've never seen Boinctasks have you? You can do exactly what you want aswell as what I want. I can select one computer, a few, or all of them, and give them an instruction. I like you set different machines doing different things. Some are better at certain projects, some can't do them at all. |
Aravah Send message Joined: 12 Apr 20 Posts: 6 Credit: 1,101,172 RAC: 0 |
I am seeing Rosetta task requesting much much more memory than usual? Is this expected? Application Rosetta 4.20 Name kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_4381 State Waiting for memory Received Thu 10 Sep 2020 16:00:19 BST Report deadline Sun 13 Sep 2020 16:00:18 BST Estimated computation size 80,000 GFLOPs CPU time 01:09:16 CPU time since checkpoint --- Elapsed time 01:17:21 Estimated time remaining 07:32:17 Fraction done 5.772% Virtual memory size 33.02 GB Working set size 28.44 GB Directory slots/6 Progress rate 4.320% per hour Executable rosetta_4.20_x86_64-pc-linux-gnu |
MarkJ Send message Joined: 28 Mar 20 Posts: 72 Credit: 25,238,680 RAC: 0 |
I am seeing Rosetta task requesting much much more memory than usual? Most of us had these earlier in the week. I aborted all the fold_and_dock tasks. They seem to have a serious problem with the amount of memory they need. BOINC blog |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Is this expected?Several users have reported fold_and_dock tasks trying (and usually eventually failing) to allocate tens of gigabytes of memory. If you have a vast amount of swap space they might be able to complete, but you’re probably as well just aborting them and doing something else instead. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1729 Credit: 18,451,410 RAC: 20,088 |
I am seeing Rosetta task requesting much much more memory than usual?Because of the size of your cache, and Rosetta being your secondary project, it's taken you several days to start processing what was a resend of those problem Tasks. Grant Darwin NT |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org