Problems and Technical Issues with Rosetta@home

Author	Message
Sid Celery Send message Joined: 11 Feb 08 Posts: 2590 Credit: 47,220,881 RAC: 5	Message 102414 - Posted: 17 Aug 2021, 18:51:51 UTC Last modified: 17 Aug 2021, 18:53:09 UTC I've been getting several batches of Rosetta work, but they all crash immediately they start, The error message I'm getting is this one - is it only my system or everyone? 17/08/2021 19:47:28 \| Rosetta@home \| [error] Signature verification failed for database_357d5d93529_n_methyl.zip I'd like to report it quickly, but I've had several problems with my PC recently and don't want to say something that's only a local problem to me. ID: 102414 · Rating: 0 · rate: / Reply Quote

Falconet Send message Joined: 9 Mar 09 Posts: 355 Credit: 1,669,337 RAC: 0	Message 102415 - Posted: 17 Aug 2021, 18:56:31 UTC - in response to Message 102414. I've got 9 CD98 tasks. 8 have been running fine for several hours now. Maybe a project reset could work? ID: 102415 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 102416 - Posted: 17 Aug 2021, 19:57:14 UTC - in response to Message 102414. I've been getting several batches of Rosetta work, but they all crash immediately they start, The error message I'm getting is this one - is it only my system or everyone? 17/08/2021 19:47:28 \| Rosetta@home \| [error] Signature verification failed for database_357d5d93529_n_methyl.zip I'd like to report it quickly, but I've had several problems with my PC recently and don't want to say something that's only a local problem to me. I looked at the log files. All the errors appear to be due to a problem with the database file shared by all of the failed tasks. I'd recommend a reset project to replace all of the shared file, since there doesn't appear to be a way of replacing just one of the shared files. ID: 102416 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2590 Credit: 47,220,881 RAC: 5	Message 102417 - Posted: 18 Aug 2021, 0:35:18 UTC - in response to Message 102416. I've been getting several batches of Rosetta work, but they all crash immediately they start, The error message I'm getting is this one - is it only my system or everyone? 17/08/2021 19:47:28 \| Rosetta@home \| [error] Signature verification failed for database_357d5d93529_n_methyl.zip I'd like to report it quickly, but I've had several problems with my PC recently and don't want to say something that's only a local problem to me. I looked at the log files. All the errors appear to be due to a problem with the database file shared by all of the failed tasks. I'd recommend a reset project to replace all of the shared file, since there doesn't appear to be a way of replacing just one of the shared files. Each time I grab more tasks that file gets replaced. It hadn't been helping up to now. I haven't reset the project, but I did a re-boot and when Boinc re-started it reported the following without bringing new tasks down 17/08/2021 23:17:35 \| Rosetta@home \| Resetting file projects/boinc.bakerlab.org_rosetta/database_357d5d93529_n_methyl.zip: RSA key check failed for file 17/08/2021 23:17:37 \| Rosetta@home \| Started download of database_357d5d93529_n_methyl.zip 17/08/2021 23:18:59 \| Rosetta@home \| Finished download of database_357d5d93529_n_methyl.zip I've now grabbed some more tasks (with over a million appearing in the queue on the front page, I now notice) and the first 3 tasks are running ok up to 7 minutes - no immediate computation errors - so I've got my fingers crossed that it's righted itself. Thanks for double-checking me ID: 102417 · Rating: 0 · rate: / Reply Quote

kennnnnnneth Send message Joined: 20 Jan 20 Posts: 2 Credit: 17,110 RAC: 0	Message 102418 - Posted: 19 Aug 2021, 14:31:52 UTC Almost every RAH task I have received for the last couple months has a deadline of less than three days. I always have to abort them and refresh or I'm wasting loads of cycles on something that will grant 0 credit. Other times, there appears to be no work available at all. I have set my compute preferences to store at most 0.5 days of work, yet I still get mega-tasks from RAH that will take 5-10x that long and cannot possibly be completed before the deadline even if I make my PC a dedicated BOINC server. These tasks are seriously 5-10x as long as anything I have ever received from another project. Since I don't appear to receive credit for tasks that exceed the deadline, I end up aborting 90% of RAH tasks. This is comically absurd. I understand there are periods of no work, which indicates that the RAH community is supplying an overabundance of resources to the RAH team, but is there no way to spread the work out more evenly? Rather than weeks of no work followed by 8GB tasks due in 2 days that I'd need a supercomputer to crunch in time, maybe break those tasks down into tasks 1/10th the size and release them over a longer period? I run BOINC on my network services server, my backup server, my wife's graphics workstation (juicy dual GPUs that idle most of the day), and in the background on my laptop when it's on AC power. I, like most contributors, do not have a fluid-cooled Xeon server stack dedicated to crunching data. I am also a member of a half dozen other projects with which I have no issues (WCG, LHC, MLC, etc). I crank out 10-20 tasks per day across my computers on those projects. RAH on the other hand hasn't seen a single drop of work from me in almost 2 weeks because of this problem. I guess my real question: is this project dead? Is it worth keeping RAH on my clients or should I put my cycles toward a project with cohesive administration? ID: 102418 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 102419 - Posted: 19 Aug 2021, 14:42:39 UTC - in response to Message 102418. Last modified: 19 Aug 2021, 14:44:38 UTC I guess my real question: is this project dead? Is it worth keeping RAH on my clients or should I put my cycles toward a project with cohesive administration? You are relatively new. It is always feast or famine here. A shortage during the summer when the researchers are on vacation is nothing unusual. The larger issue, which most people have avoided or are unaware of, is will the new AI work be done inhouse or sent out to us? There may be less work in the future, or maybe even more. You can stay to find out, or leave. PS - Don't wait for the project to communicate any of that to us. They don't. ID: 102419 · Rating: 0 · rate: / Reply Quote

kennnnnnneth Send message Joined: 20 Jan 20 Posts: 2 Credit: 17,110 RAC: 0	Message 102420 - Posted: 19 Aug 2021, 14:45:28 UTC - in response to Message 102419. thank you, i will direct my cycles to a better-managed project. ID: 102420 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 102422 - Posted: 20 Aug 2021, 6:47:38 UTC - in response to Message 102418. Last modified: 20 Aug 2021, 6:55:25 UTC Almost every RAH task I have received for the last couple months has a deadline of less than three days. Yes, all Rosetta Tasks have a deadline of 3 days. The sooner the Project gets the results back, the sooner that information can be used in the real world, for medical research. So unlike many other projects the results are actually time critical. Sooner is better. Hence the 3 day deadlines (and having 3 days to do 8 hours of work is not a big ask IMHO). Ideally there is no need for a cache at all, but if you feel the need, 0.5 days + 0.01 additional days is plenty. If you run more than one project, 0 cache is best. I always have to abort them and refresh or I'm wasting loads of cycles on something that will grant 0 credit. Or you could have posted here and got help sorting out whatever is wrong with your system, The default processing time for a Rosetta Task is 8 hours. Some may take longer, some may finish sooner but 95%+ will run for the Target CPU time set. The only projects showing on your system are Rosetta & LHC. And both projects show issues with your computer being over committed. An LHC Task Name CMS_2587858_1629250354.094329_0 Run time 13 hours 56 min 33 sec CPU time 6 hours 37 min 5 sec Taking 14 hours to do 6 and a half hours work is quite ridiculous. A Rosetta Task Name cd98_again_graft2_bcov_v1_xaj_SAVE_ALL_OUT_IGNORE_THE_REST_9tw1ev3j_1728867_2_0 Run time 8 hours 34 min 18 sec CPU time 6 hours 15 min 30 sec And taking 8 and a half hours to do just over 6 hours work isn't good either. Here is one from one of my systems Name cd98_again_graft2_bcov_v1_xab_SAVE_ALL_OUT_IGNORE_THE_REST_9un7be8a_1728858_24_0 Run time 8 hours 0 min 3 sec CPU time 7 hours 58 min 58 sec Are you running Folding at Home on the system? If so, you need to limit the number of cores/threads BOINC can use, so they're not trying to do FAH & BOINC work at the same time. If FAH needs 3, then allow BOINC to use only 5. If you are running some other CPU intensive software on the system, then limit the amount of cores/threads BOINC uses. If that CPU heavy software needs 2 cores, then limit BOINC to only 6. Then the Rosetta work will complete in time, no missed deadlines, no ridiculous processing times (and the same will occur for your other project, LHC. No missed deadlines & much improved performance due to no wasted CPU time). And whatever other software you are running will also perform much better as CPU cores won't be trying to run more than one application at a time. NB- and part of your lack of Credit for Rosetta is your system hasn't run the BOINC Benchmarks, which are used for determining Credit, and is using the default values which tend to be much lower than the actual values. Grant Darwin NT ID: 102422 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2590 Credit: 47,220,881 RAC: 5	Message 102426 - Posted: 21 Aug 2021, 2:47:11 UTC - in response to Message 102418. Last modified: 21 Aug 2021, 2:55:31 UTC Almost every RAH task I have received for the last couple months has a deadline of less than three days. They're not less than 3 days. They're exactly 3 days, to the second. 72 hours each. They're also 8 hours long each, which is somewhat less than 72 hours. I always have to abort them and refresh or I'm wasting loads of cycles on something that will grant 0 credit. Other times, there appears to be no work available at all. I have set my compute preferences to store at most 0.5 days of work, yet I still get mega-tasks from RAH that will take 5-10x that long and cannot possibly be completed before the deadline even if I make my PC a dedicated BOINC server. These tasks are seriously 5-10x as long as anything I have ever received from another project. Since I don't appear to receive credit for tasks that exceed the deadline, I end up aborting 90% of RAH tasks. I don't know why you have to abort them, but looking at the ones you have aborted for some unknown reason, they all awarded credits for the time you gave them. If you allocate 0.5 days of cache, you'll receive 12 8hr tasks at most, even if you have no other tasks from other projects. One will run on each CPU core, so 8 will run, taking up 8hrs, followed by 4 tasks left to run on your 8 cores. This will take 16 hours out of the 72 hour deadline and 56 hours to do anything else you like (or nothing if you wish). This is comically absurd. I understand there are periods of no work, which indicates that the RAH community is supplying an overabundance of resources to the RAH team, but is there no way to spread the work out more evenly? Rather than weeks of no work followed by 8GB tasks due in 2 days that I'd need a supercomputer to crunch in time, maybe break those tasks down into tasks 1/10th the size and release them over a longer period? It may or may not be absurd, but take up scheduling issues with Boinc rather than any of the projects. It's true to say that you don't need a supercomputer to run tasks here (as long as you have sufficient RAM & disk space) as tasks report back however much or little you're able to process in 8hrs of CPU time. There isn't a defined amount of work you need to process at Rosetta, which might be shorter on a faster CPU and longer on a slow CPU. Rather, there's a defined amount of time you dedicate your cores to run the tasks at whatever pace your CPU can do so. As far as deadlines go, that's the project's business. Users (the tail) don't tell projects (the dog) when they need their work returned by. As far as task runtime goes, this can be a user-defined setting to reduce the default runtime, but seeing as I don't understand why you struggle to return any 8hr tasks within a 72hr deadline I'm reluctant to encourage it. It would also mean you get more shorter tasks to run within your cache size, which seems to be the opposite of what you want to do, which also defeats what you're complaining about. I run BOINC on my network services server, my backup server, my wife's graphics workstation (juicy dual GPUs that idle most of the day), and in the background on my laptop when it's on AC power. I, like most contributors, do not have a fluid-cooled Xeon server stack dedicated to crunching data. I am also a member of a half dozen other projects with which I have no issues (WCG, LHC, MLC, etc). I crank out 10-20 tasks per day across my computers on those projects. RAH on the other hand hasn't seen a single drop of work from me in almost 2 weeks because of this problem. If you're permanently connected to the internet, there's no real reason to have anything more than Boinc's default offline cache size, which I think is 0.25 days, and let Boinc schedule which tasks you get from which project to meet the resource share you've set up. It would also reduce the number of tasks that come down to, I think, 1 default length task per core for Rosetta. I guess my real question: is this project dead? Is it worth keeping RAH on my clients or should I put my cycles toward a project with cohesive administration? I don't like to be blunt (this is a lie I often tell) but the administration that isn't down to Boinc is entirely down to you, so I don't think you're going to find "a more cohesive project". Just make your settings appropriate to the projects you run and the time you're prepared to allow your computers to run Boinc. tl;dr reduce your cache size back to the default 0.25 days or less and I'm pretty sure all your "problems" go away Edit: the "Store up to an additional..." need be no longer than the default 0.1 days - I suspect the other half of your "problem" is that you have this set inappropriately too. Point being, if the combined figures come too close to the shortest deadline, your settings plan to fail, so change them so you plan to succeed instead. ID: 102426 · Rating: 0 · rate: / Reply Quote

wolfman1360 Send message Joined: 18 Feb 17 Posts: 73 Credit: 19,103,702 RAC: 0	Message 102460 - Posted: 26 Aug 2021, 3:06:37 UTC Several of these tasks that are running for twice my set computation time and not checkpointing to boot. I hope I get some sort of credit for these. Application Rosetta 4.20 Name rb_08_23_108315_111529_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_05_1729195_676 State Running Received 8/24/2021 1:28:51 PM Report deadline 8/27/2021 1:28:51 PM Estimated computation size 80,000 GFLOPs CPU time 16:45:42 CPU time since checkpoint 16:45:42 Elapsed time 16:44:31 Estimated time remaining 01:14:20 Fraction done 93.109% Virtual memory size 955.36 MB Working set size 802.86 MB Directory slots/3 Process ID 2596365 Progress rate 5.400% per hour Executable rosetta_4.20_x86_64-pc-linux-gnu ID: 102460 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 102461 - Posted: 26 Aug 2021, 4:25:22 UTC - in response to Message 102460. Several of these tasks that are running for twice my set computation time and not checkpointing to boot. I hope I get some sort of credit for these. Application Rosetta 4.20 Name rb_08_23_108315_111529_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_05_1729195_676 [snip] Rosetta@Home tasks have sections known as decoys. The decision on whether to end the task normally occurs only at the end of a decoy. Your very long run time looks like you got at least one task with a very long time per decoy. I have no information on whether checkpoints are also written only at the ends of decoys. However, if so, this is probably why you also had the long time with no checkpoints. You might want to read the log file from that task to check whether it completed only one decoy. Also check if you can read the log files from any other tasks for that workunit. If all of them were that slow, expect to get some credit as long as you either returned it by the deadline, or returned it before the quorum was met. ID: 102461 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 102462 - Posted: 26 Aug 2021, 6:35:13 UTC - in response to Message 102460. Several of these tasks that are running for twice my set computation time and not checkpointing to boot. The default Target CPU time is 8 hours. There is a watchdog timer that kicks in at 10 hours after the target time if a Task over runs it. I hope I get some sort of credit for these. Credit is being given for them, although it is very, very, very low paying. Grant Darwin NT ID: 102462 · Rating: 0 · rate: / Reply Quote

MStenholm Send message Joined: 18 Apr 20 Posts: 19 Credit: 37,685,814 RAC: 46	Message 102463 - Posted: 26 Aug 2021, 8:45:57 UTC - in response to Message 102460. Last modified: 26 Aug 2021, 8:46:33 UTC Several of these tasks that are running for twice my set computation time and not checkpointing to boot. I hope I get some sort of credit for these. Application Rosetta 4.20 Name rb_08_23_108315_111529_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_05_1729195_676 I noticed 5 exceeding or were about to the 8 hours in the same series last night. I aborted them and around 30 others as well. Today I noticed that I did get points for the ones that did run 6-10 hours but only up to the 8 hours. ID: 102463 · Rating: 0 · rate: / Reply Quote

wolfman1360 Send message Joined: 18 Feb 17 Posts: 73 Credit: 19,103,702 RAC: 0	Message 102469 - Posted: 26 Aug 2021, 16:50:52 UTC - in response to Message 102461. Several of these tasks that are running for twice my set computation time and not checkpointing to boot. I hope I get some sort of credit for these. Application Rosetta 4.20 Name rb_08_23_108315_111529_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_05_1729195_676 [snip] Rosetta@Home tasks have sections known as decoys. The decision on whether to end the task normally occurs only at the end of a decoy. Your very long run time looks like you got at least one task with a very long time per decoy. I have no information on whether checkpoints are also written only at the ends of decoys. However, if so, this is probably why you also had the long time with no checkpoints. You might want to read the log file from that task to check whether it completed only one decoy. Also check if you can read the log files from any other tasks for that workunit. If all of them were that slow, expect to get some credit as long as you either returned it by the deadline, or returned it before the quorum was met. Thannk you, this is super helpful and I will do so. I don't think some of these tasks are going to complete in time for the deadline without checkpointing. I'm going to try and keep the client running but they're also using pretty excessive amounts of ram. I thought the quorum for each task (number of machines to complete) needed to be 1? Or do you mean others, apart from myself, also get this task, in case I don't complete it first? ID: 102469 · Rating: 0 · rate: / Reply Quote

Kissagogo27 Send message Joined: 31 Mar 20 Posts: 95 Credit: 3,781,450 RAC: 13	Message 102470 - Posted: 26 Aug 2021, 19:13:42 UTC Hi, for me, for a setting time of 12h , some of them just run in 8h ! ID: 102470 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 102472 - Posted: 26 Aug 2021, 23:20:29 UTC - in response to Message 102469. Last modified: 26 Aug 2021, 23:21:35 UTC [snip] Thannk you, this is super helpful and I will do so. I don't think some of these tasks are going to complete in time for the deadline without checkpointing. I'm going to try and keep the client running but they're also using pretty excessive amounts of ram. I thought the quorum for each task (number of machines to complete) needed to be 1? Or do you mean others, apart from myself, also get this task, in case I don't complete it first? The usual quorum used to be two, but has often been 1 lately. A quorum of 1 is adequate only for tasks for which some quick method of checking the output of the task is available. If the quorum is 2, the first two sets of task output files returned must agree enough before they are considered validated. If they don't agree enough, one more task is sent out to determine which of the first two tasks is correct enough to be validated. The purpose of the quorum is to check whether the task or tasks returned correct outputs, even if the task did not detect an error. Sometimes, a workunit with an error in its input files will give some credit if other tasks for that same workunit agree on detecting the error. Usually, the first group of tasks sent out has as many tasks as the quorum, so if the quorum is greater than one, at least one other person will also get a task for that workunit. For each task that goes past its deadline, one more task for that workunit will be sent out. You have a head start on any task sent due to another task reaching its deadline, and therefore some chance of still returning it in time. If the tasks are using excessive amounts of RAM, you may need to tell BOINC to reduce the number of tasks it is allowed to run at the same time, so that the reduced number will fit in the amount of RAM you have available. I normally keep my computer running and doing BOINC work day and night, so it can handle tasks that go over 24 hours between checkpoints. ID: 102472 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 102473 - Posted: 26 Aug 2021, 23:25:15 UTC - in response to Message 102470. Hi, for me, for a setting time of 12h , some of them just run in 8h ! Typical if it finishes its list of possible decoys in 8 hours. Also expected if at the end of a decoy it calculates the time expected to do one more decoy and it would put the total time too far past the time you set. ID: 102473 · Rating: 0 · rate: / Reply Quote

Kissagogo27 Send message Joined: 31 Mar 20 Posts: 95 Credit: 3,781,450 RAC: 13	Message 102477 - Posted: 27 Aug 2021, 11:09:11 UTC ok, thks ;) ID: 102477 · Rating: 0 · rate: / Reply Quote

Falconet Send message Joined: 9 Mar 09 Posts: 355 Credit: 1,669,337 RAC: 0	Message 102479 - Posted: 27 Aug 2021, 11:42:37 UTC Last modified: 27 Aug 2021, 11:43:05 UTC Funny, I'm running "degrader" units at Rosetta@home and also "degrader" units at Ralph@home. 2 of the Rosetta@home units finished very early after 18 and 56 minutes, respectively. ID: 102479 · Rating: 0 · rate: / Reply Quote

wolfman1360 Send message Joined: 18 Feb 17 Posts: 73 Credit: 19,103,702 RAC: 0	Message 102485 - Posted: 27 Aug 2021, 15:30:59 UTC - in response to Message 102472. [snip] Thannk you, this is super helpful and I will do so. I don't think some of these tasks are going to complete in time for the deadline without checkpointing. I'm going to try and keep the client running but they're also using pretty excessive amounts of ram. I thought the quorum for each task (number of machines to complete) needed to be 1? Or do you mean others, apart from myself, also get this task, in case I don't complete it first? The usual quorum used to be two, but has often been 1 lately. A quorum of 1 is adequate only for tasks for which some quick method of checking the output of the task is available. If the quorum is 2, the first two sets of task output files returned must agree enough before they are considered validated. If they don't agree enough, one more task is sent out to determine which of the first two tasks is correct enough to be validated. The purpose of the quorum is to check whether the task or tasks returned correct outputs, even if the task did not detect an error. Sometimes, a workunit with an error in its input files will give some credit if other tasks for that same workunit agree on detecting the error. Usually, the first group of tasks sent out has as many tasks as the quorum, so if the quorum is greater than one, at least one other person will also get a task for that workunit. For each task that goes past its deadline, one more task for that workunit will be sent out. You have a head start on any task sent due to another task reaching its deadline, and therefore some chance of still returning it in time. If the tasks are using excessive amounts of RAM, you may need to tell BOINC to reduce the number of tasks it is allowed to run at the same time, so that the reduced number will fit in the amount of RAM you have available. I normally keep my computer running and doing BOINC work day and night, so it can handle tasks that go over 24 hours between checkpoints. Hi, I normally do too, on all but one. Of course that was the one that had these issues. The tasks ended up erroring out though they for some reason displayed a vast amount of credit, over 400. Thanks for the explanation. That clears things up. ID: 102485 · Rating: 0 · rate: / Reply Quote