Computation errors and checkpoint failures

Author	Message
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0	Message 76936 - Posted: 1 Jul 2014, 0:47:35 UTC Yes, I already looked at the FAQ and searched it for all of the relevant keywords I could imagine. There was some vaguely confusing stuff about when the checkpoints actually occur, but nothing like a clear explanation (that conformed with my observations), and not even a reference in the FAQ to the computation errors. Let me deal with the trivial question first, computation errors: The early computation errors don't bother me, but sometimes I've noticed computation errors that terminate work units after several hours of work. That doesn't even bother me that much, but I do feel like I should get some credit for the effort. In other words, I think the computation errors are bugs in your code, and I shouldn't be penalized for them... More complicated problem is with work units that are apparently unable to checkpoint. I saw an excellent example a few days ago. Some of my computers are only used for an hour or two at a time, and the work unit in question needed something like 3 hours in a block. Every time the computer booted, the work unit fell back to its early phase (not sure exactly how far), and after about a week it was clear that it was only going to be stopped by its deadline, if that, so I made the special effort to keep the computer running until it had finished it. I saved the image (with the full WU name), but don't know of any mechanism to report the problem. (Again, it's apparently a bug, but a minor one.) Anyway, overall I like the low-key atmosphere and loose deadlines of this project. I've already dropped (at least) two BOINC projects and the original seti@home project (where I had been top 1%) because they created too many hassles. I still think the deadlines are stupid and none of my concern, but at least this project doesn't use them in an excessively annoying way, so I guess I'll hang around, even if some of the code seems a bit buggy. ID: 76936 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2196 Credit: 41,873,741 RAC: 13,172	Message 76939 - Posted: 1 Jul 2014, 4:46:30 UTC This is a regular area of confusion. All the tasks that error out at validation are given credit after a day or so, but it just doesn't show on the tasks list. If you check within the task, you're credited with the claimed amount for the time spent. Regarding tasks that don't checkpoint, this is another pain in the posterior. If you need to switch your machine off or reboot and have tasks of this type it's probably best to abort them. You'll be credited for all completed decoys if there are any. Hope that sort of helps ID: 76939 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 76940 - Posted: 1 Jul 2014, 5:50:46 UTC - in response to Message 76936. I saved the image (with the full WU name), but don't know of any mechanism to report the problem. (Again, it's apparently a bug, but a minor one.) There is an error report thread for the current version of the Rosetta software at the top of the Number Crunching forum (Minirosetta 3.52 at the moment). If you post the task ID and a brief description of the problem the scientists are more likely to spot the issue and investigate. A link to the same thread is usually available in the "News" section of the Home page, unless more recent news items have pushed it off the page. ID: 76940 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 76945 - Posted: 2 Jul 2014, 1:28:54 UTC - in response to Message 76936. Some of my computers are only used for an hour or two at a time, and the work unit in question needed something like 3 hours in a block. Every time the computer booted, the work unit fell back to its early phase (not sure exactly how far), and after about a week it was clear that it was only going to be stopped by its deadline, if that, so I made the special effort to keep the computer running until it had finished it. I saved the image (with the full WU name), but don't know of any mechanism to report the problem. (Again, it's apparently a bug, but a minor one.) A machine that is only running BOINC for a couple of hours at a time is not ideal for R@h. While there are many tasks that checkpoint every 5-15min. there are also some that take over an hour. I wanted to point out that R@h checks each task as it restarts, and if it has restarted from the same point more than... I think it's on the 5th time, it will be marked as completed and sent back rather than keep trying. This is true regardless of how close it might be to the deadline. I appreciate your willingness to save the WU data, but it's not really necessary. If you post the WU ID(s), preferably in the thread on the Number Crunching board for the R@h application release that ran the WU, it provides all of the detail that should be required to recreate the problem. And often times the specific model that you happened to crunch is not the only one running long, so the developers have probably already noticed this. Over time they are often able to make further improvements and make the runtimes per model more consistent. Sometimes they are trying out a rather new approach to things and know that they will not be submitting a large number of the tasks and so the results are worth the effort. I still think the deadlines are stupid and none of my concern, but at least this project doesn't use them in an excessively annoying way, so I guess I'll hang around, even if some of the code seems a bit buggy. I'm glad you are staying around. The deadlines basically just keep the databases cleared of ties to tasks that are not very likely to ever be returned. They also allow the task to be resent to another machine, which can provide some confirmation about whether there are problems with completing some specific tasks, or specific types of machines that are not processing specific tasks very well (perhaps they use more memory then average for example, and this causes them to run longer, esp. on machines with less memory). Rosetta Moderator: Mod.Sense ID: 76945 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 76949 - Posted: 2 Jul 2014, 20:35:47 UTC - in response to Message 76945. I wanted to point out that R@h checks each task as it restarts, and if it has restarted from the same point more than... I think it's on the 5th time, it will be marked as completed and sent back rather than keep trying. This is true regardless of how close it might be to the deadline. That isn't always the case. Over the years I have encountered a very small number of tasks that failed to checkpoint and restarted each time I rebooted the computer. I have to switch my system off most nights, so 5 restarts would occur in 5 days. However I found some of those tasks running several weeks later, long past their deadlines. What you said is probably how they are supposed to behave, but a tiny minority of tasks need to be aborted manually. ID: 76949 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2196 Credit: 41,873,741 RAC: 13,172	Message 76950 - Posted: 3 Jul 2014, 2:36:34 UTC - in response to Message 76949. I wanted to point out that R@h checks each task as it restarts, and if it has restarted from the same point more than... I think it's on the 5th time, it will be marked as completed and sent back rather than keep trying. This is true regardless of how close it might be to the deadline. That isn't always the case. Over the years I have encountered a very small number of tasks that failed to checkpoint and restarted each time I rebooted the computer. I have to switch my system off most nights, so 5 restarts would occur in 5 days. However I found some of those tasks running several weeks later, long past their deadlines. What you said is probably how they are supposed to behave, but a tiny minority of tasks need to be aborted manually. I can confirm that. I suspect these are tasks that never even record a first checkpoint and would normally be closed by the watchdog at "runtime +4 hours" if they were allowed to get that far. They're usually ones that only receive the default 20.0 credits, from my observations, so aborting them makes most sense of all. ID: 76950 · Rating: 0 · rate: / Reply Quote

duftkerze Send message Joined: 7 Jul 06 Posts: 2 Credit: 692,624 RAC: 0	Message 76952 - Posted: 3 Jul 2014, 16:27:49 UTC https://boinc.bakerlab.org/result.php?resultid=671762097 whats wrong? Compute error on win7 x64 No problem with this wu-type under Linux. ID: 76952 · Rating: 0 · rate: / Reply Quote

shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0	Message 76957 - Posted: 5 Jul 2014, 2:55:05 UTC Thanks for the informative replies, and I'll try to report that stuck work unit next. I'll just add that there definitely seems to be a pattern for every work unit that starts with pd1 to have a computation error after about 1 minute of runtime. My largest machine usually has quite a few work units queued up, so I tested it yesterday, and every one of the pd1 work units crashed quickly (and I just crashed another this morning, but it was the only one waiting). That subproject seems to have a serious glitch in the code... However, I still think that much of this thread could have been better covered in the FAQ. I had also spent a smaller amount of time scanning these forums, but I didn't spot the relevant comments, though some of the comments in this thread make it sound as though the two topics I mentioned have been discussed before. ID: 76957 · Rating: 0 · rate: / Reply Quote

Usuario1_S Send message Joined: 24 Mar 14 Posts: 92 Credit: 3,059,705 RAC: 0	Message 77048 - Posted: 20 Jul 2014, 16:09:51 UTC Last modified: 20 Jul 2014, 16:25:31 UTC A few weeks ago I had BOINC Client 7.0.xx (don't remember), upgraded it to the current latest 7.2.42 and started getting computation errors, I searched for on forums and found that for version 7.0.26 or so computation errors were solved and gone, so I downgraded to the latest inferior suggested on BOINC Download, 7.0.64, same errors, also found that version 6 was best, so I found this: http://boinc.berkeley.edu/dl/?C=N;O=A Where you can download any version, this should be stickied somewhere, so I downloaded the latest v.6: 6.13.12 working a few WUs and no errors, looks like I'm on the clear now. If there is a v.7.0.xx that doesn't get errors, it should be stickied too, and the 'Download BOINC' link shouldn't be pointed to the latest BOINC version, but to the latest functional one, logic suggests is 6.13.12: boinc_6.13.12_i686-apple-darwin.zip 11-Nov-2011 16:18 609K boinc_6.13.12_i686-pc-linux-gnu.sh 14-Nov-2011 11:53 2.6M boinc_6.13.12_macOSX_SymbolTables.zip 11-Nov-2011 16:19 2.7M boinc_6.13.12_macOSX_i686.zip 11-Nov-2011 16:18 5.3M boinc_6.13.12_windows_intelx86.exe 11-Nov-2011 10:05 7.6M boinc_6.13.12_windows_x86_64.exe 11-Nov-2011 10:08 8.5M boinc_6.13.12_x86_64-pc-linux-gnu.sh 14-Nov-2011 10:27 2.6M ID: 77048 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 77049 - Posted: 20 Jul 2014, 16:45:54 UTC - in response to Message 77048. Last modified: 20 Jul 2014, 16:46:38 UTC A few weeks ago I had BOINC Client 7.0.xx (don't remember), upgraded it to the current latest 7.2.42 and started getting computation errors, I searched for on forums and found that for version 7.0.26 or so computation errors were solved and gone... A lot of computation errors are caused by faults in the task design. As Rosetta runs several different tasks each week it is quite likely that you will encounter errors sooner or later no matter which version of the BOINC software you run. It is best to check if similar problems have been reported on the Number Crunching forum before deciding whether or not to change your BOINC version. ID: 77049 · Rating: 0 · rate: / Reply Quote

Usuario1_S Send message Joined: 24 Mar 14 Posts: 92 Credit: 3,059,705 RAC: 0	Message 77053 - Posted: 22 Jul 2014, 9:58:09 UTC - in response to Message 77049. I have finished 50+ WUs now on v.6.13.12, no errors, and the WUs were about the same, I checked the names fast, I reset the project on each downgrade, empiric evidence suggests I was right, as were the other empiric solutions I based mine on, from other people experimenting with different versions; more reference is always good but if you look my explanation in my previous post the problem is clearly on v.7. ID: 77053 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 77059 - Posted: 22 Jul 2014, 17:45:12 UTC - in response to Message 77053. Not necessarily. There has been a recent batch of bad work units that have affected everyone but they have been mostly cleared out now. I run version 7.2.42 (x64) and other than the recent bad batch my system has been perfectly stable. If version 6 works for you then great. All I am saying is that other readers of this thread should check for error reports from other users before jumping to the conclusion that it is their own system at fault. ID: 77059 · Rating: 0 · rate: / Reply Quote

Usuario1_S Send message Joined: 24 Mar 14 Posts: 92 Credit: 3,059,705 RAC: 0	Message 77385 - Posted: 22 Aug 2014, 8:06:52 UTC - in response to Message 77059. Fair enough, I'll give V7 another try, I have recently downloaded the AMD OpenCL driver for my CPU and V7/Rosetta might make better use of it than V6. ID: 77385 · Rating: 0 · rate: / Reply Quote