Message boards : Number crunching : Problems with Rosetta version 5.93
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
hedera, you are correct to expect only 2 tasks running at a time to be normal on that machine. You can control the amount of memory you wish to allow BOINC to use for the WUs it is currently running. This is in the General Preferences, or the local preferences for each machine. Rosetta Moderator: Mod.Sense |
hedera Send message Joined: 15 Jul 06 Posts: 76 Credit: 5,263,150 RAC: 59 |
OK, my current memory preferences are: 50% when computer is in use 90% when computer isn't in use How would you advise me to trim that to keep 2 and only 2 WUs running? As far as I could tell from the BOINC manager console, when one of the WUs got above 90% (maybe above 95%), it began using enough less memory that Rosetta could launch another WU... I didn't see 3 WUs working unless at least one of them was in the high 90% completed range. --hedera Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic. |
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=133324121 <message>Maximum disk usage exceeded </message> NTFS partition with 4GB (not compressed) 2.9GB free 800MB used by BOINC (total directory size, this includes 2 paused climate models) BOINC is allowed 8GB or 100% and asked to leave 0.01GB free which means that the Rosetta WU must have used ~2.8 GB when it crashed? Or do Rosetta WUs come with a builtin disk usage limit different from the BOINC limit? p.s.: The root path name is just d:BOINC so MAX_FNAME should not play a role, even though it is weird to include all those informations in the filename. Afaik. MAX_FNAME (and PATH_MAX) is 256/256 on NTFS and 128/143 on DOS, so the Rosetta filename (~150 characters including the path) would have violated the DOS pathname length but should still work under Win32 |
Viking69 Send message Joined: 3 Oct 05 Posts: 20 Credit: 6,872,023 RAC: 769 |
Windows Vista: I couln't get my BOINC manager to come up after I was away for 3 days. The PC is on 24/7. I tried restarting the service ( always run as a service ) an dhad no luck, I loged off with no change, I downloaded and installed 5.10.35 ( I was on 5.10.30 ) and still no luck. I looked into the slots folder and I saw that I had 4 that were rosetta but the folder said 'mini'. I deleted the slots folder with the service stopped ( it prevented me to do that with the service running ) and I was then able to see the tasks board. The service is currently stopped so I can write what I had in queue. (3) 1zpy files and (1) BAKavsc3 files. Thesea are the only WU's that I have for Rosetta on my Vista box. I will be starting the service as soon as i post this to see what happens. **update** After starting the service for BOINC again, 3 of the Rosettas uploaded and a 4th is currently processing. It is a 1zpy file. I seem to have gotten credit for the reported WU's, so they did finish without error. |
M.L. Send message Joined: 21 Nov 06 Posts: 182 Credit: 180,462 RAC: 0 |
Task ID 133439620 Name 1zpy__BOINC_DEFAULT_SYMM_FOLD_AND_DOCK-1zpy_-native__2519_34438_0 Workunit 121403622 Created 14 Jan 2008 11:42:55 UTC Sent 14 Jan 2008 11:43:40 UTC Received 15 Jan 2008 14:07:22 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 717897 Report deadline 24 Jan 2008 11:43:40 UTC CPU time 6261.875 stderr out <core_client_version>5.10.30</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 14400 # random seed: 3628558 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! Stuck at score -96.4799 for 900 seconds ********************************************************************** GZIP SILENT FILE: .xx1zpy.out </stderr_txt> ]]> Validate state Valid Claimed credit 25.9837414273405 Granted credit 20 application version 5.93 Home | Join | About | Participants | Community | Statistics |
KWSN THE Holy Hand Grenade! Send message Joined: 3 May 07 Posts: 5 Credit: 2,542,452 RAC: 0 |
Is anyone else getting compute errors like this? (5.93, Win XP pro x64 and win XP home (different machine) Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 14400 # random seed: 3348287 # cpu_run_time_pref: 14400 ERROR:: Exit from: .fullatom_energy.cc line: 2128 I've had about 8 WU's fail for this reason... |
NickHan Send message Joined: 2 Jul 07 Posts: 4 Credit: 108,170 RAC: 0 |
Version 5.93 reached 96% plus on a WU showing 10 mins to go. An hour later 97% and 10 mins to go. Stopped BOINC and restarted the WU came up at 97% and when computation restarted reset to zero and 6 hours 20 remaining! Sigh Any ideas? |
Knorr Send message Joined: 18 Feb 06 Posts: 21 Credit: 373,953 RAC: 0 |
Had an invalid result https://boinc.bakerlab.org/result.php?resultid=133554789 The watchdog didn't end the run at first. It ran for more than 4 hrs, with a setting of 2 hrs. I suspended the task, and then resumed it a bit later, and the task ended itself. - Knorr |
Luuklag Send message Joined: 13 Sep 07 Posts: 262 Credit: 4,171 RAC: 0 |
im having not much time to post these days, school is asking to much from me atm, to much things to finish. but im still having errors, a big deal of erros, 1 or 2 days ago 4 or 5 WU's in a row, some triggered watchdog. but thanks for letting me know sin cosin thing is a bit common and your looking into it, some more of these small posts will really boost the morale. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Version 5.93 reached 96% plus on a WU showing 10 mins to go. An hour later 97% and 10 mins to go. Stopped BOINC and restarted the WU came up at 97% and when computation restarted reset to zero and 6 hours 20 remaining! Sigh Any ideas? Ideas? Yes, don't stop BOINC. Seriously. The fact that your % complete reset to zero implies that no checkpoint was reached during the calculations. Some types of work are able to checkpoint very frequently, some are not. The time to completion is an estimate, and not always a very accurate estimate. Some of the work they are sending out can take 5 or 6 hours to complete a single model (longer on a slower machine). This is especially true for the 1zpy's. If your preferred runtime is less then this, you will see an estimated time to completion of something under 10 minutes for any time over your preference. So if your preference is the default 3hrs for example, it will show 10min to complete, with expoentially small reductions in that time for the last 2 or 3 hours of the model. Rosetta Moderator: Mod.Sense |
Ananas Send message Joined: 1 Jan 06 Posts: 232 Credit: 752,471 RAC: 0 |
No watchdog thing yet but a candidate (mgth-3-1sg9_a_w012_MolecularReplacement_2482_77037) : file "farlxcheck" last touched 2.5 hours ago (96.60%), the BOF looks like this : 286 LEU 67.29 165.85 0.00 0.00 chi_offsets 287 THR 58.79 60.00 0.00 0.00 chi_offsets 288 LEU 177.42 66.34 0.00 0.00 chi_offsets the fraction of chi1 correct 133 246 0.54 the fraction of chi12 correct 41 200 0.20 the fraction of chi123 correct 3 74 0.04 Maybe this helps somehow. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
got another stuck one. See details in this post, except this time it restarted at 10 minutes instead of uploading immediately. Looks like I'm in "Babysitter mode" until this one finishes. |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
That stuck WU which restarted is resultid=133551161 which ended itself on this go around. Was Valid and creditted (but not for the first wasted 2 hours spent on it, plus however long it was stuck for). The says: Graphics are disabled due to configuration... # cpu_run_time_pref: 10800 # random seed: 3171268 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! Stuck at score -113.019 for 900 seconds ********************************************************************** GZIP SILENT FILE: ./xx1zpy.out *** glibc detected *** corrupted double-linked list: 0x092683c0 *** SIGABRT: abort called Stack trace (18 frames): [0x8da3037] [0x8d9de2c] [0xffffe500] [0x8e0e444] [0x8e2330f] [0x8e27d01] [0x8e28176] [0x8e28653] [0x8df90a1] [0x8dfaac9] [0x83c4cc5] [0x8e0e98f] [0x8d9fab7] [0x8d9ff27] [0x8d2023d] [0x8d20f35] [0x8d9a0c5] [0x8e3aa1a] Exiting... No heartbeat from core client for 31 sec - exiting FILE_LOCK::unlock(): close failed.: Bad file descriptor Graphics are disabled due to configuration... # cpu_run_time_pref: 10800 # random seed: 3171268 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! Stuck at score -89.0742 for 900 seconds ********************************************************************** GZIP SILENT FILE: ./xx1zpy.out SIGSEGV: segmentation violation Stack trace (22 frames): [0x8da3037] [0x8d9de2c] [0xffffe500] [0x89a1824] [0x804c828] [0x8a8ae99] [0x8a8babf] [0x8d0c170] [0x8c12abe] [0x8c14e33] [0x804c7c2] [0x8a835ed] [0x8a8586f] [0x89363de] [0x89380e3] [0x893ba27] [0x898ad7a] [0x85e96d6] [0x87289d2] [0x8728af2] [0x8e07384] [0x8048111] Exiting... </stderr_txt> ]]> Hope something in all this ends with a fix at some point. |
hedera Send message Joined: 15 Jul 06 Posts: 76 Credit: 5,263,150 RAC: 59 |
I notice 2 things today, which may simply mean I notice things slowly: 1. My system is running MUCH faster today. Yesterday I waited minutes for the screen to change. 2. BOINC is running Rosetta Beta 5.93. I don't recall noticing that I had Rosetta Beta 5.93 before, am I just slow at noticing? Because it feels like something has changed. Was I simply running some very intensive WUs yesterday?? Today's memory usage is noticeably lower. Yesterday I was running these tasks: https://boinc.bakerlab.org/rosetta/result.php?resultid=133780391 https://boinc.bakerlab.org/rosetta/result.php?resultid=133748830 I'm STILL running this task (it's about done), which has been going since sometime on the 15th: https://boinc.bakerlab.org/rosetta/result.php?resultid=133728745 Are these tasks unusually complex or large?? --hedera Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Your result 133748830 is a 1zpy. Yes, they take a long time to complete a single model. V5.93 has been out for some time. But depending on which WUs your machine is assigned, and how large a cache of work you keep, you may not have seen much work under v5.93 until now. But more likely you just hadn't noticed. Rosetta Moderator: Mod.Sense |
Mike.Gibson Send message Joined: 3 Nov 07 Posts: 19 Credit: 311,844 RAC: 0 |
Thanks for this explanation. I had been dumping "stuck" 5.90s and was about to dump a "stuck" 5.93. As a result of your explanation, repeated below with the original question, I set a time of 10 hours in place of the default and lo & behold, after a while, the time to go shot up from 10 minutes to 5 hours meaning a total time of over 8 hours on a 3800+ dual-core with 1MB RAM! Also the progress dropped from 95% to about 35%. It is now going well. Would it not be better to put out a message about the possible time increase and also to change the default from 3 hours to something more realistic? Presumably, this is only a few minutes work to do and it would solve all these problems. Apart from anything else, BOINC Manager needs to know how long these units can take in order to assess what units to obtain and also for assessing priorities. If something is going to take 3 times the expected time, it could cause other units/projects to default on time limits. Regards Mike Version 5.93 reached 96% plus on a WU showing 10 mins to go. An hour later 97% and 10 mins to go. Stopped BOINC and restarted the WU came up at 97% and when computation restarted reset to zero and 6 hours 20 remaining! Sigh Any ideas? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Mike, if everyone had the same time preference, and if all tasks had roughly the same time per model, what you say would certainly be done. But neither is the case. Some people want shorter times (and, yes, it would be nice if they never received a task that took longer then that, but it's not a perfect world). The mixture of work varies over time. The ratio of long to short model tasks varies. ...and you are correct, this can (and does) throw off the estimates and confuse BOINC about how much work to get. The best way to get a fairly concistent and predictable completion time is to go the 24hr maximum runtime preference. But, if your machine is only on 2 hours a day, it would take you more then 10 days to complete a task and it would never get returned before the deadline. ...there's always something. But if BOINC is running 24hrs a day anyway, then this will offer the most predictability for human, and BOINC. Rosetta Moderator: Mod.Sense |
Mike.Gibson Send message Joined: 3 Nov 07 Posts: 19 Credit: 311,844 RAC: 0 |
I see where you are coming from, but, if you take the 2 hours a day machine as an example, it will start the unit thinking it will finish within the deadline but when the 3 hours is up, a couple of days later, it then sticks on the 3 hours and no progress seems to be happening and the time will be wasted when the unit is eventually aborted or the deadline passes. It is far better for the true time to appear and then the unit can be aborted before it starts if the deadline cannot be met. That way another shorter unit can be run in its place, successfully. Cheers Mike Mike, if everyone had the same time preference, and if all tasks had roughly the same time per model, what you say would certainly be done. But neither is the case. Some people want shorter times (and, yes, it would be nice if they never received a task that took longer then that, but it's not a perfect world). The mixture of work varies over time. The ratio of long to short model tasks varies. ...and you are correct, this can (and does) throw off the estimates and confuse BOINC about how much work to get. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
This WU (on one of my Linux machines): https://boinc.bakerlab.org/rosetta/result.php?resultid=133853424 was ended by the watchdog for 900 seconds of no progress. Then it bombed out giving a stack trace. Then it bombed again with another stack trace. Then it hung, showing 100% done and about an hour of CPU in the manager. The time in the manager wasn't changing and no CPU was being used. It's clear that Rosetta still has the bug where the watchdog can't terminate a WU on a Linux machine without crashing. So I decided to kill -9 the Rosetta process. Boinc showed a message saying the WU exited with zero status but no "finished" file. Boinc restarted the WU. Then the WU completed normally, with a "successful" and "valid" result. :p |
Mike.Gibson Send message Joined: 3 Nov 07 Posts: 19 Credit: 311,844 RAC: 0 |
As a follow up to message 50796 etc, when my 1c26 unit approached the 10-hour preferred runtime, I increased the runtime to the maximum of 24 hours. As soon as it took effect, the progress % fell to 38. (i.e. CPU time /24) Another 7 hours have gone by and the grogress % is still based on CPU time/24. Another consequence of increasing the runtime was that BOINC Manager woke up to the fact that I had 6 Rosetta units that were liable to miss their deadline and consequently commandeered both cores of my 3800+ dual-core machine for Rosetta at the expense of everything else. This brought a second Rosetta into play, an s099 unit, which now seems to be going along the same lines with 7 hours CPU time and 29% progress. Heaven help anyone with a PIII machine! They will never finish. Even I am wondering if how many, if any, of my units will finish before the deadline of 23/1/08. I am not expecting them to finish within the 24 hours. Does anyone know how long these will take, please? Regards Mike |
Message boards :
Number crunching :
Problems with Rosetta version 5.93
©2024 University of Washington
https://www.bakerlab.org