Problems with Rosetta version 5.93

Message boards : Number crunching : Problems with Rosetta version 5.93

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 50693 - Posted: 15 Jan 2008, 3:56:55 UTC

hedera, you are correct to expect only 2 tasks running at a time to be normal on that machine.

You can control the amount of memory you wish to allow BOINC to use for the WUs it is currently running. This is in the General Preferences, or the local preferences for each machine.
Rosetta Moderator: Mod.Sense
ID: 50693 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 76
Credit: 5,263,150
RAC: 59
Message 50695 - Posted: 15 Jan 2008, 5:07:51 UTC

OK, my current memory preferences are:

50% when computer is in use
90% when computer isn't in use

How would you advise me to trim that to keep 2 and only 2 WUs running? As far as I could tell from the BOINC manager console, when one of the WUs got above 90% (maybe above 95%), it began using enough less memory that Rosetta could launch another WU... I didn't see 3 WUs working unless at least one of them was in the high 90% completed range.
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 50695 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 50700 - Posted: 15 Jan 2008, 7:43:00 UTC
Last modified: 15 Jan 2008, 8:15:33 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=133324121

<message>Maximum disk usage exceeded
</message>

NTFS partition with 4GB (not compressed)
2.9GB free
800MB used by BOINC (total directory size, this includes 2 paused climate models)

BOINC is allowed 8GB or 100% and asked to leave 0.01GB free

which means that the Rosetta WU must have used ~2.8 GB when it crashed?

Or do Rosetta WUs come with a builtin disk usage limit different from the BOINC limit?

p.s.: The root path name is just d:BOINC so MAX_FNAME should not play a role, even though it is weird to include all those informations in the filename.
Afaik. MAX_FNAME (and PATH_MAX) is 256/256 on NTFS and 128/143 on DOS, so the Rosetta filename (~150 characters including the path) would have violated the DOS pathname length but should still work under Win32
ID: 50700 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viking69
Avatar

Send message
Joined: 3 Oct 05
Posts: 20
Credit: 6,872,023
RAC: 769
Message 50704 - Posted: 15 Jan 2008, 9:09:08 UTC
Last modified: 15 Jan 2008, 9:17:05 UTC

Windows Vista: I couln't get my BOINC manager to come up after I was away for 3 days. The PC is on 24/7.


I tried restarting the service ( always run as a service ) an dhad no luck, I loged off with no change, I downloaded and installed 5.10.35 ( I was on 5.10.30 ) and still no luck. I looked into the slots folder and I saw that I had 4 that were rosetta but the folder said 'mini'. I deleted the slots folder with the service stopped ( it prevented me to do that with the service running ) and I was then able to see the tasks board. The service is currently stopped so I can write what I had in queue. (3) 1zpy files and (1) BAKavsc3 files.

Thesea are the only WU's that I have for Rosetta on my Vista box.
I will be starting the service as soon as i post this to see what happens.

**update**
After starting the service for BOINC again, 3 of the Rosettas uploaded and a 4th is currently processing. It is a 1zpy file. I seem to have gotten credit for the reported WU's, so they did finish without error.
ID: 50704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
M.L.

Send message
Joined: 21 Nov 06
Posts: 182
Credit: 180,462
RAC: 0
Message 50710 - Posted: 15 Jan 2008, 16:10:13 UTC

Task ID 133439620
Name 1zpy__BOINC_DEFAULT_SYMM_FOLD_AND_DOCK-1zpy_-native__2519_34438_0
Workunit 121403622
Created 14 Jan 2008 11:42:55 UTC
Sent 14 Jan 2008 11:43:40 UTC
Received 15 Jan 2008 14:07:22 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 717897
Report deadline 24 Jan 2008 11:43:40 UTC
CPU time 6261.875
stderr out <core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3628558
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -96.4799 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .xx1zpy.out

</stderr_txt>
]]>


Validate state Valid
Claimed credit 25.9837414273405
Granted credit 20
application version 5.93





Home | Join | About | Participants | Community | Statistics
ID: 50710 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile KWSN THE Holy Hand Grenade!

Send message
Joined: 3 May 07
Posts: 5
Credit: 2,542,452
RAC: 0
Message 50716 - Posted: 15 Jan 2008, 18:46:37 UTC

Is anyone else getting compute errors like this? (5.93, Win XP pro x64 and win XP home (different machine)

Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3348287
# cpu_run_time_pref: 14400
ERROR:: Exit from: .fullatom_energy.cc line: 2128


I've had about 8 WU's fail for this reason...
ID: 50716 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NickHan

Send message
Joined: 2 Jul 07
Posts: 4
Credit: 108,170
RAC: 0
Message 50718 - Posted: 15 Jan 2008, 20:16:32 UTC

Version 5.93 reached 96% plus on a WU showing 10 mins to go. An hour later 97% and 10 mins to go. Stopped BOINC and restarted the WU came up at 97% and when computation restarted reset to zero and 6 hours 20 remaining! Sigh Any ideas?
ID: 50718 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 373,953
RAC: 0
Message 50719 - Posted: 15 Jan 2008, 20:19:33 UTC

Had an invalid result

https://boinc.bakerlab.org/result.php?resultid=133554789

The watchdog didn't end the run at first.
It ran for more than 4 hrs, with a setting of 2 hrs.
I suspended the task, and then resumed it a bit later, and the task ended itself.

- Knorr
ID: 50719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luuklag

Send message
Joined: 13 Sep 07
Posts: 262
Credit: 4,171
RAC: 0
Message 50721 - Posted: 15 Jan 2008, 20:27:49 UTC

im having not much time to post these days, school is asking to much from me atm, to much things to finish. but im still having errors, a big deal of erros, 1 or 2 days ago 4 or 5 WU's in a row, some triggered watchdog. but thanks for letting me know sin cosin thing is a bit common and your looking into it, some more of these small posts will really boost the morale.
ID: 50721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 50727 - Posted: 15 Jan 2008, 22:16:32 UTC - in response to Message 50718.  
Last modified: 15 Jan 2008, 22:20:25 UTC

Version 5.93 reached 96% plus on a WU showing 10 mins to go. An hour later 97% and 10 mins to go. Stopped BOINC and restarted the WU came up at 97% and when computation restarted reset to zero and 6 hours 20 remaining! Sigh Any ideas?


Ideas? Yes, don't stop BOINC. Seriously.

The fact that your % complete reset to zero implies that no checkpoint was reached during the calculations. Some types of work are able to checkpoint very frequently, some are not.

The time to completion is an estimate, and not always a very accurate estimate. Some of the work they are sending out can take 5 or 6 hours to complete a single model (longer on a slower machine). This is especially true for the 1zpy's. If your preferred runtime is less then this, you will see an estimated time to completion of something under 10 minutes for any time over your preference. So if your preference is the default 3hrs for example, it will show 10min to complete, with expoentially small reductions in that time for the last 2 or 3 hours of the model.
Rosetta Moderator: Mod.Sense
ID: 50727 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 50730 - Posted: 15 Jan 2008, 23:34:15 UTC
Last modified: 15 Jan 2008, 23:40:29 UTC

No watchdog thing yet but a candidate (mgth-3-1sg9_a_w012_MolecularReplacement_2482_77037) :

file "farlxcheck" last touched 2.5 hours ago (96.60%), the BOF looks like this :
 286 LEU   67.29  165.85    0.00    0.00  chi_offsets
 287 THR   58.79   60.00    0.00    0.00  chi_offsets
 288 LEU  177.42   66.34    0.00    0.00  chi_offsets

the fraction of chi1 correct   133   246    0.54
the fraction of chi12 correct    41   200    0.20
the fraction of chi123 correct     3    74    0.04

Maybe this helps somehow.
ID: 50730 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50774 - Posted: 17 Jan 2008, 17:51:05 UTC
Last modified: 17 Jan 2008, 17:52:41 UTC

got another stuck one. See details in this post, except this time it restarted at 10 minutes instead of uploading immediately. Looks like I'm in "Babysitter mode" until this one finishes.
ID: 50774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50777 - Posted: 17 Jan 2008, 20:10:39 UTC
Last modified: 17 Jan 2008, 20:11:12 UTC

That stuck WU which restarted is resultid=133551161 which ended itself on this go around. Was Valid and creditted (but not for the first wasted 2 hours spent on it, plus however long it was stuck for).

The says:

Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3171268
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -113.019 for 900 seconds
**********************************************************************
GZIP SILENT FILE: ./xx1zpy.out
*** glibc detected *** corrupted double-linked list: 0x092683c0 ***
SIGABRT: abort called
Stack trace (18 frames):
[0x8da3037]
[0x8d9de2c]
[0xffffe500]
[0x8e0e444]
[0x8e2330f]
[0x8e27d01]
[0x8e28176]
[0x8e28653]
[0x8df90a1]
[0x8dfaac9]
[0x83c4cc5]
[0x8e0e98f]
[0x8d9fab7]
[0x8d9ff27]
[0x8d2023d]
[0x8d20f35]
[0x8d9a0c5]
[0x8e3aa1a]

Exiting...
No heartbeat from core client for 31 sec - exiting
FILE_LOCK::unlock(): close failed.: Bad file descriptor
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3171268
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -89.0742 for 900 seconds
**********************************************************************
GZIP SILENT FILE: ./xx1zpy.out
SIGSEGV: segmentation violation
Stack trace (22 frames):
[0x8da3037]
[0x8d9de2c]
[0xffffe500]
[0x89a1824]
[0x804c828]
[0x8a8ae99]
[0x8a8babf]
[0x8d0c170]
[0x8c12abe]
[0x8c14e33]
[0x804c7c2]
[0x8a835ed]
[0x8a8586f]
[0x89363de]
[0x89380e3]
[0x893ba27]
[0x898ad7a]
[0x85e96d6]
[0x87289d2]
[0x8728af2]
[0x8e07384]
[0x8048111]

Exiting...

</stderr_txt>
]]>

Hope something in all this ends with a fix at some point.
ID: 50777 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 76
Credit: 5,263,150
RAC: 59
Message 50786 - Posted: 17 Jan 2008, 22:11:16 UTC

I notice 2 things today, which may simply mean I notice things slowly:

1. My system is running MUCH faster today. Yesterday I waited minutes for the screen to change.

2. BOINC is running Rosetta Beta 5.93.

I don't recall noticing that I had Rosetta Beta 5.93 before, am I just slow at noticing? Because it feels like something has changed. Was I simply running some very intensive WUs yesterday?? Today's memory usage is noticeably lower. Yesterday I was running these tasks:

https://boinc.bakerlab.org/rosetta/result.php?resultid=133780391
https://boinc.bakerlab.org/rosetta/result.php?resultid=133748830

I'm STILL running this task (it's about done), which has been going since sometime on the 15th:

https://boinc.bakerlab.org/rosetta/result.php?resultid=133728745

Are these tasks unusually complex or large??

--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 50786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 50789 - Posted: 17 Jan 2008, 22:58:34 UTC

Your result 133748830 is a 1zpy. Yes, they take a long time to complete a single model. V5.93 has been out for some time. But depending on which WUs your machine is assigned, and how large a cache of work you keep, you may not have seen much work under v5.93 until now. But more likely you just hadn't noticed.

Rosetta Moderator: Mod.Sense
ID: 50789 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike.Gibson

Send message
Joined: 3 Nov 07
Posts: 19
Credit: 311,844
RAC: 0
Message 50790 - Posted: 18 Jan 2008, 0:40:51 UTC - in response to Message 50727.  

Thanks for this explanation. I had been dumping "stuck" 5.90s and was about to dump a "stuck" 5.93. As a result of your explanation, repeated below with the original question, I set a time of 10 hours in place of the default and lo & behold, after a while, the time to go shot up from 10 minutes to 5 hours meaning a total time of over 8 hours on a 3800+ dual-core with 1MB RAM! Also the progress dropped from 95% to about 35%. It is now going well.

Would it not be better to put out a message about the possible time increase and also to change the default from 3 hours to something more realistic? Presumably, this is only a few minutes work to do and it would solve all these problems.

Apart from anything else, BOINC Manager needs to know how long these units can take in order to assess what units to obtain and also for assessing priorities. If something is going to take 3 times the expected time, it could cause other units/projects to default on time limits.

Regards

Mike
Version 5.93 reached 96% plus on a WU showing 10 mins to go. An hour later 97% and 10 mins to go. Stopped BOINC and restarted the WU came up at 97% and when computation restarted reset to zero and 6 hours 20 remaining! Sigh Any ideas?


Ideas? Yes, don't stop BOINC. Seriously.

The fact that your % complete reset to zero implies that no checkpoint was reached during the calculations. Some types of work are able to checkpoint very frequently, some are not.

The time to completion is an estimate, and not always a very accurate estimate. Some of the work they are sending out can take 5 or 6 hours to complete a single model (longer on a slower machine). This is especially true for the 1zpy's. If your preferred runtime is less then this, you will see an estimated time to completion of something under 10 minutes for any time over your preference. So if your preference is the default 3hrs for example, it will show 10min to complete, with expoentially small reductions in that time for the last 2 or 3 hours of the model.

ID: 50790 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 50792 - Posted: 18 Jan 2008, 3:02:08 UTC

Mike, if everyone had the same time preference, and if all tasks had roughly the same time per model, what you say would certainly be done. But neither is the case. Some people want shorter times (and, yes, it would be nice if they never received a task that took longer then that, but it's not a perfect world). The mixture of work varies over time. The ratio of long to short model tasks varies. ...and you are correct, this can (and does) throw off the estimates and confuse BOINC about how much work to get.

The best way to get a fairly concistent and predictable completion time is to go the 24hr maximum runtime preference. But, if your machine is only on 2 hours a day, it would take you more then 10 days to complete a task and it would never get returned before the deadline. ...there's always something. But if BOINC is running 24hrs a day anyway, then this will offer the most predictability for human, and BOINC.
Rosetta Moderator: Mod.Sense
ID: 50792 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike.Gibson

Send message
Joined: 3 Nov 07
Posts: 19
Credit: 311,844
RAC: 0
Message 50796 - Posted: 18 Jan 2008, 10:30:09 UTC - in response to Message 50792.  

I see where you are coming from, but, if you take the 2 hours a day machine as an example, it will start the unit thinking it will finish within the deadline but when the 3 hours is up, a couple of days later, it then sticks on the 3 hours and no progress seems to be happening and the time will be wasted when the unit is eventually aborted or the deadline passes. It is far better for the true time to appear and then the unit can be aborted before it starts if the deadline cannot be met. That way another shorter unit can be run in its place, successfully.

Cheers

Mike

Mike, if everyone had the same time preference, and if all tasks had roughly the same time per model, what you say would certainly be done. But neither is the case. Some people want shorter times (and, yes, it would be nice if they never received a task that took longer then that, but it's not a perfect world). The mixture of work varies over time. The ratio of long to short model tasks varies. ...and you are correct, this can (and does) throw off the estimates and confuse BOINC about how much work to get.

The best way to get a fairly concistent and predictable completion time is to go the 24hr maximum runtime preference. But, if your machine is only on 2 hours a day, it would take you more then 10 days to complete a task and it would never get returned before the deadline. ...there's always something. But if BOINC is running 24hrs a day anyway, then this will offer the most predictability for human, and BOINC.

ID: 50796 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 50805 - Posted: 18 Jan 2008, 21:54:42 UTC

This WU (on one of my Linux machines): https://boinc.bakerlab.org/rosetta/result.php?resultid=133853424
was ended by the watchdog for 900 seconds of no progress.

Then it bombed out giving a stack trace.

Then it bombed again with another stack trace.

Then it hung, showing 100% done and about an hour of CPU in the manager. The time in the manager wasn't changing and no CPU was being used.

It's clear that Rosetta still has the bug where the watchdog can't terminate a WU on a Linux machine without crashing.

So I decided to kill -9 the Rosetta process.

Boinc showed a message saying the WU exited with zero status but no "finished" file. Boinc restarted the WU.

Then the WU completed normally, with a "successful" and "valid" result.

:p
ID: 50805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike.Gibson

Send message
Joined: 3 Nov 07
Posts: 19
Credit: 311,844
RAC: 0
Message 50812 - Posted: 19 Jan 2008, 0:39:21 UTC

As a follow up to message 50796 etc, when my 1c26 unit approached the 10-hour preferred runtime, I increased the runtime to the maximum of 24 hours. As soon as it took effect, the progress % fell to 38. (i.e. CPU time /24)

Another 7 hours have gone by and the grogress % is still based on CPU time/24.

Another consequence of increasing the runtime was that BOINC Manager woke up to the fact that I had 6 Rosetta units that were liable to miss their deadline and consequently commandeered both cores of my 3800+ dual-core machine for Rosetta at the expense of everything else. This brought a second Rosetta into play, an s099 unit, which now seems to be going along the same lines with 7 hours CPU time and 29% progress.

Heaven help anyone with a PIII machine! They will never finish. Even I am wondering if how many, if any, of my units will finish before the deadline of 23/1/08. I am not expecting them to finish within the 24 hours.

Does anyone know how long these will take, please?

Regards

Mike
ID: 50812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

Message boards : Number crunching : Problems with Rosetta version 5.93



©2024 University of Washington
https://www.bakerlab.org