Minirosetta v1.40 bug thread

Message boards : Number crunching : Minirosetta v1.40 bug thread

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1851
Credit: 34,017,040
RAC: 3,882
Message 57504 - Posted: 2 Dec 2008, 18:53:31 UTC - in response to Message 57415.  

As I'm out of work (just with results to upload) I'll take the opportunity to delete all the lockfiles again, as previously advised, and reset the project. Seems to me like the perfect opportunity.

*All* the lock files ? Where do they accumulate ? Do they accumulate after every job ? Or only after failed ones ? This might be a leading thread to solving this silly lockfile problem!

In fact I spoke too soon before checking. The last time this was mentioned there were numerous 0-byte boinc_lockfiles in C:ProgramDataBOINCslots (and folders 1, 2, 3, 4 etc) - under Vista64 btw.

This time the slots folder was empty, so no lockfiles, even though I got many WUs with too many errors after repeated "Can't acquire lockfile" messages. I'd been away from home 11/27 to 11/30

See my results

Final note, because I'm now officially depressed:
After uploading all previous results, changing server urls, resetting the project, dl'ing new WUs, my first 4 MiniRosetta WUs all crashed out in the usual way between 10 and 100 minutes. Can't acquire lockfile.

I now have 7 folders inside the slots folder (named 0, 1, 2, 3, 4, 5 & 6) four of which contain a 0-byte boinc_lockfile, while only 2 mini-rosetta WUs are currently running.

I guess I should've let those WUs abort with the usual Computation Error so they could report properly, but I was that p'd off I aborted them to let some infallible Rosetta 5.98 WUs run.
ID: 57504 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,532,664
RAC: 1,305
Message 57506 - Posted: 2 Dec 2008, 20:38:54 UTC

just delete those lockfiles and you should be able to get back on your way again.
hopefully the new work you get will not contain these problems.
i saw awhile back that they were going to look into that problem and fix it.
ID: 57506 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 57511 - Posted: 2 Dec 2008, 21:04:43 UTC

Instant remedy for getting out of lockfile depression - have a change of scenery - go over to RALPH - about 30,000 work units at last count ready to send!
ID: 57511 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
A Few Good Men

Send message
Joined: 25 Mar 07
Posts: 14
Credit: 2,031,382
RAC: 0
Message 57544 - Posted: 3 Dec 2008, 14:09:01 UTC

Result task id's for last 12 hours of Rosetta after resetting client.

All Client Errors

211601578
211514071
211514058
211512399
211512310
211512309
211512308
211512306
211512305

Compute Errors

211522797
211514072

Please Advise.
ID: 57544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,532,664
RAC: 1,305
Message 57547 - Posted: 3 Dec 2008, 15:08:06 UTC - in response to Message 57544.  

Result task id's for last 12 hours of Rosetta after resetting client.

All Client Errors

211601578
211514071
211514058
211512399
211512310
211512309
211512308
211512306
211512305

Compute Errors

211522797
211514072

Please Advise.



looks like its a whole load of defective tasks. 2 different systems bombed them.
it is also possible that if your system is being OC'd that your speed is to fast for rosetta to handle. I was working with my OC percentage last night and crashed a whole bunch. Some of the tasks were successful with other users and some of them crashed again. Keep an eye on your current tasks and see if they crash with the same kind of error code. If your running OC'd lower your speed a little bit to see where the threshold is for Rosetta. 5-10 mhz can make a difference in a success and a crash.
ID: 57547 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
A Few Good Men

Send message
Joined: 25 Mar 07
Posts: 14
Credit: 2,031,382
RAC: 0
Message 57550 - Posted: 3 Dec 2008, 15:49:06 UTC

Ill do a run at stock cpu, ram and fsb values. Thanks.
ID: 57550 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dave Mickey

Send message
Joined: 29 Dec 07
Posts: 33
Credit: 4,136,957
RAC: 0
Message 57577 - Posted: 4 Dec 2008, 2:43:36 UTC

Just another data point - still have 1.40 tasks that
do not respond to BOINCs command to suspend.

this is not fixed yet.

Dave
ID: 57577 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 142
Credit: 1,188,182
RAC: 1
Message 57578 - Posted: 4 Dec 2008, 4:42:23 UTC

This is just ridiculous. Rosetta Mini 1.40 on Linux does NOT obey the BOINC API to suspend the task. I think it does this whenever it's creating the first decoy in the simulation.

Other than a dedicated server, this makes it really hard to let Rosetta run on a workstation box. No new work for me until you fix this and I hear back.
ID: 57578 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1191
Credit: 13,197,173
RAC: 156
Message 57580 - Posted: 4 Dec 2008, 4:58:15 UTC - in response to Message 57578.  

This is just ridiculous. Rosetta Mini 1.40 on Linux does NOT obey the BOINC API to suspend the task. I think it does this whenever it's creating the first decoy in the simulation.

Other than a dedicated server, this makes it really hard to let Rosetta run on a workstation box. No new work for me until you fix this and I hear back.


Suggestion of how to handle at least part of a fix: Allow it to suspend even during the first decoy if the leave in memory option is selected, as long as paging to the swapfile won't hurt.
ID: 57580 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,532,664
RAC: 1,305
Message 57581 - Posted: 4 Dec 2008, 9:00:45 UTC

what with all this text that shows up in the stder out text?

recovering checkpoint of tag S_U12X5X_00000001 with id abrelax_rg_state
recovering checkpoint of tag S_U12X5X_00000001 with id stage_1
recovering checkpoint of tag S_U12X5X_00000001 with id stage_2


this keeps showing up in alot of tasks. the tasks completes ok.
ID: 57581 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tony

Send message
Joined: 12 Dec 05
Posts: 7
Credit: 6,724,341
RAC: 0
Message 57586 - Posted: 4 Dec 2008, 15:58:24 UTC

I think it is not only on linux that minirosetta doesn't suspend. It seem to be like an unrully child that will not mind. In windows start the task manager to see all running processes and sort by cpu usage. Seems some of the processes obey but some keep running after a snooze or suspend. Restart seems to make it behave. I think it may be errors that will not let it stop the running task. I seem to be having lots of errors on three different computers I just started crunching with.
ID: 57586 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tony

Send message
Joined: 12 Dec 05
Posts: 7
Credit: 6,724,341
RAC: 0
Message 57587 - Posted: 4 Dec 2008, 16:21:04 UTC - in response to Message 57586.  

I think it is not only on linux that minirosetta doesn't suspend. It seem to be like an unrully child that will not mind. In windows start the task manager to see all running processes and sort by cpu usage. Seems some of the processes obey but some keep running after a snooze or suspend. Restart seems to make it behave. I think it may be errors that will not let it stop the running task. I seem to be having lots of errors on three different computers I just started crunching with.


Mostly problems with a new computer I just built. It is not overclocked but is running vista ultimate 64 bit with 8 gigs mem amd 9950 processor.
ID: 57587 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tony

Send message
Joined: 12 Dec 05
Posts: 7
Credit: 6,724,341
RAC: 0
Message 57588 - Posted: 4 Dec 2008, 17:18:45 UTC - in response to Message 57587.  

[quote]I think it is not only on linux that minirosetta doesn't suspend. It seem to be like an unrully child that will not mind. In windows start the task manager to see all running processes and sort by cpu usage. Seems some of the processes obey but some keep running after a snooze or suspend. Restart seems to make it behave. I think it may be errors that will not let it stop the running task. I seem to be having lots of errors on three different computers I just started crunching with.


This on an older computer.

12/4/2008 12:08:18 PM||Suspending computation - user is active
12/4/2008 12:08:18 PM||Suspending network activity - user is active
12/4/2008 12:08:35 PM|rosetta@home|Task cc_0_8_nocst4_homo_bench_foldcst_chunk_general_t303__olange_IGNORE_THE_REST_2GO7A_7_5161_15_0 exited with zero status but no 'finished' file
12/4/2008 12:08:35 PM|rosetta@home|If this happens repeatedly you may need to reset the project.
This is repeated many times.

With this message the task seems to be still running even though boinc says computation is suspended.

ID: 57588 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1191
Credit: 13,197,173
RAC: 156
Message 57591 - Posted: 4 Dec 2008, 19:02:43 UTC - in response to Message 57586.  

I think it is not only on linux that minirosetta doesn't suspend. It seem to be like an unrully child that will not mind. In windows start the task manager to see all running processes and sort by cpu usage. Seems some of the processes obey but some keep running after a snooze or suspend. Restart seems to make it behave. I think it may be errors that will not let it stop the running task. I seem to be having lots of errors on three different computers I just started crunching with.


Under 32-bit Windows Vista SP1, my results indicate that the suspend problem occurs under Vista also, but not in all workunits. I suspect that it's only in workunits that use the new features added under minirosetta 1.39 and 1.40, and not even all of those. I would like to see Rosetta@home add the option to select which types of workunits a particular computer gets, in order to avoid some of the more problematic new types.

ID: 57591 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1191
Credit: 13,197,173
RAC: 156
Message 57592 - Posted: 4 Dec 2008, 19:22:47 UTC - in response to Message 57588.  

This on an older computer.

12/4/2008 12:08:18 PM||Suspending computation - user is active
12/4/2008 12:08:18 PM||Suspending network activity - user is active
12/4/2008 12:08:35 PM|rosetta@home|Task cc_0_8_nocst4_homo_bench_foldcst_chunk_general_t303__olange_IGNORE_THE_REST_2GO7A_7_5161_15_0 exited with zero status but no 'finished' file
12/4/2008 12:08:35 PM|rosetta@home|If this happens repeatedly you may need to reset the project.
This is repeated many times.

With this message the task seems to be still running even though boinc says computation is suspended.


The first part of that seems likely for workunits that go for a long time between checkpoints on machines that don't have enough memory to allow the workunit to stay in memory, and don't allow BOINC to use enough disk space and swap file space to save the current contents of the memory during user interruptions.

For my computer, about US $50 worth of added memory put it up to the maximum amount of memory that model of computer can handle.
ID: 57592 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mfbabb2

Send message
Joined: 10 Oct 08
Posts: 4
Credit: 10,345
RAC: 0
Message 57594 - Posted: 4 Dec 2008, 19:28:14 UTC

Running on Vista w/SP1:

Computation Error and no apparent progress.

Project has been reset several times. Rosetta used to work.

12/4/2008 11:57:10 AM|rosetta@home|Restarting task cc_1_0_nocst4_homo_bench_foldcst_chunk_general_t364__olange_IGNORE_THE_REST_1S5UA_5_5206_5_0 using minirosetta version 140
12/4/2008 11:57:51 AM|rosetta@home|Task cc_1_0_nocst4_homo_bench_foldcst_chunk_general_t364__olange_IGNORE_THE_REST_1S5UA_5_5206_5_0 exited with zero status but no 'finished' file
12/4/2008 11:57:51 AM|rosetta@home|If this happens repeatedly you may need to reset the project.

ID: 57594 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57598 - Posted: 4 Dec 2008, 19:53:10 UTC

The version being tested now on Ralph is 1.45. I'm pretty sure the issue with tasks not suspending when BOINC tells them to has been resolved. Hopefully coming very soon to Rosetta.
Rosetta Moderator: Mod.Sense
ID: 57598 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ma3threeX

Send message
Joined: 22 Aug 08
Posts: 3
Credit: 347,217
RAC: 0
Message 57604 - Posted: 4 Dec 2008, 21:52:06 UTC

i don't know if its the best thread for it but...whatever

i have now at least 8 WUS who are 100% crunched and uploaded but it don't dissappears from the list seems like its waiting for something. I also get a Message from the Rosetta Server : " Cant attach shared Memory"

anybody knows the prob?

greetings
Ma3threeX
ID: 57604 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,532,664
RAC: 1,305
Message 57607 - Posted: 4 Dec 2008, 22:13:33 UTC - in response to Message 57604.  

the team moved the server to a new address. just let boinc manager sort it out. it needs to get the new info from the new master file. some guys are hitting update 10 times to get to the new master file, but the team says just let the program take it's course, it will self correct.


i don't know if its the best thread for it but...whatever

i have now at least 8 WUS who are 100% crunched and uploaded but it don't dissappears from the list seems like its waiting for something. I also get a Message from the Rosetta Server : " Cant attach shared Memory"

anybody knows the prob?

greetings
Ma3threeX

ID: 57607 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nicolai

Send message
Joined: 21 Jun 08
Posts: 1
Credit: 142,530
RAC: 0
Message 57632 - Posted: 5 Dec 2008, 20:00:45 UTC - in response to Message 57594.  

Running on Vista w/SP1:

Computation Error and no apparent progress.

Project has been reset several times. Rosetta used to work.

12/4/2008 11:57:10 AM|rosetta@home|Restarting task cc_1_0_nocst4_homo_bench_foldcst_chunk_general_t364__olange_IGNORE_THE_REST_1S5UA_5_5206_5_0 using minirosetta version 140
12/4/2008 11:57:51 AM|rosetta@home|Task cc_1_0_nocst4_homo_bench_foldcst_chunk_general_t364__olange_IGNORE_THE_REST_1S5UA_5_5206_5_0 exited with zero status but no 'finished' file
12/4/2008 11:57:51 AM|rosetta@home|If this happens repeatedly you may need to reset the project.



I have been having the same problem for more than a while now...
ID: 57632 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · Next

Message boards : Number crunching : Minirosetta v1.40 bug thread



©2022 University of Washington
https://www.bakerlab.org