Problems with Minirosetta v1.54

Message boards : Number crunching : Problems with Minirosetta v1.54

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 15 · Next

AuthorMessage
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1055
Credit: 11,485,390
RAC: 4,682
Message 59419 - Posted: 7 Feb 2009, 15:04:48 UTC - in response to Message 59172.  

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!



Just a wild shot. ..

How is your disk space?

How about BOINC settings for disk space? Are you at BOINC's limit?


No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.


How many BOINC projects do you have set up? I've seen signs that BOINC divides the available space equally among projects, even if some projects don't even try to use all of their share. I'm currently allowing BOINC to share up to 30 GB among 8 BOINC projects (not all making workunits available recently). I had problems getting Rosetta@home to run workunits on both cores of my dual-core CPU at the same time before that. Also, I believe I've seen a maximum percentage of the available free space on the hard drive BOINC is allowed to use, which can reduce the limits even further.
ID: 59419 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1055
Credit: 11,485,390
RAC: 4,682
Message 59420 - Posted: 7 Feb 2009, 15:25:32 UTC

I recently had a 1.54 workunit with a validate error for no reason I could spot in the Task ID details file. A wingman got a Success, but apparantly with a much shorter preferred workunit length than the 14 hours I request.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=204095976

Could you check for problems in parts of the workunit the wingman probably never reached?
ID: 59420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1755
Credit: 5,947,552
RAC: 15
Message 59421 - Posted: 7 Feb 2009, 16:24:33 UTC - in response to Message 59419.  
Last modified: 7 Feb 2009, 16:27:55 UTC

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!



Just a wild shot. ..

How is your disk space?

How about BOINC settings for disk space? Are you at BOINC's limit?


No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.


How many BOINC projects do you have set up? I've seen signs that BOINC divides the available space equally among projects, even if some projects don't even try to use all of their share. I'm currently allowing BOINC to share up to 30 GB among 8 BOINC projects (not all making workunits available recently). I had problems getting Rosetta@home to run workunits on both cores of my dual-core CPU at the same time before that. Also, I believe I've seen a maximum percentage of the available free space on the hard drive BOINC is allowed to use, which can reduce the limits even further.


I only have one project per pc, but I will add a second if the first is having workunit issues. All machines have at least a 20 gig hard drive but most have a 100 gig or bigger hard drive. The one above is a laptop with a 50 gig hard drive with almost 30 gig free. I have Boinc setup to use no more than 50% of the free hard drive space and don't have any issues with space.
ID: 59421 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
epcorian

Send message
Joined: 1 Jan 09
Posts: 16
Credit: 253,062
RAC: 0
Message 59428 - Posted: 7 Feb 2009, 20:06:18 UTC - in response to Message 59412.  

So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.


ALRIGHT!!! Glad you guys found the problem, I guess the reports of the newer versions being released without proper testing were true in your case.


I think I spoke too soon...that first WU crunched successfully but only 1 other was WU successful out of the 8 WU's. 2/8, better but still not good. I might try replacing Vista 64 with XP 64 another weekend when I'm bored. Just for curiosity sake I had my P4 and Atom 330 PC's running 32-bit XP SP3 crunch some Mini's and they did just fine.
ID: 59428 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59437 - Posted: 8 Feb 2009, 2:28:40 UTC - in response to Message 59418.  
Last modified: 8 Feb 2009, 2:29:01 UTC

Hello.
Following task (https://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.

Should I let it try to finish?

Thanks


I'd suggest allowing it to run normally. Was it still using CPU time? If you want to kind of cut it off, but get it to report in, let it run, then exit (not close) BOINC and restart it, let it run about 2 minutes, then exit again and restart, until you've done that 5 times and the task should be ended and report in with "too many restarts".
Rosetta Moderator: Mod.Sense
ID: 59437 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59439 - Posted: 8 Feb 2009, 2:43:27 UTC - in response to Message 59416.  

Hola,

En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.

Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.

El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.

Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.

Un coridal saludo para todos,

Juan


Hola Juan,

I was able to translate his message and basically, he's been having problems with Mini, including the lastest version. He wishes Rosetta had subprojects, so he could select to crunch only the RosettaBeta application instead of mini.

Looking at his 2 failed tasks, they both have Exit status -226 and the Can't acquire lockfile errors.

He is running Win Vista x86.

I know some of you have had these lock file problems as well. Were they always with WinVista? And I thought the v1.54 release of mini had resolved these issues. Can any of you that have had the problem suggest the best steps for Juan to take to resolve it? You might even convert your reply to Spanish as best we can using a tool like this: http://dictionary.reference.com/translate/text.html
Rosetta Moderator: Mod.Sense
ID: 59439 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Fishead

Send message
Joined: 3 Sep 08
Posts: 7
Credit: 89,566
RAC: 0
Message 59443 - Posted: 8 Feb 2009, 6:45:05 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=206610287
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=206617445
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=206618707
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=204395981

According to the graphics screen of these four WUs, every "accepted" step becomes the new low energy state. No matter if the energy value is smaller or higher...
ID: 59443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59447 - Posted: 8 Feb 2009, 10:12:14 UTC

*I* cured the lock file problem by running with 100% time ... if he has opted to run at some lower percentage of CPU time this may be the issue. Something else to try ... and if it works we can report another success ... this is one of the issues that we have been trying to pin down in RALPH...
ID: 59447 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 59462 - Posted: 8 Feb 2009, 16:37:47 UTC

I have aborted the following loopbuilds:

226468615
226473496

They both were going on a slow boat to nowhere with an accepted energy of 1.#INF


ID: 59462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Klimax

Send message
Joined: 27 Apr 07
Posts: 35
Credit: 1,545,810
RAC: 3,688
Message 59465 - Posted: 8 Feb 2009, 18:47:55 UTC - in response to Message 59437.  

Hello.
Following task (https://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.

Should I let it try to finish?

Thanks


I'd suggest allowing it to run normally. Was it still using CPU time? If you want to kind of cut it off, but get it to report in, let it run, then exit (not close) BOINC and restart it, let it run about 2 minutes, then exit again and restart, until you've done that 5 times and the task should be ended and report in with "too many restarts".


OK,set runtime at 8hours,so watchdog would cut it at 24hours.It has now uploaded and reported it.I have dump files as well,if somebody in team is interested.(Captured at reported time and step)
And I see I was not alone... :-(
ID: 59465 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Arkadiusz Dykiel

Send message
Joined: 13 Aug 06
Posts: 3
Credit: 12,823,537
RAC: 0
Message 59469 - Posted: 8 Feb 2009, 20:24:44 UTC

Hi,

The work units exit with status code 193 (0xc1).
Rosetta 5.98 and other projects work OK.

Do I miss something? Some library.

Full error report below:

Server state Over
Outcome Client error
Client state Compute error
Exit status 193 (0xc1)
CPU time 0

<core_client_version>6.2.15</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 2- 8 1:29: 8:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
*** glibc detected *** corrupted double-linked list: 0x093544cc ***
SIGABRT: abort called
Stack trace (15 frames):
[0x8f88f07]
[0x8fb3778]
[0xb7fff420]
[0x9016944]
[0x902c693]
[0x90310d2]
[0x9031c84]
[0x903353d]
[0x9000ec7]
[0x81bed6d]
[0x81bee1d]
[0x8195f15]
[0x8048e93]
[0x900f84c]
[0x8048111]

Exiting...

</stderr_txt>
]]>
ID: 59469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59473 - Posted: 8 Feb 2009, 23:12:20 UTC

As of v1.54, the watchdog kicks in at runtime pref. plus 4 hours. So, no longer 3 times runtime preference.
Rosetta Moderator: Mod.Sense
ID: 59473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andreas

Send message
Joined: 22 Sep 08
Posts: 1
Credit: 39,402
RAC: 0
Message 59493 - Posted: 9 Feb 2009, 22:07:37 UTC - in response to Message 59086.  

If you are seeing errors with lock-file problems try setting the cpu setting back to 100%. If you are running at 100% CPU preference and are getting this problem, I for one, am very interested. If you are getting the failures and change the CPU setting to 100% and that cures the issue ... well, we are interested in THAT too ...

I read about this in Einstein@Home and it seems to work for me ... YMMV ...


I, too, was plagued by frequent R@H lock file problems. Setting CPU to 100% seems to have cured that.
And, as I have a quad-core CPU, I can limit BOINC usage by setting "On Multiprocessor Systems, use at most 51% of all processors". (If I run BOINC at 100% on all cores, my system gets too hot - more precisely, my fan gets too loud)
-- Andreas
ID: 59493 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 59494 - Posted: 9 Feb 2009, 23:02:10 UTC

problems with this one:
227327540

heartbeat error messages

</stderr_txt>
<message>
<file_xfer_error>
<file_name>abinitio_norelax_homfrag_natfrag_129_B_1o7uA_SAVE_ALL_OUT_6252_5178_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>




ID: 59494 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1055
Credit: 11,485,390
RAC: 4,682
Message 59502 - Posted: 10 Feb 2009, 14:44:11 UTC - in response to Message 59439.  
Last modified: 10 Feb 2009, 14:52:25 UTC

Hola,

En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.

Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.

El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.

Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.

Un coridal saludo para todos,

Juan


Hola Juan,

I was able to translate his message and basically, he's been having problems with Mini, including the lastest version. He wishes Rosetta had subprojects, so he could select to crunch only the RosettaBeta application instead of mini.

Looking at his 2 failed tasks, they both have Exit status -226 and the Can't acquire lockfile errors.

He is running Win Vista x86.

I know some of you have had these lock file problems as well. Were they always with WinVista? And I thought the v1.54 release of mini had resolved these issues. Can any of you that have had the problem suggest the best steps for Juan to take to resolve it? You might even convert your reply to Spanish as best we can using a tool like this: http://dictionary.reference.com/translate/text.html


I never learned enough Spanish to do such a translation myself, so I tried asking that web site to translate all of your reply at once to Spanish, in preparation for writing an answer in English and doing the same to it. It appeared that the translation succeeded, but enough of it was hidden by advertisements that it was unusable.

Anyone know another automatic translation site that doesn't have this problem?

I've been trying to trigger that problem over on RALPH@home by setting my CPU time less than 100% and unable to actually get it less than 100%, so you might want to consider this: For anyone having this problem repeatedly, give them 1.54 workunits with extra debugging output enabled. Then have someone on the RALPH@home staff analyze the results and give them credits according to the RALPH@home standards instead of the Rosetta@home standards.
ID: 59502 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4875
Credit: 4,510,160
RAC: 1,082
Message 59505 - Posted: 10 Feb 2009, 18:23:04 UTC
Last modified: 10 Feb 2009, 18:32:00 UTC

http://www.babelfish.yahoo.com translates it as:

Hello, First of all, excuses to write in Castilian, but my English is insufficient. From August of 2008 me 99% of the tasks of Mini Rosetta with computational error are finalizing. After a time I decided not to continue processing in this project. Even so, sometimes I return to try it, but everything follows equal: even with the new versions of Mini Rosetta, including this last one. The case is that the tasks of Rosetta Beta do not fail to me, but of that one sends very few proporcinalmente to me. The pain is that in this project the possibility of selecting sub-projects, does not exist there is as if it in other many. I would like to continue processing for this project, but there is no way, and it is not question to throw low-achieving hours of computation. I hope that this problem is solved soon. As for me I will continue trying from time to time. A coridal greeting for all, Juan

he has 4 tasks running and 2 of them failed

abinitio_norelax_homfrag_natfrag_129_B_1tit__SAVE_ALL_OUT_6252_2628_0
he got a lockfile failure on this one and it ran only CPU time 683.9708

and

loopbuild_ref_tex_cst_hombench_loopbuild_tex_cst_t363__IGNORE_THE_REST_1WWTA_12_6651_14_0
this got lockfile as well it ran for CPU time 2155.325

the other 2 are split with a completion and in process
ID: 59505 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rembertw

Send message
Joined: 21 Apr 07
Posts: 14
Credit: 628,529
RAC: 0
Message 59518 - Posted: 11 Feb 2009, 16:45:08 UTC - in response to Message 59395.  

Mod.Sense

Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?


- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though

Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.

No solution as yet?
ID: 59518 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59520 - Posted: 11 Feb 2009, 19:18:25 UTC - in response to Message 59518.  

Mod.Sense

Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?


- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though

Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.

No solution as yet?


I've not heard any other reports of the percent completed not increasing. What is it showing for the estimated runtime, before the task starts?

Odd, the failed task with some time on it shows that your
core client version is 6.2.14, but your BOINC Windows Runtime Debugger Version is 6.5.0. Not sure how that would happen.

Rosetta Moderator: Mod.Sense
ID: 59520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Verrie Pearce

Send message
Joined: 2 Dec 05
Posts: 3
Credit: 90,299
RAC: 0
Message 59524 - Posted: 12 Feb 2009, 3:13:06 UTC - in response to Message 59045.  

Hello All!

We're ready for a new update. I want to say thank all of you who have helped over the last months to find and fix errors in minirosetta. A particular thank you goes to those who have donated their time over on RALPH and helped with their active feedback - we managed to find a number of difficult and rare bugs and put some new features into minirosetta that should help conserve computer time. Read about it here: http://ralph.bakerlab.org/forum_thread.php?id=431
and here http://ralph.bakerlab.org/forum_thread.php?id=432
I should add that work over there will continue,but now supplemented with information from Rosetta@HOME.

This update is highly focused on bugfixing and stability issues - we have virtually no new science in it, but: We will hopefully now be able to run the science projects that have been in the pipeline waiting for BOINC - we're expecting quite a bit of work to go out very soon indeed. See Dr. Baker's journal for more details.


Features/Fixes:
1.54 Release CHANGELOG


  • Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.

  • Bug fix concerning intermittent crashes in relax benchmark jobs (_rlbd_) jobs - caused by buggy input file reader.

  • Bug fix for a potential instability in handling text files (affects all types of WUs).

  • Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)

  • Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. This will still happen, but the jumps should be much smaller (basically maximally as long as the time between checkpoints.)

  • Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)

  • Added checkpointing to Looprelax.

  • The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!

  • Added a limit ont he number of decoys per WU: 99. The WU will end gracefully after that and give full credit. This should address issues with excessive upload problems.

  • Fixed a bug in the BOINC API concerned with unzipping the input data. (I will let the BOINC guys know about this)

  • Fixed a strange problem in the options system leading to early crashes on some systems.

  • Two nasty instabilities fixed deep in the FoldConstraints/abinitio protocol (cc_* tasks and other homology modelling tasks)

  • Generally implemented much better error reporting - many many potential problems will now show up a meaningful error messages and not random segmentation faults.



NOTE: This new version contains a lot of debug output still. YOu will see that the stderr fills up with stuff - that is ok . It does not slow down the program nor cause much extra upload - but it tells us a lot about where things can go wrong still.


Despite all these fixes there are, i'm sure, many problems left. Most of them occur extremely rarely now though or are highly specific to particular machines. Thus we have decided to move the current version over from RALPH to Rosetta@HOME and give it a go on a much larger scale. Our effords to keep the failure rate down will continue and your time donations over on RALPH as well as error reports are still highly appreciated.

Please let us know how things work out there. Particularily i'd like to know about


  • Stuck workunits
  • Overrunning workunits (WUs should now, due to the new watchdog, never run more than 4 hours longer than the preferred user time)
  • Problems with checkpointing.
  • Any other strange behaviour.




Happy crunching - I'm very excited to see how this new version will pan out.

Mike



ID: 59524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Verrie Pearce

Send message
Joined: 2 Dec 05
Posts: 3
Credit: 90,299
RAC: 0
Message 59525 - Posted: 12 Feb 2009, 3:14:52 UTC

I have reached the end since your new patch nothing works form your project. I keep resetting and still I get no improvement. Until you patch your patch I am done sorry, I wanted to help.
ID: 59525 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 15 · Next

Message boards : Number crunching : Problems with Minirosetta v1.54



©2021 University of Washington
https://www.bakerlab.org