Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 55 · Next

AuthorMessage
Shawn
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 22 Jan 10
Posts: 17
Credit: 53,741
RAC: 0
Message 70158 - Posted: 28 Apr 2011, 20:18:43 UTC

We are aware that we have had some issues with bad jobs on Rosetta@home recently. We try to ensure that these bad jobs don't slip through, but they occasionally do. When that happens, your efforts to alert us to these problems are extremely important and very much appreciated.

In order to ensure that we address technical issues promptly, graduate students in the Baker lab (such as myself) will be regularly monitoring this message board for such problems. This will be in addition to the help of Mod.Sense, our vigilant forum moderator who has done a lot to ensure that these projects run as smoothly as possible. I ask that you alert us to new issues in this thread so that we can find them more easily.

Thank you all once again for your commitment to Rosetta@home!
ID: 70158 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Hank Barta

Send message
Joined: 6 Feb 11
Posts: 14
Credit: 3,943,460
RAC: 0
Message 70164 - Posted: 29 Apr 2011, 13:32:40 UTC - in response to Message 70158.  

I ask that you alert us to new issues in this thread so that we can find them more easily.


Thank you for helping to deal with these. This morning I have seen a number of errors with a different signature. These run for 3-5 minutes before producing an error and exiting. The characteristic error seems to be:

ERROR: ct == final_atoms

An example is https://boinc.bakerlab.org/rosetta/workunit.php?wuid=382081360

thanks,
hank
ID: 70164 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,988,827
RAC: 8,538
Message 70165 - Posted: 29 Apr 2011, 15:43:51 UTC - in response to Message 70158.  


In order to ensure that we address technical issues promptly, graduate students in the Baker lab (such as myself) will be regularly monitoring this message board for such problems. This will be in addition to the help of Mod.Sense, our vigilant forum moderator who has done a lot to ensure that these projects run as smoothly as possible. I ask that you alert us to new issues in this thread so that we can find them more easily.

Thank you all once again for your commitment to Rosetta@home!


I hope these changes involve ralph@home!!!

ID: 70165 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Shawn
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 22 Jan 10
Posts: 17
Credit: 53,741
RAC: 0
Message 70170 - Posted: 29 Apr 2011, 19:03:42 UTC - in response to Message 70164.  

Thank you for helping to deal with these. This morning I have seen a number of errors with a different signature. These run for 3-5 minutes before producing an error and exiting. The characteristic error seems to be:

ERROR: ct == final_atoms

An example is https://boinc.bakerlab.org/rosetta/workunit.php?wuid=382081360

thanks,
hank


Hey Hank, thanks for letting us know. This job has been deleted and is no longer on the queue. Apparently, this was a small test job that reported failure early, and the author marked them for deletion right away, but sometimes those jobs propagate for a while anyway.

In any case, you shouldn't see this particular job anymore, but if for some reason it persists, please give us an update!
ID: 70170 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 70185 - Posted: 30 Apr 2011, 20:51:29 UTC - in response to Message 70164.  

I ask that you alert us to new issues in this thread so that we can find them more easily.


Thank you for helping to deal with these. This morning I have seen a number of errors with a different signature. These run for 3-5 minutes before producing an error and exiting. The characteristic error seems to be:

ERROR: ct == final_atoms

An example is https://boinc.bakerlab.org/rosetta/workunit.php?wuid=382081360

thanks,
hank



I thought these were tested on RALPH before being brought over the Rosetta?
If that is the case, then this job should not have slipped through.
ID: 70185 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,988,827
RAC: 8,538
Message 70215 - Posted: 2 May 2011, 10:20:31 UTC - in response to Message 70185.  


I thought these were tested on RALPH before being brought over the Rosetta?
If that is the case, then this job should not have slipped through.


Ralph has had big problems since December....
Few wu, no comunication from team, etc
I hope this situation change
If you need our help to "control" the code, please give us some informations, news, details, etc

ID: 70215 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 70224 - Posted: 2 May 2011, 17:42:25 UTC - in response to Message 70215.  


I thought these were tested on RALPH before being brought over the Rosetta?
If that is the case, then this job should not have slipped through.


Ralph has had big problems since December....
Few wu, no comunication from team, etc
I hope this situation change
If you need our help to "control" the code, please give us some informations, news, details, etc



We're in the process of upgrading RALPH. The current server is very unstable. We do need to be far better at providing information about new projects/jobs that we test on RALPH and I'll stress that point to the lab members. The RALPH WU flow will depend on whether or not we have new jobs to test. Many jobs have already been tested.
ID: 70224 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,988,827
RAC: 8,538
Message 70225 - Posted: 2 May 2011, 18:34:17 UTC - in response to Message 70224.  


We're in the process of upgrading RALPH. The current server is very unstable. We do need to be far better at providing information about new projects/jobs that we test on RALPH and I'll stress that point to the lab members. The RALPH WU flow will depend on whether or not we have new jobs to test. Many jobs have already been tested.


Thanks for information :-)

ID: 70225 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 800,690
RAC: 173
Message 70268 - Posted: 6 May 2011, 22:24:30 UTC
Last modified: 6 May 2011, 22:26:50 UTC

420656625 FOLD_N_DOCK_dagk_D2symm got Validate state Invalid after CPU time 2010.416 run time meant to be 3 hours. corresponding work unit number 420591203 got after Validate state Invalid after CPU time 3843.709 (has debug message)

I posted the above message in minirosetta 2.17 on 06/05/11
Edit = Added click able links
Have a crunching good day!!
ID: 70268 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 70279 - Posted: 8 May 2011, 7:04:57 UTC

out of memory error codes on these tasks, that is not possible as I have 3.24GB of RAM.

FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_9746_0
FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_1528_0
FOLD_N_DOCK_dagk_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26520_9259_1

Error message:
- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7C812AFB
ID: 70279 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Zydor

Send message
Joined: 4 May 11
Posts: 7
Credit: 12,648
RAC: 0
Message 70294 - Posted: 9 May 2011, 17:13:00 UTC

Couple of possible problem WUs for you - they are 1 hour WUs, ran for around 25-30 mins, and failed to progress beyond 2-3% completion. Other 1hr ones had a consistent completion percentage roughly in line with time done so far, so I aborted both.

https://boinc.bakerlab.org/rosetta/result.php?resultid=421246729

https://boinc.bakerlab.org/rosetta/result.php?resultid=421246619

Regards
Zy
ID: 70294 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Ray Wang

Send message
Joined: 9 Mar 09
Posts: 8
Credit: 230,454
RAC: 0
Message 70295 - Posted: 9 May 2011, 18:47:34 UTC - in response to Message 70279.  

out of memory error codes on these tasks, that is not possible as I have 3.24GB of RAM.

FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_9746_0
FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_1528_0
FOLD_N_DOCK_dagk_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26520_9259_1

Error message:
- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7C812AFB


Hi Speedy and Greg_BE,

I am Ray, a graduate student in the Baker lab. I will be taking care of the issues caused by "FOLD_N_DOCK" related jobs. As Greg_BE said, this is really not likely that these jobs could run out of all those 3.24GB of RAM.

Thank you all for letting us know the problems, as well as your contribution to Rosetta@home!!!

ID: 70295 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Zydor

Send message
Joined: 4 May 11
Posts: 7
Credit: 12,648
RAC: 0
Message 70297 - Posted: 9 May 2011, 19:59:34 UTC
Last modified: 9 May 2011, 20:00:53 UTC

Could someone take a peek at the list for my laptop? I'm not new to BOINC, but am a total newb at Rosetta, so I will at present miss the obvious until my feet are under the table. (running a few WUs to get used to Rosetta ready for the penthalon in a day or so)

https://boinc.bakerlab.org/rosetta/results.php?hostid=1441160&offset=20

I made a post two up re slow ones, but I'm wondering if its a bad batch. Running two from same date time batch, and they are slow as well (18-19% done circa 2hrs45min for 1 hour WUs). Two running at present are Task IDs: 421246725 and 421246743 .

I am starting to wonder if they are 1hr WUs, maybe there are longer ones in that batch, there were 1hr ones I did previously in the same batch, so its a bit strange. Ignore the laptop preference as set at present, it was set for 1hr when that batch was downloaded.

Regards
Zy
ID: 70297 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Elizabeth

Send message
Joined: 24 Nov 06
Posts: 1
Credit: 6,905
RAC: 0
Message 70298 - Posted: 9 May 2011, 20:18:29 UTC - in response to Message 70294.  

Couple of possible problem WUs for you - they are 1 hour WUs, ran for around 25-30 mins, and failed to progress beyond 2-3% completion. Other 1hr ones had a consistent completion percentage roughly in line with time done so far, so I aborted both.

https://boinc.bakerlab.org/rosetta/result.php?resultid=421246729

https://boinc.bakerlab.org/rosetta/result.php?resultid=421246619

Regards
Zy



Hi Zy,

this job is currently returning models at a reasonable rate, but we're looking into the problem. thanks for the heads up!
ID: 70298 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Adam Gajdacs (Mr. Fusion)

Send message
Joined: 26 Nov 05
Posts: 13
Credit: 2,316,967
RAC: 3,032
Message 70299 - Posted: 9 May 2011, 22:14:11 UTC - in response to Message 70295.  
Last modified: 9 May 2011, 22:17:56 UTC

I am Ray, a graduate student in the Baker lab. I will be taking care of the issues caused by "FOLD_N_DOCK" related jobs. As Greg_BE said, this is really not likely that these jobs could run out of all those 3.24GB of RAM.

They definitely can. I just noticed that one of my two rigs started trashing like hell. Turned out, a single one of these FOLD_N_DOCK WUs (https://boinc.bakerlab.org/rosetta/result.php?resultid=421379634) was using 1.45GB VM on a system with only 1GB physical memory; it was effectively running from the disk. The other core was idle because there was no memory left to run another WU on it, but if there was, it would've been about 3GBs total.
ID: 70299 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Zydor

Send message
Joined: 4 May 11
Posts: 7
Credit: 12,648
RAC: 0
Message 70300 - Posted: 9 May 2011, 22:29:51 UTC

Quick note to close the loop on my posts above. I've ended up having to do an detatch/attatch (after aborting held WUs) on my machines. Sorry about the aborts, but felt I had no choice. On restarts, the problem has disappeared, and at present at least, all appears to be progressing normaly now. Yet to complete one since detatch etc, but all three machine appear to be behaving now.

No idea the reason, strange it hit all three machines. No hang over from other worries elsewhere as far as I know as things have been stable in my recent travels around BOINC. Anyway .... for what its worth, detatch etc resolved my problems, absolutely no idea why though :)

Regards
Zy
ID: 70300 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 70302 - Posted: 9 May 2011, 23:17:01 UTC
Last modified: 10 May 2011, 0:15:40 UTC

Zydor, your expectations of what a 1hr work unit is are not realistic for R@h. And you will only confuse yourself further by modifying the runtime preference frequently. Runtime preference is actually a runtime characteristic, so your preference at time of download is actually not relevant. Here are a couple of threads that discuss how the runtime works:
Discussion on increasing the default run time
Newbie Q&A: discussion on runtime under Q: I am on a dial-up connection, how can I use less modem time?
Newbie Q&A: Q: Progress Percent not advancing?
Newbie Q&A: Q: I'm familiar with SETI and BOINC already, but what should I know about Rosetta?

I'm sure you were trying to complete and return work as quickly as possible for immediate credit recognition during the penthalon... unfortunately, when you do that, some of the other nice things such as accurate progress %, and consistently completing within such a limited timeframe go out the door. Each task must complete at least one model. For some tasks you will see a model every 5 minutes or so, for others, it can take several hours. So, not all tasks are going to complete within your one hour target, and that is normal and to be expected. You might want to start a thread to discuss suggested settings for penthalon participants.
Rosetta Moderator: Mod.Sense
ID: 70302 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Zydor

Send message
Joined: 4 May 11
Posts: 7
Credit: 12,648
RAC: 0
Message 70303 - Posted: 9 May 2011, 23:27:16 UTC
Last modified: 9 May 2011, 23:35:47 UTC

Spoke too soon :) Another for you, from the laptop - it has had a total attatch/detatch and clean out, so this one started on a pristine clean default setup, no tweeks or o/c - but some more detail this time as I was trying to watch out for it.

Task ID 421414504 finished in normal time.

Task ID 421414503 had started at exactly the same time as the one that finished, except it had only completed 20% by the time the one above finished. It also was using (and still is) 270Mb of memory. That figure has slowly risen all the time it has run, not fast, but has steadily risen (and still rises at a rate of about 0.5Mb per minute - no wild fluctuation (barring the odd 100Kb or so), just steady inexorable rise. Memory Leak? Blasee phrase, but not impossible.

The one that went through ok (421414504) was using 63Mb of memory when it finished. The replacement task that has started, began using 43Mb of memory, to early to say if thats a bad one as well.

Good luck on the hunt .... fingers crossed you nail it tomorrow with the Pentathelon coming up.

EDIT:
Just seen your post above ... its not pentathelon related as such, when that starts, the longer the WU the better for me - less messing around. The short ones selected was only because the option was there and wanted to do some quick ones to check all was well before the event start's tommorow night, my not being used to Rosetta. Point noted, I will change it to default 3hours for now.

I can start a thread re pentathelon if it helps you, but I'm not knowledgeable enough yet on Rosetta to comment or set it up properly. I'll give it a whirl if you want me to ... ??

Regards
Zy
ID: 70303 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 70304 - Posted: 10 May 2011, 0:36:40 UTC
Last modified: 10 May 2011, 0:38:19 UTC

Your observations are consistent with what I was saying... different tasks run rather differently. They have different memory requirements, they have different amount of CPU time to complete a model, and they are attempting different approaches to solving the problem so that the "better" approach can be revealed.

As for memory, yes as a given model progresses, it often will gradually use more and more memory. Once the model is completed, the memory is released and if runtime preference permits, another model is begun... and then from that new local low in memory usage it will gradually use more and more as the model progresses.

As for creating a new thread, what I was suggesting was to create a thread asking the questions about what traits you'd like to optimize or minimize for pentathelon and see what suggestions others may have for you.
Rosetta Moderator: Mod.Sense
ID: 70304 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Zydor

Send message
Joined: 4 May 11
Posts: 7
Credit: 12,648
RAC: 0
Message 70305 - Posted: 10 May 2011, 0:45:33 UTC

Re Thread - Okie Doke, will do

Regards
Zy
ID: 70305 · Rating: 0 · rate: Rate + / Rate - Report as offensive
1 · 2 · 3 · 4 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org