Problems with Minirosetta 1.80

Message boards : Number crunching : Problems with Minirosetta 1.80

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 62022 - Posted: 30 Jun 2009, 17:53:19 UTC

I've now had quite a lot of WUs run for 4 hours over my run time of 12 hours, and then get ended by the watchdog. They always report one decoy being made, although, in fact, no decoys seem to have been produced. They then have a file xfer error (-161), presumably because there was no output file.

here's yet another example: https://boinc.bakerlab.org/rosetta/result.php?resultid=262096625

Note that this ran over 16 hours on a Phenom II, yet produced no output.
ID: 62022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RC

Send message
Joined: 27 Sep 05
Posts: 13
Credit: 262,048
RAC: 0
Message 62024 - Posted: 30 Jun 2009, 20:14:55 UTC - in response to Message 61925.  

Another one that died after almost 13 hours (my runtime preference is 8 hours):

https://boinc.bakerlab.org/rosetta/result.php?resultid=262397691
ID: 62024 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 62025 - Posted: 30 Jun 2009, 21:36:19 UTC
Last modified: 30 Jun 2009, 21:42:05 UTC

Since getting 1.80, almost every WU I get is planned for about 4 Hours of work, but they will run at least 8 hours. So is there some miscalculation of how strong (or weak) my computer is?

It's quite annoying to see "calculation error" on almost every WU, because the runtime exceeds 8 hours, the last 3 did use more than 10 hours of work.

What's going on here?

Currently, I've got the following WU:
real_core_1.5_low200_beta_low200_start_hb_t332_IGNORE_THE_REST_13273_142
Task ID: 261849792, Work unit 238985112

The original time estimation was about 4hrs 20min, but the task now ran for 5 hours, and still there are 4hrs 10min left.

What I can see is, that the time left INCREASES. The same applies for the currently new started job:

lb_dk_ksync_withtrim2_hb_t302_IGNORE_THE_REST_13365_670
Task ID: 262152215, Work unit 239248916

The time left goes up and up, but never down...
ID: 62025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 62026 - Posted: 30 Jun 2009, 21:51:37 UTC

Here's another sad story.

real_core_3.5_low50_beta_low200_hb_t303__IGNORE_THE_REST_13576_83_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=239454464

This ran for 4hrs 34min made no progress.

At 1hr 49min.
MODEL:0
STEP:46800

At 4hrs 34min.
MODEL:0
STEP:46800

ABORTED.

ID: 62026 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rob Heilman [Echo Labs]

Send message
Joined: 26 Apr 07
Posts: 20
Credit: 2,815,410
RAC: 0
Message 62027 - Posted: 1 Jul 2009, 1:10:15 UTC

I am getting a ton of compute errors. I also see some ridiculous disparities at time about Claimed/Awarded credit. i.e.

262177679 239266150 28 Jun 2009 23:26:41 UTC 30 Jun 2009 5:40:18 UTC Over Success Done 101,333.10 224.36 17.95
262177658 239266121 28 Jun 2009 23:26:41 UTC 30 Jun 2009 5:40:18 UTC Over Client error Compute error 101,330.80 224.36 ---

Any ideas?
ID: 62027 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 62028 - Posted: 1 Jul 2009, 2:04:14 UTC

Here's another real_core that was stuck.

real_core_5.0_low50_beta_low200_hb_t332__IGNORE_THE_REST_13705_64_0.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=239491013

Hadn't moved in 2hrs 12min. Got to that step then didn't move.

MODEL:0
STEP:48000

ABORTED

I think i have only had 1 of these that has ran O.K.


ID: 62028 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,767,498
RAC: 6,467
Message 62031 - Posted: 1 Jul 2009, 9:31:23 UTC - in response to Message 62027.  

I am getting a ton of compute errors. I also see some ridiculous disparities at time about Claimed/Awarded credit. i.e.

262177679 239266150 28 Jun 2009 23:26:41 UTC 30 Jun 2009 5:40:18 UTC Over Success Done 101,333.10 224.36 17.95
262177658 239266121 28 Jun 2009 23:26:41 UTC 30 Jun 2009 5:40:18 UTC Over Client error Compute error 101,330.80 224.36 ---

Any ideas?


You seem to be having to different kinds of errors, one is error code 161 and the other is something that doesn't list a code. I only looked on a few machines but it is happening on all that I checked. Hmmm Here is the Wiki link to the error codes for Boinc http://www.boinc-wiki.info/Error_Code

Do you ever reboot your machines? Have you updated them lately? I see you run Linux and I know they put out updates all the time, I usually wait until there are just under a hundred to do the updates.
ID: 62031 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 62033 - Posted: 1 Jul 2009, 13:22:56 UTC

I moved Rob and mikey's posts to this thread.

Rob, several users are reporting tasks that stop progressing. This often means that some models complete in normal time and others take considerably longer. Since credit is issued on completed models, I believe that is the reason for the large disparities between some of your claimed and granted credit.
Rosetta Moderator: Mod.Sense
ID: 62033 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rob Heilman [Echo Labs]

Send message
Joined: 26 Apr 07
Posts: 20
Credit: 2,815,410
RAC: 0
Message 62034 - Posted: 1 Jul 2009, 13:26:22 UTC
Last modified: 1 Jul 2009, 13:33:42 UTC

Is there anything I can do on my end to help with the issue? It seems to have started right about when 1.80 came out.
I have tried both decreasing my run time to 3 hrs and increasing to 24 hours. Right now I am at 12 on my way back to 8 hours.

What ever is going on it is costing the project some serious computing power.
If you look at my daily credit numbers you can see that without any changes to my machines, software versions, etc. I am only completing 50-55% of what I was able to do on a daily basis over the last several weeks.

My BOINCstats
ID: 62034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 62035 - Posted: 1 Jul 2009, 15:17:13 UTC

Rob, I believe the Project Team should already have the data they need to identify specific types of tasks that are causing problems. So, really can't think of anything on your end to help.

I for one have not been getting any of the tasks with names starting with "real_core", so I tend to believe there probably are not very many of them in the mix. So, your machines should return to tasks that are running well soon.
Rosetta Moderator: Mod.Sense
ID: 62035 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 62036 - Posted: 1 Jul 2009, 15:58:36 UTC

No errors beyond the here and there compute errors that happen 1 out of 60 WU (2 hours per WU)
PCs that vary from single core AMD. Single celeron. Dual Athlon AMD. Core 2 Duo. All running Windows XP to 7.
Why is it that so many people have so many problems?
ID: 62036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 62039 - Posted: 1 Jul 2009, 18:20:13 UTC

Why is it that so many people have so many problems?


You always have to keep in mind that this is the "problems with" thread. So, by design, most of the posts here will be about problems.

Some of the 50 posts in this thread are not about specific problems in 1.80, more about BOINC general issues. I should probably be moving them elsewhere, but who has the time? So of 85,000 active hosts, you will never get every event reported, but overall the big picture is still good.

And so when you compare to about 2 million tasks completed since the creation of this thread, the number of problems is quite modest. And seems most highly correlated to some of the new task types that are being worked on. As I said, it seems these are fairly few in number, so this is the current rough ground being covered.

Not everyone monitors their machines closely, and this is why it was key to make the changes Mike made earlier this year to collect and report more data both for when things go unexpectedly and to gather better information about things that are running well (which helps you readily identify any future variations as compared to that historical data).
Rosetta Moderator: Mod.Sense
ID: 62039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile alpha

Send message
Joined: 4 Nov 06
Posts: 27
Credit: 1,550,107
RAC: 0
Message 62048 - Posted: 2 Jul 2009, 7:48:46 UTC

Two compute errors after 101,000 seconds (28 hrs) with a preference of 24 hours run time. Only one decoy in both cases:

https://boinc.bakerlab.org/rosetta/result.php?resultid=261928706
https://boinc.bakerlab.org/rosetta/result.php?resultid=262283940

Also, two more with 101,000 seconds run time, these ones completed successfully but granted ridiculously low credit, again, only one decoy:

https://boinc.bakerlab.org/rosetta/result.php?resultid=262122318
https://boinc.bakerlab.org/rosetta/result.php?resultid=262236422
ID: 62048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ByRad
Avatar

Send message
Joined: 12 Apr 08
Posts: 8
Credit: 15,686,816
RAC: 535
Message 62050 - Posted: 2 Jul 2009, 8:26:03 UTC
Last modified: 2 Jul 2009, 8:27:19 UTC

I have a very odd error in Rosetta Mini 1.80 app. Everything You can see on the screens:




I have 4GB of RAM (3GB efficiently on my WinXP x86)at 667MHz, CPU: C2D T5800 and GPU: GF9300M GS.

And after aborting this WU ewerything is back normal...
ID: 62050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Seversen

Send message
Joined: 21 Dec 07
Posts: 3
Credit: 57,599
RAC: 0
Message 62057 - Posted: 2 Jul 2009, 13:30:25 UTC

Why did this workunit get such low credit?
real_core_1.5_low200_beta_low200_start_hb_t331__IGNORE_THE_REST_13032_83

Thanks.
ID: 62057 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 62058 - Posted: 2 Jul 2009, 14:54:03 UTC

Lord ByRad my translation skills are minimal, but the status shown for the Rosetta task you highlighted has the acronym RAM in it. Which I take it means that the rest of the words translate to something like "waiting for memory". So the settings for BOINC Manager are not allowing it to use enough of the large memory your system has. There are several memory settings you can adjust to allow BOINC to use more memory.

Also, since there is no Rosetta application in the task list, I take it you have it set to remove from memory when not active. Your machine will do work more efficiently if you leave tasks in memory when suspended.
Rosetta Moderator: Mod.Sense
ID: 62058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Oliver

Send message
Joined: 11 Oct 07
Posts: 4
Credit: 525
RAC: 0
Message 62065 - Posted: 2 Jul 2009, 20:23:07 UTC

Hi folks,

I checked the output of the real_core_xxx WUs and found that all of them produce good results and valid results. So if you see RMSD=1 or similar oddities that seems to be an error of the graphics, rather than the actual WU. In summary, the issues seem to be around the boinc-managment but not the internal quality of the results.

We are now starting to address the problems mentioned in this thread with graphics, completion time and checkpointing/resuming.

-Oliver

ID: 62065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 62071 - Posted: 3 Jul 2009, 13:58:50 UTC

Oliver, the RMSD of 1 we are seeing is in the graphs of results described in this thread.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=4967
Not the graphics on the client machines. So, somewhere, you have data that reports those values in your databases used to make these graphs.
Rosetta Moderator: Mod.Sense
ID: 62071 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 62075 - Posted: 3 Jul 2009, 15:38:21 UTC

Task 262972813 failed on Mac,

Watchdog active.
Hbond tripped: [2009- 7- 2 8:46:56:]

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 334
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>


ID: 62075 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 62082 - Posted: 5 Jul 2009, 0:51:15 UTC

This one ran for 10 hrs on a 6hr pref.

It did 1 Model when the watchdog kicked in, i guess it was incomplete.

https://boinc.bakerlab.org/rosetta/result.php?resultid=263029599

Sun 05 Jul 2009 10:20:27 EST|rosetta@home|Output file lb_cutback_all_multi_hb_t328__IGNORE_THE_REST_2CEXA_8_12958_5_1_0 for task lb_cutback_all_multi_hb_t328__IGNORE_THE_REST_2CEXA_8_12958_5_1 absent



ID: 62082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Problems with Minirosetta 1.80



©2024 University of Washington
https://www.bakerlab.org