Report Problems with Rosetta Version 5.07

Message boards : Number crunching : Report Problems with Rosetta Version 5.07

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 15234 - Posted: 2 May 2006, 2:45:45 UTC - in response to Message 15229.  

Thanks for reporting. I think Moderator9 is right - it's likely a file transfer error and probably just an isolated case.

just noticed in my previous message that rosetta vesion is shown as 5.01
even though 'manager' says 5.07.


These errors look very similar to a file error that showed up on Ralph for a particular WU type. I am sure Rhiju or Bin will be along shortly to provide more detail. The fact that your system has moved on so to speak would indicate that it is a Work Unit issue. The version number error is probably just an error message in the code that did not get changed with the upgraded version release.


ID: 15234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 15236 - Posted: 2 May 2006, 2:56:17 UTC - in response to Message 15231.  

Hi Jose,

".fragments.cc line:722" says that Rosetta thinks the "fragment file" it's reading has wrong format. Since all the work units named HBLR_1.0_1dtj_ROT_TRIALS_TRIE_462_xxxxx_x will read in the same "fragment file" during rosetta initialization stage, it probably indicates that the file in your reported WU has crashed or been truncated during file transfering.

We have received successful results for this batch so this is very likely an isolated case. But we will keep an eye on it.

Thanks.


A new type of error has shown up. (Meaning a "non 107 Type" . )

BTW 107 types are still showing up: When are we going to get help or more information regarding them?


https://boinc.bakerlab.org/rosetta/result.php?resultid=18820267

Result ID 18820267
HBLR_1.0_1dtj_ROT_TRIALS_TRIE_462_13051_0
Workunit 15562229
Created 1 May 2006 11:58:34 UTC
Sent 1 May 2006 16:07:20 UTC
Received 2 May 2006 1:43:43 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status 1 (0x1)
CPU time 19.34375
stderr out <core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
ERROR:: Exit at: .fragments.cc line:722

</stderr_txt>
Validate state Invalid
Claimed credit 0.0674343140710041
Granted credit 0
application version 5.07



ID: 15236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rebirther
Avatar

Send message
Joined: 17 Sep 05
Posts: 116
Credit: 41,315
RAC: 0
Message 15251 - Posted: 2 May 2006, 8:30:43 UTC

Iam running two WUs with HT on my P4, HBLR_xx and AB_CASP6.xx. Memory usage is 300MB but my Task Manager displays 911MB RAM total, seems to be a memory leak of 300MB somewhere (+300MB for XP)?
ID: 15251 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nightbird

Send message
Joined: 17 Sep 05
Posts: 70
Credit: 32,418
RAC: 0
Message 15254 - Posted: 2 May 2006, 9:50:39 UTC - in response to Message 15104.  
Last modified: 2 May 2006, 9:59:21 UTC

and a problem : the wu stopped at 33.22 % done.


It might just be a slow spot in the WU. Give it some time.

If it's really stuck, then the watchdog should get it. The watchdog then sends back a lot of information about the WU that is useful to the project.


I did a screenshoot with this wu "not working" (1di2) and an other wu working (2tif)





ID: 15254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 15255 - Posted: 2 May 2006, 9:58:03 UTC
Last modified: 2 May 2006, 9:58:33 UTC

Now I got a "client error" Does this means that the data I produced was not received by you?

https://boinc.bakerlab.org/rosetta/result.php?resultid=18820316

Result ID 18820316
Name JUMP_ALLBARCODE03_1tul__468_770_0
Workunit 15562277
Created 1 May 2006 11:58:34 UTC
Sent 1 May 2006 16:07:20 UTC
Received 2 May 2006 9:46:12 UTC
Server state Over
Outcome Client error
Client state Done
Exit status -1073741819 (0xc0000005)
Report deadline 15 May 2006 16:07:20 UTC
CPU time 8928.046875
stderr out <core_client_version>5.2.13</core_client_version>
<message> - exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 1732251
# random seed: 1732251
# cpu_run_time_pref: 14400

</stderr_txt>
Validate state Invalid
Claimed credit 31.1240952250415
Granted credit 0
application version 5.07


This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 15255 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 15262 - Posted: 2 May 2006, 11:41:32 UTC - in response to Message 15254.  





The CPU efficiency is a "guess" from Boincview and not necessarily true. If the WU is really stuck (which happens rarely), Rosetta will auto-terminate it after an hour and return the result.
ID: 15262 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 15264 - Posted: 2 May 2006, 11:49:28 UTC

This just happened 7 units lost to computation errors in less than 8 minutes. This a verbatim copy of the message log recorded in the BOINC Manager. I am confused and searching for reasons of why this is continuously happening.

To the rythm of "As the beats goes on"...


5/2/2006 7:28:48 AM|rosetta@home|Unrecoverable error for result JUMP_ALLBARCODE04_1tul__468_770_0 ( - exit code -1073741819 (0xc0000005))
5/2/2006 7:28:48 AM||request_reschedule_cpus: process exited
5/2/2006 7:28:48 AM|rosetta@home|Computation for result JUMP_ALLBARCODE04_1tul__468_770_0 finished
5/2/2006 7:28:48 AM|rosetta@home|Starting result HBLR_1.0_1dtj_ROT_TRIALS_TRIE_462_13053_0 using rosetta version 507
5/2/2006 7:29:37 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_1dtj_ROT_TRIALS_TRIE_462_13053_0 ( - exit code -1073741819 (0xc0000005))
5/2/2006 7:29:37 AM||request_reschedule_cpus: process exited
5/2/2006 7:29:37 AM|rosetta@home|Computation for result HBLR_1.0_1dtj_ROT_TRIALS_TRIE_462_13053_0 finished
5/2/2006 7:29:37 AM|rosetta@home|Starting result HBLR_1.0_1dtj_ROT_TRIALS_TRIE_461_13052_0 using rosetta version 507
5/2/2006 7:30:10 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_1dtj_ROT_TRIALS_TRIE_461_13052_0 ( - exit code -1073741819 (0xc0000005))
5/2/2006 7:30:10 AM||request_reschedule_cpus: process exited
5/2/2006 7:30:10 AM|rosetta@home|Computation for result HBLR_1.0_1dtj_ROT_TRIALS_TRIE_461_13052_0 finished
5/2/2006 7:30:10 AM|rosetta@home|Starting result JUMP_ALLBARCODE07_1tul__468_2204_0 using rosetta version 507
5/2/2006 7:31:08 AM|rosetta@home|Unrecoverable error for result JUMP_ALLBARCODE07_1tul__468_2204_0 ( - exit code -1073741819 (0xc0000005))
5/2/2006 7:31:08 AM||request_reschedule_cpus: process exited
5/2/2006 7:31:08 AM|rosetta@home|Computation for result JUMP_ALLBARCODE07_1tul__468_2204_0 finished
5/2/2006 7:31:09 AM|rosetta@home|Starting result HBLR_1.0_1n0u_ROT_TRIALS_TRIE_462_14487_0 using rosetta version 507
5/2/2006 7:31:14 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_1n0u_ROT_TRIALS_TRIE_462_14487_0 ( - exit code -1073741819 (0xc0000005))
5/2/2006 7:31:14 AM||request_reschedule_cpus: process exited
5/2/2006 7:31:14 AM|rosetta@home|Computation for result HBLR_1.0_1n0u_ROT_TRIALS_TRIE_462_14487_0 finished
5/2/2006 7:31:14 AM|rosetta@home|Starting result HBLR_1.0_1mky_ROT_TRIALS_TRIE_462_14706_0 using rosetta version 507
5/2/2006 7:31:44 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_1mky_ROT_TRIALS_TRIE_462_14706_0 ( - exit code -1073741819 (0xc0000005))
5/2/2006 7:31:44 AM||request_reschedule_cpus: process exited
5/2/2006 7:31:44 AM|rosetta@home|Computation for result HBLR_1.0_1mky_ROT_TRIALS_TRIE_462_14706_0 finished
5/2/2006 7:31:44 AM|rosetta@home|Starting result HBLR_1.0_1di2_ROT_TRIALS_TRIE_461_15256_0 using rosetta version 507
5/2/2006 7:31:47 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_ROT_TRIALS_TRIE_461_15256_0 ( - exit code -1073741819 (0xc0000005))
5/2/2006 7:31:47 AM||request_reschedule_cpus: process exited
5/2/2006 7:31:47 AM|rosetta@home|Computation for result HBLR_1.0_1di2_ROT_TRIALS_TRIE_461_15256_0 finished

This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 15264 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 15265 - Posted: 2 May 2006, 11:53:39 UTC
Last modified: 2 May 2006, 11:55:34 UTC

Jose, download and run memtest86+ for several loops (a few hours). See if it finds a faulty memory module. Open your case and look for dust bunnies which could cause overheating. You might also run Speedfan and see what temps your system is at.

tony
ID: 15265 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 15268 - Posted: 2 May 2006, 12:05:09 UTC - in response to Message 15265.  

Jose, download and run memtest86+ for several loops (a few hours). See if it finds a faulty memory module. Open your case and look for dust bunnies which could cause overheating. You might also run Speedfan and see what temps your system is at.

tony


Tony and the rest. It is clear now that everything is futile. Another wu JST FAILED. I am going to download one more unti. Shpuld that unit fail, I will detach. I am just at the end of my frustration levels.
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 15268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 15269 - Posted: 2 May 2006, 12:07:39 UTC - in response to Message 15268.  

Jose, download and run memtest86+ for several loops (a few hours). See if it finds a faulty memory module. Open your case and look for dust bunnies which could cause overheating. You might also run Speedfan and see what temps your system is at.

tony


Tony and the rest. It is clear now that everything is futile. Another wu JuST FAILED. I am going to download one more unit. Sopuld that unit fail, I will detach. I am just at the end of my frustration levels.


This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 15269 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 15272 - Posted: 2 May 2006, 12:18:42 UTC - in response to Message 15270.  
Last modified: 2 May 2006, 12:19:50 UTC


Jose,

You have a number of machines, and for the first time I have found the most recent connecting system. I notice that it is a quad CPU system but you have it set to use 1 CPU only. While it may be counter intuitive have you tried setting it to use all four processors? This is a setting in your general preferences.



I have only one machine. And dear Lord, my machine has only one processor. If more than one machine appear it is because of the quirks caused by the BOINC systesm when one has had to reattach to solve problems and the abscence of the merge functions that would give the real picture.

As to the 4 processors...I really dont know what to say...but I doubt that something as obvious as a processor could be hidden when I inspected my motherboard.

This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 15272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 15273 - Posted: 2 May 2006, 12:20:55 UTC
Last modified: 2 May 2006, 12:21:20 UTC

Jose, running a boinc project (any) can be a "fortune teller" for your system. Since it runs the cpu at high levels for long periods, it will "test" your system. When errors appear in the boinc projects it can be a signal that it's time to maintain/service your machine. Let's face it, if there is an issue, you'll have to face/find it eventually anyway. Stopping a project will only delay the inevitable.

Now, I don't know if your puter is having a problem or not. What I do see is that you're reporting an error that others are not. Given that it seems to be just you, then it is reasonable to think that it might be your system that needs attention. Running those tests and maybe GIMPS-Prime95, you'll be able to either find the issue, or rule it out as a cause.

Calm down my friend, no need to get an ulcer from this stuff. LOL

tony
ID: 15273 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 15274 - Posted: 2 May 2006, 12:28:05 UTC - in response to Message 15273.  

Jose, running a boinc project (any) can be a "fortune teller" for your system. Since it runs the cpu at high levels for long periods, it will "test" your system. When errors appear in the boinc projects it can be a signal that it's time to maintain/service your machine. Let's face it, if there is an issue, you'll have to face/find it eventually anyway. Stopping a project will only delay the inevitable.

Now, I don't know if your puter is having a problem or not. What I do see is that you're reporting an error that others are not. Given that it seems to be just you, then it is reasonable to think that it might be your system that needs attention. Running those tests and maybe GIMPS-Prime95, you'll be able to either find the issue, or rule it out as a cause.

Calm down my friend, no need to get an ulcer from this stuff. LOL

tony


I am calm. Right now detaching and removing BOINC is becoming the more rational of the possibilities. I will have my machine checked up. But, I need the frustration this is causing as I need a callus in my but. I am sad. I thought I could do something useful but, alas all I have been able to do is mwaste my time and yours.

This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 15274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 15276 - Posted: 2 May 2006, 12:36:18 UTC - in response to Message 15274.  

I am calm. Right now detaching and removing BOINC is becoming the more rational of the possibilities. I will have my machine checked up. But, I need the frustration this is causing as I need a callus in my but. I am sad. I thought I could do something useful but, alas all I have been able to do is mwaste my time and yours.

Well, Jose, you must do what you must do. Remember, Boinc takes advantage of otherwise "unused" cycles. So in effect, you're choosing to waste those cycles, rather than allowing them to come to some benefit. You need to do what's best for you. Good luck in whatever you choose.

tony-
ID: 15276 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Whl.

Send message
Joined: 29 Dec 05
Posts: 203
Credit: 275,802
RAC: 0
Message 15278 - Posted: 2 May 2006, 12:47:16 UTC - in response to Message 15275.  


this is the system I am looking at. In the CPU section it shows as a 4 CPU system, but under number of CPUs to use is says 1.

Pardon the intrusion guys, but does'nt the 4 just mean it is a Pentium 4 ?

ID: 15278 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 15279 - Posted: 2 May 2006, 12:49:16 UTC - in response to Message 15276.  
Last modified: 2 May 2006, 12:50:11 UTC

I am calm. Right now detaching and removing BOINC is becoming the more rational of the possibilities. I will have my machine checked up. But, I need the frustration this is causing as I need a callus in my but. I am sad. I thought I could do something useful but, alas all I have been able to do is mwaste my time and yours.

Well, Jose, you must do what you must do. Remember, Boinc takes advantage of otherwise "unused" cycles. So in effect, you're choosing to waste those cycles, rather than allowing them to come to some benefit. You need to do what's best for you. Good luck in whatever you choose.

tony-

Tony the cycles are being wasted: Most of the errors are producing waste.

And yes, I will do what I must do.

Take care

Jose

This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 15279 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nightbird

Send message
Joined: 17 Sep 05
Posts: 70
Credit: 32,418
RAC: 0
Message 15280 - Posted: 2 May 2006, 12:51:13 UTC - in response to Message 15262.  
Last modified: 2 May 2006, 12:56:46 UTC





The CPU efficiency is a "guess" from Boincview and not necessarily true. If the WU is really stuck (which happens rarely), Rosetta will auto-terminate it after an hour and return the result.

The problem is that the wu 1di2 is in this state since 2 days now.
Perhaps i must abort the wu. (?)


ID: 15280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 15281 - Posted: 2 May 2006, 12:59:42 UTC
Last modified: 2 May 2006, 13:00:14 UTC

Jose, How many puters do you really have? I see six IDENTICAL puters in your account and the benchmarks are all over the map.

1) Measured floating point speed 2009.88 million ops/sec
Measured integer speed 4014.11 million ops/sec

2) Measured floating point speed 2012.98 million ops/sec
Measured integer speed 4045.58 million ops/sec

3) Measured floating point speed 545.31 million ops/sec
Measured integer speed 3966.71 million ops/sec

4) Measured floating point speed 1276.07 million ops/sec
Measured integer speed 5114.47 million ops/sec

5) Measured floating point speed 1986.21 million ops/sec
Measured integer speed 3371.27 million ops/sec

6) Measured floating point speed 1154.1 million ops/sec
Measured integer speed 235.34 million ops/sec

If you just have one machine continuously being attached/detached then you have a issue here.

Note: none of this conversation belongs in this thread, maybe a mod could move them.
ID: 15281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 15282 - Posted: 2 May 2006, 13:01:08 UTC - in response to Message 15254.  

I did a screenshoot with this wu "not working" (1di2) and an other wu working (2tif)



Are you saying that the CPU time is not increasing, even though it's "running"? Is the idle process getting all the CPU time when this WU is "running"?

I've seen something like that months ago (but not recently). It happened when BOINC stopped the WU and ran the benchmark. For some reason the rosetta client didn't restart even though BOINC said it was "running". I was able to see this by looking through the "messages". Restarting BOINC got the WU going again.

This would be serious, because if the rosetta client isn't actually running then the watchdog won't be running either. Is there anything in the messages around the time that this WU stopped?
ID: 15282 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 15283 - Posted: 2 May 2006, 13:08:38 UTC - in response to Message 15280.  





The CPU efficiency is a "guess" from Boincview and not necessarily true. If the WU is really stuck (which happens rarely), Rosetta will auto-terminate it after an hour and return the result.

The problem is that the wu 1di2 is in this state since 2 days now.
Perhaps i must abort the wu. (?)


First of all I would exit BOINC and restart and see if the WU "revives". If that isn't the case I'd abort it.
ID: 15283 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.07



©2025 University of Washington
https://www.bakerlab.org