Report Problems with Rosetta Version 5.13

Message boards : Number crunching : Report Problems with Rosetta Version 5.13

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 16361 - Posted: 16 May 2006, 4:06:05 UTC
Last modified: 16 May 2006, 4:08:34 UTC

I Have a few -

    BOINC 5.4.9, Rosetta 5.13
    GenuineIntel Intel(R) Pentium(R) M processor 1.86GHz
    Microsoft Windows XP Professional Edition, Service Pack 2, (05.01.2600.00)
    Memory 2039.37 MB
    cash 76.56 KB
    swap space 932.3 MB
    65.54 GB



resultid=20299415 -<error_code>-161</error_code> </file_xfer_error>
resultid=20402395 - exit code -1073741819 (0xc0000005)
resultid=20419239 - Maximum disk usage exceeded

The machine is also running ralph

Regards
Phil



We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 16361 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
akma

Send message
Joined: 11 May 06
Posts: 8
Credit: 159,246
RAC: 0
Message 16366 - Posted: 16 May 2006, 5:48:55 UTC
Last modified: 16 May 2006, 5:52:46 UTC

in case this helps this is the error message i get every time it dies. HOMOLOG_ABRELAX_hom005_t283__505_7824_0 ( - exit code -1073741811 (0xc000000d))
also every time i've saw it exit it always during the switch between the inital and the full atom relax stages.

ID: 16366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Seth Aaronson
Avatar

Send message
Joined: 5 Mar 06
Posts: 18
Credit: 3,976
RAC: 0
Message 16370 - Posted: 16 May 2006, 7:36:13 UTC

Here are my latest errors from the Messages tab in BOINC Manager:

5/15/2006 8:41:31 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_1ogw_ROT_TRIALS_TRIE_462_10628_1 ( - exit code 1073807364 (0x40010004))


5/15/2006 4:12:24 PM|rosetta@home|Unrecoverable error for result HOMOLOG_ABRELAX_hom006_t283__505_27635_0 ( - exit code 1073807364 (0x40010004))


5/16/2006 12:27:44 AM|rosetta@home|Unrecoverable error for result TEST_HOMOLOG_ABRELAX_hom001_1opd__504_42274_0 ( - exit code 1073807364 (0x40010004))

The last error seems to have fozen my machine.
I press ctrl-alt-delete (the old three finger salute) go into task manager, end the process 'rosetta_5.13_windows_intelx86.exe' as it's not responding, exit the dialouge to debug the app, then BOINC seems to continue to crunch my other attached projects (einstein and seti).

What's the prognosis, if any?

-Seth
ID: 16370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 16376 - Posted: 16 May 2006, 11:41:46 UTC
Last modified: 16 May 2006, 11:43:01 UTC

More errors : This one is frustrating as it occurred on a WU that was basically getting near completion (94%+ with 25 models done)

https://boinc.bakerlab.org/rosetta/result.php?resultid=20185281

My results page is basically now a computing errors report. I am getting frustrated again.

I have done everything within my power. have given maintenance to my computer. I have reinstalled my Operating systems with all the hassles of having to re install many of the applications I NEED.

I cannot buy a new computer. And forget about a MAC. Simply stated most of the applications I use and NEED do not have a MAC equivalent.

This thing is very inefficient: a computing error unit gets re sent. That means a computer that could be doing a new work unit has to redo a Wu . In the case I am just reporting, redo a WU that was more than 94% complete. ARGH!!!!
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 16376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 16377 - Posted: 16 May 2006, 11:55:21 UTC
Last modified: 16 May 2006, 11:56:17 UTC

Another error!!!!

This how my result pages looks now:

https://boinc.bakerlab.org/rosetta/results.php?userid=69098

The last error happened while I was using my word processor application.

The way this is going, it seems that Boinc and or Rosetta is becoming a "Windows XP, need not apply" application combo.

This is getting to the point where I am basically being forced to choose between being able to use my computer for the purposes I need or running Rosetta. I am sad to say that if I am force to make that choice, I will have to stop running Rosetta.

I am getting so annoyed , I am running the risk of being unfair: But it seems that the developers are not considering that the "conflict" issue is important to spend time finding a solution. This sounds harsh. But, This is how I am feeling.

This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 16377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 16379 - Posted: 16 May 2006, 12:20:38 UTC
Last modified: 16 May 2006, 12:21:12 UTC

Four more new errors. Should the last WU I have in queue fail, I will not be accepting more Rosetta work units until the issue of the 107 errors is solved. Simply stated, I am being forced to choose between using my computer for my daily tasks and watch how the Rosetta Wu's keep failing some of them within seconds of starting or have my computer as a Rosetta only computer ( That is not a choice.).
Rosetta has become for me the computational equivalent of Russian Roulette.
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 16379 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cureseekers~Kristof

Send message
Joined: 5 Nov 05
Posts: 80
Credit: 689,603
RAC: 0
Message 16382 - Posted: 16 May 2006, 13:10:57 UTC

After they implemented the watchdog function, it seems that 99% of the previous errors are gone. Only you do post here at regular basis errors.
When I see your error rate here, I guess there has to be something with your pc.
* Is your pc overclocked? If yes, try to reset it to the normal values
* Do you use an optimized BOINC client? If yes, try the default
* Try also to run a memory test (http://www.memtest.org/#downiso).

Member of Dutch Power Cows
ID: 16382 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 16384 - Posted: 16 May 2006, 13:20:07 UTC

Jose, you could find out if it's the Rosetta science app or your puter by attaching to a different project and running a few wus.
ID: 16384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 16385 - Posted: 16 May 2006, 13:39:12 UTC - in response to Message 16382.  
Last modified: 16 May 2006, 13:46:49 UTC

After they implemented the watchdog function, it seems that 99% of the previous errors are gone. Only you do post here at regular basis errors.
When I see your error rate here, I guess there has to be something with your pc.
* Is your pc overclocked? If yes, try to reset it to the normal values
* Do you use an optimized BOINC client? If yes, try the default
* Try also to run a memory test (http://www.memtest.org/#downiso).



My PC is NOT overclocked.
I do not use an optimized client
I have updated and mantained my computer and it started producing complete WUS and now ALL I am getting is the deluge of errors...

I have done everything I can to my computer. All I will do now is that should I get another 107 error, I will leave Rosetta.

Dont delude yourself thinking that because I am the only one reporting, I am the only one have problems. Maybe I am one of the few ones that CARES enought to report.

Wasting my time is something I dont like and this thing is turning out to be a waste of time.
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 16385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pieface

Send message
Joined: 20 Sep 05
Posts: 17
Credit: 797,661
RAC: 0
Message 16386 - Posted: 16 May 2006, 13:40:35 UTC
Last modified: 16 May 2006, 13:41:01 UTC

Lost a Rosetta 5.13 unit overnite: 20297842
Running BM 5.4.9 on Win XP, hit one of those 0xc0000005 errors. when I looked at the machine this morning there were several dialog boxes saying rosetta was trying to connect to the internet / dns server (norton internet security). I don't know if they were related to this unit or one of the others that finished overnite though.
ID: 16386 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 16388 - Posted: 16 May 2006, 14:31:10 UTC - in response to Message 16385.  
Last modified: 16 May 2006, 18:57:51 UTC

...My PC is NOT overclocked.
I do not use an optimized client
I have updated and mantained my computer and it started producing complete WUS and now ALL I am getting is the deluge of errors...

I have done everything I can to my computer. All I will do now is that should I get another 107 error, I will leave Rosetta.

Dont delude yourself thinking that because I am the only one reporting, I am the only one have problems. Maybe I am one of the few ones that CARES enought to report.

Wasting my time is something I dont like and this thing is turning out to be a waste of time.

Jose,

I have spent a number of hours over the last few days looking specifically at the issues you are having. Personally I do not believe the problem is specifically any one element of your hardware, Windows, BOINC or Rosetta by themselves. There is some other process running on the system (probably in the background) that is attempting to terminate Rosetta for some reason. Here is a Quote from an e-mail I received from a Windows/BOINC developer on the "-107" error type issue -

"This error code should only be triggered by an external process that is
attempting to terminate Rosetta after some other crash event has been
detected and handled."


This implies that something else is going on with your system. For the most part your errors are very consistent and in the same location. It is my belief that the problem is somehow related to the graphics functions of your system, but this is only a guess. The programmers are looking at these errors right now. While I understand your frustration, Please understand that these things cannot be fixed instantly, and in many cases it can take some time to track them down even if you are sitting right at the computer.

In this case there are people from all over the world (literally) looking at and trying to solve this particular problem for you. But it is not easy to do it remotely.

The next release of the software is closer than you think and it has been running well on RALPH. Give us at least another 48 hours to solve this.

What I would try if I was there to do it, is turn off ALL screen saver functions in windows completely (including Windows screen savers). I would then cold start the system (power off restart). Then I would NOT run any graphics for any of the BOINC projects. I would wait to see if I get any errors under those conditions, but I would also keep track of what I am doing other than BOINC while I am running this test. If I then had a problem I would be able to perhaps determine if some other process (Other than BOINC/Rosetta) was involved.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 16388 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Simon Walker

Send message
Joined: 17 Oct 05
Posts: 3
Credit: 459,592
RAC: 0
Message 16392 - Posted: 16 May 2006, 15:50:51 UTC

I'm seeing problems with 2 computers, both running XP, neither are over-clocked or running any optimised clients. One is an FX-53 with 2Gb of mem and the other is an AMD 64 X2 4400 with 4Gb mem.

The latter (my work PC) was doing nothing except Boinc and getting mail (Outlook Open) yet managed to come up with :

16/05/2006 16:41:22|rosetta@home|Unrecoverable error for result HOMOLOG_ABRELAX_hom004_t283__505_31515_0 ( - exit code -1073741811 (0xc000000d))
16/05/2006 16:41:22|rosetta@home|Deferring scheduler requests for 1 minutes and 0 seconds
16/05/2006 16:41:22||Rescheduling CPU: application exited
16/05/2006 16:41:22|rosetta@home|Computation for task HOMOLOG_ABRELAX_hom004_t283__505_31515_0 finished

The other FX-53 machine has been having Rosetta problems now for quite a while now, and it's user (the wife) is getting close to getting it deleted
Active PC's

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=145422

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=193752

Results: https://boinc.bakerlab.org/rosetta/results.php?userid=5150
ID: 16392 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile KWSN Sir Clark

Send message
Joined: 18 Sep 05
Posts: 46
Credit: 387,432
RAC: 0
Message 16395 - Posted: 16 May 2006, 16:31:11 UTC
Last modified: 16 May 2006, 16:34:44 UTC

Using BOINC 5.4.9 and Rosetta 5.13

WU Name: CASP_HOMOLOG_ABRELAX_hom001_t287__507_11587 (https://boinc.bakerlab.org/rosetta/workunit.php?wuid=16862883)

Stuck at 1.04% after 1 hour despite having a preference setting of 1hr.

It's still crunching........if it's not changed by 90min crunch time I'm going to abort it.

In the screensaver it shows Accepted RMSD: ?

Not sure whether this is a bug

Edit: It just crashed

<core_client_version>5.4.9</core_client_version>
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# random seed: 2215394
No heartbeat from core client for 31 sec - exiting
# random seed: 2215394
No heartbeat from core client for 31 sec - exiting
# random seed: 2215394
# cpu_run_time_pref: 3600

</stderr_txt>


ID: 16395 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 16396 - Posted: 16 May 2006, 16:48:01 UTC - in response to Message 16395.  

In the screensaver it shows Accepted RMSD: ?

Not sure whether this is a bug

Not a bug. I means you are exploring unknown territory with one of the proteins from the CASP contest. I described my understanding of RMSD way down here in this thread. Hope this helps.

PS a WU that takes more than an hour is very common, and giving it another 30 min. isn't always going to help. In this case, it errored out. But in the future look at the graphic and the steps and models crunched. You have to complete 1 model before the WU will be able to send anything back. And with a 1 hr time preference, you will often run over, and often get only the one model completed. Pay no attention to the progress %, it will say 1% something during all of model 1... and with your short time preference, will then often zip to 100% when it reaches the end of model one.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 16396 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile KWSN Sir Clark

Send message
Joined: 18 Sep 05
Posts: 46
Credit: 387,432
RAC: 0
Message 16399 - Posted: 16 May 2006, 18:39:46 UTC

I've upped it to 2 hrs and will see what happens
ID: 16399 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
senatoralex85

Send message
Joined: 27 Sep 05
Posts: 66
Credit: 169,644
RAC: 0
Message 16400 - Posted: 16 May 2006, 18:43:35 UTC

I am not sure if this is a problem with Version 5.13 or the new CASP workunits. Lately, I am finding that the workunits get stuck at 1.04 percent for an unknown amount of time (similiar to clark). I returned about 1.5 hours later and it was at 77 percent. The workunit reported successfully and I did not have to abort it. I think there may be a problem with the progress indicator under the workunit tab. I have not had this problem with Rosetta until my machine was sent a CASP workunit. I am using BOINC version 4.45.
ID: 16400 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 16401 - Posted: 16 May 2006, 18:48:47 UTC

Jose:
Download a program called HiJackThis! from here: MajorGeeks download site
When you run it, select the option that creates the log. Post the HiJackThis! log here, and we can see what's running in the background - and hopefully help the programmers identify what can be causing the 107 errors.
ID: 16401 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 16402 - Posted: 16 May 2006, 18:51:54 UTC - in response to Message 16388.  

I am running another BOINC Application. Lets see what happens there.

I am not running Rosetta. I cannot take more frustration for the moment. Nor I can take more aggravation.

I tried everything you suggested :all but the MAC. So I I will stop being the odd-person out and leave the projects to those who can actually run it and process applications without the humongous quantity of errors I got.

I seriously doubt, the conflicts will be solved.

Since I am a CASP 7 observer, I will keep track of Team Rosetta's progress. I wish you all sucess.


This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 16402 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 16403 - Posted: 16 May 2006, 18:53:22 UTC - in response to Message 16400.  

I am not sure if this is a problem with Version 5.13 or the new CASP workunits. Lately, I am finding that the workunits get stuck at 1.04 percent for an unknown amount of time (similiar to clark). I returned about 1.5 hours later and it was at 77 percent. The workunit reported successfully and I did not have to abort it. I think there may be a problem with the progress indicator under the workunit tab. I have not had this problem with Rosetta until my machine was sent a CASP workunit. I am using BOINC version 4.45.

This is normal Rosetta behavior. The work unit reaches 1.4% complete during the initialization process. It then will stay there until it completes the first model. Depending on your time setting and the size of the protein being examined, the percent complete can do a variety of different things.

If the first model takes longer then you time setting, It will jump from 1.4% to 100% complete in one leap.
If the first model takes only half of your time setting, the percent will jump to 50%.

It progresses in that way. There is a lot more detail in the FAQs linked in my signature.

But what you are seeing is completely normal. You may not have notice that the time to completion, rises as the CPU time rises. When the Percent complete jumps forward, the time remaining falls back to a more accurate number as well. This is also normal.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 16403 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 16404 - Posted: 16 May 2006, 18:59:28 UTC

When you're seeing reports of 1.x percent done.. it's not an actual percentage. Those are specific points in the program, and it's informing us where it is. Such as point 1.040, 1.042, etc. Once it finishes a model, it figures out how much time it took, looks at your time setting, and then determines the percentage done, and whether there's time to create another model.

Since some of the Casp7 models are taking much longer than we're used to, we're much more likely to see 4, 6, or 8 hour WUs on our machines, and they'll click slowly through the 1.x percent done messages, finish the model, notice that it's over the time limit, pop up a message about being 100% complete, and then send the WU back to the lab.


ID: 16404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.13



©2024 University of Washington
https://www.bakerlab.org