Problems with version 5.96

Message boards : Number crunching : Problems with version 5.96

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
ConflictingEmotions

Send message
Joined: 5 Jun 08
Posts: 10
Credit: 3,081,990
RAC: 0
Message 53831 - Posted: 19 Jun 2008, 13:35:07 UTC - in response to Message 53823.  

We haven't been able to reproduce this behavior yet. Tomorrow I'll update rosetta with the latest boinc api and double check the source code to see if there were any changes between versions that could be causing this. We are seeing an odd error at the end of a local run on our linux machines that suggests an api issue but it may or may not be related.


Why are you doing a local run? It should always be the same as us.

If you print some useful error messages to print out then probably some of us would be willing to run it for you. It does take about 2 hrs to appear but restarting seems to go back to the same place. The error I reported indicates that there is something wrong with the memory call - Google indicates that it is freeing on non-existent memory or providing insufficient size.

I can not offer more because these systems are behind a firewall.
ID: 53831 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53838 - Posted: 19 Jun 2008, 17:32:46 UTC

Are people seeing this problem with other work units or is it a t405 specific problem for now?
ID: 53838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 652
Credit: 11,662,550
RAC: 1,151
Message 53841 - Posted: 19 Jun 2008, 18:09:22 UTC
Last modified: 19 Jun 2008, 18:11:52 UTC

All that I had that "got stuck" were t405 wu's. That, of course, is circumstantial evidence only of course.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 53841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 14 Oct 05
Posts: 101
Credit: 578,497
RAC: 0
Message 53843 - Posted: 19 Jun 2008, 18:15:02 UTC - in response to Message 53838.  

Are people seeing this problem with other work units or is it a t405 specific problem for now?


It seems to be only or mostly t405 work units.



ID: 53843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ConflictingEmotions

Send message
Joined: 5 Jun 08
Posts: 10
Credit: 3,081,990
RAC: 0
Message 53848 - Posted: 19 Jun 2008, 18:48:41 UTC - in response to Message 53843.  

Are people seeing this problem with other work units or is it a t405 specific problem for now?


It seems to be only or mostly t405 work units.


Yes, t405 work units have caused problems on two different systems. I have not had any other work units have the problem.
ID: 53848 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53850 - Posted: 19 Jun 2008, 19:01:03 UTC

There was a little miscommunication and the person who ran the local test confirms that we can reproduce this with the t405 task so we have something to work with now. It appears to be just a linux issue. When the boinc api calls exit, it stalls.
ID: 53850 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,701,005
RAC: 2,103
Message 53851 - Posted: 19 Jun 2008, 19:01:38 UTC

Heres a t401 that blew up
https://boinc.bakerlab.org/rosetta/result.php?resultid=170539103
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3431315
ERROR:: Exit from: .loop_relax.cc line: 1814

</stderr_txt>
]]>
ID: 53851 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Helix Von Smelix

Send message
Joined: 16 Oct 05
Posts: 12
Credit: 4,026,102
RAC: 53
Message 53852 - Posted: 19 Jun 2008, 19:18:14 UTC - in response to Message 53850.  
Last modified: 19 Jun 2008, 19:20:09 UTC

There was a little miscommunication and the person who ran the local test confirms that we can reproduce this with the t405 task so we have something to work with now. It appears to be just a linux issue. When the boinc api calls exit, it stalls.


Out of ten+ Win XP (SP2&3) ALL had this problem

and it was with t407 too. Also think there was the same issue with other t4xx WU's
ID: 53852 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53854 - Posted: 19 Jun 2008, 21:39:08 UTC

we can't reproduce this on our windows machines and it appears to be a linux specific issue. Can others confirm seeing this with windows platforms? can you point me to your tasks so I can see the stderr if so?
ID: 53854 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 53859 - Posted: 19 Jun 2008, 22:44:34 UTC

Unfortunately... mine will all say "aborted by user". Because I didn't take the time to end and restart BOINC 5 times. And suspend and resume didn't seem to resolve the problem of BOINC thinking the task is active, but it wasn't getting any CPU.

https://boinc.bakerlab.org/rosetta/result.php?resultid=171538656
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 53859 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 14 Oct 05
Posts: 101
Credit: 578,497
RAC: 0
Message 53860 - Posted: 19 Jun 2008, 22:50:18 UTC - in response to Message 53854.  

we can't reproduce this on our windows machines and it appears to be a linux specific issue. Can others confirm seeing this with windows platforms? can you point me to your tasks so I can see the stderr if so?


Here you are David, windows only problem on my end:

Task 171979908

I aborted all of the other 5.96 tasks but they were behaving the same way thus the rest on my list have a status of "aborted by user" with nothing else to go off of.



ID: 53860 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 53863 - Posted: 19 Jun 2008, 23:51:19 UTC - in response to Message 53854.  

we can't reproduce this on our windows machines and it appears to be a linux specific issue. Can others confirm seeing this with windows platforms? can you point me to your tasks so I can see the stderr if so?

Yes, I can. My t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_53598_0 is being stuck at 44.231% on Windows host.

I've not tried yet to restart the client, just suspended the task (+keep in memory) after seeing it idle and immediately resumed afterwards. Such task in past occasionally continued flawlessly till the successfull end after really restarting their executable.

[...]

Letting the applications restart due to missed heartbeat did not help. I've thus restarted the client and the task is now hapily crunching. I've elevated Rosetta's STD, so the task should finish in some 2 hours, but will probably be not reported until maybe 8 am UTC. You can see the result afterwards.

Unluckilly I've not got the idea to make a snapshot of its slot... But if it helps, stderr.txt contains a loooot of "res 13 and var 1 at position 1 is not a proper Nterm variant" lines and stderr.txt has following inside:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C911669 read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...
but without any debugger output afterwards.

No other suspicious files around.

Peter
ID: 53863 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 53864 - Posted: 20 Jun 2008, 1:12:30 UTC - in response to Message 53863.  

I've restarted the client and the task is now hapily crunching. I've elevated Rosetta's STD, so the task should finish in some 2 hours, but will probably be not reported until maybe 8 am UTC. You can see the result afterwards.

Unluckilly I've not got the idea to make a snapshot of its slot... But if it helps, stderr.txt contains a loooot of "res 13 and var 1 at position 1 is not a proper Nterm variant" lines and stderr.txt has following inside:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C911669 read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...
but without any debugger output afterwards.

No other suspicious files around.

In 58 minutes the task got it to 54.843% and the same Access Violation happened again. Paused, waiting on requests. As Feet1st noted, if necessary for the result data, the task might finish in 5 BOINC restarts :-D

Peter
ID: 53864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 53865 - Posted: 20 Jun 2008, 1:44:11 UTC
Last modified: 20 Jun 2008, 1:46:28 UTC

From The_Brain_QC, who posted in Science section:

Problem of multithreading with rosetta 5.96 beta


I actually run rosetta 5.96 with Boinc 5.10.45 on Quad 6600 with 2 gigs of ram 1066 under Win XP 32.

When i run simultaneously 2 or more rosetta threads, only one is active with this version of software. Never have this bug with other rosetta version.

For developper information.


PS:Sorry for my poor english.

ID: 53865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 652
Credit: 11,662,550
RAC: 1,151
Message 53866 - Posted: 20 Jun 2008, 6:03:23 UTC

I can confirm that all of the t405 wu's that "stuck" were running on Win XP Sp3 systems. The symptoms were not the same as the "stuck at 100%", mine were typically sticking between 40 and 50%, otherwise the story is the same, showing as Running" in BM, but the wall time, % and completion were static, and the Windows Idle process was using the quota - it was the fall in CPU temperature that alerted me to the problem.

As above, the stderr will just show aborted by user.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 53866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 728
Message 53872 - Posted: 20 Jun 2008, 12:49:09 UTC - in response to Message 53798.  

Are the actual rosetta processes not running and the boinc client stays idle as if it doesn't know that the error occurred? We need some more feedback to assess the situation. There is definitely a problem right now with these jobs that were submitted yesterday. If the client doesn't report back, we can't tell that the errors are occurring. I sent an email to Rom and David Anderson to see if there may have been an issue with the BOINC api.


Have this occur on both Windows and Linux. On Linux getting to 100% keeps saying that it is 'running' but nothing is happening. I have aborted 2 of these so far with another half dozen to go.
Have now had one that only got to 16% on Windows before stopping doing anything but the status says 'running'.
Aborted that one also, will now abort all "t405" type work units as losing many hours with nothing to show for it.
ID: 53872 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 53877 - Posted: 20 Jun 2008, 18:43:13 UTC - in response to Message 53864.  

I've restarted the client and the task is now hapily crunching. I've elevated Rosetta's STD, so the task should finish in some 2 hours, but will probably be not reported until maybe 8 am UTC. You can see the result afterwards.

In 58 minutes the task got it to 54.843% and the same Access Violation happened again. Paused, waiting on requests. As Feet1st noted, if necessary for the result data, the task might finish in 5 BOINC restarts :-D

The t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_53598_0 finally finished after another client restart.

Peter
ID: 53877 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 53879 - Posted: 20 Jun 2008, 18:51:45 UTC
Last modified: 20 Jun 2008, 18:55:09 UTC

I've got 2 T405s I had suspended on a Win XP machine. What should I capture when they hang up? And when to capture? Before or after suspending the task?

[edit]
I shortend my runtime, and the first completed normally at 1:51. Do they have to run longer to see the problem?
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 53879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,701,005
RAC: 2,103
Message 53885 - Posted: 20 Jun 2008, 23:13:52 UTC

ID: 53885 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53887 - Posted: 21 Jun 2008, 3:50:32 UTC

I posted minirosetta version 1.29 and rosetta_beta 5.97 on ralph which both include a fix for this bad bug that stalls clients. The problem was a possible infinite loop in the boinc api when an access violation caused by our t405 job was caught after the job completed. Hopefully the tests running on ralph will confirm the fix.
ID: 53887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : Problems with version 5.96



©2024 University of Washington
https://www.bakerlab.org