Posts by ConflictingEmotions

1) Message boards : Number crunching : Major problems with granted credit (Message 58658)
Posted 7 Jan 2009 by ConflictingEmotions
Post:
the credit granting system was broken due to a corrupt database table. I fixed it and it appears to be running okay now.


So when will we see the updates to the affected WU that were already completed?
2) Message boards : Number crunching : Problems with web site (Message 57477)
Posted 2 Dec 2008 by ConflictingEmotions
Post:
For those that are still having issues with your boinc client not getting the new scheduler url from our master url and if you would rather not detach and reattach the project which should fix the issue if updating doesn't. You can do the following:

1. stop the boinc client.
2. manually edit the client_state.xml, client_state_prev.xml, and master_boinc.bakerlab.org_rosetta.xml files in the BOINC client data directory by changing all occurrences of "http://boinc.bakerlab.org/rosetta_cgi/cgi" to "http://srv4.bakerlab.org/rosetta_cgi/cgi". This should be between <scheduler> and </scheduler> tags. You might not need to update master_boinc.bakerlab.org_rosetta.xml but you might as well.

If anyone has any suggestions or ideas for an easier fix, I'm all ears. The redirection does not appear to work so I removed it.


Does these suggestions lose all existing work? If so I really don't appreciate losing 4+ cpu hours per WU! Note this would be worst when you change the default to 6 hours.
3) Message boards : Number crunching : Problems with Rosetta version 5.98 (Message 54996)
Posted 8 Aug 2008 by ConflictingEmotions
Post:
I aborted wuid 167183863 because it hung at 100% but took cputime way beyond expected. Watchdog got the other attempt so I am pointing it out as it may expose some error with Rosetta beta.
4) Message boards : Number crunching : Problem of multithreading with rosetta 5.96 beta (Message 53875)
Posted 20 Jun 2008 by ConflictingEmotions
Post:
I actually run rosetta 5.96 with Boinc 5.10.45 on Quad 6600 with 2 gigs of ram 1066 under Win XP 32.

When i run simultaneously 2 or more rosetta threads, only one is active with this version of software. Never have this bug with other rosetta version.

For developper information.


PS:Sorry for my poor english.


I saw something similar in an Vista system but I rebooted due to software update and found all threads running. Some combination of the following might work - try restarting Boinc, starting the manager again, start another Boinc process, or, worse case scenario, restart the system.
5) Message boards : Number crunching : Problems with version 5.96 (Message 53848)
Posted 19 Jun 2008 by ConflictingEmotions
Post:
Are people seeing this problem with other work units or is it a t405 specific problem for now?


It seems to be only or mostly t405 work units.


Yes, t405 work units have caused problems on two different systems. I have not had any other work units have the problem.
6) Message boards : Number crunching : Problems with version 5.96 (Message 53831)
Posted 19 Jun 2008 by ConflictingEmotions
Post:
We haven't been able to reproduce this behavior yet. Tomorrow I'll update rosetta with the latest boinc api and double check the source code to see if there were any changes between versions that could be causing this. We are seeing an odd error at the end of a local run on our linux machines that suggests an api issue but it may or may not be related.


Why are you doing a local run? It should always be the same as us.

If you print some useful error messages to print out then probably some of us would be willing to run it for you. It does take about 2 hrs to appear but restarting seems to go back to the same place. The error I reported indicates that there is something wrong with the memory call - Google indicates that it is freeing on non-existent memory or providing insufficient size.

I can not offer more because these systems are behind a firewall.
7) Message boards : Number crunching : Problems with version 5.96 (Message 53772)
Posted 18 Jun 2008 by ConflictingEmotions
Post:
I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort.


Don't these failed units ever disappear?

I also just had to terminate these off my systems for the second time and I see we are not the only ones!

These units expose a very nasty bug with rosetta. The worst part is that these prevent boinc from starting another task.


They cycle twice, to see if the same error happens on a different system.
That's why so mant people are complaining, double error's.
Some get lucky and don't get the error if they are number 2 in line.


Unfortunately these crash but the failure is not reported until the user aborts them or the deadlines pass. Consequently the bugs are not fixed and potentially many users are wasting resources.
8) Message boards : Number crunching : Problems with version 5.96 (Message 53764)
Posted 18 Jun 2008 by ConflictingEmotions
Post:
I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort.


Don't these failed units ever disappear?

I also just had to terminate these off my systems for the second time and I see we are not the only ones!

These units expose a very nasty bug with rosetta. The worst part is that these prevent boinc from starting another task.
9) Message boards : Number crunching : Problems with version 5.96 (Message 53723)
Posted 16 Jun 2008 by ConflictingEmotions
Post:
After multiple restarts of the boinc client, I have terminated these t405_CASP8_JUMPAB tasks. There clearly are some major bugs in rosetta and boinc involved here!

I have been getting a number of errors on 64-bit SMP Linux (Fedora 8):
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** double free or corruption (!prev): 0x0959d100 ***
*** glibc detected *** double free or corruption (!prev): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***


These appear to freeze boinc because these continue after restarting boinc.
The tasks have the prefix t405_CASP8_JUMPAB eg.
t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_356_0

The stdout.txt files contain the following message many times:
res 13 and var 1 at position 1 is not a proper Nterm variant
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 7.83973
1 1 8.21224 0.746619
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 4.71364
2 1 4.14663 0.817297
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 9.48273
3 1 -10.3568 -1.28817




10) Message boards : Number crunching : Problems with version 5.96 (Message 53714)
Posted 16 Jun 2008 by ConflictingEmotions
Post:
I have been getting a number of errors on 64-bit SMP Linux (Fedora 8):
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** double free or corruption (!prev): 0x0959d100 ***
*** glibc detected *** double free or corruption (!prev): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***


These appear to freeze boinc because these continue after restarting boinc.
The tasks have the prefix t405_CASP8_JUMPAB eg.
t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_356_0

The stdout.txt files contain the following message many times:
res 13 and var 1 at position 1 is not a proper Nterm variant
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 7.83973
1 1 8.21224 0.746619
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 4.71364
2 1 4.14663 0.817297
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 9.48273
3 1 -10.3568 -1.28817









©2024 University of Washington
https://www.bakerlab.org