Posts by Thomas Leibold

1) Message boards : Number crunching : Problems with version 5.96 (Message 52874)
Posted 5 May 2008 by Thomas Leibold
Post:
Workunit 1wit__BOINC_CONTROLABRELAX_VF_IGNORE_THE_REST-S25-9-S3-3--1wit_-vf__2589_100944_0
crashed with Segmentation Violation and has been stuck in this state for the last week. I'm going to abort it now.
2) Message boards : Number crunching : minirosetta v1.15 bug thread (Message 52753)
Posted 27 Apr 2008 by Thomas Leibold
Post:
What is wrong with the validater ?
Workunits 144724221,144734937,144747594 all apparently completed normally (around the specified runtime and without any errors), but got marked invalid and received no credit.
Two of those workunits were completed successfully by other users (however with shorter runtimes).
3) Message boards : Number crunching : Problems with version 5.96 (Message 52752)
Posted 27 Apr 2008 by Thomas Leibold
Post:
Workunit 144799448 crashed:

<core_client_version>5.10.21</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
# random seed: 3471563
SIGSEGV: segmentation violation
Stack trace (25 frames):
[0x8e1b49b]
[0x8e15d8c]
[0xffffe500]
[0x85493e7]
[0x8c836d4]
[0x804c8c0]
[0x86579a0]
[0x87ac2be]
[0x87ac646]
[0x87ae444]
[0x87bcf50]
[0x87c5a86]
[0x865c88d]
[0x87c6c19]
[0x804e502]
[0x8d6dfef]
[0x89efd7e]
[0x866a2c7]
[0x88ff31d]
[0x89d141d]
[0x8628b0e]
[0x8768a2a]
[0x8768b4a]
[0x8e80034]
[0x8048111]

Exiting...

</stderr_txt>
]]>
4) Message boards : Number crunching : Problems with version 5.96 (Message 52751)
Posted 27 Apr 2008 by Thomas Leibold
Post:
Workunit 144639617 failed with:
<core_client_version>5.10.21</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
# random seed: 1637471
ERROR:: Exit from: minimize.cc line: 2088

</stderr_txt>
]]>
5) Message boards : Number crunching : Problems with version 5.96 (Message 52750)
Posted 27 Apr 2008 by Thomas Leibold
Post:
Workunit 144383830 and
Workunit 144815278 and
Workunit 144380616 failed with
<core_client_version>5.10.21</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
# random seed: 1633461
ERROR:: Exit from: pack.cc line: 5278

</stderr_txt>
]]>
6) Message boards : Number crunching : Problems with Rosetta version 5.82 (Message 52749)
Posted 27 Apr 2008 by Thomas Leibold
Post:
Workunit 144340058 and Workunit 144341601
failed with:
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
# random seed: -60370
bad seed
ERROR:: Exit from: dock_structure.cc line: 429

</stderr_txt>
]]>

In both cases the workunit also failed for the other user that got it assigned.
7) Message boards : Number crunching : Problems with Rosetta version 5.93 (Message 51700)
Posted 27 Feb 2008 by Thomas Leibold
Post:
Just checked on one of the servers whose performance was below par and found that it was still "running" on a 1zpy workunit. The workunit deadline expired over 1 month ago, confirming that short of manually aborting misbehaving workunits they will never stop on their own.

OS: SuSE Linux 10.1
Boinc: 5.10.21
Rosetta: 5.93
Workunit: ? no idea which number, long gone from the server!

stderr.txt:
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -95.2845 for 900 seconds

This is (as usual!!!) followed by a SIGSEGV with the watchdog crashing and the client failing to terminate properly (and since the client process remains alive Boinc never finds out that there is anything wrong).
I'm well aware that this is not specific to the 5.93 client since that issue has been around for a long time, just reporting that it is still an issue.
8) Message boards : Number crunching : Problems with minirosetta version 1.+ (Message 51356)
Posted 12 Feb 2008 by Thomas Leibold
Post:
I hope this helps. I'm trying to respond to everyones comments and provide useful info.


It does help and it is very much appreciated.

Thank you!
9) Message boards : Number crunching : Problems with minirosetta version 1.+ (Message 51296)
Posted 10 Feb 2008 by Thomas Leibold
Post:
Well, if it's like the last few Rosetta "releases" that have been inflicted on the users, the testing was measured in hours, and only a handful of WUs. Not exactly exhaustive testing.


While earlier versions, especially 1.03 - 1.06 were indeed tested on Ralph, with Mini Rosetta 1.07 there is a new record for "test period": 33 minutes and 14 seconds between the time it was released on Ralph and Rosetta. This means that there was not enough time to even get a single successful result!

I never received a single 1.07 workunit on Ralph, but have gotten several on Rosetta and they seem to be all successfull.
10) Message boards : Number crunching : Problems with Rosetta version 5.82 (Message 51295)
Posted 10 Feb 2008 by Thomas Leibold
Post:
Yesterday I had one of my systems (quad-core opteron) run all three types of applications at the same time: Rosetta 5.82, Rosetta Beta 5.93 and Mini Rosetta 1.07. Looks like they get along alright, because they all completed successfully.
11) Message boards : Number crunching : Problems with Rosetta version 5.93 (Message 51046)
Posted 28 Jan 2008 by Thomas Leibold
Post:


The Wu's starting with "2h4o" were causing problems on Ralph so I was supprised to see them over here on Rosetta.



I'm seeing the same problems as Conan on a number of my servers. The trouble workunits are 2h4o and 1zpy and all require manual abortion. Restarting Boinc will just reset the amount of time already spend on them and starting them again.

The 2h4o units in particular tend to stay at 100% Completed but state "Running" with no increase in amount of cpu time spend. Looking at the stdout.txt/stderr.txt files shows that there was an attempt by the watchdog to shut down the client (and as far as I know that has never worked properly for Rosetta on Linux).
12) Message boards : Number crunching : Problems with Rosetta version 5.85 (or 5.86 for linux) (Message 50412)
Posted 6 Jan 2008 by Thomas Leibold
Post:
I did encounter one anomaly when a task ( the only one running ) stalled and reported waiting for memory. I'm not sure where it thought it would get more.


It is not about the amount of memory installed in your computer, it is about the amount of memory available to run the client. Depending on what other applications are currently running the amount of memory available to Boinc and project clients can indeed change.

Unfortunately the Rosetta client (at least on Linux, since that is all I'm using) still has problems with Boinc task switching which requires to keep the client in memory even if it is not active (this is an option in your Boinc preferences). Keeping the inactive (waiting to run) tasks in memory however reduces the amount of memory available to other Boinc tasks. This usually only affects users who participate in multiple Boinc projects since the Boinc client will not normally switch between tasks for the same project (exception is if a non-active workunit nears the project deadline).
13) Message boards : Number crunching : woke up to a stalled Boince on a rosetta wu (Message 50386)
Posted 5 Jan 2008 by Thomas Leibold
Post:
Anyone else with stalled computers recently?


Lots of them, but this particular workunit doesn't match any of the cases I'm familiar with. It is the infamous 5.90 Rosetta client, but you seem to run Windows XP and the hangs were specific to the Linux version.
14) Message boards : Number crunching : Problems with Rosetta version 5.93 (Message 50385)
Posted 5 Jan 2008 by Thomas Leibold
Post:

if you have an existing code which has proven to be running good,

The Rosetta Linux client has known issues in the interaction between the main computation thread, the watchdog thread and the Boinc client. These have been there for a very long time and they have still not been resolved (I have no way of knowing if anybody is even attempting to resolve them). However this clearly means that the premise of starting with known good code is false.

and you know what your changes will cause

I'm an experienced software developer and I can assure you that no matter how well you think you understand the code you are changing and all the consequences of making that change, there is always the possibility of overlooking something. It is especially challenging when you change code that needs to run not only in one particular well controlled environment of your own, but at many different customer sites over which you have no control whatsoever. Following good practices while developing software is important, but no substitute for testing.

i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong.


It seems that you don't understand the nature of the problem with the 5.90 Linux client or some of the 1zpy workunits: they never finish.

In the 5.90 Linux case they will forever run while remaining at near 0 cpu time accumulated. If you restart Boinc some, but not all of the workunits processed with the 5.90 will show the correct amount of cpu time and finish while others will restart and continue again forever.

In the 1zpy workunit case (even with the 5.91 client) they will get to the point of showing 100% completed, but remain in state "running". When restarting Boinc the amount of cpu time accumulated resets to a low number and the workunit starts again.

In both of these recent cases:
- preferred runtime is ignored
- the 4 times runtime safeguard is not working
- the workunits even continue beyond the project deadline for returning the result
- even restarting Boinc does not resolve the problem and the workunit continues to be stuck

Any unattended Linux server running R@H may very well continue to run these stuck 5.90/1zpy workunits for another year or longer. It certainly doesn't help that nobody from the Rosetta Team of Developers has made any attempt to communicate the nature of the problem to the user community (especially the requirement that those workunits have to be manually cleaned up).

An 8 hour test cannot detect that some workunits get stuck and don't complete ever! The reason I believe a 2 week test period is most sensible is that there is enough time to get problem reports from folks who perhaps check their servers only once a week.
15) Message boards : Number crunching : Problems with Rosetta version 5.93 (Message 50345)
Posted 5 Jan 2008 by Thomas Leibold
Post:
Please post problems and/or bugs with rosetta 5.93.


My problem with 5.93 is that once again an insufficiently tested client is released for the Rosetta project. Was the trouble that the 5.90 client caused for Linux users (and the 1zpy workunits for everyone) not severe enough to make project developers think about what they are doing (e.g. learning from their mistakes) ?
Do you really have such an excess of contributors that you can afford to irritate a significant portion of them away to other projects ?

There were less then 20 hours between the 5.93 announcement on Ralph and the same one on Rosetta. During that time my test machine has been getting 0 workunits from Ralph. In fact it didn't get any 5.92 work either and as of this post still has not received any work from Ralph (it did get 5.93 workunits from Rosetta already).

If Rhiju hadn't said " We'll certainly pay closer attention to this in the future, and do tests for, say, at least two days. " I would have left Rosetta right then (I was already testing the Folding@Home SMP Linux client). Not that I think even 2 full days are really sufficient, it should probably be two weeks. To say that I'm disappointed about how quickly this turned into an empty promise is an understatement.
16) Message boards : Number crunching : Problems with version 5.90/5.91 (Message 50231)
Posted 1 Jan 2008 by Thomas Leibold
Post:
I am running rosetta_beta_5.91_i686-pc-linux-gnu. The problem I am
having is showing completion of 100% and not moving on to the next task.


Your problem report would be more useful if you mentioned what kind of workunits you have problems with. Looking through your computers and unfinished workunits for them the problem workunits appear to be of the 1zpy__BOINC_TWIST_RINGS...2477... variety that a lot of us had problems with.

I would abort them if they don't end on their own.
17) Message boards : Number crunching : Problems with version 5.90/5.91 (Message 50024)
Posted 25 Dec 2007 by Thomas Leibold
Post:
My most recent 5.90 errored out WU: 128154725

This workunit was completed once already without error by someone else: 116521023


Don't be misled by the fact that someone else was successful into thinking that the problem is yours!

You were using a faster cpu and a 24 hour preferred runtime. The workunit errored after over 20 hours of computations.

The other user had a slower cpu and a 3 hour preferred runtime. This workunit never progressed far enough to reach the point of failure!
18) Message boards : Number crunching : Problems with version 5.90/5.91 (Message 49984)
Posted 23 Dec 2007 by Thomas Leibold
Post:
There seems to be a problem with the 1zpy__BOINC_TWIST_RINGS_SYMM_FOLD_AND_DOCK-1zpy_-native__2470 jobs. So far, the watchdog has killed 6 out of 7 jobs.


I'm getting watchdog errors for 1zpy_... workunits using 5.91 on Linux as well. As can be seen from the stderr.txt below the old issues with the Rosetta watchdog segfaulting on Linux are still present in 5.91. This workunit was processed on dual AMD Quad-Core Opteron 2346HE running OpenSuSE 10.3 in 64-bit mode. Boinc client is 5.10.21 (also the 64-bit version).

<core_client_version>5.10.21</core_client_version>
<![CDATA[
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
# random seed: 3497252
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -10.2275 for 900 seconds
**********************************************************************
GZIP SILENT FILE: ./xx1zpy.out
SIGSEGV: segmentation violation
Stack trace (19 frames):
[0x8d9f877]
[0x8d9a66c]
[0xffffe500]
[0x8a8a0eb]
[0x8d089ac]
[0x8c0f2fa]
[0x8c1166f]
[0x804c7c2]
[0x8a824f1]
[0x8a83ebb]
[0x8935b66]
[0x89378a1]
[0x893b1af]
[0x898a502]
[0x85e96ae]
[0x87289aa]
[0x8728aca]
[0x8e03bc4]
[0x8048111]

Exiting...

</stderr_txt>
]]>
19) Message boards : Number crunching : Problems with version 5.90/5.91 (Message 49967)
Posted 23 Dec 2007 by Thomas Leibold
Post:
I have posted the steps I'm taking to recover from the 5.90 problem on my Linux systems in the Ralph forum . Perhaps this is useful to other Linux users.

20) Message boards : Number crunching : Problems with version 5.90/5.91 (Message 49927)
Posted 22 Dec 2007 by Thomas Leibold
Post:
Update: we've tracked down the problem -- its an issue with the BOINC-provided API (I guess we happened to be unlucky in being the first to update our linux app after the bug got introduced). Later today, we'll update the ralph and rosetta@home linux apps and they should work.


Since you tracked down the problem, can you please tell us how it will effect all those of us running Rosetta on Linux ?

We already know that those 5.90 tasks will not finish after the specified runtime. Without manual intervention, will these tasks ever end on their own or do I have to go to each and every server and manually abort all the 5.90 tasks ?

I have over 100 cpus running Rosetta on Linux and having to clean up this mess is not something I'm looking forward to. It especially upsets me that the lack of testing on Ralph caused the problem to appear in Rosetta. This was clearly avoidable!


Next 20



©2024 University of Washington
https://www.bakerlab.org