Posts by glaesum

1) Message boards : Number crunching : Problems with Rosetta version 5.98 (Message 54109)
Posted 1 Jul 2008 by glaesum
Post:
starting to get the occasional error on 5.98:

here is a t443 {wuid=158705243} that plugged away for nearly 15hrs until it packed in with a validate error. credit was claimed and granted but never actually got issued...

there's no diagnostic on my task report but the wingman's task stopped with client error after 20mins and does have lots of diagnostics (too many restarts with no progress).
2) Message boards : Number crunching : Computer array crunching? (Message 53361)
Posted 27 May 2008 by glaesum
Post:
interesting reading from BitSpit et.al. though I'm not really techie enough to understand it all.

there has been an open source community developing a firmware mash-up to turn the hundred dollar Linkys NSLU2 ["Slug"] into a unrestricted Linux server. it has an ARM processor inside but not much memory. I wonder if it has any use {for the very hardcore!} in the crunching arrays being discussed above - perhaps only as a low cost/low power server even if not as a main cruncher...

http://www.nslu2-linux.org/wiki/Main/HomePage
http://en.wikipedia.org/wiki/NSLU2

(people have managed to use it as email/web/music server, home PABX phone switch using Asterix, etc.etc.)
3) Message boards : Number crunching : minirosetta v1.25 bug thread (Message 53357)
Posted 26 May 2008 by glaesum
Post:
here's one: {using intel P4 with XP} & also wouldn't it help if this topic was pinned to the top of the message board??

resultid=166522181

26 May 2008 2:19:45 UTC
Received 26 May 2008 13:13:16 UTC
Outcome Client error
Client state Compute error
CPU time 12114.94
stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
<message>
<file_xfer_error>
<file_name>d110a_BOINC_CASP8_ABRELAX_t405_IGNORE_THE_REST-S25-7-S3-6--d110a-_3558_71_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>
Invalid

=================

other tasks are getting a non-fatal warning, an example:

Task ID 166593193
Name t0391_BOINC_LOOP_IGNORE_THE_REST-S25-10-S3-5--1a19A-_3571_5870_0
Workunit 152083437
Sent 26 May 2008 9:03:36 UTC
Received 26 May 2008 17:26:40 UTC
Server state Over
Outcome Success
Client state Done
CPU time 14333.23
stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
WARNING: Override of option -out:nstruct sets a different value
9 10000
3 9
1 3
# cpu_run_time_pref: 14400
======================================================
called boinc_finish

</stderr_txt>
]]>
Validate state Valid
4) Message boards : Number crunching : Rosetta@home hits 4 BILLION credits!!! (Message 53182)
Posted 19 May 2008 by glaesum
Post:
...another nice round number crossed. Congratulations to the team.

interestingly also passing 75 TeraFLOPS running crunching power, a moderately round number, out of a total BOINC output that topped 1000 TeraFlops a couple of months ago.
5) Message boards : Number crunching : minirosetta v1.19 bug thread (Message 53038)
Posted 13 May 2008 by glaesum
Post:
error #161 (whatever that is)

finally a wu failed, that's on top of the usual non-fatal 120 error:
resultid=162869266

<core_client_version>5.10.30</core_client_version>
<stderr_txt>
AllocateAndInitializeSid Error 120
failed to create shared mem segment
# cpu_run_time_pref: 14400
:
BOINC :: Watchdog shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>rb_05_12_11631_20348_T0397_IGNORE_THE_REST_10_16_3247_49_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>
]]>
6) Message boards : Number crunching : minirosetta v1.19 bug thread (Message 53007)
Posted 12 May 2008 by glaesum
Post:
I got a similar err msg as AMD_is_logical except the task succeeded and validated even on top of the error reporting with every work unit done on win98 os. note, the 'psipred' line only occurred three times = no. of decoys, hmmmm??
http://boinc.bakerlab.org/rosetta/result.php?resultid=162095619

Received 11 May 2008 2:20:03 UTC
<core_client_version>5.10.30</core_client_version>
<stderr_txt>
AllocateAndInitializeSid Error 120
failed to create shared mem segment
WARNING: Override of option -out:nstruct sets a different value
can not open psipred_ss2 file tt
# cpu_run_time_pref: 14400
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
======================================================
DONE :: 1 starting structures 10977 cpu seconds
This process generated 3 decoys from 3 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>

(this work unit had been through boinc v.6.2 with another user where it failed to validate)
]]>

I just had a couple of mtlr_test2_S.00000001.*_3238_1 WUs error out.

http://boinc.bakerlab.org/rosetta/result.php?resultid=161862548
http://boinc.bakerlab.org/rosetta/result.php?resultid=161862513

In both cases the WU ran the normal length of time (16 hr), then printed a bunch of:

can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
...

lines to sterr. The WUs ended up being marked "invalid".

These WUs were on separate machines, both running Linux.
7) Message boards : Number crunching : minirosetta v1.19 bug thread (Message 52910)
Posted 8 May 2008 by glaesum
Post:
things must be going pretty well as the thread is so quiet...

good news too with win98 OS - the 1.19 app is running, completing and validating although an error message is still getting thrown up. no idea if this matters or not.

on all three wus completed so far this is the message:

Task ID 161439715
Name score13_hb_envtest62_A_1ctf__3171_14411_0
Workunit 147493846
Received 8 May 2008 11:10:33 UTC
Outcome Success

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
AllocateAndInitializeSid Error 120
failed to create shared mem segment
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 13875.8 cpu seconds
This process generated 3 decoys from 3 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

work unit ID nos are:
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=147390671
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=147405464
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=147493846
8) Message boards : Number crunching : minirosetta v1.15 bug thread (Message 52863)
Posted 4 May 2008 by glaesum
Post:
My Celeron with WinME seems happy with the mini's.

Dave


that's interesting as it must mean that win98 is only one tweak away from working...

meanwhile, as long I'm trashing under 50% of tasks sent I'll keep going on that pc (no probs on the XP m/ch at all).
9) Message boards : Number crunching : minirosetta v1.15 bug thread (Message 52823)
Posted 1 May 2008 by glaesum
Post:
so... ...it doesn't look terribly hopeful that minirosetta is fully ready for the launch of CASP8 next monday, does it!!!
10) Message boards : Number crunching : Problems with minirosetta version 1.+ (Message 52822)
Posted 1 May 2008 by glaesum
Post:
hi Lockleys,

have a look at the minirosetta v.1.15 thread adjacent where a couple of us have reported failure on the mini app. with the old win98 platform - although with somewhat different symptoms. /pg

Over the last 3 weeks or so, I have noticed that minirosetta WUs are taking 12 to 13 hours to complete. Previously, it had been running at around 3 hours. But I've been keeping them going.

Then, this week, they have all started to fail. To try to solve the problem, I have uninstalled BOINC, deleted BOINC folders from the system and completely reinstalled/reattached. Sadly, I am still getting the same pattern of errors.

Each download from Rosetta gives a similar set of error messages, e.g.:
01/05/08 08:40:45||Starting BOINC client version 5.10.45 for windows_intelx86
01/05/08 08:40:45||log flags: task, file_xfer, sched_ops
:
: >>snip<<
:
01/05/08 08:45:21|rosetta@home|Restarting task 1lis__BOINC_ABRELAX_IGNORE_THE_REST-S25-13-S3-6--1lis_-_3114_13_0 using minirosetta version 115

This keeps going forever, downloading a new WU and cycling through this error sequence until it fails (or I abort), then starting all over again.

I'm running this application on Windows 98SE (dedicated to Rosetta).

11) Message boards : Number crunching : minirosetta v1.15 bug thread (Message 52774)
Posted 28 Apr 2008 by glaesum
Post:
I'm getting the same type of failure as Peter Leman (pm sent) using OS win98, the tasks don't even start:

the "stderr out" result report reads like this -

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
too many normally harmless exit(s)
</message>
]]>

the two tasks failed so far are:
wu145397531
wu145454437

(mini v.1.07 worked ok whilst mini v.1.09 did not, see (v.1.09 message) for slightly different stderr out report)

if anyone succeeds with win98 on these tasks please report happiness!!
12) Message boards : Number crunching : Problems with Minirosetta version 1.09 (Message 52055)
Posted 21 Mar 2008 by glaesum
Post:
I also have errors:
CreateProcess() failed - One of the library files needed to run this application cannot be found. (0x485)

This is on Win98SE box, also. It is running current BOINC.

looks like we are both getting the same response with a win98 box - there are still 24000 win98 pcs registered on boincstats though I don't how many are still active.

since minirosetta 1.07 worked they must have taken something out in 1.09 that zapped it.
13) Message boards : Number crunching : Problems with Minirosetta version 1.09 (Message 51967)
Posted 15 Mar 2008 by glaesum
Post:
hi, I run Rosetta on a couple of hosts including a very old win98 pc; although not guaranteed it has been quite happy - if slow - with most applications, 5.82, the recent 5.93 & 5.95 running fine. Minirosetta v.1.07 didn't throw up any problems on the few wus that came my way. Sadly I've had 2/2 "compute error" failures with the new v.1.09 reporting a missing library file:

on resultid=148185328

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
CreateProcess() failed - One of the library files needed to run this application cannot be found. (0x485)
</message>
]]>

the other wu was exactly the same.

I don't expect win98 to be specifically supported so if this doesn't solve itself as minirosetta develops I guess I'll have to retire the machine from this project and find a similar folding application perhaps on WCG etc. I can only sensibly do small work units on it.
as long as there are plenty of 5.95s around I'll be fine but if I start trashing a significant percentage of wus then I'd better suspend things.

meanwhile all looks ok on the main XP pc - graphics button apart of course.
14) Message boards : Number crunching : Target CPU time setting (Message 49430)
Posted 5 Dec 2007 by glaesum
Post:
very evocative metaphor indeed! :-)

with winter approaching, I wish my knees were up to skiing in those mountains again... {wistful sigh}
15) Message boards : Number crunching : Target CPU time setting (Message 49421)
Posted 5 Dec 2007 by glaesum
Post:
thanks to both of you, especially for all that work in setting up that spreadsheet. actually I'd copied the first ten wus into a table too but I only calculated the averages for an overview. gosh, did you hand cast each wu name into the sheet - as my global c&p from the web page only picked up the wu numbers? beyond the call of duty I'm sure. :-)

there were two things I was pondering:

1] watching the graphics, the folding seems to go in phases; so does the first phase get repeated with every decoy attempt? if not, then a longer time setting will be better

2] imagine a wu that takes 5800s to find a decoy and then ends; this releases the remaining 5000s of the 3hrs to the next randomly but on average more difficult wu.
shorter settings will mean the proportion of total crunching time that is released and carried forward will increase (ie greater lumpiness which is otherwise smoothed out with either more powerful machines or longer task times)

/Pete
16) Message boards : Number crunching : Target CPU time setting (Message 49406)
Posted 5 Dec 2007 by glaesum
Post:
I'm just getting my head around 'target CPU time' and how it works.

a few days ago I set up a new host (id #678950) using an ancient win98 600MHz athlon and it's chugging along fine even if more like 90% winter heater and only 10% number cruncher. I wondered why the work units were behaving differently to my 3.4Gig machine until I realised that the time was similar and hence the work done in that time was less.

it's set on the default 3hrs target CPU time and on analysing the first ten WUs it's averaged 2.5hrs per WU, each earning 5.35 creds and finding 4.3 decoys on av.

in a significant proportion of tasks, because of the limited power, it's only going to find one or two decoys in the three hours. as a result, tuning the target time might optimise the work done. I'm trying to think whether it is better to shorten the time and end the WUs quicker that aren't going to be very efficient at finding decoys or else lengthen and let the easier tasks have a good run and find more of them. there's probably not much in it either way - thanks anyway for thoughts on the matter. /pg
17) Message boards : Number crunching : No work sent? (Message 49312)
Posted 2 Dec 2007 by glaesum
Post:
what's a pagefile? :-)

I've run the benchmarks again in case the new host isn't properly calibrated.

otherwise it's fetched down unaided a couple of smaller wus without the memory warning and one more with the warning. / pete
18) Message boards : Number crunching : No work sent? (Message 49284)
Posted 1 Dec 2007 by glaesum
Post:
Regarding the credits... as you would expect a slower machine gets less credit per hour of CPU time then a faster machine. Your processing probably completed just fine, with valid results, but you didn't complete as many models in the period of time that a faster machine would, so you got less credit. Same credit per model as a faster box, but you completed less models in the period of time.

thanks: yes, of course; I typically get 15-20cr/wu (apart from the occasional bigger one) but these first two on the old Athlon only got ~4.8cr each. but it's only a statistical sample of two so we'll have to be patient to see how things pan out.
what would help is some way of persuading boinc mgr. not to defer communications for 24hours but a more reasonable hour or half-hour.

I clicked update accidentally on the project and to my surprise a third wu downloaded ok; it's now having to share 50:50 with malariacontrol so it'll be a while before it reports. (the latter project seems to have got going ok on win98 too, though one of the apps. doesn't show progress and cpu time.) /pg
19) Message boards : Number crunching : No work sent? (Message 49276)
Posted 1 Dec 2007 by glaesum
Post:
here is a news update on progress - thanks for the light hearted exchange from everyone.
first I've unhidden my computers on Rosetta the host ID of the old pc is #678950 for those that like to poke around.

the first wu finished overnight but the comms keeps backing off 24hours so I had to nudge it this morning to get another task. this has finished now so I've successfully completed two 5.85 wus; the time was promising but they didn't earn many credits - is that a sign that they didn't really get very far?

now the server won't give me any more work with the same 800MB memory demand and each time it backs off 24hours unless I do a manual retry of comms. it's failed 5 times so I'll wait a while and see how other peeps are getting on.

it's a bit rich if even 512MB machines with xp won't run - they are hardly toy boxes!
_

I think I'll try malaria next (sans the optimizer application) and then WCG (minus the africaclimate project) which has clear guidance on minimum system needed.
20) Message boards : Number crunching : No work sent? (Message 49255)
Posted 1 Dec 2007 by glaesum
Post:
well after forcing 'retry comms' a couple of times and like Ed found, it also finally downloaded a WU - even if it was a 'second hand' one:

cut&paste between pcs is a fiddle - lol!
|01-Dec-2007 00:55:50 [rosetta@home] Fetching scheduler list
|01-Dec-2007 00:55:56 [rosetta@home] Master file download succeeded
|01-Dec-2007 00:56:01 [rosetta@home] Sending scheduler request: To fetch work. Requesting 8640 seconds of work, reporting 0 completed tasks
|01-Dec-2007 00:56:06 [rosetta@home] Scheduler request succeeded: got 1 new tasks
|01-Dec-2007 00:56:06 [rosetta@home] Message from server: Your computer has only 401674240 bytes of memory; workunit requires 398325760 more bytes
|01-Dec-2007 00:56:09 [rosetta@home] Started download of rosetta_beta_5.85_windows_intelx86.exe
|01-Dec-2007 00:56:09 [rosetta@home] Started download of 1ffk.vall_torsions.gz
| etc.etc.etc. and 'ton of other files...'

and it's happily started crunching despite the warning, that's the m/ch's very first WU, wey hey! (even if the box is more like 75% winter heater and 25% number cruncher)

one slight mistake, after a minute I pushed the 'Show Graphics' button - big mistake :(
total freeze up needing the reset button so it restarted from scratch.
won't try that one again!

the time to completion prediction won't mean much yet and we'll see how the cpu handles things as it gets near the end of the WU before fully celebrating.


good luck to everyone else //pg


Next 20



©2024 University of Washington
https://www.bakerlab.org