Posts by Jean-David Beyer

1) Message boards : Number crunching : Rosetta v4.08 x86_64-pc-linux-gnu or Rosetta v4.07 i686-pc-linux-gnu (Message 91004)
Posted 7 Aug 2019 by Jean-David Beyer
Post:
The Rosetta developers have been repeatedly skeptical about my performance improvement estimates. That is not a surprise. Developers are sensitive about their work and frequently think they know more than they do. I had to explain to many compiler developers why their "really neat improvement" was not going to make the impact they forecast. The application developers are farther away from performance problems than the compiler developers.


When I was working on optimizers, another part of my department was working on hardware design for a new 32-bit processor. The hardware designers were even farther away from performance problems than the compiler developers. The hardware guys found out that in a benchmark program that the marketing department thought was important, there was often a multiplication by two, so they were going to design in a special floating point multiply by two instruction. I pointed out that in a normal workload, multiplying floating point numbers by two was seldom done and furthermore, due to the construction of the benchmark program, I could guarantee that the compiler-optimizer would never generate the floating point multiply by two instruction. (The value 2 was in an external variable that could not be seen by the compiler-optimizer). I suggested that a much better use of the chip area would be to put in a larger instruction cache instead, which would be much more useful. But they would not do that; they designed their fancy new instruction, and we never generated it.
2) Message boards : Number crunching : Rosetta v4.08 x86_64-pc-linux-gnu or Rosetta v4.07 i686-pc-linux-gnu (Message 90969)
Posted 4 Aug 2019 by Jean-David Beyer
Post:
Their efforts to "hyper optimize" the binary by pulling functions "inline" is based on running 1 copy on a large, idle machine. The result is "sub optimized" results when running 2 or more WU on a machine that strain a critical resource .. like the instruction cache. I am running 36 copies on a machine and the negative impact of inlining functions is pretty obvious.


I guess it really depends on what compiler and optimizer is used. Long ago, a friend and I worked for Bell Labs, doing a post-compiler assembler-level optimizer for their C compiler. One of the optimizations we did was to expand functions in-line. Not if they were "too big" or obviously recursive. By itself, it could save the call return overhead that really mattered only in short fast functions. But sometimes, this also gave the optimizer a better view of what was going on. In one benchmark, a function was called 10,000 times, but the loop was outside that function. Expanded inline, the optimizer noticed that everything inside the loop had the same value each time around, so all those instructions were moved outside the loop, greatly speeding up the execution time. Then the live-dead analysis eliminated the single computation because the values were never used. Even the loop overhead and the function call and return overhead was removed.

As far as running more than one instance of a program at the same time, actual RAM use could be reduced because only one instance of the code need be in RAM, independent of the number of processes using that code. And if the working sets were comparable, these days with large instruction caches (my 4-core Xeon processor has 10 MB SmartCache) the working set of the instances could well be pretty much the same, so execution time for both might not degrade at all, compared to running the programs sequentially. For short programs, this might not matter, but programs like climateprediction.net that can take weeks or months to run, this could be quite significant.
3) Message boards : Number crunching : Problems with version 5.96 (Message 53986)
Posted 25 Jun 2008 by Jean-David Beyer
Post:
I am close to being out of here! I started crunching Rosetta because it would run for weeks without any attention, it sure wasn't because of the way low Boinc credit.

Now I have many stuck jobs, have had to abort plenty of jobs and am running out of patience.

Jim


I have gotten a few "stuck jobs", if by that you mean some that get to 100% complete, time remaining: --, but still running for quite a while. I just assumed this was similar to those that run 2x or 3x longer for the last 4% than they took for the first 96%, so I let them continue to run for a while. They ultimately finished. I have not checked if they finished correctly or with an error.
4) Message boards : Number crunching : Problems with version 5.96 (Message 53973)
Posted 24 Jun 2008 by Jean-David Beyer
Post:
On about half of the jobs, when I reach around 95% completed progress simply crawls. To completion time stops but percentages increment extremely slowly. I assume the job is progreessing but I don;t know.


That would be normal. Especially if you are still on the first model, and/or have a short preferred runtime specified in your Rosetta preferences.


I guess this is normal, but sometimes, like today, it bugs me.

I have two hyperthreaded Xeons (32-bit) and 8 GBytes RAM running Linux kernel 2.6.18-92.1.1.el5PAE on one machine and two Pentium III processors and 512 MBytes RAM running Linux kernel 2.6.9-67.0.15.ELsmp on the other machine. In each case, Rosetta runs up to about 96% complete in a relatively short period of time, and time remaining is usually in the order of 10 minutes. Right now, it has used up about 8 hours since getting to 96% complete (and it took only about three hours to get to 96%). This is time actually consumed by the process, not wall-clock time.

I just wish the time remaining would more accurately reflect the time needed to complete.

Rosetta is not the worst offender in this regard. Some projects have the time remaining actually increasing as the time consumed increases.
5) Message boards : Number crunching : Preemption Failures on Linux (Message 47607)
Posted 10 Oct 2007 by Jean-David Beyer
Post:
I, too, have problems with rosetta, and Mod.Sense succested I post here.

First of all, I have two 3.06 GHz Hyperthreaded Xeon processors, 8 GBytes RAM, and a dedicated disk partition of 16 GBytes for BOINC stuff. I run Red Hat Enterprise Linux 5 with (at the moment) kernel 2.6.18-8.1.14.el5PAE. Swap space is set up as two partitions of 2 GBytes each. My network connection is Verizon FiOS with 20 Megabit/second download speed and 5 Megabit/second upload speed. I usually get these speeds.

As far as BOINC is concerned, I say to leave applications in memory when they are suspended, use all 4 processors, switch applications every 60 minutes, and use at most 100% of the processor time. Use at most 15.75 GBytes of disk space, leave at least .1 GByte free, and use at most 98% available disk space. Use at most 75% of swap space, 75% of memory when computer is in use and 90% of memory when computer is not in use. (Computer is turned on about 100% of the time.)

For Rosetta, I say give the application 11.11% resource share.

The original problem I though I had was a rosetta application ran up about 4 hours of time, which is about what I expect, indicated that there was -- left to complete the work unit, progress 100%, and so on. But it continued running a long time (about 30 hours), really running up CPU time. I.e., it did not freeze. As Mod.Sense suggested, I stopped the BOINC client by running /etc/rc.d/init.d/boinc stop. This shut down everything _except_ the rosetta applications. I nominally had one running, but pstree revealed (in part) something like this (before shutting down):

─su───boinc─┬─hadam3_4.07_i68─┬─hadam3_um_4.07_───{hadam3_um_4.07_}
│ │ └─2*[{hadam3_4.07_i68}]
│ ├─2*[hadcm3trans_5.4─┬─hadcm3transum_5───{hadcm3transum_5}]
│ │ └─2*[{hadcm3trans_5.4}]]
│ ├─malariacontrol_───{malariacontrol_}
│ ├─rosetta_beta_5.───rosetta_beta_5.───2*[rosetta_beta_5.]
│ ├─setiathome-5.27───setiathome-5.27───2*[setiathome-5.27]
│ └─wcg_faah_autodo───3*[{wcg_faah_autodo}]

(This is one that, as far as I know, is actually running correctly.)

Now this time, when everything seems to be running correctly, stopping the boinc clienit causes all the boinc applications to stop too.
6) Message boards : Number crunching : Silly Newbie Tricks - Suspending a work unit (Message 47597)
Posted 10 Oct 2007 by Jean-David Beyer
Post:
Since then it has run up more than 37 hours. I propose to let it run another day or so and see what happens.


Looks like your preferred runtime is 3hrs. The watchdog should have killed that task some time ago. You've already exited and restarted BOINC and it did not complete the task, so I suggest you abort it. Sorry.

Also, please join the Linux problems discussion


Note that when I exited BOINC it did not manage to kill the rosetta processes. I seem to remember that this is always the case. Could there be a problem in either the BOINC client, or the rosetta application that makes this happen?

I do not care what my preferred run time is. Would it make sense for me to increase it?
7) Message boards : Number crunching : Silly Newbie Tricks - Suspending a work unit (Message 47582)
Posted 10 Oct 2007 by Jean-David Beyer
Post:
I guess I would suggest ending BOINC and restarting.

The "excess" processes could be due to BOINC going to a "waiting for memory" state. It then starts up another process and crunches on that until memory again cross above your preference.

I see you have 4 cores and 8GB of memory. Do your BOINC General Preferences allow it to use at least 25% of that? For both idle and while active?


I do not see why my machine would have any trouble getting memory for a BOINC application. I have 8 GBytes RAM and allow 75% of it to BOINC when the machine is busy (whatever that means) and 95% when the machine is not busy. Typically, 75% of the RAM is devoted to the input cache, although that can go down somewhat when I run a postgreSQL database application.

I tried stopping BOINC and everything stopped except for the rosetta programs that kept running. The one with all the time on it was the parent of the other three.

I killed them and restarted BOINC and all seems to be running normally. I assume I lost 30 hours credit for that mess.


Progress report, sort-of. I probably did not lose any credit, at least as yet. After the boinc client scheduler got around to it, it resumed that 100% progress work unit again and it ran quite a few hours more. Then it started another part of the same work unit (same line in boincmgr), reset the time run to 0, but still indicating 100% progress with no time remaining. Since then it has run up more than 37 hours. I propose to let it run another day or so and see what happens.
8) Message boards : Number crunching : Silly Newbie Tricks - Suspending a work unit (Message 47458)
Posted 6 Oct 2007 by Jean-David Beyer
Post:
I guess I would suggest ending BOINC and restarting.

The "excess" processes could be due to BOINC going to a "waiting for memory" state. It then starts up another process and crunches on that until memory again cross above your preference.

I see you have 4 cores and 8GB of memory. Do your BOINC General Preferences allow it to use at least 25% of that? For both idle and while active?


I do not see why my machine would have any trouble getting memory for a BOINC application. I have 8 GBytes RAM and allow 75% of it to BOINC when the machine is busy (whatever that means) and 95% when the machine is not busy. Typically, 75% of the RAM is devoted to the input cache, although that can go down somewhat when I run a postgreSQL database application.

I tried stopping BOINC and everything stopped except for the rosetta programs that kept running. The one with all the time on it was the parent of the other three.

I killed them and restarted BOINC and all seems to be running normally. I assume I lost 30 hours credit for that mess.
9) Message boards : Number crunching : Silly Newbie Tricks - Suspending a work unit (Message 47444)
Posted 6 Oct 2007 by Jean-David Beyer
Post:
[quote]
[snip]
The 5.5 CPU scheduler waits for the next checkpoint later than 10 seconds before the check (there is some asynchronous code, and several seconds can disappear if the host is slow and busy) unless there is a task the needs extra CPU time to complete on time. This may suspend a task just a few seconds before it is complete if there is a checkpoint there, but normally a checkpoint will only happen once every few minutes. Problems that had to be dealt with: tasks that run for days without checkpointing (there are projects that do this), projects that lie about how much work is left (one project I remember had tasks that had a 100 hours or so of CPU time after 100% complete was reached on some tasks).

[snip]


Is rosetta@home one of these? This morning, after about 5 hours, the boincmgr indicated that rosetta@home reached 100% complete, yet it has been running about 10 hours since then. And really running, not stalled. I am running 5.8.16 of the BOINC client and boincmgr. rosetta_5.69_i686-pc-linux-gnu is the program itself.
This is a Red Hat Enterprise Linux 5 system with two 3.06 GHz hyperthreaded Xeon processors and 8 GBytes RAM.

$ ps -fu boinc
UID PID PPID C STIME TTY TIME CMD
boinc 2420 4627 86 03:52 ? 15:04:04 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c
boinc 2421 2420 0 03:52 ? 00:00:00 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c
boinc 2422 2421 0 03:52 ? 00:00:00 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c
boinc 2423 2421 0 03:52 ? 00:00:00 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c
boinc 4627 4625 0 Sep29 ? 00:11:16 /home/boinc/BOINC/boinc




I assume you are asking about the comment I've bolded?

...not to my knowledge. I believe the odd symptoms people are seeing on Linux all relate to tasks which show they are not yet completed, but BOINC has requested that they stop crunching and it has scheduled another task, but the Rosetta thread continues working... working what would otherwise be normally. As in it will finish at a normal time... just that it shouldn't still be running.


You assume correctly. Most rosetta work units seem to complete in 5 to 8 hours for me. This one announced it was 100% complete and had no time remaining at about 5 hours, but it has now run up 22 hours 17 minutes. According to "top" command, it has consumed 1338:07 (minutes:seconds) time.

If I knew it was running something important, I would just let it run, but most of this time has run up after boincmgr announced the process was complete.

Also I do not understand the excess rosetta processes.

PID PPID USER PR NI S VIRT RES SHR SWAP %MEM %CPU TIME+ P COMMAND
2420 4627 boinc 39 19 R 56500 45m 20 9632 0.6 74 1342:07 0 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 1
4629 4627 boinc 34 19 S 35760 5900 3148 29m 0.1 0 1:07.95 0 hadcm3trans_5.41_i686-pc-linux-gnu hadcm3inct_cmus_1920_160_65869824 1085_ocean.year yafbg
2421 2420 boinc 34 19 S 56500 45m 20 9632 0.6 0 0:00.13 2 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 1
2422 2421 boinc 34 19 S 56500 45m 20 9632 0.6 0 0:00.51 1 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 1
2423 2421 boinc 35 19 S 56500 45m 20 9632 0.6 0 0:00.04 2 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 1
10) Message boards : Number crunching : Silly Newbie Tricks - Suspending a work unit (Message 47429)
Posted 6 Oct 2007 by Jean-David Beyer
Post:
[quote]
[snip]
The 5.5 CPU scheduler waits for the next checkpoint later than 10 seconds before the check (there is some asynchronous code, and several seconds can disappear if the host is slow and busy) unless there is a task the needs extra CPU time to complete on time. This may suspend a task just a few seconds before it is complete if there is a checkpoint there, but normally a checkpoint will only happen once every few minutes. Problems that had to be dealt with: tasks that run for days without checkpointing (there are projects that do this), projects that lie about how much work is left (one project I remember had tasks that had a 100 hours or so of CPU time after 100% complete was reached on some tasks).

[snip]


Is rosetta@home one of these? This morning, after about 5 hours, the boincmgr indicated that rosetta@home reached 100% complete, yet it has been running about 10 hours since then. And really running, not stalled. I am running 5.8.16 of the BOINC client and boincmgr. rosetta_5.69_i686-pc-linux-gnu is the program itself.
This is a Red Hat Enterprise Linux 5 system with two 3.06 GHz hyperthreaded Xeon processors and 8 GBytes RAM.

$ ps -fu boinc
UID PID PPID C STIME TTY TIME CMD
boinc 2420 4627 86 03:52 ? 15:04:04 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c
boinc 2421 2420 0 03:52 ? 00:00:00 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c
boinc 2422 2421 0 03:52 ? 00:00:00 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c
boinc 2423 2421 0 03:52 ? 00:00:00 rosetta_beta_5.80_i686-pc-linux-gnu xx mcr1 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -abrelax -output_c
boinc 4627 4625 0 Sep29 ? 00:11:16 /home/boinc/BOINC/boinc


11) Questions and Answers : Unix/Linux : Difficulties downloading new work (Message 39439)
Posted 15 Apr 2007 by Jean-David Beyer
Post:
Trying to download 5.59 work units. It has been trying to download three work units, lots of files. Each file has run up over an hour of time trying to download, but has gotten no bytes on any of them. Examining system status indicates that all servers are up and work units available. Messages indicate that servers may be down, but the Rosetta site indicates they are all up.

Dual Hyperthreaded Xeon (*86) system with 8 GBytes of RAM running RHEL 3 at the moment.
12) Questions and Answers : Unix/Linux : some WU's stop executing on linux (Message 30286)
Posted 30 Oct 2006 by Jean-David Beyer
Post:
G'day NilsB

Welcome to Rosetta@Home

Rosetta does occasionally have Linux errors (3.52% last time I saw).

You can of course abort them if you see them, but the programme will eventually stop itself. The programme will also send debugging information about work unit that failed, so the Rosetta@Home team can reduce these errors even further.

Hope that helps

Hugo.


I also have this problem. I noticed it yesterday and it is still stuck today.

Work unit 1n0u_HIGHFREQ_ABRELAX_7_1_NATIVe_ONLY_BARCODE__1312_9043_0.
It has accumullated 00:58:44. The BOINC client gives it an hour of CPU from time-to-time and it seems to use none of it.

I am running Red Hat Enterprise Linux 3 ES (up to date) on a dual 3.06 GHz Xeon hyperthreaded processor with 8 GBytes RAM, and this leaves one hyperthreaded processor idle all the time it is scheduled. Other Rosetta applications run just fine and one completed sometime yesterday.

You say "the programme will eventually stop itself." How long is eventually? Because eventually I will wish to abort it.
13) Questions and Answers : Unix/Linux : Work unit way too slow, I think. (Message 13627)
Posted 13 Apr 2006 by Jean-David Beyer
Post:
P.S.:

If the Work Unit is really Hung -


1. suspend the Work Unit, BOINC Manager -> Work (tab) -> click on the Work Unit click the Suspend button (on the left hand side) then Resume button, wait for the computer to re-start the Work Unit (it will need to finish the new Work Unit it started, if it had another available) and see if it's still stuck, give it about 20min.

It took more than 20 minutes because BOINC client downloaded about
7 Predictor work units and had to do them first.

But after that the rosetta process got up to about 24 hours and still
did not progress.

2. Shutdown BOINC, restart BOINC see if the Work Unit is still stuck, give it about 20min.

After shutting down BOINC, the rosetta process kept on running,
with init as the parent. I killed it and then restarted BOINC.
In less than 60 seconds, the rosetta process got up to 1.01% but
I do not have hope for it.

3. Reboot your computer. See if the Work Unit is still stuck, give it about 20min.

I am not prepared to reboot the computer. What good would that do
that shutting down BOINC, killing any leftover BOINC-owned processes,
and restarting BOINC would already do?

4. Abort the Work Unit, BOINC Manager -> Work (tab) -> click on the Work Unit that's stuck click the Abort button (on the left hand side).

I will consider this if it is still stuck tomorrow.
14) Questions and Answers : Unix/Linux : Work unit way too slow, I think. (Message 13609)
Posted 13 Apr 2006 by Jean-David Beyer
Post:
Work unit TRUNCATE_TERMINI_FULLRELAX_1ptq_433_996_0 is taking far too long. It has used 17:07:13 as I type this and it seems to be at 1.04% complete with 20:52:22 remaining. Normally, a work unit is complete long before this. Should I kill it, or what? And if so, how?
15) Questions and Answers : Getting started : How long for Rosetta@home to get started? (Message 2364)
Posted 5 Nov 2005 by Jean-David Beyer
Post:
Check your preferences, the disk space items in particular.
You should allow Rosetta at least 200 MB of disk space. The client will probably not even use that amount, but it wont download wu's when allowed disk space is below 200MB.

Disk space is not the problem (I allocated 8GBytes to the boinc partition). There were two problems, one with the server being intermittantly down (turns our not to have been the major problem), and that my machine, even with two 3.06 GHz hyperthreaded Xeon processors and 4 GBytes RAM was overcommitted (with four climate prediction work units). Suspending the climate prediction stuff allowed me to download from Rosetta@home. Once that was done, I allowed climate prediction to run again (which it didn't, of course, since the deadline for the Rosetta stuff was December 1 or 2, and the deadline for the deadline for the climate prediction was January 24. I let the Rosetta work units run and they took only a brief time each (less than an hour, IIRC). So things are back to normal.

So being patient would not have helped unless I were extremely patient. I assume I would not have gotten any work units from Rosetta until sometime in late January 2006 when the climate-prediction stuff cleared out.
16) Questions and Answers : Getting started : How long for Rosetta@home to get started? (Message 2160)
Posted 3 Nov 2005 by Jean-David Beyer
Post:
Yesterday I registered for this project and attached to it. But I get no application program(s) and no work units. Is this normal, have I configured something incorrectly, or what?

I have entries in my log saying (I wish I could copy from the boincmgr window and paste in here, but I cannot, so I hope there are no typos):

Sending scheduler request to http://boinc.bakerlab.org...
Reason: Requested by user
Note: not requesting new work or reporting results

Well why not?

Sometimes it says:

... to fetch work
Requesting 692100 seconds of new work
No work from project






©2021 University of Washington
https://www.bakerlab.org