Posts by Alan Roberts

1) Message boards : Number crunching : exited with zero status but no 'finished' file (Message 73885)
Posted 24 Sep 2012 by Alan Roberts
Post:

...

Yes but as Sid notes most of the time it isn't worth fretting over as a rare occurrence it would be difficult to track down the conflict and may be impossible to avoid. If it continues to happen frequently click through to the BOINC FAQ Service and check out Jord's list of suggestions. The link in my previous post takes you straight to the relevant page.

Best,
Snags


It was definitely happening frequently, but I did try the suggestion of not using BOINC's CPU throttling (switched from using four cores at 75% to just two at 100%), and that seems to have cleaned the problem up for me, at the "cost" of a wider spread in core temperatures and somewhat less crunching that I was hoping to get done.

Thanks,
Alan

2) Message boards : Number crunching : exited with zero status but no 'finished' file (Message 73861)
Posted 19 Sep 2012 by Alan Roberts
Post:
Seeing the same issue on a new Win7 machine that I cranked up with BOINC 7.0.28. Per your note, I've just switched from 75% CPU on four jobs (one per core) to 100% CPU, only two concurrent jobs. Waiting to see if that reduces the problem, and how the temperature settles out.

Question: Any correlation between this error and the, "mismatch" between Rosetta and BOINC 7.x? Should I be planning my retreat to BOINC 6 because I'm seeing this failure?

...
Please know that this only becomes a fatal error when it occurs 100 times to a particular task; at that point BOINC assumes the task will never be able to finish and gives up on it, ending it as a client error. If you see this message only occasionally it is safe to ignore it.


Best,
Snags


Understood, but does the restart imply loss-of-work back to the previous checkpoint for the job?
3) Message boards : Number crunching : Any pointers for this failure? (Message 72816)
Posted 18 Apr 2012 by Alan Roberts
Post:
The machine was happily working on BOINC 6.something. An upgrade to 7.0.25, another project reset, and no joy. Just watched as it took another try on the one work unit per day the project is allowing after so many failures. Same behavior, download of everything happens, then five process creation failures, and the task is finished with status of computation error. Same exit status (-185). No firewall requests or antivirus action. Nothing in the Windows event logs.

This machine isn't going to be upgraded beyond what it is licensed for (Win2K), and it still functions for its primary purposes, home file server and Squeeezebox server.

Apologizes for not being up to speed, but is there anything straightforward I can do to report this failure to developers, along the lines of emailing or posting some additional status file, or enabling some diagnostic mode? If not I guess this machine retires from Rosetta.

Thanks!
4) Message boards : Number crunching : Any pointers for this failure? (Message 72796)
Posted 16 Apr 2012 by Alan Roberts
Post:
Machine is Win2K, so no per-executable permissions. I checked and didn't see anything recent in antivirus quarantine, nor did it flag anything after the project reset (which seemed to have dumped all the files, since I saw it downloading executables again). I'll check the firewall log, but all the software firewall on that old box does is rules controlling network access. So unless the new executable needs its own internet access rules (versus Boinc's standing rules), I'm not inclined to think the firewall is the issue.

Thanks!
5) Message boards : Number crunching : Any pointers for this failure? (Message 72787)
Posted 16 Apr 2012 by Alan Roberts
Post:
My home music server has been crunching for years with no attention on my part. Just noticed something went wrong on 13-APR. Everything since has been failing like this:

http://boinc.bakerlab.org/rosetta/result.php?resultid=498704389

Behavior (at the "Messages" level) seems to be:
Starting ...
[error] Process creation failed:
[error] Process creation failed:
[error] Process creation failed:
[error] Process creation failed:
[error] Process creation failed:
Computation for ... finished


I tried Reset Project and a computer reboot yesterday, no help. Unfortunately I don't have time this week for a lot of forum reading. Has anyone seen/solved this one already? If so a pointer to the thread would be appreciated. If not, any suggestion for next steps ... Detach/reattach, reinstall/upgrade BOINC, etc?

Thanks,
Alan
6) Message boards : Cafe Rosetta : A couple of "old school" IT suggestions (Message 64735)
Posted 2 Jan 2010 by Alan Roberts
Post:
Yes, there are two figures for number of work units available. ...


Mod.Sense thanks for the explanation of the difference. I guess this means it would take a slightly more careful glance at the home page (noticing that Credits last 24h has plummeted perhaps) for anyone doing a status check to realize that work isn't happening.

Regards,
Alan

7) Message boards : Cafe Rosetta : A couple of "old school" IT suggestions (Message 64728)
Posted 2 Jan 2010 by Alan Roberts
Post:
Back in the prehistoric times when I supported computers used by research groups, there were these archaic devices called "pagers." All sorts of software conditions (missing processes, low/failed storage issues, systems down) and environmental conditions (HVAC faults, power failures) would page the on-call member of the team, and if he/she failed to respond the page would eventually shift to their backup.

Remote access was a primitive thing done with dial-up modems, so some times things could be fixed from home, other times it was a drive to the servers, and you might or might not have been meeting the vendors' field engineer on-site.

Strangely enough, the on-call schedule provided coverage across holidays once the team has puzzled out who was likely to be available on what dates.

In these days when every cell phone on the planet can receive text messages, it seems to require no additional hardware and not much development cost to have any Server Status "Not Running" fault also send an alert message to Rosetta's IT people.


Since all the really cool people have smart phones that can surf the web, another thought comes to mind ... It would seem to be in the self-interest of the researchers with active projects on Rosetta to visit the web site every couple of days and check to see if, "work is happening."

I'll admit I am somewhat confused because the front page claims 567,925 queued jobs, while Server Status reports only 1,281 jobs pending, but someone must understand which number is accurate.


Not really griping, just tossing out some ideas to avoid future failures,
Alan
8) Message boards : Number crunching : Can't get work ... Trying to make sure this isn't my problem (Message 61904)
Posted 23 Jun 2009 by Alan Roberts
Post:
Spoke too soon ... The manual edits seemed to result in work stuck in the "downloading" state with nothing good happening.

Tried the detach and reattach approach. I'm back to downloading stuff, but now I'm seeing the file size errors (I'm assuming this is what you referred to) on the minirosetta and minirosetta_graphics executables.

Oh well, perhaps it will sort itself out with a bit more time.
9) Message boards : Number crunching : Can't get work ... Trying to make sure this isn't my problem (Message 61903)
Posted 23 Jun 2009 by Alan Roberts
Post:
Thanks Greg, manual edits have gotten me back to downloading work.
10) Message boards : Number crunching : Can't get work ... Trying to make sure this isn't my problem (Message 61897)
Posted 23 Jun 2009 by Alan Roberts
Post:
I recently completed a long-overdue cooling repair on my home file server. Rosetta has been disabled on the server to avoid the load.

Once everything was back together, I upgraded to the latest Rosetta and tried to get some work. The message response is:
6/23/2009 7:45:25 AM	rosetta@home	Sending scheduler request: To fetch work.
6/23/2009 7:45:25 AM	rosetta@home	Requesting new tasks
6/23/2009 7:45:30 AM	rosetta@home	Scheduler request completed: got 0 new tasks
6/23/2009 7:45:30 AM	rosetta@home	Message from server: Server error: can't attach shared memory

Anyone else catching this error this morning?
11) Message boards : Number crunching : Are there still Rosetta Beta work units? (Message 61649)
Posted 9 Jun 2009 by Alan Roberts
Post:
Mod.Sense:

Sorry, missed your question about version where BOINC would suspend but Mini just kept running. 5.10.45 and 6.2.18 both gave me problems. That wasn't the only failure mode. As noted in my old post, I also saw cases where the Mini job just kept running forever, and BOINC never showed any progress for the job.

I don't have hours in the day to baby sit never-ending jobs and definitely did not want rock the boat at my customer site with failure-to-suspend, so I just avoided the problem.

Based on Hammeh's comment, I'm going to give Mini with BOINC 6.4.7 a try on the Optiplex box least likely to cause complaints if it fails to suspend, and (if it will run on Win2K) on some older non-production servers. If I can get clean operation, I'll let Mini back into the world.
12) Message boards : Number crunching : Are there still Rosetta Beta work units? (Message 61644)
Posted 9 Jun 2009 by Alan Roberts
Post:
Chilean:
The issues with pausing crunching on the business machines during work hours are:

  • The machines are Dell Optiplex GX620 small form-factor desktops, located in conference rooms. As soon as the machine pulls more than a 20% sustained load, the fan speed steps up and people start asking why there is an aircraft taxiway in the room. Unacceptable.

  • While the machines are 3.2GHz Pentium Ds with plenty of RAM, when one core is crunching for Rosetta there is a noticeable increase in lag on intensive actions, IMO (e.g., starting bloatware applications like PowerPoint, clean video playback, etc). Since the typical user of these machines is an administrative assistant working as the driver and scribe during a meeting, it isn't acceptable to have anything going on which even might be blamed for complicating their life.



Mod.Sense:
Thanks for the explanation, I didn't realize it wouldn't query the entire pending queue.

Hammeh:
Interesting. Since I have machines with no work in progress, I should be able to switch back to 6.4.7 without any impact. Do you know offhand if I can just run the install to roll back, or do I need to detach, uninstall, do a clean install, and then attach and merge?

13) Message boards : Number crunching : Are there still Rosetta Beta work units? (Message 61625)
Posted 8 Jun 2009 by Alan Roberts
Post:
I'm allowed to crunch on a number of internal machines at a customer site, provided I avoid business hours. In the past Mini was not pausing even when BOINC thought it was supposed to, so I gave up and used app_info.xml to lock those machines down to just Rosetta Beta jobs.

I just noticed most/all of those machines have completed all tasks, and the Messages tab is reporting:
Sending scheduler request: To fetch work. Requesting 345600 seconds of work, ...
Scheduler request succeeded: got 0 new tasks
Message from server: No work sent

repeatedly.

Is Rosetta Beta gone?
14) Message boards : Number crunching : Times for work units on new machine (Message 61276)
Posted 19 May 2009 by Alan Roberts
Post:
Jchojnacki:

No, I either never new about the model limit or had forgotten about it. Certainly explains the wide time variation.

I guess what worries me a bit is what is the machine's current set of tasks (downloaded after I reset). When I total them up there are enough to be 3.5 days of work for 2 x 100% of the cores if the To completion times average 3:44:15 (what BOINC Manager lists for all of them). Since many of the runs will be longer than this, seems like the machine could still be overbooked (have work it can't complete prior to deadline).

If it had only fetched enough work for 3.5 days of 1-day work units, then it might be underbooked (since some/many would finish "early" by hitting the 100 model limit), but that seems more efficient than missing deadlines (since the Rosetta servers won't have to dispatch the overdue WU to another machine).

Hammeh's response says BOINC will figure it all out. Would it settle faster if I turned Target CPU run time down so that most WUs end on CPU time, instead of number of models?

Obviously I haven't been paying enough attention to my Rosetta machines. The last time I was watching carefully, the target run time was consistently the limit for a job's duration.

Thanks!
15) Message boards : Number crunching : Times for work units on new machine (Message 61275)
Posted 19 May 2009 by Alan Roberts
Post:
Hammeh:

Sorry, I was attempting to keep my post short, and left out information. Primary reason for the project reset was some early failures I saw when transitioning to this machine. I initially tried to run using 100% of the cores (all 4), with each core throttled to no more than 65% of the CPU time (i.e., BOINC was throttling the running work units).

While my foreground workstation performance was fine and the core temps were very stable (at least as far as SpeedFan was concerned), I kept finding "dead" Rosetta processes (work units sitting in memory but consuming no CPU, BOINC Manager showing them running but with no CPU time or progress). These failures (e.g., 249863482, 249863504, 249863499) reported as Compute Error although browsing through the log it seems more like some sort of resource contention bug. The corresponding Messages in BOINC Manager had wording along the lines of, "... If this keeps happening, a project reset may be necessary..."

I didn't really have time to research this board and find out if this is a know bug, so I selected No new tasks, waited until everything left on the queue wasn't going to finish by deadline, and Reset project. Before picking up more work I changed preferences to just use 50% of the cores (2) at 100% available CPU time. Without BOINC trying to throttle the work units I seem to be getting stable system behavior (i.e., no dead Rosetta processes).

I do understand that I need to let BOINC settle on Result duration correction factor (and whatever else goes into predicting and scheduling work). Just wanted to make sure I haven't created some unstable configuration that wouldn't tune, since the machine pulled a lot more work units than it could complete before deadline.

Thanks.
16) Message boards : Number crunching : Times for work units on new machine (Message 61269)
Posted 19 May 2009 by Alan Roberts
Post:
Added a new computer to my collection.

I have the computer set to my "Work" location. This location is set for Target CPU run time of 1 day; and a Maintain enough work for an additional setting of 3.5 days.

In BOINC Manager, the To completion time on all the pending work units is 03:44:15. A lot of work units were downloaded (more than it could finish by the initial deadline). Even after a project reset it seems to be repeating this behavior. Actual work done seems all over the map, but very few run just the small time shown for pending work units, and very few run a full day.

Can anybody spot something obvious that I'm doing wrong? All I want is a stable pipeline of full-day work units, not a overbooking.

I did initially try and migrate Rosetta from the older machine that this one replaced, until I discovered I couldn't merge machines across a name change (which was required). Could that be contributing to the mess?

Thanks!
17) Message boards : Number crunching : HELP !!! minirosetta hang on after 20 second of execution on freebsd (Message 54457)
Posted 12 Jul 2008 by Alan Roberts
Post:
Maxim,

I finally threw in the towel, and with the help of this forum resorted to avoiding Mini tasks on machines where I was getting frequent hangs. I don't know if the freebsd supports app_info. If it does, I suspect you'll need some different details for application names, but the thread should give you a start at it.

I just found a hung Beta 5.98 task on a previously trouble-free machine, so I can no longer claim that dodging Mini tasks is a complete fix.
18) Message boards : Number crunching : Problems with Rosetta version 5.98 (Message 54456)
Posted 12 Jul 2008 by Alan Roberts
Post:
I've had so many problems with Mini (see this post) that I've had to resort to filtering it off of quite a few of my dual-core/dual-CPU machines.

This morning I walked into my home's listening room to find my recycled laptop, low-power music server (that to-date has happily consumed anything Rosetta sent its way) making excessive noise. Checking I found this 5.98 WU stuck at 100% CPU, even though the machine's preferences were set for max of 70% of CPU (BOINC 5.10.45, and BOINC CPU setting has been honored in the past). Within BOINC, CPU time used and progress were {b]not[/b] advancing, the job was sitting at 20-something percent progress.

Suspending the project did not suspend the job. Shutting down the BOINC service did. Ran a round of Windows updates and rebooted. The work unit restarted and ran with CPU throttling for about 10 minutes, then locked up at 100% again. This time I aborted the task ... I believe the first time across any of the machines on my team that I've had to abandon a 5.98 work unit.

The worst news for me is that the long (possibly better part of two days) non-cycling fan run seems to have put the fan into a permanent high-noise mode. I've got a spare fan assembly, but won't really enjoy the time to tear down and reassemble the unit this weekend.

I guess I'll reinstall and setup Threadmaster, since the BOINC/Rosetta combination seems to be trending towards less operational reliability.

I know everyone is busy with CASP, but I have to emphasize that this is important to me, and I assume to others who are trying to contribute with machines that are not dedicated crunchers. Most of the machines on my team are there because I committed to not loading the machine during business hours (time-of-day and when needed manual suspends) and not overheating the machine (CPU limits). If I can't reliably do this with minimal ongoing effort I'll end up having to pull machines off the project.
19) Message boards : Number crunching : Need help fixing problems or avoiding Rosetta Mini (Message 54232)
Posted 7 Jul 2008 by Alan Roberts
Post:
Thanks for the pointer. I've got the servers locked down to Beta jobs, and they are obeying BOINC's time-of-day suspends (helps with the geo-politics). I can deploy to the Pentium D desktops that are throwing all the errors on Mini jobs this evening.

Not sure if the previous poster was asking me about BOINC version, but I think everything is at 5.10.30 or higher. I've only been running BOINC updates when I happened to be visiting a machine at it was near a job boundary.
20) Message boards : Number crunching : App_info for windows only. (Message 54218)
Posted 7 Jul 2008 by Alan Roberts
Post:
Hello Peter,

I've just tried the following app_info.xml on a machine in my herd that had just completed a Reset Project:
<app_info>
<app>
<name>rosetta_beta</name>
</app>
<file_info>
<name>rosetta_beta_5.98_windows_intelx86.exe</name>
</file_info>
<app_version>
<app_name>rosetta_beta</app_name>
<version_num>598</version_num>
<file_ref>
<file_name>rosetta_beta_5.98_windows_intelx86.exe</file_name>
<main_program/>
</file_ref>
</app_version>
</app_info>

The initial result was download of the data files for several Beta 5.98 WUs, which stayed in the downloading state. Messages showed the following error:
7/7/2008 12:47:39 AM||file projects/boinc.bakerlab.org_rosetta/rosetta_beta_5.98_windows_intelx86.exe not found
7/7/2008 12:47:39 AM||[error] No URL for file transfer of rosetta_beta_5.98_windows_intelx86.exe
7/7/2008 12:47:39 AM|rosetta@home|Sending scheduler request: To fetch work.  Requesting 194400 seconds of work, reporting 0 completed tasks
7/7/2008 12:47:40 AM|rosetta@home|Backing off 1 min 0 sec on download of rosetta_beta_5.98_windows_intelx86.exe

So I copied rosetta_beta_5.98_windows_intelx86.exe from another machine that I had not reset yet, and the first workunit launched, with all the others going ready to start. Major progress! Looking at messages, I see:
7/7/2008 12:50:41 AM|rosetta@home|[error] Application file rosetta_beta_5.98_windows_intelx86.exe missing signature
7/7/2008 12:50:41 AM|rosetta@home|[error] BOINC cannot accept this file

So I don't think I have the app_info.xml file quite correct yet.

I'll look again in the morning after I've had some sleep, but if you can spot what I have wrong and let me know, I'd appreciate it!

Regards,
Alan


Next 20



©2024 University of Washington
https://www.bakerlab.org