Posts by gbayler

1) Message boards : Number crunching : Summary of issues with VirtualBox tasks (Message 105348)
Posted 6 Mar 2022 by gbayler
Post:
6. Task gets stuck with the BOINC Manager status:
Postponed: VM Hypervisor failed to enter an online state in a timely fashion.

In contrast to the well-known issue #4 on my list with the status
Postponed: VM job unmanagable, restarting later.

here it doesn't help to restart BOINC. I just had such a Task 1475946906. Now, after ~2 days, it decided to continue to process again. Let's see whether it will be finished before its deadline!
2) Message boards : Number crunching : ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time. (Message 105347)
Posted 6 Mar 2022 by gbayler
Post:
. It feels a little bit like riding the proverbial dead horse!

It`s zombie dead horse , along with the zombie rosetta tasks ,


Hahaha, made my day! 😂 🧟 🐎

Sorry for my late answer, saw only now that you replied.
3) Message boards : Number crunching : ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time. (Message 105180)
Posted 23 Feb 2022 by gbayler
Post:
xii5ku wrote:
PPS:
Kudos to @gbayler for the issues & workarounds tracker.

Thank you, it is great to hear that! :) Actually, @dcdc had the idea to collect issues with VirtualBox tasks in his thread Summary of issues with VirtualBox tasks, my sheet is just based on that.

xii5ku wrote:
PS:
The need for such absurd workarounds shows in which sorry state the whole Rosetta@home project has carried itself over the years. It's sad. The 'rosetta python projects' application is just disgusting and should go away. I only am running it myself because I know how to, and because Rosetta v4 work is only intermittently available.

I was quite surprised to learn that R@h does not work properly out of the box, but workarounds such as aborting tasks and restarting BOINC regularly are necessary. I'm torn between being happy to be able to contribute something to the project on the one hand, and doubting whether it is a good investment of my time on the other hand. It feels a little bit like riding the proverbial dead horse!
4) Message boards : Number crunching : ERROR: Vboxwrapper lost communication with VirtualBox, rescheduling task for a later time. (Message 105150)
Posted 23 Feb 2022 by gbayler
Post:
Hi @buscher,
The root cause of this problem seems to be the usage of an outdated/not matching vboxwrapper in the VirtualBox tasks for Rosetta@home. From my understanding, as a user you cannot do anything to fix this problem, but there are some workarounds available (such as restarting BOINC, what you already mentioned).
This issue is #4 on my list of issues with VirtualBox tasks: https://docs.google.com/spreadsheets/d/1lBP27MYx2RH9PYuweMoSwOLvmIaoqI77Q0_gC34e-Z0/edit?usp=sharing
5) Message boards : Number crunching : Summary of issues with VirtualBox tasks (Message 105073)
Posted 20 Feb 2022 by gbayler
Post:
I have compiled the issues mentioned in this thread into a Google Sheet: https://docs.google.com/spreadsheets/d/1lBP27MYx2RH9PYuweMoSwOLvmIaoqI77Q0_gC34e-Z0/edit?usp=sharing
The idea is to make it simpler for everybody to get an overview of the open issues.
Since the forum is spammed from time to time, the sheet's access rights are "comment onIy", that is, you cannot directly edit the sheet. If you have some additions, just let me know! I'll keep an eye on this thread anyways and update the sheet from time to time.
6) Message boards : Number crunching : Summary of issues with VirtualBox tasks (Message 104819)
Posted 14 Feb 2022 by gbayler
Post:
xii5ku wrote:
the script requires a fairly recent version of 'boinccmd'. I don't know precisely how recent, but 7.16.17 works, 7.16.6 does not work. (The version of the boinc clients which are controlled by the script does not matter.)
PS, you can tell whether your boinccmd is recent enough for the script by looking at the output of the --get_tasks call (towards any client which has one ore more task in progress). If there is a line with "elapsed task time:" for each task, it will work.


I saw that too! According to boinccmd --get_tasks is missing elapsed time #3463, the issue was solved with version 7.16.11.

Something else: is there a way to make a sticky post here? Otherwise, I have little hope that this thread will do what the thread starter/original poster @dcdc intended:
I thought it would be a good idea to create a post listing the issues with the VirtualBox tasks, which can then be updated as/if they get fixed. This isn't a thread to list the details of the issues - just to link to the wider discussions elsewhere:
7) Message boards : Number crunching : Summary of issues with VirtualBox tasks (Message 104735)
Posted 8 Feb 2022 by gbayler
Post:

1. Sometimes tasks don't start. They sit there at 0% and with no time used, but say "Running" in BOINC Manager. This tends to happen in batches for me and happens on multiple machines. I don't think there is a thread discussing this. Restarting BOINC Manager fixes this.

2. Some tasks never end. The % keeps climbing but they have to be aborted. My record that I've noticed is 4-days of CPU time. The error on the Vbox screen is always the same, but might be misleading (Spectre error). Thread here:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14897

I observed sort of a combination of both problems, but maybe it is something different altogether. In my case, VirtualBox-tasks seem to start in the BOINC Manager (Status: "Running"), when actually the VM that belongs to that task hangs while booting. The progress indicated in the BOINC manager asymptotically approaches 100%, but never reaches it. These tasks run until they are manually aborted. They don't consume much CPU time. I observed the problem using Linux and using Windows. When using Windows, I checked the VBox screen: in one case it was completely empty, in the other case it showed the error message
Couldn't copy file: fwrite() failed

I have not seen the Spectre error yet. In Linux I haven't yet figured out how to check the VBox-screen.
@Jim1348 described such tasks in the forum as "0 CPU-tasks".
I have written a watchdog-script to abort such tasks as soon as possible. This is a good workaround for me, of course it would be better if this problem would be fixed.


Have I missed any major issues?

4. Some tasks get the status "Postponed: VM job unmanagable, restarting later." and block a slot where another work unit could be processed. This was discussed here and here. In my experience, after restarting the BOINC client, the results of such tasks are reported and new work units are downloaded and processed.
8) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 104462)
Posted 23 Jan 2022 by gbayler
Post:
How did this come? Faulty RAM?
No, perfectly stable machine with RAM tested. Nothing else does this, must be a python fault. VB tasks from LHC work smoothly.

Strange!

On my 2 PCs roughly 20% of the rosetta python tasks have to be aborted; if I wouldn't abort them, they would just exceed the deadline, so I don't have much alternative. Most of the aborted tasks can be completed and validated on other systems, regardless of the OS. Would be really interesting why they don't run on my systems.
It's very strange and nobody knows the answer, except perhaps the silent staff that aren't here! Some machines can't run them, some can half the time, some can most of the time, there must be a huge list of bugs in it.

I see! Very strange indeed!
9) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 104460)
Posted 23 Jan 2022 by gbayler
Post:
Errors that result in systems being black listed from getting work

When do systems get blacklisted?
I'm using a self-written script to abort tasks that don't process, wondering whether this can get my systems blacklisted too.
I have 7 Windows systems. 1 works fine 99% of the time. 5 do no CPU time and I've cancelled loads. 1 looked like it was working, but produced errors when validating. Only that last one got blacklisted (after 100 failed tasks), and only blacklisted from python. I assume if it was my own account I could go and switch it back on, but through gridcoin I cannot, and the admin can't be bothered doing it for me.

Interesting, thank you for the answer!
1 looked like it was working, but produced errors when validating.

How did this come? Faulty RAM?

On my 2 PCs roughly 20% of the rosetta python tasks have to be aborted; if I wouldn't abort them, they would just exceed the deadline, so I don't have much alternative. Most of the aborted tasks can be completed and validated on other systems, regardless of the OS. Would be really interesting why they don't run on my systems.
10) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 104441)
Posted 23 Jan 2022 by gbayler
Post:
Do you see the drive in the Disk Management? If yes, can you assign it a drive letter there?
11) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 104438)
Posted 23 Jan 2022 by gbayler
Post:
Errors that result in systems being black listed from getting work

When do systems get blacklisted?
I'm using a self-written script to abort tasks that don't process, wondering whether this can get my systems blacklisted too.
12) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 104400)
Posted 22 Jan 2022 by gbayler
Post:
For the Linux-users out there: I have written a Perl-script boinc_watchdog.pl that checks for "0 CPU"-tasks (tasks with a very low CPU utilization, that likely won't terminate) and whether there is at least one task executing. If it finds "0 CPU"-tasks, it aborts them, and if there is not a single task executing, it restarts the boinc-client. I run it every 30 minutes as a cron job; for me, it works quite well. I am perfectly aware that this doesn't solve the root cause of the current problems, this is merely a workaround. Still, I think it is an improvement in comparison to having to manually abort tasks or restart the PC every other day.

Here you can find it: https://github.com/gbayler/boinc_watchdog

Hope that it is useful for someone else too! :)

Günther
13) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 103949)
Posted 30 Dec 2021 by gbayler
Post:
@dcdc: Thank you for your answer!

In my case, there are ~14 GB free on the disk. That's too little to get additional tasks, I can see entries like this in the syslog:
Dec 30 14:57:40 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:40 [Rosetta@home] Sending scheduler request: To fetch work.
Dec 30 14:57:40 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:40 [Rosetta@home] Requesting new tasks for CPU
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] Scheduler request completed: got 0 new tasks
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] No tasks sent
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] rosetta python projects needs 5292.79MB more disk space.  You currently have 13780.69 MB available and it needs 19073.49 MB.

Not sure whether this interferes with the running tasks. In addition to the 3 problematic tasks there are 2 other tasks (also VBox tasks) on this machine that seem to run normally.

I'm using Ubuntu 21.10 on an i5-8400, if that makes a difference.

The system created now another task for the workunit that wasn't finished in time. I'm curious whether the next computer processing this WU will experience the same problems!
14) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 103946)
Posted 30 Dec 2021 by gbayler
Post:
I have 3 WUs/tasks running longer than any other tasks I have seen before; they don't seem to terminate. Their progress asymptotically approaches 100%, but, as it seems, never reaches it.

These are the WUs in question:

https://boinc.bakerlab.org/rosetta/result.php?resultid=1462247667 progress: 99.986% elapsed: 2d 23:19:00 CPU time: 00:19:44
https://boinc.bakerlab.org/rosetta/result.php?resultid=1462512698 progress: 99.929% elapsed: 2d 10:03:00 CPU time: 00:15:56
https://boinc.bakerlab.org/rosetta/result.php?resultid=1462518266 progress: 99.822% elapsed: 2d 02:42:00 CPU time: 00:13:54

Do I have to manually abort such WUs?

Best regards,

Günther






©2024 University of Washington
https://www.bakerlab.org