rosetta python projects (vbox64)

Message boards : Number crunching : rosetta python projects (vbox64)

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 102698 - Posted: 18 Sep 2021, 21:01:32 UTC - in response to Message 102685.  

[quote]Would you please include a link to one of the tasks that is causing this?

For sure!!
1426492275 - predecessor errored out same issue
1426492111 - sent to another system ok.
1426492670 - predecessor errored out same issue
etc
ID: 102698 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1680
Credit: 17,842,955
RAC: 23,019
Message 102700 - Posted: 18 Sep 2021, 21:53:07 UTC - in response to Message 102684.  

Something different. The task listing page is nuts.
It says I have a 121 python tasks, but shows only 8 tasks.
Ghosts.
The server thinks you got them, but you didn't.
Grant
Darwin NT
ID: 102700 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 102703 - Posted: 18 Sep 2021, 23:20:34 UTC - in response to Message 102700.  

Something different. The task listing page is nuts.
It says I have a 121 python tasks, but shows only 8 tasks.
Ghosts.
The server thinks you got them, but you didn't.


Weird...all this shows up after I split my system up into sections for all the different projects.
I've got issues with scheduler as well. I've got something like 8 screens worth of Rosetta work.
I just went to no new tasks. According to Emfer BOINC tasks I have 858 4.2 tasks and 101 python.
BOIINC Manager seems to agree.

I think Rosie has gone nuts! That or BOIINC has gone nuts.
4.2 tasks equal about 292 days of work, I just don't get it.
Python work is about a month long.
Crazy
I guess I had better add something to app_config to limit work.
If I could send some of this to our Italian friend I would.
I'm just going to grind through it and let the project decide what can't meet the limit.
ID: 102703 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 102704 - Posted: 18 Sep 2021, 23:44:50 UTC - in response to Message 102703.  

I think Rosie has gone nuts! That or BOIINC has gone nuts.
4.2 tasks equal about 292 days of work, I just don't get it.
Python work is about a month long.
Crazy
I guess I had better add something to app_config to limit work.
If I could send some of this to our Italian friend I would.
I'm just going to grind through it and let the project decide what can't meet the limit.

Don't add "max_concurrent" or "project_max_concurrent". That causes the problem.
We just had a discussion on it at LHC.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5720&postid=45320#45320
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5720&postid=45323#45323
ID: 102704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1680
Credit: 17,842,955
RAC: 23,019
Message 102705 - Posted: 18 Sep 2021, 23:59:48 UTC - in response to Message 102704.  
Last modified: 19 Sep 2021, 0:07:56 UTC

I think Rosie has gone nuts! That or BOIINC has gone nuts.
4.2 tasks equal about 292 days of work, I just don't get it.
Python work is about a month long.
Crazy
I guess I had better add something to app_config to limit work.
If I could send some of this to our Italian friend I would.
I'm just going to grind through it and let the project decide what can't meet the limit.

Don't add "max_concurrent" or "project_max_concurrent". That causes the problem.
From memory that it what is there & i mentioned that is what is probably causing the issue, but Greg_BE doesn't like the way BOINC schedules things, along with other non-BOINC software that heavily uses the CPU so he uses max_concurrent/project_max_concurrent to limit Project CPU usage. I have suggested reserving CPU cores per Task per GPU project to help- at one stage Run time was 3 or more time greater than CPU time on several of his projects Tasks.

See Project scheduler has gone nuts (4.2 tasks) thread.
And One task using 14 cores and only 36% +/- of that total cpu power for why max_concurrent is being used.



I thought this issue was meant to have been resolved one or 2 BOINC versions back?
Grant
Darwin NT
ID: 102705 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 102706 - Posted: 19 Sep 2021, 1:10:18 UTC - in response to Message 102705.  

I thought this issue was meant to have been resolved one or 2 BOINC versions back?

I had the problem on BOINC 7.16.11, the latest official one from Ubuntu.
There is 7.16.17 from Locutus-of-Borg, though I don't know whether it fixes it.
ID: 102706 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2124
Credit: 41,219,446
RAC: 10,842
Message 102707 - Posted: 19 Sep 2021, 3:54:19 UTC

Sorry for not being around much recently - either on the forums here or paying attention to my own hosts.
I've been pointed to this thread and can see how bad this has got.

I don't have the first idea about VirtualBox, so rather than collate all the issues I've just sent a link to this whole forum thread to the Admin and hope they can deal with it quickly.

You'll probably see the results before me tbh
ID: 102707 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,613,739
RAC: 9,057
Message 102708 - Posted: 19 Sep 2021, 6:22:10 UTC - in response to Message 102697.  

So it is NOT your system all the time. It is the tasks that are buggy.


I start to think there is a problem of AMD cpu....
ID: 102708 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,613,739
RAC: 9,057
Message 102709 - Posted: 19 Sep 2021, 7:09:08 UTC

Still today: 1426749997

5 days with the same error.
I wrote also on R@H Twitter account, without answer.
ID: 102709 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 102712 - Posted: 19 Sep 2021, 8:30:33 UTC - in response to Message 102709.  

Still today: 1426749997

5 days with the same error.
I wrote also on R@H Twitter account, without answer.



They don't monitor social media.
See how far back their last post is?
ID: 102712 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 102713 - Posted: 19 Sep 2021, 8:31:19 UTC - in response to Message 102708.  

So it is NOT your system all the time. It is the tasks that are buggy.


I start to think there is a problem of AMD cpu....


Have you done a chipset update or recently downloaded any new software drivers for your cpu?
ID: 102713 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 102714 - Posted: 19 Sep 2021, 8:33:02 UTC - in response to Message 102707.  

Sorry for not being around much recently - either on the forums here or paying attention to my own hosts.
I've been pointed to this thread and can see how bad this has got.

I don't have the first idea about VirtualBox, so rather than collate all the issues I've just sent a link to this whole forum thread to the Admin and hope they can deal with it quickly.

You'll probably see the results before me tbh




The only references for VBox projects that has really good answers is over on LHC on ATLAS or searching the web for info on BOINC itself.
ID: 102714 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 102715 - Posted: 19 Sep 2021, 9:23:45 UTC

From Mod.Sense in 2008

The transfer of the file to your system was interrupted (incomplete read) before it completed, and therefore the resulting file on your system did not match the required signiture. When several WUs require the same file, BOINC only downloads it once. Since that download failed, all of the WUs that needed the file failed as well.

So, basically, it sounds like your internet connection at the time was interrupted or unreliable. ...or perhaps the project's servers were so busy, they only got a portion of the file sent to you prior to the connection timing out, but that would be very rare.

BOINC will recover itself from such a situation, and request more work and assuming the network is stable later, it will then proceed normally. The project servers basically expect a certain amount of such failures, and so the work will eventually be reassigned to other machines.

So maybe reboot your modem box as well as the other things mentioned.
Otherwise it is beyond anything any of us can do here.
Ive been digging around and it is very hard to find anything that relates to this problem.
I sent you a few PM's with some other ideas, CHKDSK from windows.
Already suggested resetting the project.
Only other thing I would do myself is run a HDD cleaner and a Registry cleaner just to make sure my system is clean. You have made a few changes that may or may not have disrupted the system.

If you have a clean system, a clean reset of the project, "repaired" BOINC via the installer, latest chipset drivers for your CPU, latest VBOX and extension packs and you have reset your modem (I would also check with your tech support about the latest BIOS for your modem) and you have checked all your network connections, then there really is nothing more I can contribute. I have searched every possibility and every variation I could think of for this error. It just seems to come back to server side issues or stuff that does not apply to this project. This really is something for the tech guru at the UW to figure out.

Your latest error task is back in queue and not assigned to anyone yet.
Last night I pointed out that 2 out of the 3 tasks you got were buggy already and the server stopped sending them since it was 2 consecutive errors of the same type.
If there are more from other people that do not monitor their systems closely, then eventually the results will start showing up more and more errors and the person in charge of IT or Vbox will have to deal with it because the researcher will be upset. What percent has to come back before they do anything? I don't know.

I see that there is now 7,479 NEW 4.20 tasks (as of the time of this post), so maybe you can get a piece of that while Python will error out and at least you could get some credit.

I will leave it to Sid and his contact to push the lab into doing something. Its beyond my ability to comprehend now.
ID: 102715 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 102717 - Posted: 19 Sep 2021, 9:46:03 UTC

Sid Celery - make sure the admin or whoever also reads this thread: [url=http://srv1.bakerlab.org/rosetta/forum_thread.php?id=6893&sort_style=5&start=2480[/url]

There it is pointed out that most of the people reporting problems are windows machines users.
ID: 102717 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2124
Credit: 41,219,446
RAC: 10,842
Message 102721 - Posted: 19 Sep 2021, 11:55:24 UTC - in response to Message 102717.  

Sid Celery - make sure the admin or whoever also reads this thread: [url=http://srv1.bakerlab.org/rosetta/forum_thread.php?id=6893&sort_style=5&start=2480[/url]

There it is pointed out that most of the people reporting problems are windows machines users.

Done.
I think the first msg I sent was around 8pm Seattle time and this one about 3am - and now it's Sunday, so I don't know how quick the response will be.
Might reasonably be another day, but let's see
ID: 102721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,613,739
RAC: 9,057
Message 102722 - Posted: 19 Sep 2021, 12:11:08 UTC - in response to Message 102714.  

The only references for VBox projects that has really good answers is over on LHC on ATLAS or searching the web for info on BOINC itself.


The strange thing is that i've crunched CMS@Home and Test4Theory (both use VirtualBox) without problems
ID: 102722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 102723 - Posted: 19 Sep 2021, 12:50:38 UTC - in response to Message 102721.  

Sid Celery - make sure the admin or whoever also reads this thread: [url=http://srv1.bakerlab.org/rosetta/forum_thread.php?id=6893&sort_style=5&start=2480[/url]

There it is pointed out that most of the people reporting problems are windows machines users.

Done.
I think the first msg I sent was around 8pm Seattle time and this one about 3am - and now it's Sunday, so I don't know how quick the response will be.
Might reasonably be another day, but let's see



Well you know Baker Lab is only M-F and no weekends or holidays. If something is down, we just have to wait for the next week day.
ID: 102723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 102725 - Posted: 19 Sep 2021, 12:58:05 UTC - in response to Message 102722.  

The only references for VBox projects that has really good answers is over on LHC on ATLAS or searching the web for info on BOINC itself.


The strange thing is that i've crunched CMS@Home and Test4Theory (both use VirtualBox) without problems



I think part of it is, that this is the first time in the history of RAH that they have used Vbox and as such are not familiar with all its little quirks. Plus I read somewhere these tasks have not been vetted on RALPH, which makes them prone to bugs.

LHC projects are much more advanced and perhaps staffed by people that understand this kind of thing better. How long has CMS been up? 2015 or something? And how long have they been doing VM type work?
ATLAS has been up since 2014 and they have been doing VM for how long?
You can't compare RAH and these guys. The difference is to great.
RAH is a infant when it comes to VM.
ID: 102725 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2124
Credit: 41,219,446
RAC: 10,842
Message 102746 - Posted: 20 Sep 2021, 1:03:56 UTC - in response to Message 102723.  

Sid Celery - make sure the admin or whoever also reads this thread: [url=http://srv1.bakerlab.org/rosetta/forum_thread.php?id=6893&sort_style=5&start=2480[/url]

There it is pointed out that most of the people reporting problems are windows machines users.

Done.
I think the first msg I sent was around 8pm Seattle time and this one about 3am - and now it's Sunday, so I don't know how quick the response will be.
Might reasonably be another day, but let's see

Well you know Baker Lab is only M-F and no weekends or holidays. If something is down, we just have to wait for the next week day.

That's not strictly true, to be fair. A few months ago tasks stopped coming through because they were working on the transitioner one Sunday and it started crashing and wasn't noticed or fixed until they found it on Monday.
More true to say, when they go on holiday (eg over Xmas) the key people can go completely offline until they return.

That said, I've had no acknowledgement yet, so this is likely to be one of those occasions.
When they're around I've had replies/solutions within the hour.
ID: 102746 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,613,739
RAC: 9,057
Message 102753 - Posted: 20 Sep 2021, 10:01:03 UTC - in response to Message 102725.  
Last modified: 20 Sep 2021, 10:01:23 UTC

ATLAS has been up since 2014 and they have been doing VM for how long?
You can't compare RAH and these guys. The difference is to great.
RAH is a infant when it comes to VM.


Ok, it's no problem if they are "an infant" in VM fields.
But, after 6 days, they could give us a sign of life.
Today, still download error
ID: 102753 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : rosetta python projects (vbox64)



©2024 University of Washington
https://www.bakerlab.org