Message boards : Number crunching : rosetta python projects (vbox64)
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
[quote]Would you please include a link to one of the tasks that is causing this? For sure!! 1426492275 - predecessor errored out same issue 1426492111 - sent to another system ok. 1426492670 - predecessor errored out same issue etc |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,754,335 RAC: 22,824 |
Something different. The task listing page is nuts.Ghosts. The server thinks you got them, but you didn't. Grant Darwin NT |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Something different. The task listing page is nuts.Ghosts. Weird...all this shows up after I split my system up into sections for all the different projects. I've got issues with scheduler as well. I've got something like 8 screens worth of Rosetta work. I just went to no new tasks. According to Emfer BOINC tasks I have 858 4.2 tasks and 101 python. BOIINC Manager seems to agree. I think Rosie has gone nuts! That or BOIINC has gone nuts. 4.2 tasks equal about 292 days of work, I just don't get it. Python work is about a month long. Crazy I guess I had better add something to app_config to limit work. If I could send some of this to our Italian friend I would. I'm just going to grind through it and let the project decide what can't meet the limit. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I think Rosie has gone nuts! That or BOIINC has gone nuts. Don't add "max_concurrent" or "project_max_concurrent". That causes the problem. We just had a discussion on it at LHC. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5720&postid=45320#45320 https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5720&postid=45323#45323 |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,754,335 RAC: 22,824 |
From memory that it what is there & i mentioned that is what is probably causing the issue, but Greg_BE doesn't like the way BOINC schedules things, along with other non-BOINC software that heavily uses the CPU so he uses max_concurrent/project_max_concurrent to limit Project CPU usage. I have suggested reserving CPU cores per Task per GPU project to help- at one stage Run time was 3 or more time greater than CPU time on several of his projects Tasks.I think Rosie has gone nuts! That or BOIINC has gone nuts. See Project scheduler has gone nuts (4.2 tasks) thread. And One task using 14 cores and only 36% +/- of that total cpu power for why max_concurrent is being used. I thought this issue was meant to have been resolved one or 2 BOINC versions back? Grant Darwin NT |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I thought this issue was meant to have been resolved one or 2 BOINC versions back? I had the problem on BOINC 7.16.11, the latest official one from Ubuntu. There is 7.16.17 from Locutus-of-Borg, though I don't know whether it fixes it. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2121 Credit: 41,179,074 RAC: 11,480 |
Sorry for not being around much recently - either on the forums here or paying attention to my own hosts. I've been pointed to this thread and can see how bad this has got. I don't have the first idea about VirtualBox, so rather than collate all the issues I've just sent a link to this whole forum thread to the Admin and hope they can deal with it quickly. You'll probably see the results before me tbh |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,555,377 RAC: 6,312 |
So it is NOT your system all the time. It is the tasks that are buggy. I start to think there is a problem of AMD cpu.... |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,555,377 RAC: 6,312 |
Still today: 1426749997 5 days with the same error. I wrote also on R@H Twitter account, without answer. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Still today: 1426749997 They don't monitor social media. See how far back their last post is? |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
So it is NOT your system all the time. It is the tasks that are buggy. Have you done a chipset update or recently downloaded any new software drivers for your cpu? |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Sorry for not being around much recently - either on the forums here or paying attention to my own hosts. The only references for VBox projects that has really good answers is over on LHC on ATLAS or searching the web for info on BOINC itself. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
From Mod.Sense in 2008 The transfer of the file to your system was interrupted (incomplete read) before it completed, and therefore the resulting file on your system did not match the required signiture. When several WUs require the same file, BOINC only downloads it once. Since that download failed, all of the WUs that needed the file failed as well. So, basically, it sounds like your internet connection at the time was interrupted or unreliable. ...or perhaps the project's servers were so busy, they only got a portion of the file sent to you prior to the connection timing out, but that would be very rare. BOINC will recover itself from such a situation, and request more work and assuming the network is stable later, it will then proceed normally. The project servers basically expect a certain amount of such failures, and so the work will eventually be reassigned to other machines. So maybe reboot your modem box as well as the other things mentioned. Otherwise it is beyond anything any of us can do here. Ive been digging around and it is very hard to find anything that relates to this problem. I sent you a few PM's with some other ideas, CHKDSK from windows. Already suggested resetting the project. Only other thing I would do myself is run a HDD cleaner and a Registry cleaner just to make sure my system is clean. You have made a few changes that may or may not have disrupted the system. If you have a clean system, a clean reset of the project, "repaired" BOINC via the installer, latest chipset drivers for your CPU, latest VBOX and extension packs and you have reset your modem (I would also check with your tech support about the latest BIOS for your modem) and you have checked all your network connections, then there really is nothing more I can contribute. I have searched every possibility and every variation I could think of for this error. It just seems to come back to server side issues or stuff that does not apply to this project. This really is something for the tech guru at the UW to figure out. Your latest error task is back in queue and not assigned to anyone yet. Last night I pointed out that 2 out of the 3 tasks you got were buggy already and the server stopped sending them since it was 2 consecutive errors of the same type. If there are more from other people that do not monitor their systems closely, then eventually the results will start showing up more and more errors and the person in charge of IT or Vbox will have to deal with it because the researcher will be upset. What percent has to come back before they do anything? I don't know. I see that there is now 7,479 NEW 4.20 tasks (as of the time of this post), so maybe you can get a piece of that while Python will error out and at least you could get some credit. I will leave it to Sid and his contact to push the lab into doing something. Its beyond my ability to comprehend now. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Sid Celery - make sure the admin or whoever also reads this thread: [url=http://srv1.bakerlab.org/rosetta/forum_thread.php?id=6893&sort_style=5&start=2480[/url] There it is pointed out that most of the people reporting problems are windows machines users. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2121 Credit: 41,179,074 RAC: 11,480 |
Sid Celery - make sure the admin or whoever also reads this thread: [url=http://srv1.bakerlab.org/rosetta/forum_thread.php?id=6893&sort_style=5&start=2480[/url] Done. I think the first msg I sent was around 8pm Seattle time and this one about 3am - and now it's Sunday, so I don't know how quick the response will be. Might reasonably be another day, but let's see |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,555,377 RAC: 6,312 |
The only references for VBox projects that has really good answers is over on LHC on ATLAS or searching the web for info on BOINC itself. The strange thing is that i've crunched CMS@Home and Test4Theory (both use VirtualBox) without problems |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Sid Celery - make sure the admin or whoever also reads this thread: [url=http://srv1.bakerlab.org/rosetta/forum_thread.php?id=6893&sort_style=5&start=2480[/url] Well you know Baker Lab is only M-F and no weekends or holidays. If something is down, we just have to wait for the next week day. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
The only references for VBox projects that has really good answers is over on LHC on ATLAS or searching the web for info on BOINC itself. I think part of it is, that this is the first time in the history of RAH that they have used Vbox and as such are not familiar with all its little quirks. Plus I read somewhere these tasks have not been vetted on RALPH, which makes them prone to bugs. LHC projects are much more advanced and perhaps staffed by people that understand this kind of thing better. How long has CMS been up? 2015 or something? And how long have they been doing VM type work? ATLAS has been up since 2014 and they have been doing VM for how long? You can't compare RAH and these guys. The difference is to great. RAH is a infant when it comes to VM. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2121 Credit: 41,179,074 RAC: 11,480 |
Sid Celery - make sure the admin or whoever also reads this thread: [url=http://srv1.bakerlab.org/rosetta/forum_thread.php?id=6893&sort_style=5&start=2480[/url] That's not strictly true, to be fair. A few months ago tasks stopped coming through because they were working on the transitioner one Sunday and it started crashing and wasn't noticed or fixed until they found it on Monday. More true to say, when they go on holiday (eg over Xmas) the key people can go completely offline until they return. That said, I've had no acknowledgement yet, so this is likely to be one of those occasions. When they're around I've had replies/solutions within the hour. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,555,377 RAC: 6,312 |
ATLAS has been up since 2014 and they have been doing VM for how long? Ok, it's no problem if they are "an infant" in VM fields. But, after 6 days, they could give us a sign of life. Today, still download error |
Message boards :
Number crunching :
rosetta python projects (vbox64)
©2024 University of Washington
https://www.bakerlab.org