David E K Forum moderator Project administrator Project developer Project scientist Joined: Jul 1 05 Posts: 660 ID: 14 Credit: 838,217 RAC: 28
This app update includes a fix for checkpointing.
Please report issues and bugs here!
thanks,
DK
ID: 64951 | Rating: 0 | rate:
/
Sarel Forum moderator Project administrator Project developer Project scientist Joined: May 11 06 Posts: 47 ID: 81994 Credit: 52,066 RAC: 125
Hi,
I'll be resubmitting the *gbnnotyr* protein design trajectories to boinc over the next few hours. The tests I ran on ralph showed that the checkpointing issue is resolved. To make sure that there are no other issues, I will submit these trajectories 'slowly' starting with a modest sized batch, and according to the responses I get on the thread I will increase the number of work units over the next few days. Please keep me posted about these problems. Your reports have been invaluable in tracking this problem down!
Bad news guys just woke up today and my homopt_cstmc WU is stuck @ 40% using no CPU time. Although 3-4 other different named WU's have gone through and been totally fine. Just thought id let you know.
ID: 64967 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Admin, please double check the application version those are running under. (it is shown in the tasks tab of the advanced view under the application column)
____________ Rosetta Moderator: Mod.Sense
Admin, please double check the application version those are running under. (it is shown in the tasks tab of the advanced view under the application column)
About
http://www.flickr.com/photos/37828392@N08/4273113531/
I can 100% confirm i am/was running the new version mini rosetta 2.05 when i got the stuck homopt WU. Heres the WU link: http://boinc.bakerlab.org/rosetta/workunit.php?wuid=282419440. A wingman seems to have also had a compute error, but I can confirm i was running the updated 2.05 client.
New app working well. And it seems that now the WU need less RAM (about 100 MB per WU). Is it true? If it is, then may be this is a step to rosetta's GPU client? :-)
Although I didnt grab a screenshot the task details of the work unit show "application version 2.05" You can check it out at http://boinc.bakerlab.org/rosetta/result.php?resultid=310562856. I wish i could give you guys more information, anything else i can do to help you guys solve this issue? All other work so far has gone through fine, but upon further investigation the common factor is windows 7. I have a boinc_filtered loopbuild_threading running now at 33% which gave me problems on 2.03, so i will see how it goes on 2.05 and give an update.
According to the time to completion, it's going to be a long old process too.
ID: 64977 | Rating: 0 | rate:
/
Sarel Forum moderator Project administrator Project developer Project scientist Joined: May 11 06 Posts: 47 ID: 81994 Credit: 52,066 RAC: 125
Thanks! If these were the *gbn* runs, then they have a low-memory step which is memory efficient, but then they /might/ go on to a memory intensive step requiring 300-500Mb...
New app working well. And it seems that now the WU need less RAM (about 100 MB per WU). Is it true? If it is, then may be this is a step to rosetta's GPU client? :-)
Although I didnt grab a screenshot the task details of the work unit show "application version 2.05" You can check it out at http://boinc.bakerlab.org/rosetta/result.php?resultid=310562856. I wish i could give you guys more information, anything else i can do to help you guys solve this issue? All other work so far has gone through fine, but upon further investigation the common factor is windows 7. I have a boinc_filtered loopbuild_threading running now at 33% which gave me problems on 2.03, so i will see how it goes on 2.05 and give an update.
I wouldn't worry about it. A number of these have failed. I have just sent in two that failed on their second run.
While The boinc_filtered WU went through fine, i have another that has stalled: opttest2.2d4f..... just thought id give an update, it froze at 18.046%. Other than that 2.05 seems stable although sometimes the graphics crash when i try to look at them.
Just had to shut down boinc, which i did properly to run a few programs quickly. Seems both Wu's the computer was working on started from model 0 when the client restarted. Both units were between 10-15 models done for being around 20% complete which they are currently (20% complete and now working on model 1). Did the units really just start over from 0 and erase all the previous work? Is this another issue we are tracking? Just trying to be helpful!
In another thread, I've seen something about workunits using one of the new features not having working checkpointing while that feature is running. Checkpointing still works for workunits that don't use that feature.
I was reading the 2.03 thread and saw something about the checkpoint issue, which i saw with myself just now thats why I thought I would point it out. Your saying everything is fine even though the model says its starting from 1 again correct? Thanks for the help!
New app working well. And it seems that now the WU need less RAM (about 100 MB per WU). Is it true? If it is, then may be this is a step to rosetta's GPU client? :-)
I too notice that version 2.05 uses less RAM, and not only on tasks *gbn*. Somewhere 200-250 MB instead of 300-350 in version 2.03.
Is it one of "and other minor updates" about which is written in "Version Release Log"?
If so it seems to me not absolutely "minor" :)
I noticed such thing in the new version (though it can feature of the concrete WU - this type of WU in version 2.03 did not come across to me). At model calculation at first steps go very fast, for example 36000 steps have been calculated all for 6 minutes after that calculation has gone very slowly and following 10 steps have occupied more than 10 minutes.
And it is conceived?
Task example: job_boinc_1bm8__broker_random_pairings_from_psipred_16 906_1305_1
ID: 64994 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Please don't presume that the information from the Project Team is an inaccurate description and that your memory observations are a new and permanent condition for all to enjoy going forward. As Sarel points out, they introduced a new type of work unit which has a new low-memory phase to execution. And so you are only going to see the lower memory usage when that specific type of task is being worked on. And this new type of work unit was introduced in prior versions, so the actual delta to v2.05 is small. Since this new type of work is a current area of review, you may see a high concentration of this type of work for a period of time. But it doesn't mean we can presume more then was stated.
____________ Rosetta Moderator: Mod.Sense
For the past two days my Windows 7 machine has been bombing with occasional blue screen of death crashes. I ran the Microsoft debugger and it points to an issue with minirosetta 2.05.
MULTIPLE_IRP_COMPLETE_REQUESTS (44)
A driver has requested that an IRP be completed (IoCompleteRequest()), but
the packet has already been completed. This is a tough bug to find because
the easiest case, a driver actually attempted to complete its own packet
twice, is generally not what happened. Rather, two separate drivers each
believe that they own the packet, and each attempts to complete it. The
first actually works, and the second fails. Tracking down which drivers
in the system actually did this is difficult, generally because the trails
of the first driver have been covered by the second. However, the driver
stack for the current request can be found by examining the DeviceObject
fields in each of the stack locations.
Arguments:
Arg1: fffffa800afb3320, Address of the IRP
Arg2: 0000000000000eae
Arg3: 0000000000000000
Arg4: 0000000000000000
Debugging Details:
------------------
IRP_ADDRESS: fffffa800afb3320
CUSTOMER_CRASH_COUNT: 1
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
BUGCHECK_STR: 0x44
PROCESS_NAME: minirosetta_2.
CURRENT_IRQL: 2
LAST_CONTROL_TRANSFER: from fffff8000285fb95 to fffff80002875f00
I'll be resubmitting the *gbnnotyr* protein design trajectories to boinc over the next few hours. The tests I ran on ralph showed that the checkpointing issue is resolved. To make sure that there are no other issues, I will submit these trajectories 'slowly' starting with a modest sized batch, and according to the responses I get on the thread I will increase the number of work units over the next few days. Please keep me posted about these problems. Your reports have been invaluable in tracking this problem down!
Sarel.
At last I have received enough WUs of this type for check. My output - still there are problems with checkpointing. In difference from version 2.03 the information about "CPU time at last checkpoint" is displayed now correctly that gives the chance to BOINC client to switch between projects, but after restart calculation still starts from the beginning.
Here a task example which I watched: 8gbnnotyr_3gbn_2iug_9Jan2010_16915_7_0
Before restart it has been used 0:33 hour CPU time, 27 models done, after restarting another 1:27 hour and 72 more models are calculated.
But apparently in the report 72 models counted after restarting are mirrored only, 27 models do not suffice, also the task was completed with Validate error.
Here another example: 8gbnnotyr_3gbn_1ijt_9Jan2010_16915_1_0
The same results - in report there are only models counted after restarting and Validate error too.
For matching here the task of this type which was computing without breaks: 8gbnnotyr_3gbn_1woj_9Jan2010_16909_12_0
Without interruption 2 hours of CPU result to 94 models (compare with 72 and 67 in the previous cases in the same 2 hours of CPU time) and Validate state = Valid
The difference just corresponds somewhere to 0.5 hours of CPU time, and so much time passed before restartings
Please don't presume that the information from the Project Team is an inaccurate description and that your memory observations are a new and permanent condition for all to enjoy going forward. As Sarel points out, they introduced a new type of work unit which has a new low-memory phase to execution. And so you are only going to see the lower memory usage when that specific type of task is being worked on. And this new type of work unit was introduced in prior versions, so the actual delta to v2.05 is small. Since this new type of work is a current area of review, you may see a high concentration of this type of work for a period of time. But it doesn't mean we can presume more then was stated.
Yes, here I was mistaken. Simply with new version 2.05 some time in the beginning i recieve ONLY the new types of WU using few RAM. From what I have come to a (wrong) conclusion.
But now some WUs of old types come, and for them memory usage about same have as in version 2.03.
Soooo... this new version hangs too often. 2.0.3 was much more stable.
It hangs on my 2xAthlonMP 2800 as well on the Intel E8400 so the CPU is not the issue.
I think 15% of tasks stuck in the middle consuming >200 Megs of RAM but no CPU.
I'm thinking to leave Rosetta for a while until new version ready as tired of kicking off broken tasks every morning :(
____________
ID: 65013 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Looks like Mike Solo has 3 machines:
One WinXP using BOINC version 6.10.18
One WinXP using BOINC version 6.10.18
One WinServer 2003 using BOINC version 6.10.18
____________ Rosetta Moderator: Mod.Sense
P.S.
Last from these 3(id 311163725) it has been stopped at the very beginning of operation, still before 1st checkpoint has been written. However after restarting its processing all was completed with validate error.
So it is possible validate errors in this type of WUs are not linked with checkpoints directly and these are 2 different bugs.
ID: 65020 | Rating: 0 | rate:
/
Sarel Forum moderator Project administrator Project developer Project scientist Joined: May 11 06 Posts: 47 ID: 81994 Credit: 52,066 RAC: 125
Thanks! We'll have a look at this as soon as possible and let you know what we find. Best, Sarel.
P.S.
Last from these 3(id 311163725) it has been stopped at the very beginning of operation, still before 1st checkpoint has been written. However after restarting its processing all was completed with validate error.
So it is possible validate errors in this type of WUs are not linked with checkpoints directly and these are 2 different bugs.
In the last week I've had to abort 11 tasks on W7 because the tasks are hung consuming 0% CPU time. I was hoping that the combination of upgrading to the latest BOINC and the new 2.05 version of R@h would fix the problem but no: it continues as before. Tasks on Mac OS X seem to be unaffected by this problem. Until there's some indication this problem is fixed I'm not getting any more tasks for W7.
Validate Error on Win7, successfully completed by a wingman on win xp
http://boinc.bakerlab.org/rosetta/result.php?resultid=311128874
name: 8gbnnotyr_3gbn_1iuk_9Jan2010_16915_131_0
# cpu_run_time_pref: 28800
======================================================
DONE :: 345 starting structures 28787.1 cpu seconds
This process generated 345 decoys from 345 attempts
======================================================
Note: On several occasions the following line appears:
No heartbeat from core client for 30 sec - exiting
Edit: Wingman running XP also received a validate error on apparently successful completion.
____________
ID: 65026 | Rating: 0 | rate:
/
Miguel Veiga Joined: Oct 15 07 Posts: 1 ID: 212621 Credit: 84,354 RAC: 166
Hi guys, let me just tell you.
If youre using Windows7 the beta version 6.10.24 or even the new beta 6.10.29 is much more stable.
Ive used a lot of time the beta 6.10.24 and i had no problem at all with rosetta.
For me its much more stable than 6.10.18 in windows7 of course. Anyway its just my case.
I too have such example: http://boinc.bakerlab.org/rosetta/result.php?resultid=311202691
Claimed credit=54.35 vs Granted credit = 1.83 (about 30 times lower)
And I even can tell what exactly with it have occurred:
Usually in this type of WUs model settle up very fast, nearby 1 or several minutes on 1 model. This task started as - approximately for 15 minutes 13 models have been calculated (on ~500 steps in each) , but about 14th something has occurred, calculation has not stopped on 500th step, and proceeded much longer, I saw as the counter have passed for 40000 steps, and did not look any more further(i think all was about 60000-70000 steps total).
I was already think to abort this task since thought that calculation has gone in cycles, but in 5 hours (instead of several minutes) calculation of 14th model all the same was completed. I.e. 13 models were considered about 15 minutes, and 14th about 5 hours.
From here from such small stake-in Granted credit - since they are calculated proportionally to quantity of models. (If not this 14th model, for 5 hours it would be calculated about 300 models instead of 14 and Granted credit would be close to Claimed credit).
I think too most was and in your taks...
P.S.
Quite probably that it NOT an error, but a feature of algorithm - if it finds something interesting more detail calculation of this model probably starts. It is desirable for specifying for scientists responsible for this type of WUs.
ID: 65032 | Rating: 0 | rate:
/
Sarel Forum moderator Project administrator Project developer Project scientist Joined: May 11 06 Posts: 47 ID: 81994 Credit: 52,066 RAC: 125
Hello,
based on the reports of validator issues, David Kim has now fixed the validator. He also asked me to remind people that credit is granted based on the client's claimed credit, regardless of validator results.
Thanks for the information Sarel - and David for the fix.
No further errors today, but a cursory check has revealed I haven't re-booted my desktop since Dec 15th! I'm sure I've had various updates since then, but that's a ridiculous amount of uptime for me... Back in 5... ;)
____________
Hi,
I have a strange graphic I wanted to show you... I *think* there *might* be a problem...
Please go to see this sreen shoot :
http://www.flickr.com/photos/37828392@N08/4273113531/
(Capitain Flam is my account on Flickr)
Possible bug for the application BOINC / ROSETTA, because the protein is *completely* folded, in a tiny meat ball ;-)
I hope this is NOT a bug, or even, I hope it will help you to solve it ;)
I do not think that it is an error in the software, but probably weak place in the scientific algorithm itself, so it is necessary to address it not to programmers, but scientists.
minirosetta 2.05 hangs on my computer frequently. It's a windows vista machine. The cpu meter shows no activity, the time to completion is incrementing instead of decrementing and the screen saver for r@h is blank. I've shut my machine off then on 3 times and r@h runs normally after that. The cpu meter shows activity, the time to completion is decrementing and the time as decreased from more than 10 hours to around 2 hours and the screen saver works. This started happening the second week of January. My machine was off from December 18 to January 8. After a few hours of running, r@h hangs again.
ID: 65040 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Hello,
based on the reports of validator issues, David Kim has now fixed the validator. He also asked me to remind people that credit is granted based on the client's claimed credit, regardless of validator results.
Let us know if you see more such problems.
Thanks, Sarel.
If I have my facts straight, Sarel means to say that credit is issued as normal. This means based on the average credit claims PER MODEL of the tasks reported before yours. This is a bit odd for Sarel's tasks because, as he's been explaining, there is a new technique where a quick cursory review of a given model is performed, and then some small percentage of those are deemed worth a more detailed review. And so model runtimes can vary from around 60 seconds, to several hours. So you will see credit all over the map. But it seems that on average most tasks spend the majority of their time crunching on one low level model, and so over time credit is still comparable with other types of Rosetta work.
If you somehow run through 60 models, and none require low level analysis, and you only allow a 1hr runtime preference, then you would probably see considerably more credit granted then your claim. As I say, this would be rather rare. If you run for a 24hr runtime preference, then you'll probably see several low level models. But then that is over a longer period of crunching too. But once you've run through several such tasks the credit will average out, as it always does.
____________ Rosetta Moderator: Mod.Sense
ID: 65043 | Rating: 0 | rate:
/
Sarel Forum moderator Project administrator Project developer Project scientist Joined: May 11 06 Posts: 47 ID: 81994 Credit: 52,066 RAC: 125
Thanks RosettaMod for the clarification!
On another note, I've isolated why on restart the *gnb* runs report starting over from model 1. The fix for this will be part of the next update of the minirosetta application. Despite the confusion, the models that we get are unharmed and credit is allocated correctly.
Many thanks to the users who reported this for another bug catch!
____________
Could you elaborate what it is that you're seeing? These types of job are treated as others in these respects.
Hi
This WU's, "8gbnnotyr" and older "dock" types won't be listed on results pages?
With regards
Hi
There were some WUs not showed on results page, here is small (uncomplete)collection from last months:
aTt13
histone
1 famA
foldit WUs
denovo_design_rossmann2x3_flxbb (a really RAM eating ang long running ones)
NeR103A
CGR26A
and finally this two from 2010: CtR69A_2KRU_BOINC_ABRELAX, 3gbn bla-bla&gz_dock
And now the 8gbnnotyr WUs seems to have similar fate, crunching only for credit.
____________
ID: 65049 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 361,915 RAC: 771
based on the reports of validator issues, David Kim has now fixed the validator. He also asked me to remind people that credit is granted based on the client's claimed credit, regardless of validator results.
Let us know if you see more such problems.
Thanks, Sarel.
The last 20 tasks on my computer were completed without any validation errors.
(Among them were including *gbnnotyr* and tasks restarted in execution time)
So seems this problem is solved.
On another note, I've isolated why on restart the *gnb* runs report starting over from model 1. The fix for this will be part of the next update of the minirosetta application. Despite the confusion, the models that we get are unharmed and credit is allocated correctly.
Many thanks to the users who reported this for another bug catch!
In addition to Wus type *gnb* bug with only 1 model after a restart occurs in many other types of tasks. But there it does not seem to affect the results sent to the server, but only on the mapping process in the graphic part. So it is not a significant error. It makes sense to report such?
ID: 65078 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 361,915 RAC: 771
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0054FC53 read attempt to address 0xFFFFFFC0
____________
ID: 65091 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 361,915 RAC: 771
Thanks RosettaMod for the clarification!
On another note, I've isolated why on restart the *gnb* runs report starting over from model 1. The fix for this will be part of the next update of the minirosetta application. Despite the confusion, the models that we get are unharmed and credit is allocated correctly.
Many thanks to the users who reported this for another bug catch!
New app working well. And it seems that now the WU need less RAM (about 100 MB per WU). Is it true? If it is, then may be this is a step to rosetta's GPU client? :-)
Well, now I see two WUs are being processed, and one is consupting about 510 MB of RAM, and another - 480. I like such a heavy WUs, give me more please! :-)
Just thought Id add to the post above mine. I can also confirm energy levels for t311 (same WU) have been sky high some values like 76053423 and RMSD running around 700. Thought it was strange so Id let you guys know.
ID: 65104 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
To help clarify, ofry has an anti-meatball. I don't see any problem in their screenshot though.
Admin, ofry's screenshot is of protein t374, so if you are doing t311, then it is a different protein... although perhaps using the same methods to study it.
Admin, how long would you see such high numbers? I'd think they'd settle down pretty quickly.
I don't believe these are Sarel's new ones, so you can see why he's working on the approach that makes that initial 60-100 second survey of a given model and then moves on to something more promising much of the time.
These proteins are very large, so when they are out of position and perhaps are nowhere near the natural conformation, the numbers can get pretty high... but 76m!?
____________ Rosetta Moderator: Mod.Sense
I have a "antimeatball" with energy eq. 2K, 16K etc.
Sample screenshot:
This bug might have in tasks such "boinc_filtered_loopbuild_threading_"
And this tasks usually does eq. 5 hours, but in my preferences "Target CPU run time" not selected (default 3 hours).
P.S. Sorry for my English, I speak Russian.
Таких я еще не видел (а вот скомкивание протеина в мячик - довольно часто).
По русски я бы это назвал "взрыв на макаронной фабрике" :)
А вообще не факт что это проблема, может просто одна из ранних стадий моделирования - т.к. изначально моделирования вообще начинается с протеина вытянутого в одну длинную "веревку". Причем в отличии от folding@home промежуточные этапы моделирования идут не точно (в соотвествии с тем, как это происходит в природе), а приблизительно и весьма хаотично. Так что промежуточные формы могут быть самыми причудливыми и далекими от оригинала.
Это объясняется разными целями проектов - в фолдинге ученые хотят знать КАК протеин из цепочки сворачивается в свою естественную форму/структуру. А в Розетте - определять только конечную простанственную структуру протеина(или взаимодействия 2-х протеинов), по его известной "аминокислотной формуле", но зато делать это на порядки(в десятки и сотни раз) быстрее чем фолдинг, с его моделированием "в лоб" (на уровне отдельных атомов с шагом порядка 1 пикосекунды).
А вот "мячик" (meatball) это проблема - т.к. там похоже какая-то ошибка, моделирование проскакивает естественную форму и начинает просто скомкивать белок в шар, все дальше уходя от оригинала (а не приближаясь к нему).
Та же проблема есть и на t303. (но только на типах задач boinc_filtered_loopbuild_threading_. На других этого нету) Я просто не давал аналогичные скрины.
"translate"
[quote]
Admin, ofry's screenshot is of protein t374, so if you are doing t311, then it is a different protein... although perhaps using the same methods to study it.
[quote]
This problem is in many proteins, eq. t303 too. But in other methods (not boinc_filtered_loopbuild_threading_) I don't see this problem. Maybe, this method too bugged.
I checked it last night, and it went though fine this morning but the values were defiantly very high either 7.6mill or 760k, the protein wasn't even in the window so I know it was a high value. Doesn't seem to be occurring with any other such WU right now, but ill keep an eye out.
ID: 65111 | Rating: 0 | rate:
/
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
If it's the high energy of 4K you're worried about - that's not unusual when runs are submitted with constraints - looks all good to me ..
Compute error occurred - Exit status -1073741819 (0xc0000005). Debug info is far too advanced for me to get any info from, so a team member will need to look at it. Occurred with cl1.2cmx.2cmx.IGNORE_THE_REST.c.0.25.pdb.pdb.JOB_17322_1_1.
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000
Looks like Mike Solo has 3 machines:
One WinXP using BOINC version 6.10.18
One WinXP using BOINC version 6.10.18
One WinServer 2003 using BOINC version 6.10.18
Yes, sorry, missed the OS info.
I introduced some Linux machines (Debian) instead of MS Win.
All looks stabe under Linux.
I am running Rosetta@home (along with some other projects) on four machines, three runing XP and one running Win7. Two of the machines have AMD 64 single core processors and two have AMD 64 dual core processors. The rosetta app is 'rosetta mini 2.05' and the BOINC version is 6.10.18. About once a week or so I will notice that the Rosetta WU running on at least one of the machines has been running longer than usual and using little or no CPU time. I abort it, and the next WU runs fine. I have not seen this occur with any of my other projects (climite prodiction, malaria control or world community grid).
I am running Rosetta@home (along with some other projects) on four machines, three runing XP and one running Win7. Two of the machines have AMD 64 single core processors and two have AMD 64 dual core processors. The rosetta app is 'rosetta mini 2.05' and the BOINC version is 6.10.18. About once a week or so I will notice that the Rosetta WU running on at least one of the machines has been running longer than usual and using little or no CPU time. I abort it, and the next WU runs fine. I have not seen this occur with any of my other projects (climite prodiction, malaria control or world community grid).
Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.
Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.
[/quote]
Even if this works, it shouldn't be necessary to babysit BOINC/Rosetta in this way. This hanging certainly seems to be a widespread issue but one that only affects Windows in its various incarnations. The fact that it's irreproducible means a fix may be some time in coming but I hope the project team find it soon.
Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal?
And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): http://boinc.bakerlab.org/rosetta/result.php?resultid=314040179
While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time).
This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type *gbnnotyr*, where the combined "small" and "huge" models in the same type of tasks)
Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.
Even if this works, it shouldn't be necessary to babysit BOINC/Rosetta in this way. This hanging certainly seems to be a widespread issue but one that only affects Windows in its various incarnations. The fact that it's irreproducible means a fix may be some time in coming but I hope the project team find it soon.
The fact someone noticing that this is occurring, suggests babysitting to begin with.
____________
ID: 65158 | Rating: 0 | rate:
/
Sarel Forum moderator Project administrator Project developer Project scientist Joined: May 11 06 Posts: 47 ID: 81994 Credit: 52,066 RAC: 125
Hello, yes! The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details.
Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal?
And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): http://boinc.bakerlab.org/rosetta/result.php?resultid=314040179
While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time).
This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type *gbnnotyr*, where the combined "small" and "huge" models in the same type of tasks)
____________
ID: 65159 | Rating: 0 | rate:
/
Sarel Forum moderator Project administrator Project developer Project scientist Joined: May 11 06 Posts: 47 ID: 81994 Credit: 52,066 RAC: 125
Rosetta @ Home has produced many very high-quality designs for our Protein-interface design team! So we're likely to submit many more jobs to Rosetta @ Home. To help you recognize these jobs, we'll add a _Protein_Interface_Design_ note to every job name that is related to these jobs from now on. This way you'll be able to follow these jobs. I also hope that this will help you see where the variable-credit issue is coming from more easily.
Hello, yes! The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details.
Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal?
And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): http://boinc.bakerlab.org/rosetta/result.php?resultid=314040179
While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time).
This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type *gbnnotyr*, where the combined "small" and "huge" models in the same type of tasks)
The Rosetta application is spinning its wheels. It is continually running a task even though the task is 100% complete. There is another task to run, but Rosetta won't switch to it.
And what about this?:
> Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a very lot of steps up to 200000 - 400000 for 1 model. Is this normal?
And at the same time, another note: it seems the job of this type: resa_sel_core_1.5_low200_beta_low200_nostart_texcst_05_hb_t328__IGNORE_THE_REST_17378_267_0 ignore the target CPU time. For example, this WU calculate 1 model somewhere for 2.5 hours (already longer than the target time ), but after the 1-st model, instead of sending the result starts calculating 2-nd model. Total 18850 seconds vs cpu_run_time_pref = 7200 seconds.
In this example, all ended well, but in other circumstances it can lead to excess cpu_run_time_pref more than 3 times and triggering watchdog and results loss. In addition, some members may think that the task stuck and abort it...
ID: 65162 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Max, the watchdog should be expected to kick in if a model exceeds target runtime by more then 4 hours. And so with a short 2 hour preference, you will tend to see much more variation in runtimes then if you had a longer preference.
However, I would expect that once the target runtime is exceeded, and a model is completed, that work on the WU should be ended and it would be reported back. So if you have a 2hr preference, and the first model took more then 2hrs to completed, I should have expected the WU to end once the first model was completed. In fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well.
Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful.
____________ Rosetta Moderator: Mod.Sense
A couple of t287__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901 WUs on two different Linux machines failed after a few seconds claiming "process got signal 11".
2 Mod.Sense
Thanks for the clarification on the watchdog. Previously I had seen how it hit after exceeding 6 hours of calculations and thought that he was fired after exceeding CPU TT x 3 (2h * 3 = 6h for my case). So in fact correct formula is CPU TT + 4h, right? (just in my case it gives the same 2h +4 h = 6h)
fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well.
Yes, usually does so. Here's an example of such a task: http://boinc.bakerlab.org/rosetta/result.php?resultid=313861637
Calculation of 1-st model took 5145 sec and the program has ended the processing, because second model would exceed the CPU TT (5145 * 2 = 10290> 7200).
Or another example: http://boinc.bakerlab.org/rosetta/result.php?resultid=314455813
Calculation of the two models has taken 4995 sec and the program has ended the processing, because third model would exceed the CPU TT ((4995 / 2) * 3 = 7492> 7200).
In these (and most others) the logic of the program is working correct.
But in the example above, this algorithm seems to give a failure.
Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful.
I got a problem with two tasks at the moment.
Yesterday i wondered why remaining time is set to 30,5h per WU when i saw it, but i didn't care about it ... perhaps a test with more work per WU ... who knows. ;-)
But now one task is 'stuck' at 58.285% (+0.002% in now more than 12h) and the other one at 82.419% work done.
Runtime for these WUs are at around 28h und 11,75h counting on and on up high (elapsed and remaining -_- ).
So i asked the task-manager for help and is says the following:
these two WUs are using 218mb and 300mb memory ... not using ANY cpu-resources any more ... 0% both (cpu-time is still counting on 1sec/sec).
Did something went wrong on my pc while crunching? Or what's the matter of this?
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Max:
perhaps just a 1st model was designed quickly, and the last took much longer than expected
Right and that is exactly what Sarel's new tasks do. Run 5 models in 5 minutes, then hit one that looks interesting and run for (for example) 80 minutes. Now 6 models have been completed in 85 minutes and with a 2hr runtime preference, we guess we can complete more models in the 2 hours. If that next one happens to be interesting as well, you run long.
Some of the improvements Sarel is making and working on will help the longer models run faster. So this should avoid some of those that were taking several hours for a single model, and make completion times closer to your preference.
Yes, Max. The watchdog USED to be based on 4 times the runtime preference. This was fine for short runtime preferences, but those with preference set to over 12 hours wanted to kill the task sooner and get on with others. Now it is runtime pref. plus 4 hrs, with the thought that all properly running models will complete in less then 4 hours.
The rest of what you say sounds right to me. So keep watching, and jotting notes. This information will be key to resolving this issue.
KnopperHarley
This is one of the few remaining problems that some people are seeing in version 2.05. It seems to be rather rare, and perhaps only to occur on Windows. I see you are running Win XP (I highlight that just to make it easy for the Project Team to see it, not because it should be a problem). I believe suspending and resuming the tasks seems to get them going again.
Could I ask you how your machine is configured? Specifically, do you leave tasks in memory while preempted? Do you run other BOINC projects? Do you allow BOINC to run 100% of CPU? Do you power your machine off each day?
____________ Rosetta Moderator: Mod.Sense
I tried around a bit (restarted BOINC) and (you might guess): it works. ^^'
Cpu-time jumped back to 3h and 6h or something and it's using the cores again.
Seems like something really screwed up the Rosetta-apps while working.
So nevermind ... ignore my posting above. ;-)
I lost a bit of time, but the WUs are obviously (hopefully?!) undamaged and one has been completed in the meantime, so happy crunching again. \o/
greetings
PS: Would it make sense to send the WUs a second time to another participant to confirm the results ... just to be sure?!
Especially the second WU mentioned in my post above (probably more than 7,5h in the end) plus another WU with almost 6,75h
(t293__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_4919)
that has been finished last night are, let's say ... (maybe not impossible but) 'unusual' (to me :-) ).
PPS: for the protocol *g*
- Leave applications in memory while suspended? no
- Rosetta + SETI (50:50)
- Use at most 100 percent of CPU time
- it's almost every day off for a period of time (except weekend once in a while)
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Keep going or abort?
As Evan points out, often such conditions get reset if you suspend and resume the task, or end and restart BOINC...
But first, I'd like to ask you to go to the advanced view, tasks tab, select the task that's been running so long, and then click the properties button that appears over on the left. There are three time figures there that I would like you to report:
CPU time at last checkpoint:
CPU time:
and Elapsed time:
It will take you a minute or so to jot that down, then close the window, and click again on the properties button for the task and see if the CPU time has changed at all.
____________ Rosetta Moderator: Mod.Sense
Just took a look at my graphics and saw this, is it normal? Ive been watching it for awhile now and it seems to be stuck on the model 2 step 0. Any ideas on what i should do?
I do not think that should be ignored. This type of tasks on my computer, too, is behaving very strangely. Here's an example where the protein is coiled into a ring(Click to enlarge):
In this state model is already about 30 minutes. Sometime ring begins to deploy, but then rolled back into the ring.
Seems if i give it some time it finds the protean structure again it was quite strange. Also I wanted to give a headsup that im having a huge issue with the boinc_filtered_lookbuild_threading WU's. Most of the new ones i have received have stalled at about 5 percent and ive had to abort. Are we any closer to fixing this issue because it seems to be getting worse. Ill give you some info on my current one though: protein: t385, cpu time at last checkpoint 33:20, cpu time: 34:24, elapsed time 14:21:01.
I have an “igfhum ProteinInterfaceDesign” that takes from 280 MB up to 1,2 GB of memory!! Is that normal?
If it is, I think that in future most of the people will not have enough memory to run Rosetta anymore…
____________
ID: 65221 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 361,915 RAC: 771
I have an “igfhum ProteinInterfaceDesign” that takes from 280 MB up to 1,2 GB of memory!! Is that normal?
If it is, I think that in future most of the people will not have enough memory to run Rosetta anymore…
The name of the WU is:
igfhum_looprefine_placestub2_2dsrI_1B6E_ProteinInterfaceDesign_2Feb2010¬_17660_331_0
After 45 minutes I restarted BOINC and the WU restarted from zero. Now, after 2 hours, the properties show me that the CPU time after checkpoint is still without any number (“---“), like the WU has worked for a few minutes.
Looking at the task manager it seems that the WU asks continuosly more memory, until it reaches the limit set in the preferences. Then it decreases rapidly to 280 MB and again increases up to around 1,2 GB.
Vista 32 bit, Core Duo T7250, 2 GB DDR2, BOINC 6.10.29
I have an “igfhum ProteinInterfaceDesign” that takes from 280 MB up to 1,2 GB of memory!! Is that normal?
If it is, I think that in future most of the people will not have enough memory to run Rosetta anymore…
The name of the WU is:
igfhum_looprefine_placestub2_2dsrI_1B6E_ProteinInterfaceDesign_2Feb2010¬_17660_331_0
After 45 minutes I restarted BOINC and the WU restarted from zero. Now, after 2 hours, the properties show me that the CPU time after checkpoint is still without any number (“---“), like the WU has worked for a few minutes.
Looking at the task manager it seems that the WU asks continuosly more memory, until it reaches the limit set in the preferences. Then it decreases rapidly to 280 MB and again increases up to around 1,2 GB.
Vista 32 bit, Core Duo T7250, 2 GB DDR2, BOINC 6.10.29
UPDATE: The WU finished without errors.
Looking at the graphic I noticed that when the WU freeze in the “request memory loop”, it was always in the “kic_refine_r2” stage and the accepted energy didn’t vary.
I hope this info are useful. :)
For some reason I am not getting new work. When I update the project it simply says "Not reporting or requesting tasks". I am using BOINC version 6.10.18 .
ID: 65226 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
For some reason I am not getting new work. When I update the project it simply says "Not reporting or requesting tasks". I am using BOINC version 6.10.18 .
John, it sounds like BOINC has decided to schedule work from other projects for the nearterm on your machine. It is trying to run within the resource shares between projects that you have established. It's normal, and once some work for the other projects has been done, it will come back and ask work from Rosetta automatically.
____________ Rosetta Moderator: Mod.Sense
Hi!
I don't know if this happened also with older versions of rosetta since I started computing on the 3rd of February.
I'm running on an amd64 linux system, a pretty powerful one. Looking at my tasks log, I had about a 120 WUs assigned until today, but only 3-4 of them completed successfully. Others show "Outcome - Client error" / "Client state - Compute error". Looking at boinc.log gave me no information because it doesn't contain any error line except "output file .... absent", which I'm told from the FAQ it is safe to ignore. I'm running lhc, seti, milkyway, einstein, ralph, cosmology and with the exception of einstein tasks which seem to end up in computation errors also, every other program is running fine. Milkyway in particular granted me 2500 credits in the last four days (from which I assume the machine is stable). I have never observed problems with the machine itself (occasional lockups, strange sudden shutdowns etc).
This is the /proc/cpuinfo file (I've omitted the other 3 cores):
# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 920 Processor
stepping : 2
cpu MHz : 2800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips : 5619.47
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
The machine is equipped with 8Gb of RAM. Everything is running at stock speed, I'm not overclocking. If any other information is needed I can provide it and I'm not scared to do some debugging. :)
Thanks
Neo2
ID: 65229 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Neo2, thanks for joining Rosetta. I see you have two machines. The 4 core that you described is here. And at present, it doesn't show any successfully completed work units. If you look at the task details for that host, such as this one], they each report an error opening a file. The file name seems to vary with each task.
This implies a security setup problem on your machine. The executable and the user that is running the BOINC core client, need authority to the files that are downloaded. Is it possible your BOINC installation is conflicting with some anti-virus software? Or other security measures?
____________ Rosetta Moderator: Mod.Sense
I don't think so, currently I have clamd up and running, but is only a daemon to fulfill requests from userspace programs, not a real-time antivirus software.
I'm running 2.6.33 git kernel, without any extra security measures: no grsecurity, no firewall, no external security hooks of any sort, no SElinux.
The directory in which BOINC runs is owned by user and group boinc, both existing, no file in the directory is owned by other users. Every file (except executables which have 0755) has got permission 0644 while the directories have 0755. The BOINC executable runs with boinc:boinc also.
Before starting BOINC for the first time I tuned the directory parameters, so every file in the BOINC directory has been created by BOINC itself.
Gentoo by default installs a stock /etc/conf.d file through which the BOINC service is started. I only modified the paths for data storage and logging, nothing else.
The file is the following:
# Config file for /etc/init.d/boinc
# Owner of BOINC process (must be existing)
USER="boinc"
GROUP="boinc"
# Directory with runtime data: Work units, project binaries, user info etc.
RUNTIMEDIR="/mnt/storage/boinc"
# Location of the boinc command line binary
BOINCBIN="/usr/bin/boinc_client"
# Logfile (/dev/null for nowhere)
LOGFILE="/mnt/storage/boinc/boinc.log"
# Allow remote gui RPC yes or no
ALLOW_REMOTE_RPC="yes"
# nice level
NICELEVEL="17"
# scheduling parameters, arguments to chrt(1)
SCHED_PARAM="--batch 0"
# Relative CPU allocation for boinc user, default is 1024,
# requires CONFIG_FAIR_GROUP_SCHED and CONFIG_USER_SCHED,
# see /usr/src/linux/Documentation/scheduler/sched-design-CFS.txt
CPU_SHARE="768"
Now I'm a bit disappointed.
Would the manual removal of the rosetta files and the re-sync with the project be of any use?
ID: 65232 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 361,915 RAC: 771
Mon 08 Feb 2010 22:14:59 EST|rosetta@home|Output file igfhum_looprefine_placestub2_2dsrI_1P6F_ProteinInterfaceDesign_2Feb2010_17660_271_0_0 for task absent
____________
ID: 65236 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Would the manual removal of the rosetta files and the re-sync with the project be of any use?
...doubtful. I would have suggested that if I felt it stood a good chance of helping your situation. But it can't hurt anything (costs you some bandwidth to reload everything).
Now that I think about it, if security setup were the problem, you should have same issue with other projects.
Anyone else have any ideas why Linux would be unable to open an application file?
____________ Rosetta Moderator: Mod.Sense
ID: 65237 | Rating: 0 | rate:
/
jcorn Forum moderator Project administrator Project developer Project scientist Joined: Jan 27 06 Posts: 6 ID: 54746 Credit: 175,758 RAC: 747
Hi Manuel and P.P.L.
The large memory requirements are a once-in-a-while occurrence, but not something entirely unexpected. These jobs occasionally find a very interesting possible solution and spend a lot of resources testing it. I had submitted these jobs with the requirement for 512 MB RAM allocated for boinc. But based on your observations, I'll increase that requirement to 1 GB in the future. Thanks very much for the reports!
____________
Anyone else seeing the following consistent error:-
File - minirosetta_graphics_1.92_windows_x86_64.exe stops downloading at 4.57/5.10 MB
Message section is showing this as a HTTP error followed by Internet access OK - project servers may temporarily be down.
I have reset the project (more than once) also detached and waited until next PC boot to re-attach. All this had no impact and its been doing this for several days now. So I am unable to process any work units as the applications hasn't finished downloading.
Running on Boinc 6.10.18 for Windows 64Bit on Windows 7, AMD 64Bit Dual Core, 4GB RAM
I am also running Seti@Home and this is running error free in both the standard and astropulse projects.
ID: 65239 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 361,915 RAC: 771
Hi jcorn.
I have an “igfhum ProteinInterfaceDesign” that takes from 280 MB up to 1,2 GB of memory!! Is that normal?
If it is, I think that in future most of the people will not have enough memory to run Rosetta anymore…
===============================================================================
Going by this, if i can make a suggestion you might want to up the memory limit to 1.5GB for those tasks.
My rig that had the error has 1GB total, less with O.S. taken out that's not going to be enough.
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Anyone else seeing the following consistent error:-
File - minirosetta_graphics_1.92_windows_x86_64.exe stops downloading at 4.57/5.10 MB
Message section is showing this as a HTTP error followed by Internet access OK - project servers may temporarily be down.
I have reset the project (more than once) also detached and waited until next PC boot to re-attach. All this had no impact and its been doing this for several days now. So I am unable to process any work units as the applications hasn't finished downloading.
Running on Boinc 6.10.18 for Windows 64Bit on Windows 7, AMD 64Bit Dual Core, 4GB RAM
I am also running Seti@Home and this is running error free in both the standard and astropulse projects.
It should recover the transfer from where it left off and get the rest of the file. But it seems it must have a hiccup along the way. Are you using a cacheing proxy server or something?
Sounds like you've enabled the http tracing. Which Rosetta server does it say it is trying to get the file from? It should actually cycle through all of them as it does the retries. This should confuse a proxy enough that it would start fresh.
The large memory requirements are a once-in-a-while occurrence, but not something entirely unexpected. These jobs occasionally find a very interesting possible solution and spend a lot of resources testing it. I had submitted these jobs with the requirement for 512 MB RAM allocated for boinc. But based on your observations, I'll increase that requirement to 1 GB in the future. Thanks very much for the reports!
This is a good idea, but I think the specific WU I mentioned had another problem. It continued to take memory until the maximum available was reached. So maybe it tooke more RAM if I would have more in my PC.
So far I'm the only one that notice this problem, maybe it is only one case.
Anyone else seeing the following consistent error:-
File - minirosetta_graphics_1.92_windows_x86_64.exe stops downloading at 4.57/5.10 MB
Message section is showing this as a HTTP error followed by Internet access OK - project servers may temporarily be down.
I have reset the project (more than once) also detached and waited until next PC boot to re-attach. All this had no impact and its been doing this for several days now. So I am unable to process any work units as the applications hasn't finished downloading.
Running on Boinc 6.10.18 for Windows 64Bit on Windows 7, AMD 64Bit Dual Core, 4GB RAM
I am also running Seti@Home and this is running error free in both the standard and astropulse projects.
It should recover the transfer from where it left off and get the rest of the file. But it seems it must have a hiccup along the way. Are you using a cacheing proxy server or something?
Sounds like you've enabled the http tracing. Which Rosetta server does it say it is trying to get the file from? It should actually cycle through all of them as it does the retries. This should confuse a proxy enough that it would start fresh.
It does not show which server it is connecting to, I tried the url you provided and srv3 as well. Both of these download at 136KB/sec consistently until the download hits 4.57MB then the transfer rate drops to 0KB/sec and after 5 minutes or so it times out the connection.
Exit status -1073741819 (0xc0000005)
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000
____________
ID: 65252 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
It does not show which server it is connecting to, I tried the url you provided and srv3 as well. Both of these download at 136KB/sec consistently until the download hits 4.57MB then the transfer rate drops to 0KB/sec and after 5 minutes or so it times out the connection.
...even so, the reaction on the client should be to see that it did not receive the entire length of the file, and then try a second request for just the remainder of the file (this is done with an HTTP range header).
I just tried the link from home and it worked fine for me 5.1MB. So that would explain why others are not seeing the same.
Is it possible your ISP is limiting the time of each connection or something? Even so, I'm still puzzled why it doesn't sound like it is doing a retry from where it left off. What is the reaction on the client after the connection gets the 4.57M and then times out?? It should schedule a retry on the file until it gets it. And the retry should pick up where the first attempt left off at the 4.57M. You should see this in the advanced view, in the transfers tab.
Perhaps there is something up with Win7? I'd be curious to have a look at a Wireshark trace if you would take the time to gather one.
____________ Rosetta Moderator: Mod.Sense
ID: 65254 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 361,915 RAC: 771
Hi jcorn.
Either this is an old task or the memory limit hasn't been changed yet, this one
had the same problem on the same rig, would you believe!
It does not show which server it is connecting to, I tried the url you provided and srv3 as well. Both of these download at 136KB/sec consistently until the download hits 4.57MB then the transfer rate drops to 0KB/sec and after 5 minutes or so it times out the connection.
...even so, the reaction on the client should be to see that it did not receive the entire length of the file, and then try a second request for just the remainder of the file (this is done with an HTTP range header).
I just tried the link from home and it worked fine for me 5.1MB. So that would explain why others are not seeing the same.
Is it possible your ISP is limiting the time of each connection or something? Even so, I'm still puzzled why it doesn't sound like it is doing a retry from where it left off. What is the reaction on the client after the connection gets the 4.57M and then times out?? It should schedule a retry on the file until it gets it. And the retry should pick up where the first attempt left off at the 4.57M. You should see this in the advanced view, in the transfers tab.
Perhaps there is something up with Win7? I'd be curious to have a look at a Wireshark trace if you would take the time to gather one.
I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded.
Where do you want me to send the wireshark trace report.
This is a good idea, but I think the specific WU I mentioned had another problem. It continued to take memory until the maximum available was reached. So maybe it tooke more RAM if I would have more in my PC.
So far I'm the only one that notice this problem, maybe it is only one case.
By the way - it looks like a typical memory leak...
A fairly common error in computer programs
Hi jcorn.
Either this is an old task or the memory limit hasn't been changed yet, this one
had the same problem on the same rig, would you believe!
Only ran for 10min this time.
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=288907163 igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0
Yes, its old.
Hint: name of the task contains date when it was scheduled. 2 Feb 2010 in this case.
Credit wise, this task: http://boinc.bakerlab.org/rosetta/result.php?resultid=315895911 (igfhum_looprefine_placestub2_2dsrI_2WA0_ProteinInterfaceDesign_2Feb2010_17660_334_0) wasn't even worthy my CPU time. I claimed 99 credit and got only 3!
It ran the full length of time, 15000+ seconds,ran 44 models and generated 2 decoys.
Something is wrong with those numbers. Especially granted credit.
Credit wise, this task: http://boinc.bakerlab.org/rosetta/result.php?resultid=315895911 (igfhum_looprefine_placestub2_2dsrI_2WA0_ProteinInterfaceDesign_2Feb2010_17660_334_0) wasn't even worthy my CPU time. I claimed 99 credit and got only 3!
It ran the full length of time, 15000+ seconds,ran 44 models and generated 2 decoys.
Something is wrong with those numbers. Especially granted credit.
But the times you were awarded more than claimed credit weren't a problem? Funny how that works.
It's an average and you're ahead of average generally. I am too but I thought best not to mention it ;)
____________
ID: 65276 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Let's not get testy Sid. It looks like he ran 46 models and got credit for only the last 2. I've asked the Project Team to look in to these "double headers" as I call them. Thanks for reporting it Greg. If you have any hints about any rare events that may have occurred on your PC about the time those last two models would have been run, that would be great. Did you happen to power off or shutdown BOINC about that time?
____________ Rosetta Moderator: Mod.Sense
The previous one I reported has also failed on its second attempt
____________
ID: 65278 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded.
Where do you want me to send the wireshark trace report.
Craig and I arranged exchange of the trace via PMs... Craig, I've reviewed the trace and it seems the BOINC client is requesting a partial transfer from the next server in Rosetta's cluster, just as expected. But your machine never acknowledges receipt of any of the packets. And so the 3rd and 4th try are asking for the same point in the file that the second did. Plenty of time elapses, and the frame is resend about 10 times. But never acknowledged. Eventually the connection is closed.
The only times I've seen such fundamental things not work properly is when (bear with me :) there is a fundamental problem. Some examples from my past... bad ethernet port in the hub. Bad cable. But most commonly there's a router between the client and the server that's in need of a code refresh. That, or perhaps Win7 itself or the driver for your LAN adapter need an update. Because it sounds like it is happening so consistently at the same point, I tend to think there is a code problem here, because a hardware problem like a bad cable or port would probably be more intermittent.
The only other thing I can think of to try is that the downloads started out looking good, so after that first one gets most of the file and fails, shutdown BOINC, and shutdown the LAN adapter ("disable"), reenable the adapter, and fire up BOINC again. Now the retry will occur with a fresh start on the adapter and perhaps get you over the hump. But ultimately, that is just a work around. I think the true fix is going to be drivers, Win7, or router/firewall code updates. You might also try disabling the IPv6 support on the adapter.
Is anyone aware of any specific TCP fixes for Win7?
Notes to self (cuz this little scrap of paper is sure to disappear as soon as anyone askes me anything else about it!): client starts on srv1, retries occur to srv5, srv6, and srv3. Local ports used are 51221, 51229, 51235, 51241. These make good filters, such as: tcp.port == 51235 in the Wireshark display.
____________ Rosetta Moderator: Mod.Sense
I didn't mean it that way - sorry if that's how it came across. I just recalled Sarel's comment way up the thread that "The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details" so I'm pretty much ignoring all the vagaries of credit awards against claims. It averages out so we win some, we lose some. Is that not right?
If it's not then I can report quite a few too, for what it's worth.
Probably of more benefit I should report some compute errors, much the same as reported by others:
BOINC client version 6.10.18 for windows_x86_64
Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU T6600@2.20GHz [Intel64 Family 6 Model 23 Stepping 10]
OS: Microsoft Windows 7: Home Premium x64 Edition, (06.01.7600.00)
Memory: 4.00 GB physical
# cpu_run_time_pref: 28800
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000
BOINC client version 6.10.18 for windows_x86_64
Processor: AMD Phenom(tm) 9850 Quad-Core Processor [AMD64 Family 16 Model 2 Stepping 3]
OS: Microsoft Windows Vista Home Premium x64 Edition, Service Pack 2, 06.00.6002.00)
Memory: 8.00 GB physical
# cpu_run_time_pref: 28800
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000
I'm a new user running latest version. The first work unit you sent ran fine to about 70% then stopped and dropped off the task list as submitted. Account shows no work units submitted. May be a problem.
Now I've got also quite low credit: WU 288293546.
I usually need something like 450-650 CPU-seconds for 1Cr, on this WU I got 1Cr/1972sec.
____________
Max, the watchdog should be expected to kick in if a model exceeds target runtime by more then 4 hours. And so with a short 2 hour preference, you will tend to see much more variation in runtimes then if you had a longer preference.
However, I would expect that once the target runtime is exceeded, and a model is completed, that work on the WU should be ended and it would be reported back. So if you have a 2hr preference, and the first model took more then 2hrs to completed, I should have expected the WU to end once the first model was completed. In fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well.
Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful.
The rest of what you say sounds right to me. So keep watching, and jotting notes. This information will be key to resolving this issue.
In all the examples cpu_run_time_pref was set at 7200 sec. And all was generated 2 or more decoys(and 2 of them i saw what 1st model took about 2hr or more), so that the program was able to stop after 1st decoy correctly. But for some reason did not do so.
I am seeing the 2.05 use more memory than previous versions. Up to 160k from ~60-100k.
____________
ID: 65329 | Rating: 0 | rate:
/
Sarel Forum moderator Project administrator Project developer Project scientist Joined: May 11 06 Posts: 47 ID: 81994 Credit: 52,066 RAC: 125
Hello, if these are the Protein-interface Design jobs then this is expected since they work with very large complexes of proteins. If you turn on the graphics you'll see that the protein systems are much larger than the typical ones on Rosetta @ Home. These jobs are sent out with a requirement for 512Mb of memory to ensure that large-memory jobs are not sent out to low-resource machines.
Best, Sarel.
I am seeing the 2.05 use more memory than previous versions. Up to 160k from ~60-100k.
Task 317544195 , lr15clus_opt_.1a32.1a32.IGNORE_THE_REST.c.2.8.pdb.pdb.JOB_17418_5_1 behaved strangely on Mac OS X. It got hung at Model 2: step 0 and had to aborted. In the Searching... pane in the graphics window the protein was compressed into a furball: the other protein displays seemed pretty normal.
I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded.
Where do you want me to send the wireshark trace report.
Craig and I arranged exchange of the trace via PMs... Craig, I've reviewed the trace and it seems the BOINC client is requesting a partial transfer from the next server in Rosetta's cluster, just as expected. But your machine never acknowledges receipt of any of the packets. And so the 3rd and 4th try are asking for the same point in the file that the second did. Plenty of time elapses, and the frame is resend about 10 times. But never acknowledged. Eventually the connection is closed.
The only times I've seen such fundamental things not work properly is when (bear with me :) there is a fundamental problem. Some examples from my past... bad ethernet port in the hub. Bad cable. But most commonly there's a router between the client and the server that's in need of a code refresh. That, or perhaps Win7 itself or the driver for your LAN adapter need an update. Because it sounds like it is happening so consistently at the same point, I tend to think there is a code problem here, because a hardware problem like a bad cable or port would probably be more intermittent.
The only other thing I can think of to try is that the downloads started out looking good, so after that first one gets most of the file and fails, shutdown BOINC, and shutdown the LAN adapter ("disable"), reenable the adapter, and fire up BOINC again. Now the retry will occur with a fresh start on the adapter and perhaps get you over the hump. But ultimately, that is just a work around. I think the true fix is going to be drivers, Win7, or router/firewall code updates. You might also try disabling the IPv6 support on the adapter.
Is anyone aware of any specific TCP fixes for Win7?
Notes to self (cuz this little scrap of paper is sure to disappear as soon as anyone askes me anything else about it!): client starts on srv1, retries occur to srv5, srv6, and srv3. Local ports used are 51221, 51229, 51235, 51241. These make good filters, such as: tcp.port == 51235 in the Wireshark display.
I am using a router so re-loaded the firmware as suggested and this has fixed the problem. Didn't think about the router being the cause as I would have expected that to have caused problems with other BOINC project updates or other software auto updaters.
I am seeing WUs that seem to be "stuck". If you look at the properties of the WU, you typically see something like:
CPU time at last checkpoint 00:35:26
CPU time 06:02:29
If you look at the graphics, you see that the protein is not changing shape at all and the energy and RMSD are perfectly constant. These jobs run on for around 25,000 seconds and (I assume) are then terminated by the watchdog. You get very low credit for these jobs. I assume this is a bug in the code. If so, please fix it quickly since it is wasting lots of CPU time.
When I see such a WU, should I abort it, or is it better to leave it running?
This is with Rosetta Mini 2.05 on a 64-bit Linux system. I have seen this on both of my OpenSUSE 11.2 systems with the 2.6.31.8-0.1-desktop kernel (so hardware problems are ruled out). I have so far not seen this on my dual-core OpenSUSE 11.0 system with a custom 2.6.28.2-vanilla kernel. The latter is by far the least performant machine, so there is a (small) chance that this is just random chance. I do not see an obvious pattern which WUs suffer from this.
Almost since a week I get an error while downloading the app:
[error] Can't create HTTP response output file projects/boinc.bakerlab.org_rosetta/minirosetta_2.05_windows_x86_64.exe
What can I do? I already tried to reset the project.
The rosetta_beta version_598 app works well.
all, or at least most, of the ProteinInterface jobs cause validate errors - would it be possible to fix this and post in this thread when the fix is performed? It occurs on three different computers (PC, Mac), so appears to be platform-independent. In the meantime, I am aborting all ProteinInterface jobs, leaving the others which run ok.
markj
Haven't been on this project long. No noticeable problems outside of punching 'ok'. Briefly searched the forums for c++ runtime error and didn't find anything, so cheers, here's a pic.
ID: 65420 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
Looks like "J" has had a few compute errors reported on Win XP running BOINC 6.5.0:
The first was completed ok by a wingman
The second is out being worked on right now
The third failed on a wingman as well after 2 min. with an error: The system cannot find the path specified. (0x3) - exit code 3 (0x3)
I am running Rosetta@home (along with some other projects) on four machines, three runing XP and one running Win7. Two of the machines have AMD 64 single core processors and two have AMD 64 dual core processors. The rosetta app is 'rosetta mini 2.05' and the BOINC version is 6.10.18. About once a week or so I will notice that the Rosetta WU running on at least one of the machines has been running longer than usual and using little or no CPU time. I abort it, and the next WU runs fine. I have not seen this occur with any of my other projects (climite prodiction, malaria control or world community grid).
Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.
Thanks - one of my 2.05 workunits had the same problem, but now seems to be running well after a reboot.
64-bit Vista SP2, BOINC 6.10.18, quad-core Intel, not using keep in memory when suspended (something tends to tie up lots of memory and make the computer unresponsive to the mouse and keyboard; haven't found what, though)
t311__boinc_filtered_loopbuild_threading type workunit
Before the reboot, showed CPU time 03:39:05, last checkpoint 03:39:03, elapsed time so far 20:29:26, not using any CPU time
Rebooted, that workunit restarted at about 4 hours elapsed time, but is now using a CPU core again.
Hi!
I don't know if this happened also with older versions of rosetta since I started computing on the 3rd of February.
I'm running on an amd64 linux system, a pretty powerful one. Looking at my tasks log, I had about a 120 WUs assigned until today, but only 3-4 of them completed successfully. Others show "Outcome - Client error" / "Client state - Compute error". Looking at boinc.log gave me no information because it doesn't contain any error line except "output file .... absent", which I'm told from the FAQ it is safe to ignore. I'm running lhc, seti, milkyway, einstein, ralph, cosmology and with the exception of einstein tasks which seem to end up in computation errors also, every other program is running fine. Milkyway in particular granted me 2500 credits in the last four days (from which I assume the machine is stable). I have never observed problems with the machine itself (occasional lockups, strange sudden shutdowns etc).
Thanks
Neo2
One thing to look for: I've found that when the output file absent error occurs, it's a good idea to search the logfile for any reference to boinc_lockfile. Errors that refer to that file tend to cascade from one workunit to the next, at least with the older versions of BOINC, but not with some of the newer versions like the 6.10.18 I'm now using. They can also cascade to other BOINC projects that use a file with the same name, again for the older BOINC versions.
I have had several tasks stall out and stop using CPU over the past few days. I am finishing up my rosetta tasks, then taking this machine off the project. I was running an XP machine and had no problems. It died, and I replaced it with a W7 64-bit machine and some tasks started stalling out on me. In reviewing this thread, it appears there is a problem with mini Rosetta 2.05 running on W7 machines.
# cpu_run_time_pref: 21600
======================================================
DONE :: 327 starting structures 21540.3 cpu seconds
This process generated 327 decoys from 327 attempts
======================================================
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish
</stderr_txt>
]]>
Validate state Workunit error - check skipped
One of two wingmen validated successfully after his deadline, but with far fewer decoys completed.
There is nothing wrong on your end. This is a very old (and rare) bug in the boinc server software. Take a look here.
Wait a second, the trac item claims that the bug is fixed. Maybe it is time for Rosetta to update the server-code.
http://boinc.bakerlab.org/rosetta/result.php?resultid=322413556
tyrsim_3gbn_q.gz_Protein_interface_design_25Feb2010_18415_276_1
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
CPU time 4.4375
stderr out
During the last days I had several WUs staying idle after some time of computation. Windows XP task manager shows no CPU activity. If one does not notice this, many hours of WU processing get lost, which is very unproductive for the project.
____________
Today I got strange validation errors: "Task was reported too late to validate"
But there are 4 days until deadline (19 Mar)!
I see more things happened during the weekend. I'm seeing you detached or something which might cause all incomplete units to report. Did you perhaps restore a backup which tried to continue from an earlier point?
____________
ID: 65560 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
What is odd is the way the tasks were reissued before he reported the completed ones back. That wouldn't normally happen. That isn't dependent upon Mad Max's machine, so I doubt they did a restore or anything. I'll have to see what we can find out.
____________ Rosetta Moderator: Mod.Sense
I see more things happened during the weekend. I'm seeing you detached or something which might cause all incomplete units to report. Did you perhaps restore a backup which tried to continue from an earlier point?
Error with "detached" is boinc related.
Actually I have not detached from the project, but rather connect a new computer. But after that boinc client initially goes mad - first it started to download to the new computer(Athlon II X2 250 ) tasks have already downloaded to old computer (Athlon XP 2600+), then at some point, thought better of it and register new computer on the server under a new ID, and than deleted mistakenly downloaded tasks. (I think this point and recorded on the server as "detached").
Note: there was no transfer of any boinc-related files from old computer to new one. The new client was a clean install from the distrib. So I do not know what caused this behavior. Maybe the fact that the computer is connect to internet under same ip?
Hmm, now I think that in principle, such an validate error could happen because of it. If one computer "cancels" the tasks(mistakenly downloaded), while the second worked on its, the server can issue the same WU to another volunteer computer and shift deadline time?
You still would've gotten credits if you had managed to report before the other computer. :) Anyway, from what you're telling about the other computer I do think the "too late to validate" error was more likely related to the new PC, than to a bug in the science-application. Maybe a problem with the BOINC-manager itself?
____________
ID: 65565 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
True, not a problem specific to v2.05 Rosetta. Perhaps BOINC server, or client. Either way, we should start another thread if further problem tasks are found.
Certainly many users that have multiple machines are connecting from same IP address (I'm talking the router's public IP address that the project servers see). And many other users come in via dynamic IPs, and so it is always different. My understanding is that BOINC uses many factors to determine if a given machine is the same as an existing registered one to keep it all straight and separated correctly. Factors such as the user ID, host name, any existing BOINC host ID, machine type, installed OS, last RPC sequence number... so a fresh install should not have caused the client to "go mad" on either machine. Indeed many users have identically configured machines at same site coming in via same IP.
____________ Rosetta Moderator: Mod.Sense
ID: 65567 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 361,915 RAC: 771
This took 8hrs, 2min on my 3ghz intel, four hour run time.