Message boards : Number crunching : Current issues with 7+ boinc client
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next
Author | Message |
---|---|
ETQuestor Send message Joined: 13 Nov 12 Posts: 8 Credit: 957,206 RAC: 0 |
I am a recent new member of the Rosetta project...I have two Linux x86_64 (Fedora 17, 3.6.6 kernel) machines running this project. Both have NVIDIA GPU cards running GPUGrid, BOINC version 7.0.36, NVIDIA driver 304.64. One is completing and reporting Rosetta units just fine, the other is completing computation normally but getting the Client Error / Validation Error on every unit (so far). I am happy to provide any addition info or run diagnostic/troubleshooting steps, so please let me know. Host working correctly : https://boinc.bakerlab.org/rosetta/results.php?hostid=1578095 https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1578095 Host having errors: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1578078 https://boinc.bakerlab.org/rosetta/results.php?hostid=1578078 Error result: https://boinc.bakerlab.org/rosetta/result.php?resultid=543993914 <core_client_version>7.0.36</core_client_version> <![CDATA[ <stderr_txt> [2012-11-14 11:47: 4:] :: BOINC:: Initializing ... ok. [2012-11-14 11:47: 4:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev52076.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/input_rb_11_13_34809_65508__round2_t000__0_C1_robetta.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. ====================================================== DONE :: 7 starting structures 9532.52 cpu seconds This process generated 7 decoys from 7 attempts ====================================================== BOINC :: WS_max 0 BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> Validate state Invalid Claimed credit 86.4172032926909 Granted credit 0 application version --- |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,990,626 RAC: 15,015 |
One more example of host with 100% error rate at validation stage (6th in our team): https://boinc.bakerlab.org/rosetta/results.php?hostid=1577424 All the symptoms are exactly the same: - calculation went just fine (no any errors or strange behavior) - all 100% WUs fails at validation stage - missing info about app version in WUs logs (application version --- ) P.S. The machine has been switched from the calculations in the Folding@Home where she worked with no errors. And now the owner will be forced to switch it back, and you (R@H project) have lost another 6 Cores/12 threads @ 24/7 |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,623,494 RAC: 11,263 |
One more example of host with 100% error rate at validation stage (6th in our team): In case it helps solve the issue, according to E@H that machine is running: NVIDIA GeForce GTX 680 (2048MB) driver: 31033 http://einstein.phys.uwm.edu/hosts_user.php?userid=608684 |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,990,626 RAC: 15,015 |
And some WUs examples where both hosts(wingmans) fails due this bug. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=495612750 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=495643594 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=495643592 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=495690280 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=495689889 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=495689875 Hint to desv - try to do database query with filter by host with (Maximum daily WU quota per CPU = 1/day). You will probably get almost a full list of hosts affected by this bug (because all other types of errors usual are not lowered the limit so low - it can drop to 1 only at 100% or near 100% error rate). And it may help identify patterns - that all comps have in common through statistics. Or at least get an estimate - how many computers affected. |
Alez Send message Joined: 3 Apr 12 Posts: 13 Credit: 3,534,368 RAC: 0 |
I've just had 100% failure on all units too. https://boinc.bakerlab.org/rosetta/results.php?hostid=1577285 Is there a cure for this ? Will I be granted credit ? Does this mean this machine can't be used on rosetta. As i've just wasted 3 days crunching i'm not very happy about it. Have set no more work until I know whether or not i'm wasting my time |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,990,626 RAC: 15,015 |
No, there is still no cure for this bug. Credit will be granted with ~1 day delay - look for example: https://boinc.bakerlab.org/rosetta/result.php?resultid=545268784 But soon you daily WU quota per CPU will drop to 1 per CPU/day and you can't get new WUs(only few per day) so in practice this mean this machine can't be used on rosetta until cause of this bug is found and fixed. |
Alez Send message Joined: 3 Apr 12 Posts: 13 Credit: 3,534,368 RAC: 0 |
The first this showed up for me was on the 19th, 5 days ago and still I have been granted no credit so not so sure. 67 units crunched now and not a single credit granted and still my allowance is 40 units/ core so I don't know. Would be nice to hear from the project or do I just go elsewhere? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The first this showed up for me was on the 19th, 5 days ago and still I have been granted no credit so not so sure. 67 units crunched now and not a single credit granted and still my allowance is 40 units/ core so I don't know. Would be nice to hear from the project or do I just go elsewhere? For having been granted no credit, the RAC on your machine that has 67 WUs is higher than any of your other machines. When you view the task list, it shows no credit granted. But when you look at the details of any given task, you see the credits granted by the nightly script for tasks that didn't end normally. Rosetta Moderator: Mod.Sense |
arborman Send message Joined: 6 May 07 Posts: 5 Credit: 5,067,141 RAC: 0 |
Hello, yeasterday i joined new laptop to the rosetta project, but all task end with error...What is a problem? Thanks. My ID computer: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1580489 |
Alez Send message Joined: 3 Apr 12 Posts: 13 Credit: 3,534,368 RAC: 0 |
The first this showed up for me was on the 19th, 5 days ago and still I have been granted no credit so not so sure. 67 units crunched now and not a single credit granted and still my allowance is 40 units/ core so I don't know. Would be nice to hear from the project or do I just go elsewhere? Yes, I see that now. Was wondering why my RAC was going up but every unit said no credit. Thanks for pointing that out, and to Mad Max as well and after re-reading my second post sorry if I sounded a little 'stroppy'. However my WU allowance is now starting to drop, so wondering how long before it's down to 1/core and once it is and a cure for this bug is found i presume my WU allowance will start to increase again ? |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
However my WU allowance is now starting to drop, so wondering how long before it's down to 1/core and once it is and a cure for this bug is found i presume my WU allowance will start to increase again ? Set Rosetta to "no new tasks" and crunch for other projects untill this is fixed. That way it will not drop that far and will come up to 100 a lot faster (when you will be able to return valid results here again). . |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,169,305 RAC: 3,857 |
It seems as though David's initial estimate of "about 2 weeks" was a bit off, I HOPE he has time at some point to fix this!!! |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
It seems as though David's initial estimate of "about 2 weeks" was a bit off, I HOPE he has time at some point to fix this!!! I'm getting quite used to the WCG relatively low RAM usage (150MB MAX) and GPUGRID's research... I'm switching all of my machines to WCG grid as a matter of fact, they let the PCs run smoother (at least in PCs with no more than 1GB per core). |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,990,626 RAC: 15,015 |
In our team (near 700-1000 computers/few thousand of CPU cores running R@H) we are duscussing transfer part of computing power from R@H to other biomedical DC projects. Including WCG (such as Human Proteome Folding and Help Conquer Cancer) too. Due much better work of tech part of WCG: -significant lower RAM usage (~ 100 Мб per CPU thread compare to 300-700 Мб in R@H) -much lower internet traffic (important to users with slow/limited IC, while R@H last time generates huge amounts of internet traffic) -incomparably less disk usage (because not absurd unpacking and removing the database ~200 MB and ~1500 files in each WU) -lower error rate (and so far no fatal bugs that prevent work of computer in project at all, as discussed in this topic) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
...while the topic of this thread morphs to other BOINC projects, it sounds like a good time to point out that the WUs I have received from WCG recently all have a minimum quorum of 2. This means that every piece of work issued by the project is sent out twice. 50% of the time, your machine is doing work that has already been completed by someone else. I am also still awaiting credit to be granted (often for several days), because the second cruncher has not yet returned their confirmation that my work was good. And if they happen to complete the work and report in with a lower credit claim, I will only get the lower credit granted, even if the other cruncher spoofs their results to report an artificially low number. If the other cruncher spoofs more than just the credit claim, my work might be thrown away because the results were inconsistent. I like WCG, many of you have seen me recommend it to people as a good project with low memory requirements to mix in with Rosetta's higher memory requirements on multi-core machines. But, designing your whole system such that 50% of the effort is just confirming credit claims does have some drawbacks as compared to how Rosetta@home is setup. Rosetta Moderator: Mod.Sense |
JAMES DORISIO Send message Joined: 25 Dec 05 Posts: 15 Credit: 201,201,447 RAC: 48,100 |
This new computer also seems to be affected by this problem. Intel I7-3770, Ubuntu linux 12.04 amd64 ,nvidia driver 310.14, Boinc 7.0.27, all downloaded from the Ubuntu repository. I have the run time preference set at 12 hours. https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1579123 The 1st weekend I tested it with Rosetta@home and WCG it successfully completed 5 work units with no errors. The 2nd weekend same setup it successfully completed 10 of 12 work units with 2 validate errors, then I installed nvidia drivers to test it on gpu projects, in this case Einstein@home. Since then it has returned all client errors even after setting no new work from Einstein finishing all the gpu workunits and rebooting it. Then Running only Rosetta & WCG it received 5 new work units all ended with client errors. I can see no difference in the log files from the successful work units and the client errors. Interestingly it has successfully completed all WCG human proteome folding phase 2 work units, which use Rosetta software as per the web site. quote "Human Proteome Folding Phase 2 (HPF2) continues where the first phase left off. The two main objectives of the project are to: 1) obtain higher resolution structures for specific human proteins and pathogen proteins and 2) further explore the limits of protein structure prediction by further developing Rosetta software structure prediction." I post this information hoping it will help the Rosetta staff fix this problem, but I am willing to use this computer at WCG until this problem is fixed. I will continue to test a few work units per weekend along side gpu projects to see what happens if this is of any help. Also one question for the Rosetta staff, are the work units that seem to complete normally but are marked client error of any scientific value to you? I have noticed that credit is eventually granted on the task id page although the server gives the work unit out again to someone else so it is kind of a waste of time. Please ask if you have any questions. Thanks Jim |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,623,494 RAC: 11,263 |
Hi MM TSC is obviously a huge team and that compute power would be a significant loss for the project. The RAM usage is a result of the complexity of the models/research being carried out so there's probably little that can be done about that - cutting edge research can be demanding! If the packing/unpacking issue could be solved then do you think that would convince the majority to stay? Both the disk activity and bandwidth utilisation can be reduced by increasing the run-time preference in the mean-time. Danny |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,990,626 RAC: 15,015 |
2 Mod.Sense Yes WCG use minimum quorum = 2. But almost all dc(volunteer computing) projects use double calculation of each work piece (from different hosts from different users). Only a few exceptions (and R@H is one of them). And of course it is not used to verify the calculation of credits as main goal, but in the first place to validate the submitted scientific results. If they differ at least 1 byte project sent another WU (to 3rd user) to find out which of the results is correct. Some algorithms and search strategies may allow that a small part of the results was incorrect. And the end(final) result after postprocessing will not be affected while if errors results share remains relative small. R@H refers to this kind, so it can afford to use a quorum = 1 and save computational resources. Mainly due to the fact that on each target (protein model / result of the interaction) R@H issiued several thousand of WUs and scientific interest is basically just the "best" results of them, and other else (including results from incorrect WUs) are trimmed. In the majority of projects even a small percentage of incorrect results is unacceptable because it will distort the aggregate final result (and sometimes even can fire a chain reaction/cascade of errors - if new jobs are generated on the basis of previous). > If the other cruncher spoofs more than just the credit claim, my work might be thrown away because the results were inconsistent. No, it will be thrown away only of you report incorrect results. If you report right result but yours "wingman" reports "fake result" then this particular WUs will be auto issued to 3rd user. If 3rd user report correct result - your and 3rd user results will match and marks as valid.(and your first wingman "fake" results marks as invalid and discard) In rare sutuation if 3rd user report error(or "fake") result too, then all 3 results do not match with each other and WUs will be sent to 4th user and so on (if other parameters such as "max # of error/total results" allow this or discard all results on this WU if not) P.S. Points usually are not taken by the minimum but the average of the results reported on WU (it by default in BOINC scripts). While some projects independently replaced average by the minimum value, WCG AFAIK use default average(at least 2 subrojects i try). Look for example: http://www.worldcommunitygrid.org/ms/device/viewWorkunitStatus.do?workunitId=573591090 First user claim 64.6 Cr, second (it was me) claim 61.6, and 63 (average) granted to both. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,990,626 RAC: 15,015 |
2 dcdc In fact, we are not such a huge team. Our good results and large capacity in R @ H, because we are a very "specialized" team, which focuses only on a few projects at one time (and only in the biomedical field of research). In few last years it was only 2 such projects: Folding@Home and Rosetta@Home which is concentrated about 95% of the total computing power. But now many technical problems/disadvantages of R@H forced us to search and choose 3rd project and move sugnificant part of resurses to it. But of course we are not going to stop supporting R @ H completely, because it has a very high scientific value. > The RAM usage is a result of the complexity of the models/research being carried out so there's probably little that can be done about that - cutting edge research can be demanding! Interesting point! While I studied various biomed dc projects (to gather information and provide summary information based on which team will be able to choose a new project) i fould that some projects are using same Rosetta algorithms for the simulation of proteins (or at least it declared by project staff - i can't verify that by myself). But use much less amounts of RAM. For example Human Proteome Folding app use only 70-150 Мб of RAM (compare to 300-700 Mb in R@H). And im curius: They achieved this result by optimizing the code (the same algorithms in terms of the scientific implementation, but the other program code) or just run ONLY very small/simple targets or use a totally different algorithms (only based on Rosetta, but much different)? Not sure... If the packing/unpacking issue could be solved then do you think that would convince the majority to stay? Both the disk activity and bandwidth utilisation can be reduced by increasing the run-time preference in the mean-time. No, high disk usage is the lesser of all the problems I have listed. It is rather more Irritating (and slightly slowed the work) factor than critical. Critical is others: - some computers(near 10 such comps in our team now) want(thier owners) run R@H but can not at all due R@H bugs (like bug discussed in this topic with 100% error rate) - some computers (~20-40) can not run R@H normal due high RAM usage (good CPU with relative low RAM amount or high RAM usage severely interfere with the user work) - some computer (many, but I hesitate to to estimate numbers) can not run R@H as main project due high internet traffic even with 24hr target CPU time. Last few month in R@H very many WUs with 15-40 Мб input files per each WU. So just one 4 theads(starts from lowend core i3/Athlon X4) comps use 100 - 150 Мб if IC traffic per day (+ some task runs often less 24hr in normal situation + some WUs ends by errors fast). For Core i7/AMD FX8xxx double values. This is important mainly for the 3G Internet (which in Russia is often used as the main access channel outside the major cities) and use at work (where internet usage limit set by the network administrator / company rules). Typical limits in such cases 1 - 5 Gb per month (for all, not just for R @ H exclusive, while R@H alone can easy burn such limit in 1-2 weeks). + R@H not have (and no plans to develop) GPU computing app. Nvidia cards usual works for F@H, but for ATI(AMD) cards all last F@H clients ineffective (and R@H not use it at all) so ATI card very often idle, doing only CPU WUs. This is a significant waste of potential resources. So computers equipped with ATI cards(and probable AMD hybrid CPUs) also candidates for transfer to another project(s). |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
- some computers(near 10 such comps in our team now) want(thier owners) run R@H but can not at all due R@H bugs (like bug discussed in this topic with 100% error rate) That's something, that Rosetta needs to fix ASAP. - some computers (~20-40) can not run R@H normal due high RAM usage (good CPU with relative low RAM amount or high RAM usage severely interfere with the user work) Well, it's nothing unusual, that some programs need state of the art computers and Rosetta is one of them, at least when it comes to RAM. - some computer (many, but I hesitate to to estimate numbers) can not run R@H as main project due high internet traffic even with 24hr target CPU time. Last few month in R@H very many WUs with 15-40 Мб input files per each WU. So just one 4 theads(starts from lowend core i3/Athlon X4) comps use 100 - 150 Мб if IC traffic per day (+ some task runs often less 24hr in normal situation + some WUs ends by errors fast). For Core i7/AMD FX8xxx double values. This is important mainly for the 3G Internet (which in Russia is often used as the main access channel outside the major cities) and use at work (where internet usage limit set by the network administrator / company rules). Typical limits in such cases 1 - 5 Gb per month (for all, not just for R @ H exclusive, while R@H alone can easy burn such limit in 1-2 weeks). We have 2012, 2013 soon. I understand, that also people with such connections want to cruch, but OTOH I also understand, that the world can't wait till everyone has a "normal" internet connection and not something that belong in the 90's. If the data is needed for the WUs, than it is like that. I don't know if they use the best possible compression, eventually that could decrease the size of the files a bit, but that would mean a lot more load for the server, so eventually not possible. BTW, I'm currently myself on a 3G connection, usually down to 64kb/s because of reaching the limit, so I know how it is. But I don't expect, that websites and everybody else considers that as something they have to "support". + R@H not have (and no plans to develop) GPU computing app. Nvidia cards usual works for F@H, but for ATI(AMD) cards all last F@H clients ineffective (and R@H not use it at all) so ATI card very often idle, doing only CPU WUs. This is a significant waste of potential resources. So computers equipped with ATI cards(and probable AMD hybrid CPUs) also candidates for transfer to another project(s). There are many BOINC projects, that support those cards, no need to let them idle. . |
Message boards :
Number crunching :
Current issues with 7+ boinc client
©2024 University of Washington
https://www.bakerlab.org