Work done - no "pay" for it

Author	Message
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 794,064 RAC: 0	Message 95178 - Posted: 23 Apr 2020, 6:54:06 UTC Last modified: 23 Apr 2020, 6:55:19 UTC Имя Mini_Protein_binds_IL6R_COVID-19_1p9m_2_SAVE_ALL_OUT_IGNORE_THE_REST_2fu9wb9d_924136_4_0 Задача 1041653356 Создан 22 Apr 2020, 21:53:15 UTC Отправлен 22 Apr 2020, 22:47:25 UTC Крайний срок отчёта 25 Apr 2020, 22:47:25 UTC Получен 23 Apr 2020, 3:17:44 UTC Состояние сервера Завершено Результат выполнения Ошибка вычислений Состояние клиента Отменён сервером Статус выхода 202 (0x000000CA) EXIT_ABORTED_BY_PROJECT ID компьютера 4186879 Время выполнения 2 часов 0 мин. 19 сек. Время ЦП 1 часов 58 мин. 27 сек. Состояние проверки Неправильный Очки 0.00 Пиковая производительность устройства, FLOPS 2.49 GFLOPS Версия приложения Rosetta v4.15 windows_intelx86 Peak working set size 613.73 MB Peak swap size 596.34 MB Peak disk usage 978.98 MB Текст протокола <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> aborted by project - no longer usable</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_intelx86.exe -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--il1r_design_boinc_v1.xml @flags_il6r -in:file:silent Mini_Protein_binds_IL6R_COVID-19_1p9m_2_SAVE_ALL_OUT_IGNORE_THE_REST_2fu9wb9d.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip Mini_Protein_binds_IL6R_COVID-19_1p9m_2_SAVE_ALL_OUT_IGNORE_THE_REST_2fu9wb9d.zip @Mini_Protein_binds_IL6R_COVID-19_1p9m_2_SAVE_ALL_OUT_IGNORE_THE_REST_2fu9wb9d.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 5000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1048673 Starting watchdog... Watchdog active. Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x778E86FE Engaging BOINC Windows Runtime Debugger... ******************** BOINC Windows Runtime Debugger Version 7.9.0 2 issues in one result: 1) task aborted by the project, so no client fault. Client done part of work (elapsed and CPU times are not zero) but got zero credit. Incorrect usage of credit system in this case IMO. 2) task exited via exception - not a best way to exit. ID: 95178 · Rating: 0 · rate: / Reply Quote

magiceye04 Send message Joined: 11 May 11 Posts: 11 Credit: 1,702,178 RAC: 2	Message 95183 - Posted: 23 Apr 2020, 7:25:21 UTC Last modified: 23 Apr 2020, 7:25:33 UTC I also had about 70 project aborted WUs last night. Many of them were partly computed, some also fully computed. I would really recommend to test these beta-WUs on the Ralph-project. ID: 95183 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1508 Credit: 15,158,900 RAC: 23,033	Message 95184 - Posted: 23 Apr 2020, 7:35:20 UTC - in response to Message 95183. I also had about 70 project aborted WUs last night. Many of them were partly computed, some also fully computed. I would really recommend to test these beta-WUs on the Ralph-project. Or at least pay Credit for the work that was done before they were cancelled. Grant Darwin NT ID: 95184 · Rating: 0 · rate: / Reply Quote

Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,386,173 RAC: 271	Message 95189 - Posted: 23 Apr 2020, 8:01:20 UTC - in response to Message 95184. Last modified: 23 Apr 2020, 8:03:51 UTC I also had about 70 project aborted WUs last night. Many of them were partly computed, some also fully computed. I would really recommend to test these beta-WUs on the Ralph-project. Or at least pay Credit for the work that was done before they were cancelled. Valid point. I still think we need more testing on Ralph before releasing WUs here. Users here expected stable stuff, since this is not the test project (well, to be fair, Ralph has a pretty paltry user base compared to the main project, so bug testing will take a lot longer and problems may still slip through.). Furthermore, if the task was completed before they got cancelled, it seems fair to reward credits. (if BOINC even allows that type of thing, that is) ID: 95189 · Rating: 0 · rate: / Reply Quote

Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 794,064 RAC: 0	Message 95202 - Posted: 23 Apr 2020, 10:57:14 UTC - in response to Message 95189. (if BOINC even allows that type of thing, that is) And if not there is the hint what to implement in next release. ID: 95202 · Rating: 0 · rate: / Reply Quote

Terrible T Send message Joined: 29 Dec 16 Posts: 4 Credit: 1,333,030 RAC: 0	Message 95205 - Posted: 23 Apr 2020, 11:40:03 UTC After loosing a good 200.000 sec of wasted computer power over cancelled and error tasks (also cancellled) it indeed would be better to have some more testing prior releasing of work units. ID: 95205 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1884 Credit: 8,408,928 RAC: 10,757	Message 95208 - Posted: 23 Apr 2020, 12:15:44 UTC - in response to Message 95183. I would really recommend to test these beta-WUs on the Ralph-project. +1 ID: 95208 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1884 Credit: 8,408,928 RAC: 10,757	Message 95213 - Posted: 23 Apr 2020, 15:39:41 UTC Not only tonight. Now others 4 wus aborted by server. ID: 95213 · Rating: 0 · rate: / Reply Quote

magiceye04 Send message Joined: 11 May 11 Posts: 11 Credit: 1,702,178 RAC: 2	Message 95218 - Posted: 23 Apr 2020, 17:30:33 UTC Today i got defective WUs without checkpointing. But the Computer needed to restart by external reason. AGAIN many hours of wasted computing - all start from zero. Maybe i try an not beta-project the next days... ID: 95218 · Rating: 0 · rate: / Reply Quote

bcov Volunteer moderator Project developer Project scientist Send message Joined: 8 Nov 16 Posts: 12 Credit: 11,348 RAC: 0	Message 95229 - Posted: 23 Apr 2020, 19:47:19 UTC Hey everyone, Sorry about the cancelled jobs. You're seeing the growing pains as we transition over to more design focused projects on R@H. I'll give you guys the full story so you can put what happened here in perspective. 1. We finally figured out how to do protein design on R@H 2. We started doing monomer design on R@H (these are future protein binders) 3. We worked hard an got an update out to allow Protein Interface Design on R@H 4. We started submitting interface designs to do massive sampling using R@H 5. These runs were too successful and it blew up the servers on our ends -- We decided to remedy this by using filtering on the R@H jobs in this way. Only some of the outputs get stored on our servers and the rest are discarded as they are received 6. This new freedom allowed for even larger jobs to be submitted. Absolutely incredible designs are coming out the other side. This increase in compute power is equivalent to about 5 years of methods development. 7. These jobs have an early filter and a late filter. The early filter still takes time, but ensures that the protein is worth spending compute on. The job runs and then the late filter decides whether or not we'll keep the job 8. Some computers are really fast and burn through tons and tons of early filters. Since we keep the output in order for users to get credit for their computation, this resulted in tons and tons of data. 9. This data is still sent back to the server where it will be discarded. But some users were noticing data transfers of the maximum size of 500MB being sent back. 10. This was the point where the decision to cancel the jobs was made, we decided that people's internet bandwidth took priority over lost cycles. What are we doing to remedy this: We are working out a system whereby these early filtered jobs will either not be transmitted or be greatly reduced in size. We first and foremost want to make sure that users are credited for their work, but we also need to balance this with internet bandwidth. On the topic of the beta server: We did run these on the beta server, and the jobs were successful. But this was a case of rare events. The beta server gave us mixed results here, but they were not severe red flags. The jobs in question got unlucky with the set of proteins they were set to design and were also run on fast computers with long job lengths. Only about 1% of the jobs caused this data explosion and then it was only seen for a few specific configurations. We now know what this looks like now though and will make sur ID: 95229 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 95230 - Posted: 23 Apr 2020, 19:56:42 UTC I just want to point out that when bcov said "We finally figured out how to do protein design on R@H", he is distinguishing "design" from the "structure predictions" that have been using R@h for years. Designing a protein from scratch is very different than predicting the shape that a given sequence of amino acids will take. Rosetta Moderator: Mod.Sense ID: 95230 · Rating: 0 · rate: / Reply Quote

Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0	Message 95231 - Posted: 23 Apr 2020, 20:11:27 UTC We have made designs using R@h in the past but not using the new protocols and sampling strategies that bcov et al are running for COVID-19 and large scale scaffold design and interface design in general. These new strategies that bcov helped develop with massive sampling are producing the largest number of good designs as judged by various metrics we use in the lab. ID: 95231 · Rating: 0 · rate: / Reply Quote

Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 794,064 RAC: 0	Message 95241 - Posted: 23 Apr 2020, 22:14:42 UTC - in response to Message 95229. Last modified: 23 Apr 2020, 22:15:19 UTC 5. These runs were too successful and it blew up the servers on our ends -- We decided to remedy this by using filtering on the R@H jobs in this way. Only some of the outputs get stored on our servers and the rest are discarded as they are received Could you explain this in more details. Why server required for this? Why it can't be done on clients? Does server compares recived results from many clients on this stage? 8. Some computers are really fast and burn through tons and tons of early filters. Since we keep the output in order for users to get credit for their computation, this resulted in tons and tons of data. And what prevents to accept result, pay the credit, compare it with another (as in 5) and then discard. Why credit payment binded with keeping result? 9. This data is still sent back to the server where it will be discarded. But some users were noticing data transfers of the maximum size of 500MB being sent back. Perhaps it means some additional "filtering" should be done on client side. That is, if many models generated report back only fixed number of best (this infers one can compare those models on client of course). ID: 95241 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1508 Credit: 15,158,900 RAC: 23,033	Message 95251 - Posted: 23 Apr 2020, 22:58:28 UTC I understand that in this project even Tasks that have a Computation error can provide useful data for the project, but if a Task isn't a computation error, it should be considered Valid, and get Credit for whatever work it has done. So any Tasks that had been started and are aborted by the server should count as Valid (if they have produced Valid work of course), that way they will get Credit for any work done prior to being aborted. Unstarted tasks won't get Credit, but they will still count as a Valid result, not as an Error. Grant Darwin NT ID: 95251 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1508 Credit: 15,158,900 RAC: 23,033	Message 95254 - Posted: 23 Apr 2020, 23:05:09 UTC - in response to Message 95229. Hey everyone, Sorry about the cancelled jobs. You're seeing the growing pains as we transition over to more design focused projects on R@H. I'll give you guys the full story so you can put what happened here in perspective. ... Thanks for filling us in. Along with consistent Credit (and comparable amounts with other projects), it will go a long way to help the project retain much of it's recently acquired computing resources even after Covid-19 is no longer in the news. I would suggest similar posts as new things are rolled out and existing ones tweaked as results are returned etc, will have the greatest benefit for retaining crunchers. Grant Darwin NT ID: 95254 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2003 Credit: 39,046,832 RAC: 24,095	Message 95258 - Posted: 23 Apr 2020, 23:43:08 UTC - in response to Message 95229. Sorry about the cancelled jobs. You're seeing the growing pains as we transition over to more design focused projects on R@H. I'll give you guys the full story so you can put what happened here in perspective. 5. These runs were too successful and it blew up the servers on our ends 6. This new freedom allowed for even larger jobs to be submitted. Absolutely incredible designs are coming out the other side. This increase in compute power is equivalent to about 5 years of methods development. 7. These jobs have an early filter and a late filter. The early filter still takes time, but ensures that the protein is worth spending compute on. The job runs and then the late filter decides whether or not we'll keep the job 8. Some computers are really fast and burn through tons and tons of early filters. Since we keep the output in order for users to get credit for their computation, this resulted in tons and tons of data. What are we doing to remedy this: We are working out a system whereby these early filtered jobs will either not be transmitted or be greatly reduced in size. We first and foremost want to make sure that users are credited for their work, but we also need to balance this with internet bandwidth. On the topic of the beta server: We did run these on the beta server, and the jobs were successful. But this was a case of rare events. The beta server gave us mixed results here, but they were not severe red flags. The jobs in question got unlucky with the set of proteins they were set to design and were also run on fast computers with long job lengths. Only about 1% of the jobs caused this data explosion and then it was only seen for a few specific configurations. Just quoting the bits I think are important. This is the good news. The only comment I'd make is about ensuring checkpoints are made at key stages. It sounds like the running jobs cut short is annoying but relatively trivial, especially if the lessons are learned so won't recur. If that's the case I wouldn't waste much effort finding a way to compensate users. Annoying yes, but bygones. Great post. ID: 95258 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1884 Credit: 8,408,928 RAC: 10,757	Message 95278 - Posted: 24 Apr 2020, 7:33:58 UTC - in response to Message 95229. Hey everyone, Sorry about the cancelled jobs. You're seeing the growing pains as we transition over to more design focused projects on R@H. I'll give you guys the full story so you can put what happened here in perspective. I like your work and i continue to crunch!! ID: 95278 · Rating: 0 · rate: / Reply Quote

magiceye04 Send message Joined: 11 May 11 Posts: 11 Credit: 1,702,178 RAC: 2	Message 95326 - Posted: 24 Apr 2020, 21:50:41 UTC Thank you for the explanation! Today i also had to abort some WUs. They consumed about 1,8GB per WU und freezed the PC. I only allowed about 12 WUs, but it was still to much for 16GB RAM. Maybe these WUs can be sent to PCs with minimum 4GB per CPU-Core... ID: 95326 · Rating: 0 · rate: / Reply Quote

Jonathan Send message Joined: 4 Oct 17 Posts: 43 Credit: 1,337,472 RAC: 1	Message 95332 - Posted: 25 Apr 2020, 4:19:19 UTC - in response to Message 95326. What do you have set for your computing preferences? How much RAM are you allowing to be used? The tasks should have just went into 'waiting for memory' ID: 95332 · Rating: 0 · rate: / Reply Quote