From fe2386b484fd4b2d732288e3b57341976ebccd2d Mon Sep 17 00:00:00 2001 From: Katy Scott Date: Tue, 26 Nov 2024 11:51:54 -0500 Subject: [PATCH] feat: migrate Clinical Trial Curation page from BHKLab Confluence (#105) * feat: added tables support * feat: add Clinical Trial Curation page with data from BHKLab confluence * feat: add data curation page with short description * feat: add data process overview image for clinical trial curation page * fix: addressed review comments - links have been fixed - data processing overview numbering has been corrected --- .../img/Clinical_trial_curation_overview.png | Bin 0 -> 12359 bytes .../Clinical_Trial_Curation/index.md | 130 ++++++++++++++++++ .../Data_Science/Data_Curation/index.md | 7 + mkdocs.yml | 1 + 4 files changed, 138 insertions(+) create mode 100644 docs/disciplines/Data_Science/Data_Curation/Clinical_Trial_Curation/img/Clinical_trial_curation_overview.png create mode 100644 docs/disciplines/Data_Science/Data_Curation/Clinical_Trial_Curation/index.md create mode 100644 docs/disciplines/Data_Science/Data_Curation/index.md diff --git a/docs/disciplines/Data_Science/Data_Curation/Clinical_Trial_Curation/img/Clinical_trial_curation_overview.png b/docs/disciplines/Data_Science/Data_Curation/Clinical_Trial_Curation/img/Clinical_trial_curation_overview.png new file mode 100644 index 0000000000000000000000000000000000000000..7aedbc7f9715629fc3bf3343f6b49fe707e84cfe GIT binary patch literal 12359 zcmd6uc{tSn-|uI}3}YYr&LB(3l6{G>Wy!wFHb@fL*KA`D*-D6H-$Jsb$U0*!yGrCM zOR^-{3uQTPzx#Kt`}}#%b)EY@=P$X&HJ_Q!@_s*_kLRoVCPq5c6zmib2!vWsSIZ3i z#6uv^2_z}_SB1AT3jBZunCWOhYDYQ0Lm+5~o|d{rsN>IkvP=uh-7`-VR2~0~lbQTc z5B&vUb&Q1x>OLRaJKaZ!3;bfbqsia$dfo1S)i-%Wz<(1{6h+&9-HP~e7X9;SMfgfb zSY7y^x!UkFvS)w$tH)kchyNIPn&>`OaE1S(vqTIM%Dcgag(8vNF*r2Uu>iCIsS#@f zcgBd}B^Lyse$@E57E)(Ih3Dbl`7`m$lm7Z@l{&n)gq$kNKkudTYl=dzrkNqys02(%6SE=grwS zA4_#;|14)at-Z?-+W*I=Yobgq^VU7FDo!^7r8T$-(?HzomXta9WOQDfc#} z>Udt*HK(~XhV9*Cm$`*{;n4PYHE4T5{d$87VtK5v(5lYSu=Hwb2{=6gUjz+v$JI`E zXz%{nznIFY_^3aHO_P>KwTzUS9koKI78UUzFAGu}^y5?D$}pSbt)t!LiBNv?{otckx(=je3K?@--TnLfC);Dgr*aLRG4#dTm+W3|AE5N_(+gA@I zNhi^xvG0c-Un*4$STUJv5BuI7PnoG2dDn`xvOqObNn1;6{9~1Q*Z}4tIN8ggrB*Ml zXG*&p&BgTGqh4}2U3#Fl`^QrVQ|{G?rQ5gqc5#Nt}z_tE`3P` zQA_=)VXHif{&7i6qWyyO%FS0(+1p8=^!U-Gfd|qi!X#ug-wK^AsZgY}+`~KM2C9SL zj-D^LqHP+l4;^?eH9x;}cUNB?%6?F7 z^+I3XdluW3^yPl^*{{0j2B&^JZ_0f1S^kKcOJc};WVSP*7KRiu%G*YcvLLsAPP-`w zZ(xbM1-P@m+IY5soM~Ih9bZu#=^3A#)CwZNu^vlY#X!(VT`A)YmkkRDiTtXMgKw;? z=cfxzvcb|TcUnk?DVeDd4@=OS7-+rlP`1RXIhBhrBYY7D*^|_N8v}_x0*}&5oP=Zh z+hh#_eAoGO(XOG?#OL(*2<{0Lg@oWRG|KF#!1a&!-uFKMlV!PXde-55DgO%RPt^E=#SaoK}Vt>Z7 zu7OFN3*LG9>DL00M8u*JH6V|pwW5p91~(z}mAdB=)}7zcrYq`-odAvuyZy2*9Nx>6XlZi# zySZzJyv%p8J8CL^>bh%WGK<8D+(X(Tf)Yv?=HHVFWp5$<7{7rQ$`eCfBl5W{6=9=r z5I-y3>+rFfk|Q1@8H@~BPah+kNR3T1FVm%~3QwW_ki@m@u>IxNkEue>B?nqG8~SNi z@@)rspSdzPIsJC@1XGK-4HRCwf?wJKcj{-u(Cmke$calPowRvycpTxKRaF53UjC7( zpB0i4bNX6sM1+C=8ECwH*=sdMmjN&M9>MJ5Q>1mSL|OU2Rf& zG~tAFbiHMR=s;2f+f;q%vu#PsHfACQ!_&T4X#XkkH1WADlA1-muJez`ugxz{_t0u5 z`wVUI#hMAbOtPc9o$Gi0d3}jd5&7b2pGi8(Tkyn=~qSpTGY{uf2 zRc7%o3*TNt%V7pt)s}eKqJPa=6xl_RBcm>S2k%x!o!Jq7;61wi>h@$(=bv>^BkCnv z#0`2se?kClD8)PLD=L$Wx3w{e8vLei4Cyhwno^+`M(xK*KWp>vt=1Lbw$B7mNSLN! zi^zx&>W_))RTI4!?-5Ik@3lus2mEN^T1<0=1iD2)ukSA%iqX2r?gI;I_EFi_6C^T8 zbo|A4N~E`**-jK3-p91|X&(V^zEBma2E%`)2=x2|!`nTwXy<{d`ypCoBca4gdLnZV za78V!FS(t#g~u0|wsbHi_A%auPXF778&E=eU()*Hrvn2Ioc}X^zQ$kS)U~N1TyOYd zieeWAYg~&njMYvu&jd;4&QiMgoy8A1=a&K^csBbUX`5K;!tg;OCG28FtC z^>8RCQ5%(9)TvG2K@vrMkD2nQYVz3IjTu)?w-eO2o7n>;2JPSkoLWXg^hs7_+Jn;> z%+X`GucwJ%_S}@(+bPvIz>M+_^^_^$vb-#gbjz?kOqC zCrFX1$KjCaNdN8w3*_qS;f zs>p(b^f)?jz>o<|;B=*i%ahSOuJ3Tb(dkr1Xr=rT%CvwJm2on5WR?=ia5&k`-g>IF z?oqDTSYDD(78qaQ>r3p{Ew5q4!1+bu#YYfu2bu4_Trj*Qbp5?OW?VaiW{MZBz*zL( z`%)I!;{Nv!nSdX3|GdqS;(z5imDU=vRhwJ@>!i7ZneGUecN@)n!hPj#Z{6##t&W2= zKvMD@XQzLmEG0A~w@A9dMsv};!hk#f##llAOp}Kuz?2H-o_iCi9P-GZs6XF@lmgdK zhItC<^H~Y4K0{adZkXR4ZD7gNxGEcRr6Y-Aie8%sv}9&ES5(Wl2vf#1#6C!e#N{T<+m zHmNczzXd+WIYJIT5}*v)m`)dWXzdL+CyP4jJF8=b#+1xrzP+WvXv|_g957TlIOGSg zZXCXnKYU|Stg*lH)W?j7M^r_homzB8gsp#gM*9U&>}#KSBS0-ncfnfe{sX@y?=_9F zuC*_24f@eo&md%E@IMKMf<@fmO~7i!R_EU|vkC(}p2&T=9x~5wKlZ>U(K55@iaH79 zKD<;_WBao4QSrXOwa1dakcOy}Kj(Cvs~jSJXRiGTFHr$-LzzV9AwE~$$H+H+ZAfo| z6z;>@1XyN1;kbuZO%iDm4A|FWyhvTfwVz5UZ1;lv5kQ}+8}A+kZcIxtiw|ymetG27 zWSFlc`)h3?mR>-=S6}0aVu0Y!`L04zDpmqT0-xtp?Z`>j@$OLI>2X+LVJU4u%q}*o z2%8BGU;hh`D=pWRyP2+eO&%W_EeF&2$-Z<(K3#2$I=Op~1oHK-PuI!g7YP|9)W~V= z@9)c915&?GA=n3dYrAW&Lea(rD$W_n_#8gR_E7K!g6lJ{L}hP(Oa$!6p;r;Mx8fG! zBEH5G*?oynjnUIiVszHUuAD?}Qp2L$Mj=}O13z{~hF^bmo5`*KP?8y7)v#LMAP7eQ z39jBGF&Vp9>(pfgFfphb>Oq1JNx|wO+GeOUK+Fqc*_>|Zc}ZE^BJDmpZtWD=S4bQy z#qgf<$g>N2kH)tbdyZD;qyDDyVhmu*P{xhh8&<+TJ74=$)ttbR7QhT$>eeVGloR*} zyM*@u|M10q2Ow{(D3BFF9c|zk$oA^+lO-ffhDI4hWer_^e97)HxEcDPQ-Hiaf7CcU z$F*gNB?(7`W0C7&5>Ncz%;XB4a8jh z_Ycq9)zZw#DWdq7M`3w}M{k_55YNm7?gl*vFS@OfmIn@EhNM{3hjvku34PA?oL|Wks&lr62;5 z+)TjV?>@CHT0J9tX+)!I(>R>yP5ZO2Fyz4>?<~xIaWYmG^t|%QfYY&2M|E;?}+b$v-B~@nR&;k0IfuVy%zgXFL^en%hkj+AHrD zlyno}>lYUeWprZGot67)lkl?3lx)D*{@T;OeJXw(DYM!!FvKBuiNTZ0$90My5{E7Z z*tYocI%yRf+h15b$$&ta2duBsA0Ty=7T+d$T@-&BG1)_1mQ_N9>=zi%yl8&-iMBcX zZR?`DeJiFRlx7hmTEQDf57m3ac#0NWDnqvBxZ~9&Lvs-N+8Pv*OYXvSEE=7cAGV}x z1^cdltP-=fw_F-zX~vNvyUtFPohC~jfcrXYpgs5}J7A^Mm$54buK7wqDu2{rl;aSz zy*_X3{bMo=!M*eHqe>FOHYCk&Xzh7Hx^6NJX*mYBC387Qw_C<*`XM3$g!Rs^eIpAX z#gBYe-oP*lQz*i1$TXejBlf80NnNHuEb|S|u@+`-4icU_KR;K=V3`Sa7<(rE1c>eg zZEb#sbn4AMOiXAegqZ|^{rZO$eK@2G(|zfgSnU{yS_lV6eCW582ID2j@X6EerRBq= zlSG$@v1HO0ghNrwsq%A0Pvw1fN%$*40YC&+H2 zk@QBo!n`*nF&GVUV@BQp>(Z;#QT1>^PRWHgaBy>7NWeCll0DH^5xh6u<3%@4b~(k@ zNx=|}Bf%jd&UD8y^+7+ZHfcW_6Al*>c>FJupC3dk1vjwd+Z}rFCFJC$_$wGxofaN@ z>$5NQJ#EQ?8*+~{re7qIn4AZ`#J7Hp5Z3-HuwYKUC`S10P;R<0*qWOQj7R7y`$>sJFMB8!9t2j{z8!LV3T_c8clM4f*899KMa>&CYCs$$~v!*Bay$qmEXAoRkxd*vCF|>jU8t`F{7F+Uv$Ki z|2My+T^RMJYL&(GSoYHOCSkSCS~EJ7*X?VVdhz;1_fUdrE# z=xV&T)yH?NRy7SHQZM&jJsv`uy+u4rLrqV?{JDGPRZI~7UY&VjY>WkU2s{=ZRI^jBz4@}{;2s)&<- z>&nFg*3HRdFN$c-p z+q6E_9~0^_Y?gvbjBzD_JM>e6-KU>@t)0yL@tHA?H;J#>-wo+=R2K!zdrd_S-K_$sfB5mcsTOb@_dbeR)vyZa zXC^xIoaY%#4D4f9bTl`K>%VivXuxk-o?%fSS*gta7sG0jE$;A4wWIX3a~S$ z=T?`_efzn>N2KLvqSay7-}nn?#;8JJ!2(5?Bj8HE1ONH)g`GV2d=5`03$V|*ARujB z?U5xM{N-S2NHL{7W0&d%CwrRDS+nT)-*y$I0GXbO*q+nDK@vf5FeGOXEX#I^da}4{ zsqu-=zT@uXk}!Iu!RYW?hVfw=$dB9JQ9zVc0fAj3lL;=lDe!uct)w|S z0LT2%= z6spZj?|_TALqkZ@uEF)igaXXb@S3jY%@TLWCDB0!1LCmwm*B+MLjpQNcU-E0mem9a zSx(9CD|{;6LpF1r8n%uZzlmC(sv8Tq+#d;a{sY=6%W{3D2YBd{EWqW*cblc$jD1!sQ}1|5v`rz2xUQ0gqeyAFmVy6PW-RmBk8 z7K`~)Md~E%a$c-F3?Pf>gJ`)>MAbQ?Y!l~C7wFI)njR?}A%w5eOrVj11go>k-ex?A z?NWzQ{L*YS^?^qdhv}`DxmTta#)MrjxztDvHU2jvbkoW`@(Y;$^(||YTt=H|u zV5sn^FvX7FGoD@oyd*7(ZuYV2$BR7)pU0k1GSbYefud;cgpx)4Tiv2GT_SJe)f;Ia zV6&88Nsc!nAKS>fIBKcImq@gicajD1YACvIU;wVq*w8OjX%3FSACT2Z(;%vx>N zEI!rq-30&_ryb$rUv`!~sdeBq6_>b)NV5r+Syuebjo+0kAAql)V-;3t5OZkFbStAm zs=BYQY99K*1Gk?5OIhwG{-9c|NylX6lP%X041+__(b0L_()x3?o_VrmnxPe{%xDZ()d63Xt`4y4V%CdY3h9x9-?bLPA zPwq@OaR~ElrO}Wfo5-0*Br$%|eJgmbtsmUbPu=!dhjHr`^&)DdO`<$iaPZ5F*R3s7 z*J{=KL~lNCn@RE{w{#;rNwU>O6}7KnP1s;H5}z=>0SlPT%c2!?`uUjA3BCD-f_qko zxHT8jEMCG49RBp5x7pnj@qV)@qe0W0;CPf{;iWMCG*rs5*)NMJO5NOXLdFH-gcuT!(LBZDwlp^Bn`HpOC%{sw2bR<( z0fXo)9`HowS3pGJt76hV4S+Jf(h*fy4V`dn>`I#DXu zO~;EnbsBiQFY`l)H!G8SqQvc62+Y{6xelrrQyNYM3`kcM6^2jWPdP;!0xeJm@{kFr zT+X*g8UJs(?9w->u0$L@1lu77Kuq!B&e9&G++BbnAV8M^7i0c!5ZE78-Wt!9^*lTf zoT_s&2AlnPKR6CP=|!-1$$;=iDj|2xCKj>`3ebFD-t@t9s`?A!_Iqo-kjHu~S5|bY z3M>hiZ0d3J{~^w>^1DPbp88$opmpt`L0t%h_Sg9=fL74@$Hgr&YId#j)&lGvrohj5 zfL+|8#zsoSzD1Hr%;sTh;CGE==1V2Ox_kh>$~7Z)`!uV$%KVvGony!O9vPJdLDhe7 z<~5lM-}+a#eNF|@1^@P7=NQ8w=T!v$@B?7Aqcb|sJf)x{(0G6Kqjn$7D+bzZ{PmrG zrrDw05l6c_V?bM_%CdIP@Ogb{&C<^nHw3vom63I>E4pxr6jWltz0{zsx+$P;n?S>r zR3>QyBgjM@hq5aL3g@du7x=FXzk3W(zGB;WgBw+}%`SUe<=Hm)+LUtx#gEu_-O z=TnOXwOlldF}Q*I*CNfOcy5^%j`bwhZ7-O~GGG%VtVA z75u$tS#4HMK~@&F`|bXvSk#;ptc->y)yd-w=v0*hBxu>Ce*OF>oBo0>0h4zn{4ux@ z`}Moiya90W$eV&jju3(Z6?{3kulyGh(DXVc@KT_Mzm5%MUevUE?X_q_QcLzM75FEA z+-Cq`xBq#4r$!wY(=*4HQ%)=Glk;P(31*E;=4&AjcoX!4x#RF_^8Wj45?7`WE_61e zjK=~bEo3$f(L#^cRu)zcVECu+Em|a?>dwn=6oa5dLl2SJY#a(U*kdXUWM!?u8<)ll zKQWTKWH+7Bmk0XjHpu|#u`npUc~2^zh7|W(Jh!!-4ZQ-7b{bRpw3ATY%6^DcSK;+T zk+*KFNdayyN(Job8}_3M$7vU6(-IPBc`ATN@Ws1>%G>+hXt3J91|gRIxymW+$8mp2 zB(1!qPiqq@xpa#h6}&OMGkdKae1Cm#!e3p9vY>8yi6lj~%*b|)k}Af7(0V`Wct@ub z(K~i6TfAKD^e|Q;%-0ylC{_E=Pce=|`vyz8Z7+hJf?lr$r=i3)*yDz(O@Th`O8HiB zCA>0Vb@cdlbXof|gC`mgXi{8Icb~;%wdI@W;bkCz%0c=4aE}QBmCpcvKg9BTr_u-J zIB*NTEb{x55VSVeq3pET1Zu9U54cD4N`b(xZ+t4_p#hlw*<43>fD)U0O3NL%Ov_2n z7sw->x2z?Xc9B14UM+<~A`%=nMEUOz3wraUC*J;8N#|s?xI&LcFUI+$H;LH27IRmJ z#-)c|w(ntaiXdhiT_yHhZW;G)yG02b7N-bihLgv1o&5eMA}n^5x<1*jJbZu6^d4M3 z7WnrYsE3NZK@I82gj%@F;aDj38K@a-8X!cctMpUDEk;+xVF8$~misao!UxlOOG5VW8X7U!`%nD>jGtmDg}+J9iqGJ&Eu`=py0AqkLrw(ahPR5`y>=)rW(+jq zeGnu_Z;SaUWV)yk{^Z0OK5r^hGQ=ks1Okr+zNGzqgR72lE=Q0QGltORi>Rj(W~H>A zoc97*ZL zasDT67m9D?Sz&EaoP0yb0~l(%jD96f<=#;e8w5?GZl;g%WrO!ZLx-u_XqjOvY&x=K zkDz4uJ`jHVsz37}>#1~kmO{xqV_bO{DNx657Zz_94TIT<-3)`!6l!L~&#$2Ip63n5 zNRpdiciVuz{_bPL6G5Au4g!ZRJyGeQYRJo5tL!P^@Vt>76`W~32?#fDg%EIikj7TT ziWZq4GHGo<;=z8E=NS2GQuiGF^y7k{Heg0UsgQNy419$T&y2^Ed7dFS$)M;cxlOtS z7z8GUqF_uhN8tM_H|d7GM=!B9kr1n~4^K2-(-E_?VC9SZ&YS{^?7o+!&ZvdJnN4w1 zdAH#y*3=ViC0%H>;@NC8jmKJ>+~afBd+1UCvRE5^-cgV+3{QQ@u{H4i78DFZ^!RCe zse0XZoq1eMfqdLs7A*i(kLI$wUW~))8T_%PN9|8Flr#N5d&#*b7VGiY?BXr@=~tfF zp03&|J$1je%x9y1T?<@#ckt-~0J$1N(f!+t*-ks2-iIUKOSF@(=Ar?_fig8Ayx|6& z1=AxO)*%3fw}Dq^gB}$QzfX7S&*ehb%U9_`w_U;1?l|B?d`B3pBL3BK-rRc8Ja>0f zAor7|WYBF9A36w$V}N@*`&9N?sgmj+b0$Yf5{ugtCwO(*^`B!FVW|2W(o`m#!3}S) z1Akj}B)jydY)bShg2yagLo}IvjF(VIZyaa2?7##mr*9NA}Z9r}q z&9hHt4!|@<0F{^^@Sm<}JmJ#RLdC@-YE&Sb`Tw)(^#6K;n+0BA8Z62B5WZqyh>{Pz^S$_H%*A82S6N`{UJ8MkIL0Qapn4 zJy^g?8UKJ@wGxE;K(Mq9;S)WyhXsID3Ow3%)6Wo0UG}Czof8vy_z+sd07Wn0 zj#SXAzMBW~=|PBsm~obfbzhAP&{6t!3JsJ>`*RLApa&_ES zG&&Pp{~hn&q|j4_n2ZqOXsqqo`tv%KQ>HdfXcZJzq%L{yGQjQEmqFM6419t{0fNy0 zjVuP&?fJ2CXV#|PWcsKRqu{#$v6HePA;5k>vYl}_8!|}_C?7@(6!$de=I)ZlbQ@`o zhp?&$s$LpPx1evrts0?=QVqThE?}q-HQVL5uGsR>oj#T;=RFe`gld7Tgn)EnQDr6& zOztaj$Ss5*i32gq1PsJ}9zm~QpuXEjb3zn7eCXF;mnUmuw_$7%J?28Z+Tbu~!>T0)5ScX9`5c1vDxXc?R5>>X4nUyi2J_L<}eNF60&wvF4PT3m-7Eu_8Tk~_e-E#vkUh_UD6qlExENTnb+ z_3`@(1PIjC^10Q>%fG2KSO*gxW+Yk$^z**$GAuSCJ|T2jPfZ48bGGQ}d#J4!?$8xU z!p#*F(MdcYD)I=K`mz(oOg0LmPu~?T4X1q|ix3Hx2o`Wbh=%{YF*uZg+ia%97CC`EH`D5Vm8hQ*xxN26l2y5V~HB1NjS}u(Pi|x1n;6wlUxJM^+2Shh`e0JI!OTL0dL%IpZSi;)+y8~htW?UV)xqG8&`XNpr8O6bnAr&hM zh4QxS(tMtP$33((^c50B!>#9dm2+{Z#%l7-UU=NXpz;kJ29$=CEHOU^hq4b#*tiAL zcv~&l!oYySi*7ySGQ=QlC*dJ3Fb&^juEtVwB=7Uw1tS9tvbJ3_SO&^_!;)e4J~@(@ p74kMs5FJ-@8?P+&?}MJwHgN6zqOTx72LCevqNi=7RjYx;{|{eriTMBk literal 0 HcmV?d00001 diff --git a/docs/disciplines/Data_Science/Data_Curation/Clinical_Trial_Curation/index.md b/docs/disciplines/Data_Science/Data_Curation/Clinical_Trial_Curation/index.md new file mode 100644 index 00000000..97720258 --- /dev/null +++ b/docs/disciplines/Data_Science/Data_Curation/Clinical_Trial_Curation/index.md @@ -0,0 +1,130 @@ +# Clinical Trial Curation + +## Immunotherapy datasets +### Introduction +This documentation goes over the clinical trial data curation process in detail, using immunotherapy data. + +### Objective +The objective is to curate a clinical dataset into R's [MultiAssayExperiment](https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) object. An example of a clinical data MultiAssayExperiment (MAE) object can be found in [ORCESTRA](https://zenodo.org/records/7332074). + +Currently, a clinical data object contains the following data parts: + +1. Clinical metadata: Contains patient/sample metadata. +2. Molecular profiles: Molecular assay data (Currently RNA-seq, SNV or CNA) which is formatted in either [RangedSummarizedExperiment or regular SummarizedExperiment object](https://bioconductor.org/packages/devel/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html). + +### Data Access +#### Public data + +If the source is Pubmed, the raw omics files and clinical response metadata are available from Supplementary or external repository links in Data Availability section of the paper. + +#### Private data + +Private data such as PHI, clinical response might be available only upon request. Please contact the author(s) or whoever is responsible for requesting such data. + +### Data Processing Overview +![](img/Clinical_trial_curation_overview.png){: align=left height=25% width=25% } + +An example of clinical data processing pipeline can be found here as [a Snakemake pipeline](https://github.com/BHKLAB-DataProcessing/ICB_Braun-snakemake/blob/main/Snakefile). + +Generally, an overall process of the curation follows the steps outlined below: + +1. **Download source data**: Download data from publications or data repositories. The source data can be in various formats such as an Excel file, CSV or TXT. +2. **Process raw molecular data, if available**: The RNA-seq processing from raw FASTQ is outlined on the [RNAseq raw processing page](https://collaborate.uhnresearch.ca/confluence/display/BHKLabPRC/RNA+seq+raw+processing). +3. **Add annotations**: Ensure that genes, tissues and treatments are annotated with metadata available from external source and lab standardized columns. +4. **Create RangedSummarizedExperiment or SummarizedExperiment (SE) object**: For the molecular data, we prefer RangedSummarizedExperiment as it is compatible with [GenomicRanges R package](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html). +5. **Create MAE object**: Format downloaded data to the layout and structure that is favourable to creating a MAE object. Through this process, the source data is extracted from the source data format and formatted into a CSV or TSV file. Integrate molecular data to MAE. + +### Processing Clinical Metadata +The clinical data should be formatted into patient/sample ids as rows and attributes as column data. This will be added as `colData` of the SE or MAE object. + +The following columns are mandatory and should be filled with NA if the data is not available to maintain consistency across ICB and non-ICB datasets: + +| **Column name** | **Description** | +|------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **Patientid** | This column contains unique patient identifiers | +| **treatmentid** | This column contains the treatment regimen of each patient. Individual drug names are separated by ":" and standardized based on the lab's nomenclature. For example, the drug combo "FAC" is represented as "5-fluorouracil:Doxorubicin:Cyclophosphamide" | +| **response** | This column contains the response status of the patients to the given treatment - Responders (R) and Non-responders (NR) | +| **tissueid** | Cancer type standardized based on the lab's nomenclature from Oncotree. Example: “Breast” | +| **survival_time_pfs/survival_time_os** | The time starting from taking the treatment to the occurrence of the event of interest. The event name like "pfs", "os" must be appended to survival_time to differentiate the survival measure. Example for data in this column: “2.6” | +| **survival_unit** | The unit in which the survival time is measured. If the event is measured in other units such as “day”, or “year”, it must be converted to "month" for consistency | +| **event_occurred_pfs/event_occurred_os** | Binary measurement showing whether the event of interest occurred (1) or not (0). The event name like "pfs", "os" must be appended to event_occurred to differentiate the survival measure | + +!!!note + Common columns have to be the first set of columns appearing in the metadata followed by the rest of the columns. You could add other columns with the name in the source data, but the standard columns with the above mentioned names should be present. + + If you are adding new columns based on restructured data from existing columns, please assign the lucid, self-explanatory column names. + +The table below shows the other common columns across the 19 ICB datasets curated. + +| Column name | Description | type | +|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------| +| age | Age | source | +| AMP | Sum of total AMP/coverage; calculated from CNA values | in-lab curation | +| cancer_type | Type of cancer tissue | source | +| CIN | Calculated from CNA values | in-lab curation | +| CNA_tot | Sum of total CNA/coverage; calculated from CNA values | in-lab curation | +| DEL | Sum of total DEL/coverage; calculated from CNA values | in-lab curation | +| dna | DNA sequencing type. eg: whole exome sequencing | source | +| histo | Histological info such as subtype | source | +| indel_nsTMB_perMb | - | in-lab curation | +| indel_nsTMB_raw | - | in-lab curation | +| indel_TMB_perMb | - | in-lab curation | +| indel_TMB_raw | - | in-lab curation | +| nsTMB_perMb | - | in-lab curation | +| nsTMB_raw | - | in-lab curation | +| recist | Annotated using RECIST. The most commonly used responses are CR,PR,SD, PD. | source | +| response.other.info | Same data as Responders (R) and Non-responders (NR) | source | +| rna | Type of rna processed data. eg: TPM | source | +| sex | Sex of the patient - Male or Female | source | +| stage | Cancer stage | source | +| survival_type | PFS or OS or both (denoted by '/'). If both, added by in-lab curation | in-lab curation | +| TMB_perMb | TMB per megabase (Mb) was performed as defined: TMB = mutns/target. With mutns = number of non-synonymous mutations; and target = target size of the sequencing See Supplementary Table S2 of https://pubmed.ncbi.nlm.nih.gov/36055464/ | in-lab curation | +| TMB_raw | Tumor Mutation Burden raw values | in-lab curation | +| treatment | Drug target or drug name | source | + + +### Processing Molecular Data + +The raw omics data files are obtained and processed in the lab. If the raw files are not available, processed data is used. Exceptions are Mutation data where only processed data is used to avoid ambiguity around matched normals. + +In general, all molecular data should be formatted into genes (eg: transcript IDs for RNA profiling) as rows and patient/sample IDs as columns. + +#### RNA-seq data +First and foremost, **the RNA-seq data should be at gene-level and in TPM**. The TPM value should be log transformed with log2(TPM) + 0.001. + +If the TPM values are not available, but counts values are available, you could use the following formula to convert counts value to TPM: + ``` + GetTPM <- function(counts, gene_size) { + x <- counts/gene_size + return(t(t(x)*1e6/colSums(x))) + } + ``` + +If available, counts and transcript-level data (isoforms) should also be included. + +#### SummarizedExperiment Object +Each molecular data needs to be formatted into a SummarizedExperiment (or RangedSummarizedExperiment) object. + +At minimum, SummarizedExperiment requires: + +1. **colData** (the patient metadata) formatted in patient/sample IDs as rows and attribute data as columns. +2. **assay** (expression values) formatted in gene/transcript IDs as rows and patient/sample IDs as columns. +3. **rowData** (gene metadata) is gene metadata for the genes that exist in the assay, formatted as gene/transcript IDs as rows and attributes as columns. More details on the gene metadata below. + +### Annotation +Lab standardized annotation data are stored in BHKLab-Pachyderm's Annotation repository. + +#### Gene Annotations +Gene metadata is obtained from Gencode annotations. We have a few versions of Gencode annotation data available in .RData files. An .RData file includes data frames that contains gene and transcript information such as features_gene, features_transcript and tx2gene. Some of the available gene annotations include: + +- [Gencode v19](https://github.com/BHKLAB-Pachyderm/Annotations/blob/master/Gencode.v19.annotation.RData) +- [Gencode v40](https://github.com/BHKLAB-Pachyderm/Annotations/blob/master/Gencode.v40.annotation.RData) + +!!!note + Please use the most recent version for your gene annotations from this repository. The version of Gencode must be decided after checking the reference genome. Follow Gene curation SOP for detailed steps + +#### Drug Annotations +For clinical data, drug annotations are performed in case-by-case basis. For immunotherapy treatments, both instances such as anti-"target" (eg: anti-CTLA4) and monoclonal antibody brand names can be present. Please follow the Drug curation SOP to correctly annotate such cases using the standard lab files in the [Annotation](https://github.com/BHKLAB-Pachyderm/Annotations) repository. + +#### Tissue Annotations +For tissue annotations that cannot be mapped using Tissue curation SOP to the standard lab files in the [Annotation repository](https://github.com/BHKLAB-Pachyderm/Annotations), manual review needs to be performed in case-by-case basis. \ No newline at end of file diff --git a/docs/disciplines/Data_Science/Data_Curation/index.md b/docs/disciplines/Data_Science/Data_Curation/index.md new file mode 100644 index 00000000..e4b2f630 --- /dev/null +++ b/docs/disciplines/Data_Science/Data_Curation/index.md @@ -0,0 +1,7 @@ +# Data Curation + +## Overview + +Data curation is the process of preparing data for analysis. It involves identifying, cleaning, and transforming data to ensure its quality and usability. Data curation is an essential step in the data analysis process, as it helps to ensure that the data is accurate, complete, and relevant for the analysis. + +DataRaven has established standard operating procedures (SOPs) for different data types. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 89a1cc55..7729778e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -24,6 +24,7 @@ markdown_extensions: - attr_list - md_in_html - footnotes + - tables plugins: - redirects: # handles URL redirects for moved pages