Ukuhlola Ukusebenza Nokufana Kokwenziwa Kwedatha Yedatha: I-Technical Deep Dive kanye nokuhlaziya okuqhathanisayo

Kushicilelwe:
February 27, 2024

Isingeniso

Esikhathini samanje sedijithali, ukuqwashisa ngobumfihlo bedatha kuye kwakhula kakhulu. Abasebenzisi baya ngokuya bebona idatha yabo njengezigxivizo zeminwe zedijithali ezihlukile, okubeka engcupheni ubumfihlo babo uma kwenzeka kwephulwa idatha. Lokhu kukhathazeka kuthuthukiswa futhi yimithetho efana ne-GDPR, enikeza abasebenzisi amandla ukuthi bacele ukususwa kwedatha yabo. Nakuba kudingeka kakhulu, lo mthetho ungabiza kakhulu izinkampani njengoba ukufinyelela kudatha kuncishiswa; imikhawulo evame ukudla isikhathi kanye nezinsiza ukuze inqobe. 

Okuqukethwe

Ayini ama-synthetic data generator?

Faka idatha yokwenziwa, isixazululo sale ndida. Abakhiqizi bedatha bokwenziwa bakha amasethi edatha alingisa idatha yomsebenzisi wangempela kuyilapho elondoloza ukungaziwa nokugcinwa kuyimfihlo. Le ndlela izuza umfutho kuzo zonke izimboni, kusukela kwezokunakekelwa kwezempilo kuye kwezezimali, lapho ubumfihlo bubaluleke kakhulu.  

Lokhu okuthunyelwe kuklanyelwe ochwepheshe bedatha nabashisekeli, kugxilwe ekuhlolweni kwamajeneretha edatha okwenziwa. Sizocubungula amamethrikhi ayisihluthulelo futhi senze ukuhlaziya okuqhathanisayo phakathi kwe-Syntho's Engine nezinye izindlela zayo zomthombo ovulekile, sinikeze imininingwane yokuthi ingahlolwa kanjani ngempumelelo ikhwalithi yesixazululo yokwenziwa kwedatha yokwenziwa. Ngaphezu kwalokho, sizophinda sihlole izindleko zesikhathi zemodeli ngayinye yalawa ukuze sinikeze ukuqonda okwengeziwe ekusebenzeni kwamamodeli. 

Ungayikhetha kanjani indlela efanele yokwenziwa kwedatha yokwenziwa?

Ezindaweni ezihlukahlukene zokwenziwa kwedatha yokwenziwa, kunendathane yezindlela ezitholakalayo, ngayinye ilwela ukunakwa ngamakhono ayo ahlukile. Ukukhetha indlela efaneleke kakhulu yohlelo oluthile kudinga ukuqonda okuphelele kwezici zokusebenza zenketho ngayinye. Lokhu kudinga ukuhlolwa okuphelele kwamajeneretha edatha okwenziwa ahlukahlukene okusekelwe kusethi yamamethrikhi achazwe kahle ukuze kwenziwe isinqumo esinolwazi. 

Okulandelayo ukuhlaziya okuqhathanisayo okuqinile kwe-Syntho Engine eceleni kohlaka lomthombo ovulekile olwaziwayo, i-Synthetic Data Vault (SDV). Kulokhu kuhlaziya, sisebenzise amamethrikhi amaningi avame ukusetshenziswa njengokwethembeka kwezibalo, ukunemba kokubikezela kanye nobudlelwano obuhlukahlukene. 

I-Synthetic Data Evaluation Metrics

Ngaphambi kokwethula noma iyiphi imethrikhi ethile, kufanele sivume ukuthi kunemibono eminingi mayelana nokuhlola idatha ye-Synthetic, ngayinye enikeza ukuqonda engxenyeni ethile yedatha. Unalokhu engqondweni, izigaba ezintathu ezilandelayo zigqama njengezibalulekile futhi zibanzi. Lawa mamethrikhi ahlinzeka ngemininingwane ezicini ezihlukahlukene zekhwalithi yedatha. Lezi zigaba yilezi: 

      1. I-Statistical Fidelity Metrics: Ukuhlola izici eziyisisekelo zezibalo zedatha, njengezindlela nokuhluka, ukuze kuqinisekiswe ukuthi idatha yokwenziwa ihambisana nephrofayela yezibalo yedathasethi yoqobo. 

        1. Ukunemba kokubikezela: Ukuhlola ukusebenza kwemodeli yokwenziwa kwedatha yokwenziwa, eqeqeshwe ngedatha yoqobo, futhi kwahlolwa kudatha yokwenziwa (Train Real - Test Synthetic, TRTS) futhi ngokuphambene nalokho (Train Synthetic - Test Real, TSTR) 

          1. Ubudlelwano Obuhlukahlukene: Lesi sigaba esihlanganisiwe sihlanganisa: 

            • Ukuxhumana Kwesici: Sihlola ukuthi idatha yokwenziwa ibugcina kahle kangakanani ubudlelwano phakathi kokuguquguqukayo kusetshenziswa ama-coefficients wokuxhumanisa. I-metric eyaziwa kakhulu njenge-Propensity Mean Squared Error (PMSE) izoba yalolu hlobo. 

            • Ulwazi Oluhlanganyelwe: Sikala ukuncika phakathi kokuguquguqukayo ukuze siqonde ukujula kwalobu budlelwano ngaphezu kokuxhumana nje. 

          Ukuhlaziywa Okuqhathanisayo: I-Syntho Engine vs. Open-Source Alternatives

          Ukuhlaziya okuqhathanisayo kwenziwa kusetshenziswa uhlaka lokuhlola olusezingeni kanye namasu okuhlola afanayo kuwo wonke amamodeli, okuhlanganisa i-Syntho Engine kanye namamodeli e-SDV. Ngokuhlanganisa amasethi edatha asuka emithonjeni efanayo nokuyifaka ngaphansi kokuhlolwa kwezibalo okufanayo nokuhlola imodeli yokufunda komshini, siqinisekisa ukuqhathanisa okufanele nokungachemi. Isigaba esilandelayo sichaza ukusebenza kwejeneretha yedatha yokwenziwa ngayinye ebangeni lamamethrikhi ethulwe ngenhla.  

           

          Ngokuqondene nedathasethi esetshenziselwa ukuhlola, sisebenzise i- Isethi Yedatha Yokubalwa Kwabantu Abadala ye-UCI okuyidathasethi eyaziwa kakhulu emphakathini wokufunda ngomshini. Sahlanza idatha ngaphambi kwakho konke ukuqeqeshwa sase sihlukanisa idathasethi yaba amasethi amabili (ukuqeqeshwa kanye nesethi yokubamba ukuze kuhlolwe). Sisebenzise isethi yokuqeqeshwa ukuze sikhiqize amaphoyinti edatha amasha ayisigidi ngemodeli ngayinye futhi sahlola amamethrikhi ahlukahlukene kulawa madathasethi akhiqiziwe. Ukuze uthole okunye ukuhlola kokufunda komshini, sisebenzise isethi yokubamba ukuze sihlole amamethrikhi afana nalawo ahlobene ne-TSTR kanye ne-TRTS.  

           

          Ijeneretha ngayinye yayisebenza ngamapharamitha azenzakalelayo. Njengoba amanye amamodeli, njenge-Syntho, engasebenza ngaphandle kwebhokisi kunoma iyiphi idatha yethebula, akukho ukulungisa okuhle okwenziwe. Ukusesha ama-hyperparameter alungile emodeli ngayinye kungathatha isikhathi esibalulekile, futhi Ithebula lesi-2 selivele libonisa umehluko omkhulu wesikhathi phakathi kwemodeli ka-Syntho naleyo ehlolwe ngokumelene nayo. 

           

          Kuyaphawuleka ukuthi ngokungafani namanye amamodeli ku-SDV, i-Gaussian Copula Synthesizer isuselwe ezindleleni zezibalo. Ngokuphambene, okunye kusekelwe kumanethiwekhi e-neural afana namamodeli e-Generative Adversarial Networks (GAN) kanye nezishumeki ezizenzakalelayo ezihlukile. Kungakho i-Gaussian Copula ingabonwa njengesisekelo sawo wonke amamodeli okuxoxwe ngawo. 

          Imiphumela

          Ikhwalithi yedatha

          Umfanekiso 1. Ukubona ngeso lengqondo imiphumela yekhwalithi eyisisekelo yawo wonke amamodeli

          Ukunamathela okuxoxwe ngakho ngaphambilini kumathrendi nokuvezwa kudatha kungatholakala kuMfanekiso 1 nakuThebula 1. Lapha, imethrikhi ngayinye esetshenziswayo ingatolikwa kanje:

          • Isilinganiso Sekhwalithi Sisonke: Ukuhlolwa sekukonke kwekhwalithi yedatha yokwenziwa, okuhlanganisa izici ezihlukahlukene njengokufana kwezibalo nezici zedatha. 
          • Izimo Zekholomu: Ihlola ukuthi ingabe idatha yokwenziwa igcina umumo wokusabalalisa ofanayo njengedatha yangempela yekholomu ngayinye. 
          • Amathrendi Ahamba Ngamakholomu: Ihlola ubudlelwano noma ukuhlobana phakathi kwamapheya amakholomu kudatha yokwenziwa uma kuqhathaniswa nedatha yangempela. 
          •  

          Sekukonke, kungaqashelwa ukuthi u-Syntho uzuza amaphuzu aphezulu kakhulu ebhodini lonke. Okokuqala, uma ubheka ikhwalithi yedatha iyonke (ihlolwe ngelabhulali yamamethrikhi e-SDV) i-Syntho ingathola umphumela ongaphezulu kuka-99% (ngokubambelela komumo wekholomu okungu-99.92% nokunamathela komumo wokupheya kwekholomu okungu-99.31%). Lokhu ngenkathi i-SDV ithola umphumela wokufinyeleleka okungama-90.84% ​​(ne-Gaussian Copula, enokunamathela komumo wekholomu okungu-93.82% nokunamathela komumo wekholomu okungama-87.86%). 

          Ukumelwa kwethebula yezikolo zekhwalithi yedathasethi ngayinye ekhiqiziwe ngemodeli ngayinye

          Ithebula 1. Ukumelwa kwethebula yezikolo zekhwalithi yedathasethi ngayinye ekhiqiziwe ngemodeli ngayinye 

          Ukufakwa Kwedatha

          Imojula yombiko wokuxilonga ye-SDV isiletha ekunakekeleni kwethu ukuthi idatha ekhiqizwe yi-SDV (kuzo zonke izimo) ayinayo ngaphezu kuka-10% wezinombolo zezinombolo; Esimeni se-Triplet-Based Variational Autoencoder (TVAE), inani elifanayo ledatha yezigaba nalo alikho uma liqhathaniswa nedathasethi yoqobo. Azikho izexwayiso ezinjalo ezakhiwe ngemiphumela ezuzwe ngokusebenzisa i-Syntho.  

          ukuboniswa kwesilinganiso samamethrikhi okusebenza ahlakaniphile kukholomu yawo wonke amamodeli
           
           

          Umfanekiso 2. ukubonakala kwesilinganiso samamethrikhi okusebenza ahlakaniphile kukholomu yawo wonke amamodeli 

          Ekuhlaziyeni okuqhathanisayo, isakhiwo soMfanekiso 2 sibonisa ukuthi izingobo zomlando ze-SDV zinemiphumela engcono kancane ekumbozweni kwesigaba namanye amamodeli azo (okuyi-GaussianCopula, i-CopulaGAN, ne-Conditional Tabular GAN - CTGAN). Noma kunjalo, kubalulekile ukugqamisa ukuthi ukuthembeka kwedatha ye-Syntho kudlula amamodeli e-SDV, njengoba umehluko ekukhavekeni kuzo zonke izigaba nobubanzi mncane, okubonisa nje umehluko ongu-1.1%. Ngokuphambene, amamodeli e-SDV akhombisa ukwehluka okukhulu, kusukela ku-14.6% kuya ku-29.2%. 

           

          Amamethrikhi amelwe lapha, angatolikwa kanje: 

          • Ukufakwa Kwesigaba: Ikala ubukhona bazo zonke izigaba kudatha yokwenziwa uma iqhathaniswa nedatha yangempela.
          • Ukusabalala Kwebanga: Ihlola ukuthi ububanzi bamanani kudatha yokwenziwa bufana kanjani naleyo ekudatha yangempela. 
          Ukumelwa kwethebula kokumbozwa okumaphakathi kohlobo lwesibaluli esinikeziwe ngemodeli ngayinye

          Ithebula 2. Ukumelwa kwethebula kokumbozwa okumaphakathi kohlobo lwesibaluli esinikeziwe ngemodeli ngayinye 

          Umbuso

          Ukuqhubekela esihlokweni sokusebenziseka kwedatha yokwenziwa, indaba yamamodeli okuqeqeshwa kudatha iyabaluleka. Ukuze sibe nokuqhathanisa okunokulinganisela nokulungile phakathi kwazo zonke izinhlaka sikhethe I-Gradient Boosting Classifier ezenzakalelayo kusukela kulabhulali ye-SciKit Learn, njengoba yamukelwa kahle njengemodeli esebenza kahle enezilungiselelo ezingaphandle kwebhokisi.  

           

          Amamodeli amabili ahlukene aqeqeshiwe, eyodwa kudatha yokwenziwa (ye-TSTR) neyodwa kudatha yangempela (ye-TRTS). Imodeli eqeqeshwe kudatha yokwenziwa ihlolwa kusetshenziswa isethi yokuhlola yokubamba (engazange isetshenziswe ngesikhathi sokwenziwa kwedatha yokwenziwa) futhi imodeli eqeqeshwe kudatha yoqobo ihlolwa kudathasethi yokwenziwa.  

          ukuboniswa kwezikolo zendawo engaphansi kwejika (AUC) ngendlela ngayinye ngemodeli ngayinye

          Umfanekiso 3. Ukubonakala Kwendawo Ngaphansi Kwejika (AUC) izikolo ngendlela ngayinye imodeli ngayinye 

           Imiphumela eboniswe ngenhla ikhombisa ukuphakama kokukhiqizwa kwedatha ye-Synthetic ngenjini ye-Syntho uma kuqhathaniswa nezinye izindlela, njengoba ungekho umehluko phakathi kwemiphumela etholwe ngezindlela ezihlukene (ekhomba ekufananeni okuphezulu phakathi kwedatha yokwenziwa kanye neyangempela). Futhi, ulayini onamachashazi abomvu okhona esakhiweni uwumphumela otholwe ngokuhlola ukusebenza okuyisisekelo kokuhlolwa kwe-Train Real, Test Real (TRTR) ukuze kunikezwe isisekelo samamethrikhi aqashiwe. Lo mugqa umele inani elingu-0.92, okuyisikolo se-Area Under the Curve (AUC score) esitholwe imodeli eqeqeshwe kudatha yangempela futhi yahlolwa kudatha yangempela. 

          Ukumelwa kwethebula yezikolo ze-AUC ezitholwe yi-TRTS ne-TSTR ngokulandelanayo ngemodeli ngayinye.

          Ithebula 3. Ukumelwa kwethebula lamaphuzu e-AUC atholwe yi-TRTS ne-TSTR ngokulandelanayo ngemodeli ngayinye. 

          Ukuqhathanisa ngesikhathi

          Ngokwemvelo, kubalulekile ukucabangela isikhathi esitshaliwe ekukhiqizeni le miphumela. Ukuboniswa okungezansi kubonisa lokhu nje.

          ukubonwa kwesikhathi esithathiwe ukuqeqesha nokwenza idatha yokwenziwa yamaphoyinti edatha ayisigidi ngemodeli ene-GPU nangaphandle kwayo.

          Umfanekiso 5. Ukubona ngeso lengqondo isikhathi esithathwa ukuqeqesha nokwenza ukwenziwa kwedatha yokwenziwa yamaphoyinti edatha ayisigidi anemodeli ene-GPU nangenayo. 

          Umfanekiso wesi-5 ubonisa isikhathi esithathwayo ukukhiqiza idatha yokwenziwa ezilungiselelweni ezimbili ezihlukene. Eyokuqala yayo (lapha ebizwa ngokuthi Ngaphandle kwe-GPU), bekuyisivivinyo esigijima kusistimu ene-Intel Xeon CPU enama-cores ayi-16 asebenza ku-2.20 GHz. Ukuhlola okumakwe ngokuthi “kugijime nge-GPU” bekusohlelweni olune-AMD Ryzen 9 7945HX CPU enama-cores angu-16 asebenza ku-2.5GHz kanye ne-NVIDIA GeForce RTX 4070 Laptop GPU. Njengoba kubonakala kuMfanekiso 2 nakuThebula 2 ngezansi, kungaqashelwa ukuthi i-Syntho ishesha kakhulu ekukhiqizeni idatha yokwenziwa (kuzo zombili izimo) ebaluleke kakhulu ekuhambeni komsebenzi okuguquguqukayo. 

          ithebula elibonisa isikhathi esithathwayo ekwenzeni idatha yokwenziwa yamaphoyinti edatha ayisigidi ngemodeli ngayinye ene-GPU nangaphandle kwayo

          Ithebula 5. Ukumelwa kweThebula lesikhathi esithathiwe ukwenziwa kwedatha yokwenziwa yamaphoyinti edatha ayisigidi ngemodeli ngayinye ene-GPU nangenayo 

          Amazwi okuphetha kanye nezikhombisi-ndlela zesikhathi esizayo 

          Okutholakele kugcizelela ukubaluleka kokuhlolwa kwekhwalithi okuphelele ekukhetheni indlela elungile yokwenziwa kwedatha yokwenziwa. I-Syntho's Engine, ngendlela yayo eqhutshwa yi-AI, ibonisa amandla aphawulekayo kumamethrikhi athile, kuyilapho amathuluzi anomthombo ovulekile njenge-SDV ekhanya ekusebenziseni kwawo izinto ezihlukahlukene kanye nentuthuko eqhutshwa umphakathi. 

          Njengoba inkambu yedatha yokwenziwa iqhubeka nokuvela, sikukhuthaza ukuthi usebenzise lawa mamethrikhi kumaphrojekthi akho, uhlole ubunkimbinkimbi bawo, futhi wabelane ngolwazi lwakho. Hlala ubukele okuthunyelwe okuzayo lapho sizongena sijule kwamanye amamethrikhi futhi sigqamise izibonelo zomhlaba wangempela zohlelo lwakho lokusebenza. 

          Ekupheleni kosuku, kulabo abafuna ukuhlola amanzi kudatha yokwenziwa, enye indlela eyethulwe yomthombo ovulekile ingaba ukukhetha okuthethelelekayo uma kubhekwa ukufinyeleleka; nokho, kochwepheshe abafaka lobu buchwepheshe besimanje ohlelweni lwabo lokuthuthukiswa, noma yiliphi ithuba lokuthuthuka kufanele lithathwe futhi zonke izithiyo zigwenywe. Ngakho-ke kubalulekile ukukhetha inketho engcono kakhulu etholakalayo. Ngokuhlaziya okuhlinzekwe ngenhla kuba sobala ukuthi i-Syntho futhi ngalokho i-Syntho Engine iyithuluzi elikwazi ukusebenza kahle kodokotela. 

          Mayelana neSyntho

          Syntho inikeza inkundla yokwenziwa kwedatha ehlakaniphile, esebenzisa amafomu amaningi okwenziwa kwedatha nezindlela zokukhiqiza, inika izinhlangano amandla okuguqula idatha ngobuhlakani ibe unqenqema lokuncintisana. Idatha yethu yokwenziwa ekhiqizwe yi-AI ilingisa amaphethini ezibalo edatha yangempela, iqinisekisa ukunemba, ubumfihlo, kanye nesivinini, njengoba kuhlolwe ochwepheshe bangaphandle abafana ne-SAS. Ngezici ezihlakaniphile zokungahlonzi kanye nokuhlelwa kwemephu okungaguquki, ulwazi olubucayi luyavikelwa kuyilapho kugcinwa ubuqotho obuyinkomba. Inkundla yethu inika amandla ukudalwa, ukuphatha, nokulawula idatha yokuhlola yezindawo ezingakhiqizi, kusetshenziswa izindlela zokwenziwa kwedatha yokwenziwa ezisekelwe emithethweni yezimo ezihlosiwe. Ukwengeza, abasebenzisi bangakwazi ukukhiqiza idatha yokwenziwa ngokohlelo futhi bathole idatha yokuhlola engokoqobo ukuze bathuthukise ukuhlola okuphelele nezimo zokuthuthukisa kalula.  

          Uyafuna ukufunda izinhlelo zokusebenza ezengeziwe zedatha yokwenziwa? Zizwe ukhululekile uku Isheduli yedemo!

          Mayelana nababhali

          I-Software Engineering Intern

          URoham ungumfundi we-bachelor eDelft University of Technology futhi uyiSoftware Engineering Intern e Syntho 

          Injini Yokufunda Yomshini

          U-Mihai uthole i-PhD yakhe e- Inyuvesi yaseBristol esihlokweni se-Hierarchical Reinforcement Learning isetshenziswa kumaRobhothi futhi uyi- Unjiniyela Wokufunda Ngomshini at Syntho. 

          ikhava yomhlahlandlela we-syntho

          Londoloza umhlahlandlela wakho wedatha wokwenziwa manje!