Ukuhlola Ukusebenza Nokufana Kokwenziwa Kwedatha Yedatha: I-Technical Deep Dive kanye nokuhlaziya okuqhathanisayo
Isingeniso
Esikhathini samanje sedijithali, ukuqwashisa ngobumfihlo bedatha kuye kwakhula kakhulu. Abasebenzisi baya ngokuya bebona idatha yabo njengezigxivizo zeminwe zedijithali ezihlukile, okubeka engcupheni ubumfihlo babo uma kwenzeka kwephulwa idatha. Lokhu kukhathazeka kuthuthukiswa futhi yimithetho efana ne-GDPR, enikeza abasebenzisi amandla ukuthi bacele ukususwa kwedatha yabo. Nakuba kudingeka kakhulu, lo mthetho ungabiza kakhulu izinkampani njengoba ukufinyelela kudatha kuncishiswa; imikhawulo evame ukudla isikhathi kanye nezinsiza ukuze inqobe.
Okuqukethwe
Ayini ama-synthetic data generator?
Faka idatha yokwenziwa, isixazululo sale ndida. Abakhiqizi bedatha bokwenziwa bakha amasethi edatha alingisa idatha yomsebenzisi wangempela kuyilapho elondoloza ukungaziwa nokugcinwa kuyimfihlo. Le ndlela izuza umfutho kuzo zonke izimboni, kusukela kwezokunakekelwa kwezempilo kuye kwezezimali, lapho ubumfihlo bubaluleke kakhulu.
Ungayikhetha kanjani indlela efanele yokwenziwa kwedatha yokwenziwa?
Ezindaweni ezihlukahlukene zokwenziwa kwedatha yokwenziwa, kunendathane yezindlela ezitholakalayo, ngayinye ilwela ukunakwa ngamakhono ayo ahlukile. Ukukhetha indlela efaneleke kakhulu yohlelo oluthile kudinga ukuqonda okuphelele kwezici zokusebenza zenketho ngayinye. Lokhu kudinga ukuhlolwa okuphelele kwamajeneretha edatha okwenziwa ahlukahlukene okusekelwe kusethi yamamethrikhi achazwe kahle ukuze kwenziwe isinqumo esinolwazi.
Okulandelayo ukuhlaziya okuqhathanisayo okuqinile kwe-Syntho Engine eceleni kohlaka lomthombo ovulekile olwaziwayo, i-Synthetic Data Vault (SDV). Kulokhu kuhlaziya, sisebenzise amamethrikhi amaningi avame ukusetshenziswa njengokwethembeka kwezibalo, ukunemba kokubikezela kanye nobudlelwano obuhlukahlukene.
I-Synthetic Data Evaluation Metrics
Ngaphambi kokwethula noma iyiphi imethrikhi ethile, kufanele sivume ukuthi kunemibono eminingi mayelana nokuhlola idatha ye-Synthetic, ngayinye enikeza ukuqonda engxenyeni ethile yedatha. Unalokhu engqondweni, izigaba ezintathu ezilandelayo zigqama njengezibalulekile futhi zibanzi. Lawa mamethrikhi ahlinzeka ngemininingwane ezicini ezihlukahlukene zekhwalithi yedatha. Lezi zigaba yilezi:
- I-Statistical Fidelity Metrics: Ukuhlola izici eziyisisekelo zezibalo zedatha, njengezindlela nokuhluka, ukuze kuqinisekiswe ukuthi idatha yokwenziwa ihambisana nephrofayela yezibalo yedathasethi yoqobo.
- Ukunemba kokubikezela: Ukuhlola ukusebenza kwemodeli yokwenziwa kwedatha yokwenziwa, eqeqeshwe ngedatha yoqobo, futhi kwahlolwa kudatha yokwenziwa (Train Real - Test Synthetic, TRTS) futhi ngokuphambene nalokho (Train Synthetic - Test Real, TSTR)
- Ubudlelwano Obuhlukahlukene: Lesi sigaba esihlanganisiwe sihlanganisa:
- Ukuxhumana Kwesici: Sihlola ukuthi idatha yokwenziwa ibugcina kahle kangakanani ubudlelwano phakathi kokuguquguqukayo kusetshenziswa ama-coefficients wokuxhumanisa. I-metric eyaziwa kakhulu njenge-Propensity Mean Squared Error (PMSE) izoba yalolu hlobo.
- Ulwazi Oluhlanganyelwe: Sikala ukuncika phakathi kokuguquguqukayo ukuze siqonde ukujula kwalobu budlelwano ngaphezu kokuxhumana nje.
Ukuhlaziywa Okuqhathanisayo: I-Syntho Engine vs. Open-Source Alternatives
Ukuhlaziya okuqhathanisayo kwenziwa kusetshenziswa uhlaka lokuhlola olusezingeni kanye namasu okuhlola afanayo kuwo wonke amamodeli, okuhlanganisa i-Syntho Engine kanye namamodeli e-SDV. Ngokuhlanganisa amasethi edatha asuka emithonjeni efanayo nokuyifaka ngaphansi kokuhlolwa kwezibalo okufanayo nokuhlola imodeli yokufunda komshini, siqinisekisa ukuqhathanisa okufanele nokungachemi. Isigaba esilandelayo sichaza ukusebenza kwejeneretha yedatha yokwenziwa ngayinye ebangeni lamamethrikhi ethulwe ngenhla.
Ngokuqondene nedathasethi esetshenziselwa ukuhlola, sisebenzise i- Isethi Yedatha Yokubalwa Kwabantu Abadala ye-UCI okuyidathasethi eyaziwa kakhulu emphakathini wokufunda ngomshini. Sahlanza idatha ngaphambi kwakho konke ukuqeqeshwa sase sihlukanisa idathasethi yaba amasethi amabili (ukuqeqeshwa kanye nesethi yokubamba ukuze kuhlolwe). Sisebenzise isethi yokuqeqeshwa ukuze sikhiqize amaphoyinti edatha amasha ayisigidi ngemodeli ngayinye futhi sahlola amamethrikhi ahlukahlukene kulawa madathasethi akhiqiziwe. Ukuze uthole okunye ukuhlola kokufunda komshini, sisebenzise isethi yokubamba ukuze sihlole amamethrikhi afana nalawo ahlobene ne-TSTR kanye ne-TRTS.
Ijeneretha ngayinye yayisebenza ngamapharamitha azenzakalelayo. Njengoba amanye amamodeli, njenge-Syntho, engasebenza ngaphandle kwebhokisi kunoma iyiphi idatha yethebula, akukho ukulungisa okuhle okwenziwe. Ukusesha ama-hyperparameter alungile emodeli ngayinye kungathatha isikhathi esibalulekile, futhi Ithebula lesi-2 selivele libonisa umehluko omkhulu wesikhathi phakathi kwemodeli ka-Syntho naleyo ehlolwe ngokumelene nayo.
Kuyaphawuleka ukuthi ngokungafani namanye amamodeli ku-SDV, i-Gaussian Copula Synthesizer isuselwe ezindleleni zezibalo. Ngokuphambene, okunye kusekelwe kumanethiwekhi e-neural afana namamodeli e-Generative Adversarial Networks (GAN) kanye nezishumeki ezizenzakalelayo ezihlukile. Kungakho i-Gaussian Copula ingabonwa njengesisekelo sawo wonke amamodeli okuxoxwe ngawo.
Imiphumela
Ikhwalithi yedatha
Umfanekiso 1. Ukubona ngeso lengqondo imiphumela yekhwalithi eyisisekelo yawo wonke amamodeli
Ukunamathela okuxoxwe ngakho ngaphambilini kumathrendi nokuvezwa kudatha kungatholakala kuMfanekiso 1 nakuThebula 1. Lapha, imethrikhi ngayinye esetshenziswayo ingatolikwa kanje:
- Isilinganiso Sekhwalithi Sisonke: Ukuhlolwa sekukonke kwekhwalithi yedatha yokwenziwa, okuhlanganisa izici ezihlukahlukene njengokufana kwezibalo nezici zedatha.
- Izimo Zekholomu: Ihlola ukuthi ingabe idatha yokwenziwa igcina umumo wokusabalalisa ofanayo njengedatha yangempela yekholomu ngayinye.
- Amathrendi Ahamba Ngamakholomu: Ihlola ubudlelwano noma ukuhlobana phakathi kwamapheya amakholomu kudatha yokwenziwa uma kuqhathaniswa nedatha yangempela.
Sekukonke, kungaqashelwa ukuthi u-Syntho uzuza amaphuzu aphezulu kakhulu ebhodini lonke. Okokuqala, uma ubheka ikhwalithi yedatha iyonke (ihlolwe ngelabhulali yamamethrikhi e-SDV) i-Syntho ingathola umphumela ongaphezulu kuka-99% (ngokubambelela komumo wekholomu okungu-99.92% nokunamathela komumo wokupheya kwekholomu okungu-99.31%). Lokhu ngenkathi i-SDV ithola umphumela wokufinyeleleka okungama-90.84% (ne-Gaussian Copula, enokunamathela komumo wekholomu okungu-93.82% nokunamathela komumo wekholomu okungama-87.86%).
Ithebula 1. Ukumelwa kwethebula yezikolo zekhwalithi yedathasethi ngayinye ekhiqiziwe ngemodeli ngayinye
Ukufakwa Kwedatha
Imojula yombiko wokuxilonga ye-SDV isiletha ekunakekeleni kwethu ukuthi idatha ekhiqizwe yi-SDV (kuzo zonke izimo) ayinayo ngaphezu kuka-10% wezinombolo zezinombolo; Esimeni se-Triplet-Based Variational Autoencoder (TVAE), inani elifanayo ledatha yezigaba nalo alikho uma liqhathaniswa nedathasethi yoqobo. Azikho izexwayiso ezinjalo ezakhiwe ngemiphumela ezuzwe ngokusebenzisa i-Syntho.
Umfanekiso 2. ukubonakala kwesilinganiso samamethrikhi okusebenza ahlakaniphile kukholomu yawo wonke amamodeli
Ekuhlaziyeni okuqhathanisayo, isakhiwo soMfanekiso 2 sibonisa ukuthi izingobo zomlando ze-SDV zinemiphumela engcono kancane ekumbozweni kwesigaba namanye amamodeli azo (okuyi-GaussianCopula, i-CopulaGAN, ne-Conditional Tabular GAN - CTGAN). Noma kunjalo, kubalulekile ukugqamisa ukuthi ukuthembeka kwedatha ye-Syntho kudlula amamodeli e-SDV, njengoba umehluko ekukhavekeni kuzo zonke izigaba nobubanzi mncane, okubonisa nje umehluko ongu-1.1%. Ngokuphambene, amamodeli e-SDV akhombisa ukwehluka okukhulu, kusukela ku-14.6% kuya ku-29.2%.
Amamethrikhi amelwe lapha, angatolikwa kanje:
- Ukufakwa Kwesigaba: Ikala ubukhona bazo zonke izigaba kudatha yokwenziwa uma iqhathaniswa nedatha yangempela.
- Ukusabalala Kwebanga: Ihlola ukuthi ububanzi bamanani kudatha yokwenziwa bufana kanjani naleyo ekudatha yangempela.
Ithebula 2. Ukumelwa kwethebula kokumbozwa okumaphakathi kohlobo lwesibaluli esinikeziwe ngemodeli ngayinye
Umbuso
Ukuqhubekela esihlokweni sokusebenziseka kwedatha yokwenziwa, indaba yamamodeli okuqeqeshwa kudatha iyabaluleka. Ukuze sibe nokuqhathanisa okunokulinganisela nokulungile phakathi kwazo zonke izinhlaka sikhethe I-Gradient Boosting Classifier ezenzakalelayo kusukela kulabhulali ye-SciKit Learn, njengoba yamukelwa kahle njengemodeli esebenza kahle enezilungiselelo ezingaphandle kwebhokisi.
Amamodeli amabili ahlukene aqeqeshiwe, eyodwa kudatha yokwenziwa (ye-TSTR) neyodwa kudatha yangempela (ye-TRTS). Imodeli eqeqeshwe kudatha yokwenziwa ihlolwa kusetshenziswa isethi yokuhlola yokubamba (engazange isetshenziswe ngesikhathi sokwenziwa kwedatha yokwenziwa) futhi imodeli eqeqeshwe kudatha yoqobo ihlolwa kudathasethi yokwenziwa.
Umfanekiso 3. Ukubonakala Kwendawo Ngaphansi Kwejika (AUC) izikolo ngendlela ngayinye imodeli ngayinye
Imiphumela eboniswe ngenhla ikhombisa ukuphakama kokukhiqizwa kwedatha ye-Synthetic ngenjini ye-Syntho uma kuqhathaniswa nezinye izindlela, njengoba ungekho umehluko phakathi kwemiphumela etholwe ngezindlela ezihlukene (ekhomba ekufananeni okuphezulu phakathi kwedatha yokwenziwa kanye neyangempela). Futhi, ulayini onamachashazi abomvu okhona esakhiweni uwumphumela otholwe ngokuhlola ukusebenza okuyisisekelo kokuhlolwa kwe-Train Real, Test Real (TRTR) ukuze kunikezwe isisekelo samamethrikhi aqashiwe. Lo mugqa umele inani elingu-0.92, okuyisikolo se-Area Under the Curve (AUC score) esitholwe imodeli eqeqeshwe kudatha yangempela futhi yahlolwa kudatha yangempela.
Ithebula 3. Ukumelwa kwethebula lamaphuzu e-AUC atholwe yi-TRTS ne-TSTR ngokulandelanayo ngemodeli ngayinye.
Ukuqhathanisa ngesikhathi
Ngokwemvelo, kubalulekile ukucabangela isikhathi esitshaliwe ekukhiqizeni le miphumela. Ukuboniswa okungezansi kubonisa lokhu nje.
Umfanekiso 5. Ukubona ngeso lengqondo isikhathi esithathwa ukuqeqesha nokwenza ukwenziwa kwedatha yokwenziwa yamaphoyinti edatha ayisigidi anemodeli ene-GPU nangenayo.
Umfanekiso wesi-5 ubonisa isikhathi esithathwayo ukukhiqiza idatha yokwenziwa ezilungiselelweni ezimbili ezihlukene. Eyokuqala yayo (lapha ebizwa ngokuthi Ngaphandle kwe-GPU), bekuyisivivinyo esigijima kusistimu ene-Intel Xeon CPU enama-cores ayi-16 asebenza ku-2.20 GHz. Ukuhlola okumakwe ngokuthi “kugijime nge-GPU” bekusohlelweni olune-AMD Ryzen 9 7945HX CPU enama-cores angu-16 asebenza ku-2.5GHz kanye ne-NVIDIA GeForce RTX 4070 Laptop GPU. Njengoba kubonakala kuMfanekiso 2 nakuThebula 2 ngezansi, kungaqashelwa ukuthi i-Syntho ishesha kakhulu ekukhiqizeni idatha yokwenziwa (kuzo zombili izimo) ebaluleke kakhulu ekuhambeni komsebenzi okuguquguqukayo.
Ithebula 5. Ukumelwa kweThebula lesikhathi esithathiwe ukwenziwa kwedatha yokwenziwa yamaphoyinti edatha ayisigidi ngemodeli ngayinye ene-GPU nangenayo
Okutholakele kugcizelela ukubaluleka kokuhlolwa kwekhwalithi okuphelele ekukhetheni indlela elungile yokwenziwa kwedatha yokwenziwa. I-Syntho's Engine, ngendlela yayo eqhutshwa yi-AI, ibonisa amandla aphawulekayo kumamethrikhi athile, kuyilapho amathuluzi anomthombo ovulekile njenge-SDV ekhanya ekusebenziseni kwawo izinto ezihlukahlukene kanye nentuthuko eqhutshwa umphakathi.
Njengoba inkambu yedatha yokwenziwa iqhubeka nokuvela, sikukhuthaza ukuthi usebenzise lawa mamethrikhi kumaphrojekthi akho, uhlole ubunkimbinkimbi bawo, futhi wabelane ngolwazi lwakho. Hlala ubukele okuthunyelwe okuzayo lapho sizongena sijule kwamanye amamethrikhi futhi sigqamise izibonelo zomhlaba wangempela zohlelo lwakho lokusebenza.
Ekupheleni kosuku, kulabo abafuna ukuhlola amanzi kudatha yokwenziwa, enye indlela eyethulwe yomthombo ovulekile ingaba ukukhetha okuthethelelekayo uma kubhekwa ukufinyeleleka; nokho, kochwepheshe abafaka lobu buchwepheshe besimanje ohlelweni lwabo lokuthuthukiswa, noma yiliphi ithuba lokuthuthuka kufanele lithathwe futhi zonke izithiyo zigwenywe. Ngakho-ke kubalulekile ukukhetha inketho engcono kakhulu etholakalayo. Ngokuhlaziya okuhlinzekwe ngenhla kuba sobala ukuthi i-Syntho futhi ngalokho i-Syntho Engine iyithuluzi elikwazi ukusebenza kahle kodokotela.
Mayelana neSyntho
Syntho inikeza inkundla yokwenziwa kwedatha ehlakaniphile, esebenzisa amafomu amaningi okwenziwa kwedatha nezindlela zokukhiqiza, inika izinhlangano amandla okuguqula idatha ngobuhlakani ibe unqenqema lokuncintisana. Idatha yethu yokwenziwa ekhiqizwe yi-AI ilingisa amaphethini ezibalo edatha yangempela, iqinisekisa ukunemba, ubumfihlo, kanye nesivinini, njengoba kuhlolwe ochwepheshe bangaphandle abafana ne-SAS. Ngezici ezihlakaniphile zokungahlonzi kanye nokuhlelwa kwemephu okungaguquki, ulwazi olubucayi luyavikelwa kuyilapho kugcinwa ubuqotho obuyinkomba. Inkundla yethu inika amandla ukudalwa, ukuphatha, nokulawula idatha yokuhlola yezindawo ezingakhiqizi, kusetshenziswa izindlela zokwenziwa kwedatha yokwenziwa ezisekelwe emithethweni yezimo ezihlosiwe. Ukwengeza, abasebenzisi bangakwazi ukukhiqiza idatha yokwenziwa ngokohlelo futhi bathole idatha yokuhlola engokoqobo ukuze bathuthukise ukuhlola okuphelele nezimo zokuthuthukisa kalula.
Uyafuna ukufunda izinhlelo zokusebenza ezengeziwe zedatha yokwenziwa? Zizwe ukhululekile uku Isheduli yedemo!
Mayelana nababhali
I-Software Engineering Intern
URoham ungumfundi we-bachelor eDelft University of Technology futhi uyiSoftware Engineering Intern e Syntho.
Injini Yokufunda Yomshini
U-Mihai uthole i-PhD yakhe e- Inyuvesi yaseBristol esihlokweni se-Hierarchical Reinforcement Learning isetshenziswa kumaRobhothi futhi uyi- Unjiniyela Wokufunda Ngomshini at Syntho.
Londoloza umhlahlandlela wakho wedatha wokwenziwa manje!
- Iyini idatha yokwenziwa?
- Kungani izinhlangano ziyisebenzisa?
- Inani elengeza amakesi eklayenti edatha yokwenziwa
- Ungaqala kanjani