Kuongorora Utility uye Kufanana muSynthetic Data Majenareta: Iyo Tekinoroji Yakadzika Dive uye Kuenzanisa Ongororo.

Yakabudiswa:
February 27, 2024

ziviso

Munguva yanhasi yedhijitari, kuziva kwekuvanzika kwedata kwakawedzera zvakanyanya. Vashandisi vanowedzera kuziva data ravo seyakasarudzika dhijitari zvigunwe, zvichiisa njodzi kune yavo kuvanzika kana pakatyorwa data. Kunetseka uku kunowedzerwa nemirau yakaita seGDPR, iyo inopa vashandisi simba rekukumbira kubviswa kwedata ravo. Kunyange zvichidiwa zvakanyanya, mutemo uyu unogona kudhura zvakanyanya kumakambani sezvo kuwana data kuri kushomeka; zvirambidzo zvinogara zvichitora nguva-uye kushandisa zviwanikwa kukunda. 

Zviri Mukati

Chii chinonzi synthetic data generator?

Pinda synthetic data, mhinduro kune iyi conundrum. Synthetic data jenareta inogadzira dataset inotevedzera chaiyo mushandisi data ichichengetedza kusazivikanwa uye kuvanzika. Iyi nzira iri kuwana traction mumaindasitiri, kubva kuhutano kusvika kumari, uko kuvanzika kwakakosha.  

Iyi posvo yakarongedzerwa nyanzvi dzedata uye vanofarira, vachitarisa pakuongororwa kweanogadzira data majenareta. Isu tichaongorora makiyi metrics uye toita ongororo yekuenzanisa pakati peSyntho's Injini uye yakavhurika-sosi dzimwe nzira, ichipa ruzivo rwekuti unganyatso ongorora sei mhinduro yemhando yekugadzira data data. Uyezve, isu tichaongorora zvakare mutengo wenguva yeimwe neimwe yeaya mamodheru kuti tipe imwe nzwisiso mukushanda kwemamodheru. 

Nzira yekusarudza nzira yakakodzera yekugadzira data?

Munzvimbo dzakasiyana-siyana dzekugadzirwa kwedata rekugadzira, kune nzira dzakawanda dziripo, imwe neimwe ichirwira kutariswa nekugona kwayo kwakasiyana. Kusarudza nzira yakanyatsokodzera kune imwe application kunoda kunyatsonzwisisa maitiro ekuita kwesarudzo yega yega. Izvi zvinoda kuongororwa kwakadzama kweakasiyana siyana ekugadzira data majenareta zvichibva pane seti yemametric akanyatsotsanangurwa kuita sarudzo ine ruzivo. 

Izvi zvinotevera kuongororwa kwakaomarara kwekuenzanisa kweSyntho Injini padivi peiyo inozivikanwa yakavhurika-sosi chimiro, iyo Synthetic Data Vault (SDV). Mukuongorora uku, takashandisa akawanda anowanzo shandiswa metrics akadai sehuwandu hwekuvimbika, kufanotaura chokwadi uye hukama hwepakati-inosiyana. 

Synthetic Data Evaluation Metrics

Tisati tasuma chero metric chaiyo, isu tinofanirwa kubvuma kuti kune akawanda mafungiro ekuongorora Synthetic data, imwe neimwe inopa nzwisiso mune imwe nzvimbo yedata. Tichifunga izvi, mapoka matatu anotevera anomira seakakosha uye akazara. Aya metrics anopa ruzivo mune akasiyana siyana emhando yedata. Aya mapoka ndeaya: 

      1. Statistical Fidelity Metrics: Kuongorora zvekutanga zviverengero zve data, senzira uye kusiyana, kuona kuti data rekugadzira rinoenderana neiyo yekutanga dataset yehuwandu hwehuwandu. 

        1. Predictive Accuracy: Kuongorora maitiro ekugadzirwa kwedata remhando, akadzidziswa nedata rekutanga, uye kuongororwa padhata rekugadzira (Chitima Chaiyo - Test Synthetic, TRTS) uye zvichipesana (Chitima Synthetic - Test Real, TSTR) 

          1. Inter-Variable Relationships: Ichi chikamu chakasanganiswa chinosanganisira: 

            • Feature Correlation: Isu tinoongorora kuti data rekugadzira rinochengetedza sei hukama pakati pezvinosiyana uchishandisa correlation coefficients. Metric inozivikanwa sePropensity Mean Squared Error (PMSE) ingave yerudzi urwu. 

            • Ruzivo Rwose Isu tinoyera kutsamirana pakati pezvakasiyana kuti tinzwisise kudzika kwehukama uhwu kunze kwekungobatana chete. 

          Kuenzanisa Kuongorora: Syntho Engine vs. Open-Source Alternatives

          Ongororo yekuenzanisa yakaitwa pachishandiswa yakamisikidzwa yekuongorora dhizaini uye akafanana ekuyedza matekiniki pamhando dzese, kusanganisira Syntho Injini uye SDV modhi. Nekugadzira dhatabhesi kubva kune akafanana masosi uye kuzviisa pasi kune imwecheteyo nhamba bvunzo uye muchina wekudzidza modhi yekuongorora, tinova nechokwadi chekuenzanisa uye kusingarereki. Chikamu chinotevera chinodonongodza mashandiro ejenareta yedata yega yega pahuwandu hwemametric anoratidzwa pamusoro.  

           

          Kana iri dhata rinoshandiswa pakuongorora, isu takashandisa iyo UCI Adult's Census Dataset inova dhatabheti inozivikanwa munharaunda yekudzidza muchina. Takachenesa iyo data tisati tapedza kudzidziswa uye tichibva tapatsanura dhata mumaseti maviri (yekudzidzira uye yekuchengeta seti yekuyedza). Takashandisa kudzidziswa kwatakaita kugadzira miriyoni imwe itsva datapoints neimwe yemhando uye tikaongorora akasiyana metrics pane aya akagadzirwa dataset. Kumwe ongororo yekudzidza muchina, takashandisa iyo holdout set kuti tiongorore metrics seaya ane chekuita neTSTR neTRTS.  

           

          Jenareta yega yega yaifambiswa neyakagadzika paramita. Sezvo mamwe mamodheru, saSyntho, anogona kushanda kunze-kwe-kwe-bhokisi pane chero tabular data, hapana kurongedza kwakanaka kwakaitwa. Kutsvaga ma hyperparameter akakodzera emhando yega yega zvingatora nguva yakakura, uye Tafura 2 yatoratidza musiyano mukuru wenguva pakati pemuenzaniso waSyntho newakaedzwa. 

           

          Zvinokosha kuziva kuti kusiyana nemamwe mamodheru muSDV, iyo Gaussian Copula Synthesizer yakavakirwa pamaitiro ehuwandu. Kusiyana neizvi, mamwe akavakirwa pane neural network senge Generative Adversarial Networks (GAN) modhi uye akasiyana-siyana encoder. Ichi ndicho chikonzero Gaussian Copula inogona kuonekwa seyokutanga kune ese mamodheru anokurukurwa. 

          Results

          Data Data

          Mufananidzo 1. Kuonekwa kwehutano hwemhando yepamusoro yemhando dzose

          Izvo zvakambokurukurwa zvekuomerera kune maitiro uye zvinomiririra mune data zvinogona kuwanikwa muMufananidzo 1 uye Tafura 1. Pano, imwe neimwe yemetrics inoshandiswa inogona kududzirwa sezvinotevera:

          • Huwandu Hwemhando Chibodzwa: Yese ongororo yemhando yedhata yekugadzira, kusanganisa zvinhu zvakasiyana-siyana sehuwandu hwekufanana uye hunhu hwedata. 
          • Column Shapes: Inoongorora kana iyo synthetic data inochengetedza yakafanana kugovera chimiro seiyo chaiyo data yekoramu yega yega. 
          • Column Pair Trends: Inoongorora hukama kana kuwirirana pakati pemapaya emakoramu mune data rekugadzira zvichienzaniswa nedata chairo. 
          •  

          Pakazere, zvinogona kucherechedzwa kuti Syntho anowana akanyanya kukwirira zvibodzwa pabhodhi. Kutanga, kana uchitarisa kumhando yedata rese (yakaongororwa neSDV metrics raibhurari) Syntho inogona kuwana mhedzisiro inokwira ye99% (ne column shape kutevedzana kwe99.92% uye column peya shape kutevedza kwe99.31%). Apa ndipo apo SDV inowana mhedzisiro inokwana 90.84% ​​(ine Gaussian Copula, ine mbiru yechimiro kutevedzana kwe93.82% uye mbiru mbiri kutevedzera kwe87.86%). 

          Chiratidziro chetabular chemhando yezvibodzwa zvega yega dataset yakagadzirwa pa modhi

          Tafura 1. Chiratidziro chetabular chemhando yezvibodzwa zvega yega dataset yakagadzirwa pamuenzaniso 

          Data Coverage

          Iyo Diagnosis Report module yeSDV inounza kwatiri kuona kuti SDV-yakagadzirwa data (muzviitiko zvese) inoshaikwa inodarika gumi muzana yenhamba dzenhamba; Panyaya yeTriplet-Based Variational Autoencoder (TVAE), huwandu hwakafanana hwedata remhando haupo kana uchienzaniswa nedataset yekutanga. Hapana yambiro dzakadaro dzakagadzirwa nemhedzisiro yakawanikwa nekushandisa Syntho.  

          kuona kweavhareji makoramu-akachenjera kuita metrics kune ese mamodheru
           
           

          Mufananidzo 2. Kuonekwa kweavhareji column-wise performance metrics kune ese mamodheru 

          Mukuongorora kwekuenzanisa, chirongwa cheMufananidzo 2 chinoratidza kuti SDV inochengeterwa mhedzisiro iri nani muchikamu chekuvhara nemamwe mamodheru avo (anoti neGaussianCopula, CopulaGAN, uye Conditional Tabular GAN - CTGAN). Zvakangodaro, zvakakosha kuratidza kuti kuvimbika kwedata raSyntho kunopfuura iyo yemhando dzeSDV, sezvo mutsauko mukuvharwa mumapoka uye marenji ishoma, ichiratidza kungosiyana 1.1%. Kusiyana neizvi, mhando dzeSDV dzinoratidza mutsauko wakakura, kubva pa14.6% kusvika 29.2%. 

           

          Iwo anomiririrwa metrics pano, anogona kududzirwa seinotevera: 

          • Category Coverage: Inoyera kuvepo kwese zvikamu mudhata rekugadzira sekuenzanisa nedata chairo.
          • Range Coverage: Inoongorora kuti huwandu hwehuwandu hwe data yekugadzira hunoenderana sei neiyo data chaiyo. 
          Mucherechedzo wetabular weavhareji yekuvhara yemhando yakapihwa hunhu pa modhi

          Tafura 2. Chiratidziro chetabular cheavhareji yekuvharwa kwemhando yakapihwa hunhu pamuenzaniso 

          Utility

          Kuenderera mberi kune musoro wekushandiswa kwedata rekugadzira, iyo nyaya yekudzidzira modhi pane data inova yakakosha. Kuti tive nekuenzanisa uye kwakaringana pakati pemafuremu ese isu takasarudza iyo yakasarudzika Gradient Boosting Classifier kubva kuSciKit Dzidza raibhurari, tichiiona inogamuchirwa zvakaringana semuenzaniso unoshanda nemaseting ekunze kwebhokisi.  

           

          Mhando mbiri dzakasiyana dzakadzidziswa, imwe pane data rekugadzira (yeTSTR) uye imwe pane yekutanga data (yeTRTS). Iyo modhi yakadzidziswa pane data rekugadzira inoongororwa nekushandisa holdout test set (iyo isina kushandiswa panguva yekugadzira data data) uye modhi yakadzidziswa pane yekutanga data inoedzwa padhata rekugadzira.  

          kuona kweNzvimbo Yepasi Peji (AUC) zvibodzwa panzira imwe neimwe modhi

          Mufananidzo 3. Kuonekwa kweNzvimbo Pasi Peji (AUC) zvibodzwa panzira pa modhi 

           Mhedzisiro yakaratidzwa pamusoro inoratidza hukuru hweSynthetic data chizvarwa neSyntho injini kana ichienzaniswa nedzimwe nzira, tichiona pasina mutsauko pakati pemibairo yakawanikwa nenzira dzakasiyana (inongedza kune kufanana kwepamusoro pakati pedhata rekugadzira uye chairo). Zvakare, mutsetse wakatsvuka une doti uripo muchirongwa ndiwo mhedzisiro yakawanikwa nekuongorora mashandiro eChitima Chaiyo, Test Real (TRTR) bvunzo kuti ipe hwaro hwemametric akacherechedzwa. Iyi mutsara inomiririra kukosha kwe0.92, inova Nharaunda Pasi peCurve mamakisi (AUC mamakisi) yakawanikwa nemuenzaniso wakadzidziswa pane chaiyo data uye yakaedzwa pane chaiyo data. 

          Iyo tabular inomiririra yeAUC zvibodzwa zvakawanikwa neTRTS uye TSTR zvakateerana pamhando.

          Tafura 3. Mucherechedzo wetabular weAUC zvibodzwa zvakawanikwa neTRTS uye TSTR zvakateerana pamuenzaniso. 

          Kuenzanisa kwenguva

          Nomuzvarirwo, zvakakosha kufunga nezvenguva inoiswa mukugadzira izvi mhedzisiro. Chiratidzo chiri pasi apa chinoratidza izvi.

          kuona yenguva inotorwa kudzidzisa uye kuita yekugadzira data kugadzirwa kwemiriyoni imwe datapoints ine modhi ine uye isina GPU.

          Mufananidzo 5. Kuona nguva inotorwa kudzidzisa nekuita synthetic data kugadzira yemirioni imwe datapoints ine modhi ine uye isina GPU. 

          Mufananidzo 5 unoratidza nguva inotorwa kugadzira data rekugadzira muzvirongwa zviviri zvakasiyana. Yekutanga iyo (pano inonzi Pasina GPU), yaive bvunzo inomhanya pane system ine Intel Xeon CPU ine gumi nematanhatu cores inomhanya pa16 GHz. Iwo maedzo akanzi "akamhanya neGPU" aive pahurongwa ine AMD Ryzen 2.20 9HX CPU ine gumi nematanhatu cores inomhanya pa7945GHz uye NVIDIA GeForce RTX 16 Laptop GPU. Sezvinoonekwa muMufananidzo 2.5 uye muTable 4070 pazasi, zvinogona kucherechedzwa kuti Syntho inokurumidza kukurumidza pakugadzira data rekugadzira (mune ese ari maviri mamiriro) iyo yakakosha mukufambiswa kwebasa. 

          tafura inotaridza nguva yakatorwa kugadzira data kugadzirwa kwemiriyoni imwe datapoints ine modhi yega yega ine uye isina GPU.

          Tafura 5. Chiratidziro cheTabular chenguva yakatorwa kuna synthetic data kugadzira yemiriyoni imwe datapoints ine modhi yega yega ine uye isina GPU 

          Mashoko Ekupedzisira uye Nhungamiro Yeramangwana 

          Zvakawanikwa zvinosimbisa kukosha kwekuongororwa kwemhando yepamusoro pakusarudza nzira yekugadzira data. Syntho's Injini, ine maitiro ayo anotyairwa neAI, inoratidza masimba akakosha mune mamwe metrics, nepo akavhurika-sosi maturusi seSDV achipenya mukuita kwavo kwakasiyana-siyana uye kuvandudzwa kunofambiswa nenharaunda. 

          Sezvo ndima yedata yekugadzira ichiramba ichishanduka, tinokukurudzira kuti ushandise aya metrics mumapurojekiti ako, ongorora kuomarara kwawo, uye kugovera zviitiko zvako. Ramba wakarongedzerwa zvinyowani zvenguva yemberi apo isu tichanyura zvakadzika mune mamwe ma metrics uye tinoratidza chaiwo-epasi mienzaniso yekushandiswa kwavo. 

          Pakupera kwezuva, kune avo vari kutsvaga kuyedza mvura padhata rekugadzira, iyo yakavhurwa-sosi imwe nzira inogona kuve sarudzo inotenderwa inopihwa kuwanikwa; zvisinei, kune nyanzvi dzinobatanidza iyi tekinoroji yechizvino-zvino mumaitiro avo ebudiriro, chero mukana wekuvandudza unofanirwa kutorwa uye zvipingaidzo zvese zvinodziviswa. Saka zvakakosha kusarudza sarudzo yakanakisisa iripo. Nekuongororwa kwakapihwa pamusoro zvinova pachena kuti Syntho uye neiyo Syntho Injini chishandiso chinokwanisa kwazvo kune varapi. 

          Nezve Syntho

          Syntho inopa yakangwara yekugadzira data chikuva, inosimudzira akawanda ekugadzira data mafomu uye nzira dzekugadzira, ichigonesa masangano kushandura nehungwaru dhata kuita inokwikwidza. Yedu AI-yakagadzirwa synthetic data inotevedzera manhamba ehuwandu hwekutanga data, kuve nechokwadi chechokwadi, kuvanzika, uye kumhanya, sekuongororwa nenyanzvi dzekunze seSAS. Iine smart de-identification maficha uye inopindirana mepu, ruzivo rwakadzama runodzivirirwa uku uchichengetedza kutendeseka. Yedu puratifomu inogonesa kusika, manejimendi, uye kutonga kweyedzo data yenzvimbo dzisiri-yekugadzira, uchishandisa mutemo-based synthetic data yekugadzira nzira dzezvakanangwa mamiriro. Pamusoro pezvo, vashandisi vanogona kugadzira data rekugadzira zvine hurongwa uye kuwana rechokwadi bvunzo dhata kuti vagadzire yakakwana yekuyedza uye mamiriro ekusimudzira zviri nyore.  

          Iwe unoda here kudzidza zvimwe zvinoshanda zvekushandisa data rekugadzira? Inzwa wakasununguka purogiramu demo!

          Nezve vanyori

          Software Injiniya Intern

          roham ari bachelor mudzidzi paDelft University of Technology uye ari Software Engineering Intern pa Syntho 

          Mashini Kudzidza Injiniya

          Mihai akawana PhD yake kubva ku Yunivhesiti yeBristol pamusoro weiyo Hierarchical Reinforcement Kudzidza yakashandiswa kune Robotics uye a Muchina Kudzidza Injiniya at Syntho. 

          syntho guide cover

          Sevha yako synthetic data gwara izvozvi!