Kuyang'anira Utility ndi Kufanana mu Zopanga Zopanga Zopanga: Kuzama Kwambiri Kwaukadaulo ndi Kusanthula Kofananira

Lofalitsidwa:
February 27, 2024

Introduction

M'nthawi yamakono ya digito, kuzindikira zachinsinsi za data kwakula kwambiri. Ogwiritsa ntchito amazindikira kwambiri deta yawo ngati chala chapadera cha digito, zomwe zimayika chiwopsezo pazinsinsi zawo pakaphwanya deta. Nkhawayi imakulitsidwanso ndi malamulo monga GDPR, omwe amapatsa mphamvu ogwiritsa ntchito kupempha kuchotsedwa kwa deta yawo. Ngakhale kuti ndizofunikira kwambiri, lamuloli likhoza kukhala lokwera mtengo kwambiri kwa makampani chifukwa kupeza deta kumachepetsedwa; zoletsa zomwe nthawi zambiri zimatengera nthawi komanso zowononga kuti zigonjetse. 

M'ndandanda wazopezekamo

Kodi majenereta opangira data ndi chiyani?

Lowetsani deta yopangira, yankho la vuto ili. Majenereta opangira data amapanga ma data omwe amatsanzira zenizeni za ogwiritsa ntchito ndikusunga chinsinsi komanso kusadziwika. Njira iyi ikukulirakulira m'mafakitale ambiri, kuyambira pazaumoyo kupita pazachuma, komwe chinsinsi ndichofunika kwambiri.  

Cholembachi chimapangidwira akatswiri aza data komanso okonda, ndikuwunikanso kuwunika kwa opanga ma data. Tidzasanthula ma metrics ofunikira ndikuwunika mofananiza pakati pa Syntho's Engine ndi njira zake zotseguka, ndikupereka zidziwitso zamomwe mungawunikire bwino njira yothetsera kupanga deta yopangira. Kuphatikiza apo, tiwonanso mtengo wanthawi yamtundu uliwonse wamitundu iyi kuti tidziwitsenso momwe zimagwirira ntchito. 

Momwe mungasankhire njira yoyenera yopangira deta?

M'malo osiyanasiyana opangira deta, pali njira zambiri zomwe zilipo, iliyonse yomwe imafuna chidwi ndi luso lake. Kusankha njira yoyenera kwambiri yogwiritsira ntchito kumafuna kumvetsetsa bwino za machitidwe a njira iliyonse. Izi zimafunika kuunika kokwanira kwa majenereta osiyanasiyana opangira data potengera miyeso yodziwika bwino kuti apange chisankho mwanzeru. 

Chotsatira ndikuwunika kofananiza kwa Syntho Engine pamodzi ndi chimango chodziwika bwino chotsegulira, Synthetic Data Vault (SDV). Pakuwunikaku, tidagwiritsa ntchito ma metric omwe amagwiritsidwa ntchito kawirikawiri monga kukhulupirika kwa ziwerengero, kulondola kwamtsogolo komanso ubale wosiyanasiyana. 

Synthetic Data Evaluation Metrics

Tisanatchule ma metric enieni, tiyenera kuvomereza kuti pali malingaliro ambiri okhudza kuwunika deta ya Synthetic, iliyonse yomwe imapereka chidziwitso pagawo lina la data. Poganizira izi, magulu atatu otsatirawa amawonekera kukhala ofunika komanso omveka bwino. Ma metrics awa amapereka chidziwitso pamitundu yosiyanasiyana yamtundu wa data. Magulu awa ndi: 

      1. Statistical Fidelity Metrics: Kuyang'ana ziwerengero zoyambira za data, monga njira ndi kusiyanasiyana, kuonetsetsa kuti data yopangidwa ikugwirizana ndi mbiri yakale ya dataset. 

        1. Kulosera Zolondola: Kuwunika magwiridwe antchito amitundu yopangira deta, ophunzitsidwa ndi data yoyambirira, ndikuwunikidwa pa data yopangira (Train Real - Test Synthetic, TRTS) ndi mosemphanitsa (Train Synthetic - Test Real, TSTR) 

          1. Maubwenzi Osiyanasiyana: Gulu lophatikizidwa ili likuphatikizapo: 

            • Kugwirizana Kwachinthu: Timawunika momwe deta yopangidwira imasungira bwino maubwenzi pakati pa zosinthika pogwiritsa ntchito ma coefficients ogwirizanitsa. Metric yodziwika bwino ngati Propensity Mean Squared Error (PMSE) ingakhale yamtunduwu. 

            • Zambiri Zogwirizana: Timayesa kudalirana pakati pa zosintha kuti timvetsetse kuya kwa maubwenzi amenewa kupitirira malumikizano okha. 

          Kusanthula Koyerekeza: Syntho Engine vs. Open-Source Alternatives

          Kuwunika kofananitsaku kunachitika pogwiritsa ntchito njira yowunikira yokhazikika komanso njira zoyesera zofananira pamitundu yonse, kuphatikiza mitundu ya Syntho Engine ndi SDV. Popanga ma data kuchokera kumalo ofanana ndikuwayika ku mayeso ofanana ndi kuyesa makina ophunzirira makina, timatsimikizira kufananitsa koyenera komanso kosakondera. Gawo lomwe likutsatira mwatsatanetsatane momwe jenereta iliyonse yopangira data imagwirira ntchito pamiyeso yoperekedwa pamwambapa.  

           

          Ponena za deta yomwe imagwiritsidwa ntchito poyesa, tidagwiritsa ntchito UCI Adult's Census Dataset yomwe ndi dataset yodziwika bwino m'magulu ophunzirira makina. Tidayeretsa zomwe tidaphunzira tisanayambe maphunziro onse ndikugawa magawo awiri (zophunzitsira ndi zoyeserera zoyesa). Tidagwiritsa ntchito maphunzirowa kuti tipeze ma data 1 miliyoni atsopano pamitundu iliyonse ndikuwunika ma metric osiyanasiyana pamasamba opangidwawa. Pakuwunika kwina kwa kuphunzira pamakina, tidagwiritsa ntchito ma holdout kuti tiwunikire ma metric monga okhudzana ndi TSTR ndi TRTS.  

           

          Jenereta iliyonse imayendetsedwa ndi magawo osasintha. Monga ena mwa zitsanzo, monga Syntho, amatha kugwira ntchito kunja kwa bokosi pa data iliyonse ya tabular, palibe kukonza bwino komwe kunachitika. Kusaka ma hyperparameter oyenerera pamtundu uliwonse kungatenge nthawi yayitali, ndipo Gulu 2 likuwonetsa kale kusiyana kwakukulu pakati pa mtundu wa Syntho ndi omwe adayesedwa. 

           

          Ndizofunikira kudziwa kuti mosiyana ndi mitundu yonse ya SDV, Gaussian Copula Synthesizer imachokera ku njira zowerengera. Mosiyana ndi izi, zotsalazo zimatengera ma neural network monga ma Generative Adversarial Networks (GAN) ndi ma encoder osinthika. Ichi ndichifukwa chake Gaussian Copula amatha kuwoneka ngati maziko amitundu yonse yomwe yafotokozedwa. 

          Results

          Ubwino wa deta

          Chithunzi 1. Kuwona zotsatira zamtengo wapatali zamitundu yonse

          Kutsatiridwa komwe kunakambidwa kale kumayendedwe ndi mafotokozedwe mu deta kungapezeke mu Chithunzi 1 ndi Table 1. Pano, ma metrics omwe amagwiritsidwa ntchito angatanthauzidwe motere:

          • Zotsatira Zapamwamba Pazonse: Kuunika kwathunthu kwamtundu wa data yopangidwa, kuphatikiza zinthu zosiyanasiyana monga kufanana kwa ziwerengero ndi mawonekedwe a data. 
          • Mawonekedwe a Column: Imawunika ngati deta yopangidwayo imasunga mawonekedwe ofanana ndi omwe ali ndi data yeniyeni pagawo lililonse. 
          • Makhalidwe Awiri A Mzere: Imawunika ubale kapena kulumikizana pakati pazambiri zamagawo mu data yopangidwa poyerekeza ndi deta yeniyeni. 
          •  

          Ponseponse, zitha kudziwika kuti Syntho amakwaniritsa zambiri pagulu lonselo. Poyamba, poyang'ana khalidwe la deta lonse (loyesedwa ndi laibulale ya ma metrics a SDV) Syntho akhoza kupeza zotsatira zopitirira 99% (ndi kutsata mawonekedwe a 99.92% ndi 99.31%). Izi ndi pamene SDV imapeza zotsatira za 90.84% ​​(yomwe ili ndi Gaussian Copula, yokhala ndi mawonekedwe a 93.82% ndi 87.86%). 

          Chiwonetsero cha tabular cha kuchuluka kwamtundu uliwonse wopangidwa pamtundu uliwonse

          Tebulo 1. Chiwonetsero cha tabular cha ziwerengero zamtundu uliwonse zomwe zimapangidwa pamtundu uliwonse 

          Kufalikira kwa Data

          Gawo la Diagnosis Report la SDV likutidziwitsa kuti deta yopangidwa ndi SDV (nthawi zonse) ikusowa kuposa 10% ya manambala; Pankhani ya Triplet-Based Variational Autoencoder (TVAE), kuchuluka komweko kwa data yamagulu kukusowanso poyerekeza ndi dataset yoyambirira. Palibe machenjezo otere omwe adapangidwa ndi zotsatira zomwe zapezedwa pogwiritsa ntchito Syntho.  

          kuwonetsera ma metrics apakati pazigawo zanzeru zamamodeli onse
           
           

          Chithunzi 2. mawonedwe apakati pazigawo zoyezera magwiridwe antchito amitundu yonse 

          Poyerekeza, chiwembu cha Chithunzi 2 chikuwonetsa kuti zolemba zakale za SDV zimakhala ndi zotsatira zabwinoko pang'onopang'ono pagulu lamitundu ina (yomwe ndi GaussianCopula, CopulaGAN, ndi Conditional Tabular GAN - CTGAN). Komabe, ndikofunikira kuwonetsetsa kuti kudalirika kwa data ya Syntho kumaposa mitundu ya SDV, popeza kusiyanasiyana komwe kumapezeka m'magulu ndi magawo ndikochepa, kuwonetsa kusiyana kwa 1.1%. Mosiyana ndi izi, mitundu ya SDV imawonetsa kusiyana kwakukulu, kuyambira 14.6% mpaka 29.2%. 

           

          Ma metric omwe akuimiridwa apa, akhoza kutanthauziridwa motere: 

          • Kufalikira kwa Gulu: Kumayesa kupezeka kwa magulu onse mu data yopangidwa poyerekeza ndi deta yeniyeni.
          • Kufalikira kwamitundu yosiyanasiyana: Imawunika momwe kuchuluka kwamitengo mu data yopangira kumayenderana ndi zomwe zili mu data yeniyeni. 
          Chiwonetsero cha tabular cha kufalikira kwapakati pamtundu womwe waperekedwa pamtundu uliwonse

          Tebulo 2. Chifaniziro cha tabular cha kufalikira kwapakati kwa mtundu womwe waperekedwa pamtundu uliwonse 

          Utility

          Kupitilira pamutu wogwiritsa ntchito deta yopangira, nkhani yamitundu yophunzitsira pama data imakhala yofunikira. Kuti tikhale ndi kufananitsa koyenera komanso koyenera pakati pa zigawo zonse tasankha chosasinthika Gradient Boosting Classifier kuchokera ku laibulale ya SciKit Learn, powona kuti ndiyovomerezeka ngati chitsanzo chochita bwino chokhala ndi zoikamo zakunja.  

           

          Mitundu iwiri yosiyana imaphunzitsidwa, imodzi pa data yopangira (ya TSTR) ndi imodzi pa data yoyambirira (ya TRTS). Chitsanzo chophunzitsidwa pa data yopangira chimawunikidwa pogwiritsa ntchito mayeso osungira (omwe sanagwiritsidwe ntchito panthawi yopangira deta) ndipo chitsanzo chophunzitsidwa pa deta yoyambirira chimayesedwa pa dataset yopangira.  

          kuyang'ana kwa Area Under the Curve (AUC) ziwerengero pa njira iliyonse

          Chithunzi 3. Kuwona zotsatira za Area Under the Curve (AUC) pa njira iliyonse 

           Zotsatira zomwe zawonetsedwa pamwambapa zikuwonetsa kukula kwa mbadwo wa data wa Synthetic ndi injini ya Syntho poyerekeza ndi njira zina, powona kuti palibe kusiyana pakati pa zotsatira zomwe zapezedwa ndi njira zosiyanasiyana (zolozera ku kufanana kwakukulu pakati pa data yopangidwa ndi yeniyeni). Komanso, mzere wa madontho ofiira omwe ali pachiwembuchi ndi zotsatira zomwe zapezedwa powunika momwe mayeso a Train Real, Test Real (TRTR) akuchitira kuti apereke maziko a ma metric omwe awonedwa. Mzerewu umayimira mtengo wa 0.92, womwe ndi gawo la Area Under the Curve (AUC score) yopindula ndi chitsanzo chophunzitsidwa pa deta yeniyeni ndikuyesedwa pa deta yeniyeni. 

          Kuyimira tabular za kuchuluka kwa AUC zopezedwa ndi TRTS ndi TSTR motsatana pamtundu uliwonse.

          Tebulo 3. Chiwonetsero cha tabular cha AUC chapindula ndi TRTS ndi TSTR motsatira chitsanzo. 

          Kuyerekeza kwanthawi yake

          Mwachilengedwe, ndikofunikira kuganizira nthawi yomwe idayikidwa kuti mupange zotsatira izi. Chiwonetsero chomwe chili pansipa chikuwonetsa izi.

          kuwonera nthawi yophunzitsidwa ndi kupanga zopangira zopangira ma datapoints miliyoni imodzi ndi mtundu wokhala ndi GPU wopanda komanso wopanda.

          Chithunzi 5. Kuwonetseratu nthawi yophunzitsidwa ndikuchita kupanga deta yopangira ya datapoints miliyoni imodzi yokhala ndi mtundu wokhala ndi GPU wopanda komanso wopanda. 

          Chithunzi 5 chikuwonetsa nthawi yomwe imatengedwa kuti ipange deta yopangidwa m'malo awiri osiyana. Yoyamba yomwe (pano imatchedwa Popanda GPU), inali mayeso oyendetsedwa ndi Intel Xeon CPU yokhala ndi ma cores 16 omwe akuyenda pa 2.20 GHz. Mayesero olembedwa kuti "adathamanga ndi GPU" anali pamakina omwe ali ndi AMD Ryzen 9 7945HX CPU yokhala ndi ma cores 16 omwe akuyenda pa 2.5GHz ndi NVIDIA GeForce RTX 4070 Laptop GPU. Monga momwe zikuwonekera mu Chithunzi 2 ndi mu Table 2 pansipa, zikhoza kuwonedwa kuti Syntho imathamanga kwambiri pakupanga deta yopangira (muzochitika zonse ziwiri) zomwe ndizofunikira kwambiri pamayendedwe amphamvu. 

          tebulo losonyeza nthawi yomwe imatengedwa popanga deta yopangira ma datapoints 1 miliyoni ndi mtundu uliwonse wokhala ndi GPU komanso wopanda GPU

          Gulu 5. Chiwonetsero cha Tabular cha nthawi yomwe yatengedwa ku kupanga deta yopangira ma datapoints miliyoni imodzi ndi mtundu uliwonse wokhala ndi GPU wopanda komanso wopanda 

          Mawu Omaliza ndi Malangizo Amtsogolo 

          Zomwe zapezazi zikugogomezera kufunikira kowunika bwino kwambiri pakusankha njira yoyenera yopangira deta. Syntho's Engine, yokhala ndi njira yoyendetsedwa ndi AI, imawonetsa mphamvu zodziwika bwino pama metrics ena, pomwe zida zotseguka ngati SDV zimawala pakusinthasintha kwawo komanso kuwongolera koyendetsedwa ndi anthu. 

          Pamene gawo lazopangapanga likupitilira kusinthika, tikukulimbikitsani kuti mugwiritse ntchito ma metric awa pamapulojekiti anu, fufuzani zovuta zake, ndikugawana zomwe mukukumana nazo. Khalani tcheru ndi zolemba zamtsogolo momwe tidzalowera mozama muzitsulo zina ndikuwonetsa zitsanzo zenizeni za momwe angagwiritsire ntchito. 

          Pamapeto pa tsiku, kwa iwo amene akufuna kuyesa madzi pa data yopangidwa, njira yowonekera yotseguka ikhoza kukhala chisankho chovomerezeka kupatsidwa mwayi; komabe, kwa akatswiri omwe akuphatikizira ukadaulo wamakono pakupanga kwawo, mwayi uliwonse woti uwongolere uyenera kutengedwa ndipo zopinga zonse zipewedwe. Choncho ndikofunikira kusankha njira yabwino yomwe ilipo. Ndi kusanthula zomwe zaperekedwa pamwambapa zikuwonekeratu kuti Syntho komanso kuti Syntho Engine ndi chida chothandiza kwambiri kwa akatswiri. 

          About Syntho

          Syntho imapereka nsanja yanzeru yopangira deta, kugwiritsa ntchito mitundu ingapo yopangira data ndi njira zopangira, kupatsa mphamvu mabungwe kuti asinthe mwanzeru deta kuti ikhale yopikisana. Zopanga zathu zopangidwa ndi AI zimatsanzira ziwerengero za data yoyambirira, kuwonetsetsa kulondola, zinsinsi, komanso liwiro, monga momwe amawunikiridwa ndi akatswiri akunja monga SAS. Ndi mawonekedwe anzeru odziwikiratu komanso mapu osasinthasintha, chidziwitso chachinsinsi chimatetezedwa ndikusunga kukhulupirika. Pulatifomu yathu imathandizira kupanga, kuyang'anira, ndi kuyang'anira deta yoyesera m'malo osapanga, kugwiritsa ntchito njira zopangira deta zozikidwa pamalamulo pazolinga zomwe mukufuna. Kuphatikiza apo, ogwiritsa ntchito amatha kupanga zidziwitso zongopanga mwadongosolo ndikupeza zoyeserera zenizeni kuti apange mayeso ozama ndi chitukuko mosavuta.  

          Kodi mukufuna kudziwa zambiri zogwiritsa ntchito data yopangira? Khalani omasuka ndandanda demo!

          Za alembawo

          Software Engineering Intern

          roham ndi wophunzira wa bachelor ku Delft University of Technology ndipo ndi Software Engineering Intern pa Syntho 

          Engine Engineer Learning

          Mihai adapeza PhD yake kuchokera ku Yunivesite ya Bristol pamutu wa Hierarchical Reinforcement Learning unagwiritsidwa ntchito ku Robotic ndipo ndi Katswiri Wophunzirira Makina at Syntho. 

          syntho guide chivundikiro

          Sungani kalozera wanu wazinthu zopangira tsopano!