Testing Environment - Generatorji delno umetnih podatkov na podlagi samokodirnikov

of both models, a model built on the original data and a model built on the generated data.

As a regression model we used random forests. The Sci-kit library pro-vides theRandomForestRegressorfunction. The classifier is trained ond_train and dgen. The performance of the models is evaluated on the data, which was not seen during the training period (d_test). The performance is measured as R² score, which is a statistical measure of how close the data are to the fitted regression line. A perfect model that always predicts the expected value, would get a R² score of 1.0. Same as before, the reported values are for m1d1 and m2d1.

4.6 Testing Environment

The testing was done on Google Cloud Platform. We used a virtual machine with Ubuntu 16.04 operating system. The script ran on a single core of Intel Xeon CPU with 2.50 GHz. The machine had 16 GB RAM. The solution was developed and executed using Python 3.6.

Chapter 5 Results

The results are presented in tabular form. The tables show the metrics explained in the previous section. We also report the time in seconds used for generating data for one training set (t[s]), and the percentage of generated data exactly equal to cases from the train set (=).

The results are split in three parts. In the first part, generators’ per-formances were tested on data with mixed attributes, which can be seen in Table 5.1 and Table 5.2. We wanted to test our solution on data sets with different number of attributes and instances to get a better overview of gener-ator performance. On average, a genergener-ator based on autoencoders needs 13.1 seconds to generate data, while variational autoencoders needs 17.2 seconds.

Overall, the performance is better for autoencoders with average ARI value of 0.73 and average difference betweenm1d1 andm2d1 is−10%. With vari-ational autoencoders, the ARI value is 0.43 and ∆(m1, m2) is −23%. VAEs give significantly worse results for 22 out of 23 data sets. Wilcoxon signed-rank test atα = 0.05 shows that the median difference in ∆(m1, m2) between generators is not zero and supports the alternative hypothesis that autoen-coders perform better than variational autoenautoen-coders. For autoenautoen-coders, both models, trained on the original and generated data, perform almost equally well on three data sets. Additionally, the difference in accuracy is less than 5 percentage points in 6 cases. These data sets, on which autoencoders

per-25

form well, do not seem to have any distinctive characteristic different from other data sets, on which the performance is worse.

In the second part, we tested generators on 17 data sets used by Robnik-ˇSikonja [31]. The aim was to compare results to generators based on RBF networks and random forests. We decided to compare only autoencoders, as they produce better results, as seen in the previous test. Table 5.3 shows results of autoencoders based generators and Table 5.4 of RBF and random forests based generators. Wilcoxon signed-rank test at α = 0.05 shows that the median difference in ∆(m1, m2) between generators is not zero and sup-ports the alternative hypothesis that generators based on RBF networks and random forests perform better. Our generator provides slightly better re-sults for 2 data sets and similar rere-sults for 7 data sets. We got slightly worse results for 3 datasets and much worse results for 5.

To get an idea when the generators produce acceptable data, we ana-lyzed how are data sets characteristics correlated to the difference between m1d1 and m2d1 (∆(m₁, m₂)). Data sets characteristics used are type of at-tributes (numerical, categorical or mixed), number of atat-tributes, number of cases, number of class values and normalized Shannon entropy of the class variable. The Shannon entropy tells us the information encoded in the distri-bution of class values (value 1 means that the ratios of all classes are equal).

It is calculated as:

E = PN

i=1p_i log₂ (_p¹

log2 (N) , where p_i = probability of classi

Figure 5.1.a presents the correlation matrix for the generators based on autoencoders. It can be seen there is no strong correlation between character-istics and the difference in performance. The matrix suggests that increasing the number of classes increases the margin between models’ performance and increasing the number of cases decreases the margin. Type of the attributes also has impact on the performance. From the matrix it seems that the gen-erators perform better for categorical data than numerical data. Nonetheless, they work best on the mixed data, since attribute types are integers, where

t[s] = m(∆mean) m(∆std) m(∆γ1) m(∆γ2) ARI m1d1[%] m2d1[%] ∆(m1, m2) Benefits 27.5 0 0.023 0.055 0.20 1.01 0.56±0.04 50.52±2.40 46.56±2.60 −3.95 biomass 9.7 0 0.018 0.021 0.98 8.31 0.94±0.03 78.02±10.21 63.09±11.00 −14.93 Caschool 16.0 0 0.022 0.035 0.23 1.63 0.52±0.06 21.24±6.60 14.52±4.55 −6.71 Caterpillars 6.3 0 0.012 0.017 0.17 0.28 0.85±0.08 84.65±7.84 69.06±12.52 −15.59 Crime 10.1 0 0.022 0.062 0.76 4.72 0.89±0.03 89.11±4.46 63.17±7.68 −25.94 DoctorAUS 40.2 0 0.031 0.060 0.42 3.87 0.87±0.03 55.36±1.92 42.57±5.99 −12.80 Fatality 9.0 0 0.024 0.056 0.19 1.26 0.62±0.05 94.15±3.75 81.12±6.81 −13.04 Fishing 13.2 0 0.046 0.058 0.55 2.54 0.96±0.03 91.79±2.66 63.43±5.48 −28.36 highway 5.1 0 0.023 0.036 0.40 1.62 0.56±0.09 61.83±22.37 69.33±23.83 7.50 hla 8.8 0 0.017 0.022 0.10 0.16 0.65±0.06 99.85±1.04 99.93±0.52 0.07 Hoops 6.8 0 0.016 0.050 0.11 0.93 0.42±0.05 22.46±10.33 12.37±7.33 −10.09 infant mortality 10.1 0 0.020 0.032 1.01 6.86 0.94±0.07 56.75±15.57 43.89±14.03 −12.85 InstInnovation 31.1 0 0.017 0.058 1.37 18.28 0.75±0.04 95.28±0.83 86.87±2.29 −8.41 InsuranceVote 9.3 0 0.025 0.057 0.14 1.38 0.69±0.10 87.95±5.27 89.05±7.21 1.10 iris 7.0 0 0.021 0.017 0.19 0.33 0.56±0.07 94.93±4.73 81.73±9.69 −13.20 Kakadu 19.5 0 0.026 0.032 0.19 0.56 0.53±0.04 99.69±0.43 99.50±0.67 −0.20 MedGPA 5.3 0 0.020 0.031 0.25 1.38 0.63±0.07 75.73±15.89 70.60±18.82 −5.13 midwest 9.8 0 0.013 0.025 1.37 12.31 0.93±0.04 90.62±4.25 62.48±7.29 −28.14 Mroz 13.0 0 0.025 0.078 0.28 1.10 0.72±0.05 69.77±4.70 57.98±6.83 −11.80 msleep 6.6 0 0.013 0.034 1.17 7.06 0.80±0.08 54.81±17.40 40.83±17.72 −13.97 skulls 7.1 0 0.019 0.035 0.13 0.86 0.74±0.04 24.27±11.29 24.93±10.65 0.67 soils 6.3 0 0.016 0.010 0.12 0.34 0.68±0.07 94.50±11.63 98.80±4.75 4.30 Tobacco 23.1 0 0.029 0.065 0.23 1.85 0.99±0.01 66.81±2.82 58.50±4.27 −8.30 avg 13.1 0 0.022 0.041 0.46 3.42 0.73±0.17 72.18±26.24 62.62±26.67 −9.55

Table 5.1: Autoencoder results on data sets with mixed types of attributes.

t[s] = m(∆mean) m(∆std) m(∆γ1) m(∆γ2) ARI m1d1[%] m2d1[%] ∆(m1, m2) Benefits 53.3 0 0.009 0.140 0.34 0.73 0.36±0.04 50.09±2.10 42.30±2.77 −7.79 biomass 9.0 0 0.290 0.053 4.01 23.29 0.52±0.15 78.89±10.61 47.65±16.63 −31.24 Caschool 10.7 0 0.135 0.109 1.02 1.80 0.00±0.00 22.29±6.09 5.71±3.87 −16.57 Caterpillars 7.1 0 0.100 0.154 0.50 0.90 0.56±0.12 84.78±8.30 54.13±10.74 −30.65 Crime 11.2 0 0.225 0.076 2.23 7.25 0.56±0.06 88.29±4.70 45.71±8.31 −42.57 DoctorAUS 60.3 0 0.032 0.107 1.16 4.25 0.72±0.06 55.36±2.17 40.93±4.39 −14.44 Fatality 8.2 0 0.111 0.107 0.66 0.93 0.38±0.09 92.96±3.63 63.69±10.90 −29.27 Fishing 16.9 0 0.113 0.119 1.30 4.72 0.59±0.05 92.08±3.31 45.06±6.04 −47.02 highway 6.1 0 0.210 0.114 1.34 2.59 0.27±0.26 58.00±25.16 32.50±24.17 −25.50 hla 7.4 0 0.066 0.198 0.39 0.70 0.44±0.11 100.00±0.00 91.43±7.80 −8.57 Hoops 5.8 0 0.067 0.101 0.29 0.49 0.10±0.08 17.68±8.75 14.76±10.26 −2.91 infant mortality 7.7 0 0.276 0.068 2.14 8.84 0.38±0.20 54.36±15.97 29.60±16.76 −24.76 InstInnovation 62.7 0 0.014 0.063 2.69 26.37 0.65±0.04 95.22±0.83 76.43±6.04 −18.79 InsuranceVote 9.1 0 0.081 0.111 0.63 1.18 0.59±0.06 87.55±5.09 84.22±9.53 −3.33 iris 8.1 0 0.020 0.152 0.26 0.64 0.47±0.07 94.93±5.09 74.93±14.27 −20.00 Kakadu 29.0 0 0.067 0.141 0.42 0.59 0.30±0.04 99.79±0.34 77.94±5.17 −21.85 MedGPA 5.4 0 0.070 0.086 0.50 1.26 0.46±0.25 75.47±16.12 58.20±22.81 −17.27 midwest 9.6 0 0.380 0.081 3.90 21.78 0.05±0.09 89.89±4.03 34.42±20.95 −55.47 Mroz 12.7 0 0.130 0.099 0.49 0.89 0.58±0.03 69.38±6.15 51.92±5.77 −17.46 msleep 6.5 0 0.238 0.064 2.46 9.55 0.12±0.14 57.14±15.58 37.64±16.48 −19.50 skulls 7.9 0 0.019 0.110 0.10 0.35 0.60±0.16 24.13±10.99 21.73±11.30 −2.40 soils 6.2 0 0.138 0.128 0.36 0.77 0.13±0.18 92.70±13.83 31.10±21.89 −61.60 Tobacco 34.1 0 0.059 0.074 0.66 2.95 0.98±0.02 66.55±3.20 52.78±7.62 −13.77 avg 17.2 0 0.124 0.107 1.21 5.34 0.43±0.23 71.63±24.82 48.47±21.75 −23.16

Table 5.2: Variational autoencoders results on data sets with mixed types of attributes.

t[s] = m(∆mean) m(∆std) m(∆γ₁) m(∆γ₂) ARI m1d1[%] m2d1[%] ∆(m₁, m₂) annealing 18.1 0 0.013 0.021 0.27 1.19 0.67±0.04 99.31±1.06 89.12±15.62 −10.19 balance-scale 11.4 0 0.016 0.023 0.06 0.32 0.66±0.08 81.09±4.71 77.70±4.69 −3.39 breast-cancer 8.9 0 0.022 0.024 0.09 0.51 0.68±0.04 70.26±7.34 63.34±10.47 −6.91 breast-cancer-wdbc 9.2 0 0.024 0.070 0.38 2.48 0.44±0.04 94.98±2.07 91.35±3.84 −3.62 breast-cancer-wisconsin 15.0 0 0.031 0.023 0.23 0.80 0.78±0.07 96.05±2.30 93.70±2.91 −2.35 bridges-version1 5.4 0 0.017 0.014 0.11 0.66 0.53±0.04 64.75±11.76 61.22±10.73 −3.53 bridges-version2 6.0 36 0.000 0.000 0.00 0.00 0.57±0.04 63.91±13.34 59.11±15.11 −4.80 dermatology 10.8 0 0.015 0.016 0.22 0.67 0.82±0.03 96.23±2.97 89.13±6.46 −7.10 ecoli 7.9 0 0.014 0.044 0.29 1.00 0.71±0.04 84.23±5.81 53.95±14.26 −30.28 flags 7.2 0 0.011 0.025 0.15 0.52 0.60±0.04 60.81±11.64 36.29±12.07 −24.52 glass 7.5 0 0.025 0.062 0.70 3.68 0.85±0.06 73.82±8.69 40.69±13.02 −33.13 haberman 9.0 0 0.021 0.052 0.18 0.57 0.53±0.06 68.18±6.82 68.69±9.20 0.51 iris 6.8 0 0.021 0.013 0.19 0.31 0.52±0.06 95.33±6.00 83.07±11.25 −12.27 post-operative 5.7 98 0.000 0.000 0.00 0.00 0.33±0.03 61.56±14.44 46.89±17.40 −14.67 primary-tumor 10.0 0 0.011 0.025 0.09 0.24 0.66±0.04 42.61±8.19 23.56±9.28 −19.05 soybean-large 11.3 20 0.011 0.025 0.16 0.85 0.70±0.04 88.23±13.45 20.36±31.17 −67.87 tic-tac-toe 17.8 95 0.000 0.000 0.00 0.00 0.54±0.04 95.95±2.18 53.88±5.11 −42.07 avg 9.9 15.3 0.015 0.026 0.18 0.81 0.62±0.13 78.66±16.33 61.89±22.71 −16.78

Table 5.3: Results of autoencoders (data sets from [31]).

t[s] = m(∆mean) m(∆std) m(∆γ1) m(∆γ2) ARI m1d1[%] m2d1[%] ∆(m1, m2) annealing / 0 0.126 0.014 0.65 1.29 0.63±0.06 99.33±1.00 95.55±2.67 −3.79 balance-scale / 63 0.018 0.043 0.07 0.26 0.53±0.07 81.41±3.54 71.27±5.52 −10.14 breast-cancer / 52 0.009 0.015 0.03 0.15 0.68±0.08 70.87±6.90 72.10±7.78 1.23 breast-cancer-wdbc / 0 0.017 0.033 0.24 1.10 0.41±0.07 95.99±2.13 93.28±3.05 −2.70 breast-cancer-wisconsin / 14 0.025 0.031 0.18 0.32 0.93±0.02 95.83±2.20 95.20±2.25 −0.63 bridges-version1 / 0 0.008 0.025 0.15 0.48 0.52±0.19 65.41±13.83 63.57±13.06 −1.84 bridges-version2 / 36 0.000 0.000 0.00 0.00 0.48±0.13 65.74±12.13 62.30±11.05 −3.44 dermatology / 0 0.053 0.031 0.30 0.68 0.68±0.07 96.62±3.15 93.93±3.84 −2.69

ecoli / 0 0.047 0.021 0.27 0.58 0.91±0.06 84.82±5.48 73.62±9.30 −11.20

flags / 0 0.015 0.016 0.17 0.57 0.67±0.10 62.44±9.68 47.76±11.77 −14.68

glass / 0 0.028 0.023 0.47 2.29 0.50±0.17 76.17±9.02 46.34±10.18 −29.83

haberman / 3 0.042 0.025 0.22 0.28 0.53±0.12 67.54±7.48 66.70±8.41 −0.84

iris / 0 0.034 0.020 0.17 0.21 0.54±0.12 94.93±5.26 91.20±7.23 −3.73

post-operative / 91 0.000 0.000 0.00 0.00 0.19±0.11 61.84±12.80 59.54±12.54 −2.30 primary-tumor / 43 0.014 0.008 0.10 0.22 0.48±0.06 32.23±5.74 31.61±7.02 −0.62 soybean-large / 14 0.019 0.022 0.30 1.45 0.67±0.06 76.86±18.10 76.63±16.74 −0.23 tic-tac-toe / 77 0.000 0.000 0.00 0.00 0.55±0.04 95.26±2.47 93.38±2.83 −1.88

avg / 2 0.027 0.019 0.20 0.58 0.58±0.17 77.84±17.42 72.59±18.94 −5.25

Table 5.4: Results of generators based on RBF and random forests (data sets from [31]).

−1 stands for numerical data, 1 for categorical data and 0 for mixed data.

Number of attributes and the entropy have a small impact on the difference in performance.

Figure 5.1.b presents the correlation matrix for the generators based on variational autoencoders. Same as before, it can be seen there is no strong correlation between characteristics and the difference between performance on original and generated data. The matrix suggests that increasing the number of cases and higher entropy decrease the margin between models’

performance, while increasing number of classes increases the margin. Like before, generators work best with mixed data. We can conclude that a gen-erator based on variational autoencoders would perform best on a data set with small number of mixed attributes with only two balanced classes.

a) b)

Figure 5.1: Correlation matrix of characteristics of data sets and

∆(m1, m2), reported as diff, for a) autoencoders and b) variational autoen-coders.

Testing Regression Data Sets

In the last part, we tested performance on data sets with only numerical attributes and numerical predictor. The results are in Tables 5.5 and 5.6.

The goal was to assess how good the generators handle regression problems.

The results show that our generators could be a good approach, for some regression problems. Even though the average ∆(m1, m2) value of genera-tors suggest that autoencoders perform better, comparing individual data set suggests VAEs are better. The average is misleading, as VAEs give signifi-cantly worse results on data sets, where both generators perform badly. The results show there is a high variance in generated data, therefore it is difficult to say which approach is better. Conducting the Wilcoxon signed-rank test we fail to reject the null hypothesis atα= 0.05 that the median difference in

∆(m1, m2) between generators is zero. The generated data was acceptable for 3 data sets out of 12. In 6 cases models built on the generated data produce negativeR², which means that the chosen model does not follow the trend of the data and fits the data worse than a horizontal line, representing the mean.

t[s] = m(∆mean) m(∆std) m(∆γ1) m(∆γ2) ARI m1d1[%] m2d1[%] ∆(m1, m2)

aids 9.4 0 0.017 0.036 0.08 0.33 0.74±0.08 0.89±0.12 −1.06±1.81 −1.95

Bordeaux 6.3 0 0.094 0.070 0.44 0.81 0.37±0.11 −1.99±7.33 −84.59±263.94 −82.60

cars04 5.1 0 0.031 0.061 0.23 1.41 0.45±0.07 0.82±0.11 0.32±0.46 −0.50

diamonds 14.4 0 0.032 0.139 1.20 3.19 0.32±0.05 0.98±0.01 0.60±0.26 −0.38 honeyproduction 10.7 0 0.061 0.080 1.19 6.60 0.43±0.08 0.75±0.14 0.62±0.22 −0.14 house data 77.5 0 0.037 0.116 0.40 2.23 0.41±0.07 0.87±0.02 −0.24±0.73 −1.11 insurance 15.1 0 0.047 0.107 0.14 1.07 0.41±0.04 −0.15±0.11 −0.99±0.60 −0.84 longley 4.1 0 0.071 0.180 0.28 0.74 0.30±0.10 −0.27±3.53 −19.15±83.85 −18.88 magazines 9.4 0 0.071 0.068 2.73 19.95 0.58±0.12 0.66±0.24 0.04±1.51 −0.62 pgatour2006 6.6 0 0.024 0.090 0.16 1.32 0.27±0.04 0.07±0.51 −1.39±2.86 −1.46 weatherHistory 12.2 0 0.025 0.086 0.17 1.16 0.39±0.05 1.00±0.00 0.82±0.17 −0.18 winequality-white 28.2 0 0.050 0.158 0.38 3.73 0.26±0.03 0.90±0.01 0.12±0.27 −0.78

avg 16.6 0 0.047 0.099 0.62 3.54 0.41±0.13 0.38±0.83 −8.74±23.47 −9.12

Table 5.5: Results of autoencoders for regression problems.

We can conclude that our approach strongly depends on the data set for regression problems. We suspect that the results would improve if we used the mean squared error as a loss function in autoencoder and variational

5.1. DEPENDENCE ON THE NUMBER OF CASES 31

t[s] = m(∆mean) m(∆std) m(∆γ1) m(∆γ2) ARI m1d1[%] m2d1[%] ∆(m1, m2)

aids 10.4 0 0.078 0.173 0.22 0.74 0.44±0.07 0.89±0.13 −3.37±3.51 −4.26

Bordeaux 5.5 0 0.040 0.110 0.36 0.58 0.29±0.23 −2.63±8.56 −151.86±355.80 −149.24

cars04 5.7 0 0.100 0.085 0.50 0.78 0.40±0.12 0.82±0.11 −0.82±1.38 −1.65

diamonds 15.0 0 0.096 0.093 1.41 2.34 0.17±0.02 0.98±0.01 0.91±0.03 −0.07 honeyproduction 9.0 0 0.234 0.039 2.38 8.65 0.14±0.04 0.74±0.15 0.63±0.12 −0.10 house data 160.5 0 0.006 0.081 0.92 1.74 0.53±0.05 0.87±0.02 0.34±0.04 −0.53 insurance 16.6 0 0.025 0.110 0.20 0.45 0.55±0.07 −0.18±0.10 −0.02±0.11 0.16 longley 5.2 0 0.039 0.197 0.21 0.42 0.95±0.15 −0.79±7.33 −18.53±64.49 −17.74 magazines 9.1 0 0.212 0.066 3.41 21.87 0.18±0.11 0.61±0.32 −0.07±0.92 −0.68 pgatour2006 5.0 0 0.025 0.089 0.13 0.45 0.14±0.08 0.05±0.71 −40.25±30.79 −40.29 weatherHistory 13.8 0 0.058 0.112 0.34 0.46 0.32±0.04 1.00±0.00 0.90±0.02 −0.10 winequality-white 29.1 0 0.003 0.070 0.77 3.99 0.35±0.03 0.90±0.01 0.59±0.03 −0.31 avg 23.7 0 0.076 0.102 0.90 3.54 0.37±0.22 0.27±1.02 −17.63±42.14 −17.90

Table 5.6: Results of variational autoencoders based for regression prob-lems.

autoencoders, as mean squared error is usually used with regression problems.

Since our focus is more on classification tasks, improvements of generators for regression problems is left for future work.

5.1 Dependence on the Number of Cases

We tested the hypothesis that increasing the number of training instances significantly improves the results. We selected three data sets with at least 1500 instances (Benefits, Kakadu and Tobacco). The data sets were tested with 50, 100, 250, 500, 1000 and 1500 training instances, which were ran-domly selected from the data set. Figure 5.2.a shows m1d1 values (dashed line) and m2d1 values (solid line) of autoencoders and their standard devi-ations for each data set. It can be seen that average values stop improving with approximately 300 cases and the standard deviations stabilize after 1000 cases. On the other hand, generators based on VAEs (Figure 5.2.b) seem to be more volatile, especially when there are less than 500 cases.

We can conclude that the generators’ performance is unstable with high variance for small number of instances. The performance improves and sta-bilizes after reaching a certain threshold. We assume that the threshold is around 1000 instances, although we cannot be sure, as we only tested on

three different data sets.

Figure 5.2: Graph of m2d1 classification accuracy with its standard devia-tion depending on the number of cases for a) autoencoders and b) variadevia-tional autoencoders.

In document Generatorji delno umetnih podatkov na podlagi samokodirnikov (Strani 43-53)