Uvod v strojno učenje
učno gradivo za praktični del tečaja v okvirju Akademije FRI
Petar Vračar marec 2020
Contents
UVOD V R 2
Vektorji (osnovni podatkovni objekti v R) . . . 2
Faktorji . . . 7
Seznami . . . 8
Podatkovni okvirji (Data frames) . . . 9
NADZOROVANO UCENJE (SUPERVISED LEARNING) 13 Klasifikacija . . . 13
Regresija . . . 27
NENADZOROVANO UCENJE (UNSUPERVISED LEARNING) 45 Razvrščanje (Clustering) . . . 45
Povezovalna pravila (Association rules) . . . 50
UVOD V R
R lahko uporabljamo kot kalkulator (50 + 1.45)/12.5
## [1] 4.116 Operatorji prirejanja x = 945
y <- sin(0.47)^2 * sqrt(5) y^2 -> z
Trenutno vrednost objekta (spremenljivke) dobimo tako, da vnesemo njegovo ime x
## [1] 945 y
## [1] 0.4586309 z
## [1] 0.2103423
Izpis in odstranjevanje objektov iz pomnilnika ls()
## [1] "x" "y" "z"
rm(y) rm(x,z)
Za brisanje vseh objektov iz pomnilnika rm(list=ls())
Vektorji (osnovni podatkovni objekti v R)
Gradnja vektorja z naštevanjem vrednosti elementov v <- c(14,7,23.5,76.2)
v
## [1] 14.0 7.0 23.5 76.2 Gradnja aritmetičnih nizov v <- 1:10
v
## [1] 1 2 3 4 5 6 7 8 9 10 v <- seq(from=5, to=10, by=2)
v
## [1] 5 7 9
Gradnja vektorja s ponavljanjem elementov w <- rep(v, times = 2)
w
## [1] 5 7 9 5 7 9
Skalarji so vektorji z enim elementom w <- 45.0
w
## [1] 45
Vektor lahko zgradimo s pomočjo drugih vektorjev z <- c(v, 2.5, w)
z
## [1] 5.0 7.0 9.0 2.5 45.0 Uporabne funkcije nad vektorji v <- c(8, 4, 2, 3, 1, 9, 6) length(v)
## [1] 7 max(v)
## [1] 9 min(v)
## [1] 1 which.min(v)
## [1] 5 sum(v)
## [1] 33 mean(v)
## [1] 4.714286 sd(v)
## [1] 3.039424 rev(v)
## [1] 6 9 1 3 2 4 8 sort(v)
## [1] 1 2 3 4 6 8 9 sort(v, decreasing=T)
## [1] 9 8 6 4 3 2 1 order(v)
## [1] 5 3 4 2 7 1 6
Podatkovni tip elementov vektorja mode(v)
## [1] "numeric"
Logični vektor (elementi so logične konstante) b <- c(TRUE, FALSE, F, T)
b
## [1] TRUE FALSE FALSE TRUE mode(b)
## [1] "logical"
x <- 5 > 3 x
## [1] TRUE mode(x)
## [1] "logical"
Vektor stringov (elementi so znakovni nizi)
s <- c("character", "logical", "numeric", "complex") mode(s)
## [1] "character"
Elementi vektorja morajo biti istega tipa (v nasprotnem primeru R samodejno konvertira različne tipe) c(F, T, 5)
## [1] 0 1 5 c(2.5, 4, 8.1, T)
## [1] 2.5 4.0 8.1 1.0 c(4, 9, T, F, 12.6, "aaa")
## [1] "4" "9" "TRUE" "FALSE" "12.6" "aaa"
Operacije z vektorji Definirajmo dva vektorja:
v1 <- c(10,20,30,40) v2 <- 1:4
Aritmetične operacije se izvajajo nad istoležnimi elementi v1 + v2
## [1] 11 22 33 44 v1 * v2
## [1] 10 40 90 160
Funkcije se izvajajo po elementih vektorja v1^2
## [1] 100 400 900 1600 sqrt(v1)
## [1] 3.162278 4.472136 5.477226 6.324555 exp(v1)
## [1] 2.202647e+04 4.851652e+08 1.068647e+13 2.353853e+17 log2(v1)
## [1] 3.321928 4.321928 4.906891 5.321928
Če operatorja nista enako dolga, se med izvajanjem aritmetičnih operacij elementi krajšega vektorja ciklično ponavljajo
v1 * 10
## [1] 100 200 300 400 v1 + 1
## [1] 11 21 31 41 v1 + c(100, 200)
## [1] 110 220 130 240
Naslavljanje elementov vektorja Definirajmo vektor:
x <- c(-10,20,-30,40,-50,60,-70,80) x
## [1] -10 20 -30 40 -50 60 -70 80
Elemente lahko naslovimo z naštevanjem indeksov (položajev), ki nas zanimajo (prvi element vektorja je na položaju 1)
x[c(1,4,5)]
## [1] -10 40 -50 x[1:3]
## [1] -10 20 -30
Negativne vrednosti indeksov pomenijo, da želimo nasloviti vse elemente razen navedenih x[-1]
## [1] 20 -30 40 -50 60 -70 80 x[c(-4,-6)]
## [1] -10 20 -30 -50 -70 80 x[-(1:3)]
## [1] 40 -50 60 -70 80
Elemente je možno nasloviti tudi z logičnim vektorjem pri tem naslavljamo elemente, ki ustrezajo logični konstanti TRUE.
Rezultat primerjave po elementih vektorja predstavlja logični vektor x > 0
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
Naslavljanje z logičnim vektorjem (vrne elemente, ki ustrezajo položajem logičnih konstant TRUE) x[x>0]
## [1] 20 40 60 80 x[x <= -20 | x > 50]
## [1] -30 -50 60 -70 80 x[x > 40 & x < 100]
## [1] 60 80
Za preverjanje enakosti uporabljamo operator == Za preverjanje neenakosti uporabljamo operator !=
Funkcija which() vrne indekse, ki ustrezajo vrednosti konstante TRUE which(x > 0)
## [1] 2 4 6 8
Elemente vektorja je možno poimenovati point <- c(4.7, 3.6, 2.5)
names(point) <- c('x', 'y', 'z') point
## x y z
## 4.7 3.6 2.5
Sedaj lahko naslavljamo elemente z njihovim imenom point['x']
## x
## 4.7
point[c('x','z')]
## x z
## 4.7 2.5
Če ne podamo indeksov, naslovimo vse elemente vektorja point[] <- 0
point
## x y z
## 0 0 0
Popolnoma drugačen rezultat dobimo z naslednjim ukazom point <- 0
point
## [1] 0
Urejanje vektorjev Definirajmo vektor:
x <- c("a", "b", "c", "d") Spreminjanje vrednosti elementov x[2] <- "BBBBB"
x
## [1] "a" "BBBBB" "c" "d"
x[c(1,3)] <- c("AAAAA", "CCCCC") x
## [1] "AAAAA" "BBBBB" "CCCCC" "d"
Dodajanje novega elementa x[length(x)+1] = "EEEEE"
x
## [1] "AAAAA" "BBBBB" "CCCCC" "d" "EEEEE"
Kaj se zgodi, če ne definiramo vseh elementov vektorja?
x[10] <- "FFFFF"
x
## [1] "AAAAA" "BBBBB" "CCCCC" "d" "EEEEE" NA NA NA
## [9] NA "FFFFF"
Na katerih položajih manjkajo vrednosti elementov?
is.na(x)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE Odstranjevanje elementov vektorja
x <- x[-c(1,3)]
x
## [1] "BBBBB" "d" "EEEEE" NA NA NA NA "FFFFF"
x <- c(x[2],x[3]) x
## [1] "d" "EEEEE"
Faktorji
Definirajmo vektor:
gender <- c("f","m","m","m","f","m","f") gender
## [1] "f" "m" "m" "m" "f" "m" "f"
Faktorje uporabimo za modeliranje nominalnih spremenljivk gender <- factor(gender)
gender
## [1] f m m m f m f
## Levels: f m
Argument “levels” definira možne vrednosti elementov
smeri <- factor(c('levo','levo','desno'), levels = c('levo','desno','gor','dol')) smeri
## [1] levo levo desno
## Levels: levo desno gor dol Izpis seznama dovoljenih vrednosti levels(smeri)
## [1] "levo" "desno" "gor" "dol"
Vektorju lahko priredimo samo dovoljene vrednosti elementov smeri[1] <- "posevno"
## Warning in `[<-.factor`(`*tmp*`, 1, value = "posevno"): invalid factor
## level, NA generated smeri
## [1] <NA> levo desno
## Levels: levo desno gor dol smeri[1] <- "gor"
smeri
## [1] gor levo desno
## Levels: levo desno gor dol Frekvenčna tabela vrednosti table(gender)
## gender
## f m
## 3 4 table(smeri)
## smeri
## levo desno gor dol
## 1 1 1 0
Seznami
Seznam je urejena zbirka objektov
student <- list(id=12345,name="Marko",marks=c(10,9,10,9,8,10)) student
## $id
## [1] 12345
#### $name
## [1] "Marko"
##
## $marks
## [1] 10 9 10 9 8 10
Naslavljanje komponent seznama (z uporabo imen) student$id
## [1] 12345 student$name
## [1] "Marko"
student$marks
## [1] 10 9 10 9 8 10
Naslavljanje komponent seznama (z uporabo indeksov) student[[1]]
## [1] 12345 student[[2]]
## [1] "Marko"
student[[3]]
## [1] 10 9 10 9 8 10
Dodajanje nove komponente v seznam student$parents <- c("Ana", "Tomaz") student
## $id
## [1] 12345
#### $name
## [1] "Marko"
#### $marks
## [1] 10 9 10 9 8 10
#### $parents
## [1] "Ana" "Tomaz"
Podatkovni okvirji (Data frames)
Gradnja podatkovnega okvirja
height <- c(159, 185, 183, 170, 174, 165, 173, 169, 173, 158) weight <- c(45, 89, 70, 80, 62, 86, 50, 58, 72, 50)
gender <- factor(c("f","m","m","m","f","m","f","f","m","f")) student <- c(T, T, F, F, T, T, F, F, F, T)
df <- data.frame(gender, height, weight, student) df
## gender height weight student
## 1 f 159 45 TRUE
## 2 m 185 89 TRUE
## 3 m 183 70 FALSE
## 4 m 170 80 FALSE
## 5 f 174 62 TRUE
## 6 m 165 86 TRUE
## 7 f 173 50 FALSE
## 8 f 169 58 FALSE
## 9 m 173 72 FALSE
## 10 f 158 50 TRUE
Nekaj uporabnih funkcij summary(df)
## gender height weight student
## f:5 Min. :158.0 Min. :45.0 Mode :logical
## m:5 1st Qu.:166.0 1st Qu.:52.0 FALSE:5
## Median :171.5 Median :66.0 TRUE :5
## Mean :170.9 Mean :66.2
## 3rd Qu.:173.8 3rd Qu.:78.0
## Max. :185.0 Max. :89.0
names(df)
## [1] "gender" "height" "weight" "student"
nrow(df)
## [1] 10 ncol(df)
## [1] 4
Dostop do elementov podatkovnega okvirja df[5,]
## gender height weight student
## 5 f 174 62 TRUE
df[1:5,]
## gender height weight student
## 1 f 159 45 TRUE
## 2 m 185 89 TRUE
## 3 m 183 70 FALSE
## 4 m 170 80 FALSE
## 5 f 174 62 TRUE
df[,1]
## [1] f m m m f m f f m f
## Levels: f m df[,c(1,3,4)]
## gender weight student
## 1 f 45 TRUE
## 2 m 89 TRUE
## 3 m 70 FALSE
## 4 m 80 FALSE
## 5 f 62 TRUE
## 6 m 86 TRUE
## 7 f 50 FALSE
## 8 f 58 FALSE
## 9 m 72 FALSE
## 10 f 50 TRUE
df[1,-3]
## gender height student
## 1 f 159 TRUE
df$height
## [1] 159 185 183 170 174 165 173 169 173 158 df[df$height < 180,]
## gender height weight student
## 1 f 159 45 TRUE
## 4 m 170 80 FALSE
## 5 f 174 62 TRUE
## 6 m 165 86 TRUE
## 7 f 173 50 FALSE
## 8 f 169 58 FALSE
## 9 m 173 72 FALSE
## 10 f 158 50 TRUE
df[df$gender == "m",]
## gender height weight student
## 2 m 185 89 TRUE
## 3 m 183 70 FALSE
## 4 m 170 80 FALSE
## 6 m 165 86 TRUE
## 9 m 173 72 FALSE
Dodajanje novega stolpca v podatkovni okvir
df <- cbind(df, age = c(20, 21, 30, 25, 27, 19, 24, 27, 28, 24)) df
## gender height weight student age
## 1 f 159 45 TRUE 20
## 2 m 185 89 TRUE 21
## 3 m 183 70 FALSE 30
## 4 m 170 80 FALSE 25
## 5 f 174 62 TRUE 27
## 6 m 165 86 TRUE 19
## 7 f 173 50 FALSE 24
## 8 f 169 58 FALSE 27
## 9 m 173 72 FALSE 28
## 10 f 158 50 TRUE 24
df$name = c("Joan","Tom","John","Mike","Anna","Bill","Tina","Beth","Steve","Kim") df
## gender height weight student age name
## 1 f 159 45 TRUE 20 Joan
## 2 m 185 89 TRUE 21 Tom
## 3 m 183 70 FALSE 30 John
## 4 m 170 80 FALSE 25 Mike
## 5 f 174 62 TRUE 27 Anna
## 6 m 165 86 TRUE 19 Bill
## 7 f 173 50 FALSE 24 Tina
## 8 f 169 58 FALSE 27 Beth
## 9 m 173 72 FALSE 28 Steve
## 10 f 158 50 TRUE 24 Kim
summary(df)
## gender height weight student age
## f:5 Min. :158.0 Min. :45.0 Mode :logical Min. :19.00
## m:5 1st Qu.:166.0 1st Qu.:52.0 FALSE:5 1st Qu.:21.75
## Median :171.5 Median :66.0 TRUE :5 Median :24.50
## Mean :170.9 Mean :66.2 Mean :24.50
## 3rd Qu.:173.8 3rd Qu.:78.0 3rd Qu.:27.00
## Max. :185.0 Max. :89.0 Max. :30.00
## name
## Length:10
## Class :character
## Mode :character
####
##
NADZOROVANO UCENJE (SUPERVISED LEARNING)
Klasifikacija
Prenesite datoteko “PM10_Class.csv” v lokalno mapo. To mapo nastavite kot delovno mapo okolja R s po- mocjo ukaza “setwd” oziroma iz menuja s klikom na File -> Change dir. . . Na primer: setwd(“c:\tecaj\data\”).
Datoteka “PM10_Class.csv” vsebuje podatke o vremenu in onesnaženju zraka v obdobju od 2013 do 2016.
origData <- read.csv("PM10_Class.csv") summary(origData)
## PM10 Date Glob_radiation_max Glob_radiation_mean
## HIGH:221 2013-01-01: 1 Min. : 0.0 Min. : 0.000
## LOW :990 2013-01-02: 1 1st Qu.: 16.0 1st Qu.: 2.112
## 2013-01-03: 1 Median :108.0 Median : 20.625
## 2013-01-04: 1 Mean :176.9 Mean : 37.431
## 2013-01-05: 1 3rd Qu.:337.0 3rd Qu.: 66.188
## 2013-01-06: 1 Max. :619.0 Max. :138.375
## (Other) :1205
## Glob_radiation_min Wind_speed_max Wind_speed_mean Wind_speed_min
## Min. :0 Min. :0.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:0 1st Qu.:0.90 1st Qu.:0.5375 1st Qu.:0.2000
## Median :0 Median :1.30 Median :0.7250 Median :0.3000
## Mean :0 Mean :1.54 Mean :0.8905 Mean :0.4064
## 3rd Qu.:0 3rd Qu.:1.90 3rd Qu.:1.0500 3rd Qu.:0.5000
## Max. :0 Max. :6.80 Max. :4.6125 Max. :4.2000
#### Wind_gust_max Wind_gust_mean Wind_gust_min Precipitation_mean
## Min. : 0.000 Min. : 0.000 Min. :0.000 Min. :0.0000
## 1st Qu.: 2.300 1st Qu.: 1.519 1st Qu.:0.800 1st Qu.:0.0000
## Median : 2.800 Median : 1.900 Median :1.100 Median :0.0000
## Mean : 3.521 Mean : 2.259 Mean :1.277 Mean :0.1559
## 3rd Qu.: 4.100 3rd Qu.: 2.544 3rd Qu.:1.500 3rd Qu.:0.0000
## Max. :14.900 Max. :10.162 Max. :8.900 Max. :4.8375
#### Precipitation_sum Pressure_max Pressure_mean Pressure_min
## Min. : 0.000 Min. : 951.9 Min. : 947.3 Min. : 942.1
## 1st Qu.: 0.000 1st Qu.: 978.8 1st Qu.: 977.9 1st Qu.: 977.3
## Median : 0.000 Median : 982.9 Median : 982.2 Median : 981.5
## Mean : 1.281 Mean : 982.9 Mean : 982.0 Mean : 981.3
## 3rd Qu.: 0.200 3rd Qu.: 986.9 3rd Qu.: 985.9 3rd Qu.: 985.4
## Max. :38.500 Max. :1004.0 Max. :1003.5 Max. :1003.1
#### Humidity_max Humidity_mean Humidity_min Temp_1500_max
## Min. : 39.1 Min. :35.41 Min. :32.40 Min. :-14.300
## 1st Qu.: 86.3 1st Qu.:79.73 1st Qu.:69.90 1st Qu.: -0.600
## Median : 92.0 Median :87.84 Median :81.10 Median : 4.800
## Mean : 89.9 Mean :85.55 Mean :79.04 Mean : 4.634
## 3rd Qu.: 95.8 3rd Qu.:93.29 3rd Qu.:90.00 3rd Qu.: 9.650
## Max. :100.0 Max. :99.71 Max. :99.10 Max. : 21.300
#### Temp_1500_mean Temp_1500_min Temp_site_max Temp_site_mean
## Min. :-14.750 Min. :-15.100 Min. :-7.90 Min. :-9.675
## 1st Qu.: -1.619 1st Qu.: -2.400 1st Qu.: 4.20 1st Qu.: 3.344
## Median : 3.725 Median : 2.900 Median :11.40 Median :10.150
## Mean : 3.466 Mean : 2.666 Mean :10.75 Mean : 9.312
## 3rd Qu.: 8.456 3rd Qu.: 7.600 3rd Qu.:17.00 3rd Qu.:15.081
## Max. : 19.475 Max. : 18.800 Max. :29.10 Max. :24.087
#### Temp_site_min
## Min. :-10.800
## 1st Qu.: 2.600
## Median : 9.100
## Mean : 8.343
## 3rd Qu.: 14.100
## Max. : 22.100
##
Opis podatkov:
Atribut Pomen
PM10 Nominalni atribut, dnevna koncentracija prašnih delcev premera 10 µm
Date Čas meritve v formatu YYYY-MM-DD
Glob_radiation_max Zv. atribut, najvišja vrednost globalnega sevanja med 0:00 in 7:00 Glob_radiation_mean Zv. atribut, povprečna vrednost globalnega sevanja med 0:00 in 7:00 Glob_radiation_min Zv. atribut, najnižja vrednost globalnega sevanja med 0:00 in 7:00 Wind_speed_max Zv. atribut, najvišja hitrost vetra med 0:00 in 7:00
Wind_speed_mean Zv. atribut, povprečna hitrost vetra med 0:00 in 7:00 Wind_speed_min Zv. atribut, najnižja hitrost vetra med 0:00 in 7:00 Wind_gust_max Zv. atribut, najvišja hitrost sunkov vetra med 0:00 in 7:00 Wind_gust_mean Zv. atribut, povprečna hitrost sunkov vetra med 0:00 in 7:00 Wind_gust_min Zv. atribut, najnižja hitrost sunkov vetra med 0:00 in 7:00 Precipitation_mean Zv. atribut, povprečna količina padavin (na uro) med 0:00 in 7:00 Precipitation_sum Zv. atribut, skupna količina padavin med 0:00 in 7:00
Pressure_max Zv. atribut, najvišja vrednost zračnega pritiska med 0:00 in 7:00 Pressure_mean Zv. atribut, povprečna vrednost zračnega pritiska med 0:00 in 7:00 Pressure_min Zv. atribut, najnižja vrednost zračnega pritiska med 0:00 in 7:00 Humidity_max Zv. atribut, najvišja vrednost vlažnosti zraka med 0:00 in 7:00 Humidity_mean Zv. atribut, povprečna vrednost vlažnosti zraka med 0:00 in 7:00 Humidity_min Zv. atribut, najnižja vrednost vlažnosti zraka med 0:00 in 7:00
Temp_1500_max Zv. atribut, najvišja temperatura zraka na višini 1500m med 0:00 in 7:00 Temp_1500_mean Zv. atribut, povprečna temperatura zraka na višini 1500m med 0:00 in 7:00 Temp_1500_min Zv. atribut, najnižja temperatura zraka na višini 1500m med 0:00 in 7:00 Temp_site_max Zv. atribut, najvišja temp. zraka na merilnem mestu med 0:00 in 7:00 Temp_site_mean Zv. atribut, povpr. temp. zraka na merilnem mestu med 0:00 in 7:00 Temp_site_min Zv. atribut, najnižja temp. zraka na merilnem mestu med 0:00 in 7:00
Glob_radiation_min ima samo eno vrednost - ne potrebujemo ga.
origData$Glob_radiation_min <- NULL
Spoznavanje s podatki in Vizualizacija Število meritev (vrstic) v naši podatkovni množici nrow(origData)
## [1] 1211
Število atributov (stolpcev) ncol(origData)
## [1] 24
Pogostost posameznih razredov table(origData$PM10)
#### HIGH LOW
## 221 990
tabPM10 <- table(origData$PM10) tabPM10
#### HIGH LOW
## 221 990 Stolpčni diagram barplot(tabPM10,
main="Stolpcni diagram koncentracije delcev PM10", ylab="Stevilo meritev",
xlab="Koncentracija delcev PM10")
HIGH LOW
Stolpcni diagram koncentracije delcev PM10
Koncentracija delcev PM10
Stevilo meritev 0200400600800
Krožni diagram
pie(tabPM10, main="Krozni diagram koncentracije delcev PM10")
HIGH
LOW
Krozni diagram koncentracije delcev PM10
Histogram
hist(origData$Humidity_mean,
main="Histogram povprecne vlaznosti zraka", xlab="Povprecna vlaznost zraka",
ylab="Stevilo meritev")
Histogram povprecne vlaznosti zraka
Povprecna vlaznost zraka
Stevilo meritev
40 50 60 70 80 90 100
050100200300
Kvantilni diagram povprečne temperature zraka
boxplot(origData$Temp_site_mean, main="Povprecna temperatura zraka", ylab="Temperatura v °C")
−100510152025
Povprecna temperatura zraka
Temperatura v °C
Kvantilni digram povprečne temperature zraka glede na različne koncentracije PM10
boxplot(Temp_site_mean ~ PM10, origData, main="Kvantilni diagram", xlab="PM10", ylab="Temperatura zraka v °C")
HIGH LOW
−100510152025
Kvantilni diagram
PM10
Temperatura zraka v °C
Struktura podatkovnega okvirja str(origData)
## 'data.frame': 1211 obs. of 24 variables:
## $ PM10 : Factor w/ 2 levels "HIGH","LOW": 1 1 1 1 1 2 2 2 1 2 ...
## $ Date : Factor w/ 1211 levels "2013-01-01","2013-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Glob_radiation_max : num 1 1 3 7 6 2 2 2 7 2 ...
## $ Glob_radiation_mean: num 0.125 0.125 0.375 0.875 0.75 0.25 0.25 0.25 0.875 0.25 ...
## $ Wind_speed_max : num 0.9 1.1 1.2 1.3 1.3 0.9 1.1 0.9 1.1 2 ...
## $ Wind_speed_mean : num 0.65 0.675 0.738 0.887 1 ...
## $ Wind_speed_min : num 0.3 0.3 0.4 0.6 0.5 0.3 0.3 0.5 0.3 0.4 ...
## $ Wind_gust_max : num 2.1 2.9 2.6 3.2 4.5 2.3 2.8 2.3 2.4 4.2 ...
## $ Wind_gust_mean : num 1.61 1.81 1.73 2.21 2.85 ...
## $ Wind_gust_min : num 1.1 0.9 1 1.3 1.3 0.8 1.5 1.2 0.9 1.5 ...
## $ Precipitation_mean : num 0 0 0 0 0 0 0 0 0 0.075 ...
## $ Precipitation_sum : num 0 0 0 0 0 0 0 0 0 0.6 ...
## $ Pressure_max : num 986 984 999 995 987 ...
## $ Pressure_mean : num 984 983 998 994 986 ...
## $ Pressure_min : num 982 982 997 994 985 ...
## $ Humidity_max : num 96.1 82.9 94.5 94.2 95.6 92.2 97.6 79.9 86 94 ...
## $ Humidity_mean : num 95.8 79.2 93.6 93.3 94.3 ...
## $ Humidity_min : num 95.4 76.9 92.9 92.5 93.3 81.8 92.6 67.3 81.2 83.8 ...
## $ Temp_1500_max : num -0.4 -2.3 -4.3 3.3 3.5 0 0.4 -2.1 1.6 -3.4 ...
## $ Temp_1500_mean : num -1.69 -2.99 -4.53 2.96 2.99 ...
## $ Temp_1500_min : num -2.5 -4.3 -4.7 2.6 2.4 -0.7 -0.5 -4.2 1.1 -3.6 ...
## $ Temp_site_max : num -1.8 3.4 3.3 1.3 0.9 4.5 1.4 1.7 2.3 5.9 ...
## $ Temp_site_mean : num -2.138 3.2 2.875 0.688 0.588 ...
## $ Temp_site_min : num -2.7 2.9 2.3 0.1 0.3 2 0.2 0.6 1.7 4.3 ...
Datum je trenutno predstavljen kot nominalna spremenljivka in ni uporaben za modeliranje. R ima vgrajeno podporo za predstavitev koledarskih datumov
date <- as.Date(origData$Date)
Kronolosko razdelimo podatke na učno in testno množico sel <- date < "2016-1-1"
train <- origData[sel,]
test <- origData[!sel,]
Večinski klasifikator
Vecinski razred je razred z najvec ucnimi primeri table(train$PM10) / length(train$PM10)
#### HIGH LOW
## 0.187067 0.812933
Točnost klasifikacije, ki jo doseže trivialna teorija (vsak primer klasificira v več. razred) table(test$PM10) / length(test$PM10)
#### HIGH LOW
## 0.1710145 0.8289855
Točnost vecinskega klasifikatorja določa spodnjo mejo točnosti uporabnih modelov!
Odločitveno drevo
Učenje odločitvenega drevesa je implementirano v knjižnici rpart. Knjižnica je del osnovnega paketa sistema R in je ni potrebno namestiti. Knjižnico naložimo z ukazom “library()”
library(rpart)
Učenje odločitvenega drevesa
treeModel <- rpart(PM10 ~ ., train, usesurrogate=0) treeModel
## n= 866
#### node), split, n, loss, yval, (yprob)
## * denotes terminal node
#### 1) root 866 162 LOW (0.1870670 0.8129330)
## 2) Date=2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-09,2013-01-11,2013-01-15,2013-01-19,2013-01-20,2013-01-21,2013-01-26,2013-01-27,2013-01-28,2013-02-01,2013-02-05,2013-02-08,2013-02-11,2013-02-12,2013-02-13,2013-02-15,2013-02-16,2013-02-17,2013-02-18,2013-02-19,2013-02-20,2013-02-23,2013-02-24,2013-02-25,2013-02-26,2013-03-23,2013-03-28,2013-04-17,2013-04-18,2013-04-19,2013-04-30,2013-05-01,2013-05-02,2013-05-04,2013-06-18,2013-08-07,2013-08-08,2013-08-09,2013-10-05,2013-10-07,2013-10-08,2013-10-09,2013-10-10,2013-10-18,2013-10-24,2013-11-18,2013-11-28,2013-11-29,2013-11-30,2013-12-03,2013-12-04,2013-12-05,2013-12-06,2013-12-07,2013-12-08,2013-12-09,2013-12-10,2013-12-11,2013-12-12,2013-12-13,2013-12-14,2013-12-15,2013-12-16,2013-12-17,2013-12-18,2013-12-19,2013-12-20,2013-12-21,2014-01-01,2014-01-27,2014-01-28,2014-01-29,2014-01-30,2014-01-31,2014-02-01,2014-02-04,2014-02-05,2014-02-06,2014-02-19,2014-02-21,2014-02-25,2014-03-03,2014-03-07,2014-03-08,2014-03-09,2014-03-10,2014-03-11,2014-03-12,2014-03-13,2014-03-14,2014-03-15,2014-03-16,2014-03-17,2014-03-18,2014-03-21,2014-03-29,2014-03-31,2014-04-01,2014-04-02,2014-04-04,2014-10-07,2014-10-28,2014-10-29,2014-10-30,2014-10-31,2014-11-01,2014-11-02,2014-11-22,2014-11-23,2014-11-25,2014-11-26,2014-11-27,2014-11-28,2014-12-10,2014-12-11,2014-12-12,2014-12-16,2014-12-30,2015-01-01,2015-01-02,2015-01-03,2015-01-06,2015-01-07,2015-01-08,2015-01-09,2015-01-15,2015-01-20,2015-01-21,2015-01-27,2015-01-28,2015-01-29,2015-02-03,2015-02-04,2015-02-08,2015-02-10,2015-02-11,2015-02-14,2015-02-15,2015-02-16,2015-02-17,2015-02-19,2015-02-20,2015-02-28,2015-03-10,2015-03-11,2015-03-14,2015-03-15,2015-03-16,2015-03-17,2015-03-18,2015-03-19,2015-03-20,2015-03-21,2015-03-23,2015-03-24,2015-08-06,2015-08-07 162 0 HIGH (1.0000000 0.0000000) *
## 3) Date=2013-01-06,2013-01-07,2013-01-08,2013-01-10,2013-01-12,2013-01-13,2013-01-14,2013-01-16,2013-01-17,2013-01-18,2013-01-22,2013-01-23,2013-01-24,2013-01-25,2013-01-29,2013-01-30,2013-01-31,2013-02-02,2013-02-03,2013-02-04,2013-02-06,2013-02-09,2013-02-10,2013-02-14,2013-02-21,2013-02-22,2013-03-13,2013-03-14,2013-03-15,2013-03-16,2013-03-17,2013-03-18,2013-03-19,2013-03-20,2013-03-21,2013-03-22,2013-03-24,2013-03-25,2013-03-26,2013-03-27,2013-03-29,2013-03-30,2013-03-31,2013-04-01,2013-04-02,2013-04-03,2013-04-11,2013-04-12,2013-04-13,2013-04-14,2013-04-15,2013-04-16,2013-04-20,2013-04-21,2013-04-22,2013-04-23,2013-04-24,2013-04-25,2013-04-26,2013-04-27,2013-04-28,2013-04-29,2013-05-03,2013-05-05,2013-05-06,2013-05-07,2013-05-08,2013-05-09,2013-05-10,2013-05-11,2013-05-12,2013-05-13,2013-05-14,2013-05-15,2013-05-16,2013-05-17,2013-05-18,2013-05-19,2013-05-20,2013-05-21,2013-05-22,2013-05-23,2013-05-24,2013-05-25,2013-05-26,2013-05-27,2013-05-28,2013-05-29,2013-05-30,2013-05-31,2013-06-01,2013-06-02,2013-06-03,2013-06-04,2013-06-05,2013-06-06,2013-06-07,2013-06-08,2013-06-09,2013-06-10,2013-06-11,2013-06-12,2013-06-13,2013-06-14,2013-06-15,2013-06-16,2013-06-17,2013-06-19,2013-06-20,2013-06-21,2013-06-22,2013-06-23,2013-06-24,2013-06-27,2013-06-28,2013-06-29,2013-06-30,2013-07-01,2013-07-02,2013-07-03,2013-07-04,2013-07-05,2013-07-06,2013-07-07,2013-07-08,2013-07-09,2013-07-10,2013-07-11,2013-07-12,2013-07-13,2013-07-14,2013-07-15,2013-07-16,2013-07-17,2013-07-18,2013-07-19,2013-07-20,2013-07-21,2013-07-22,2013-07-23,2013-07-24,2013-07-25,2013-07-26,2013-07-27,2013-07-28,2013-07-29,2013-07-30,2013-07-31,2013-08-01,2013-08-02,2013-08-03,2013-08-04,2013-08-05,2013-08-06,2013-08-12,2013-08-13,2013-08-14,2013-08-15,2013-08-16,2013-08-17,2013-08-18,2013-08-19,2013-08-20,2013-08-21,2013-08-22,2013-08-23,2013-08-24,2013-08-26,2013-08-27,2013-08-28,2013-08-29,2013-08-30,2013-08-31,2013-09-01,2013-09-02,2013-09-03,2013-09-04,2013-09-05,2013-09-06,2013-09-07,2013-09-08,2013-09-11,2013-09-12,2013-09-13,2013-09-14,2013-09-15,2013-09-16,2013-09-17,2013-09-18,2013-09-19,2013-09-20,2013-09-21,2013-09-22,2013-09-23,2013-09-24,2013-09-25,2013-09-26,2013-09-27,2013-09-28,2013-09-29,2013-09-30,2013-10-01,2013-10-02,2013-10-03,2013-10-04,2013-10-06,2013-10-11,2013-10-12,2013-10-13,2013-10-14,2013-10-15,2013-10-16,2013-10-17,2013-10-19,2013-10-20,2013-10-21,2013-10-22,2013-10-23,2013-10-25,2013-10-26,2013-10-27,2013-10-28,2013-10-29,2013-10-30,2013-10-31,2013-11-01,2013-11-02,2013-11-03,2013-11-16,2013-11-17,2013-11-26,2013-11-27,2013-12-01,2013-12-02,2013-12-22,2013-12-23,2013-12-24,2013-12-25,2013-12-26,2013-12-27,2013-12-28,2013-12-29,2013-12-30,2013-12-31,2014-01-02,2014-01-03,2014-01-04,2014-01-05,2014-01-06,2014-01-07,2014-01-08,2014-01-09,2014-01-11,2014-01-12,2014-01-13,2014-01-14,2014-01-15,2014-01-16,2014-01-17,2014-01-18,2014-01-19,2014-01-20,2014-01-23,2014-01-24,2014-01-25,2014-01-26,2014-02-02,2014-02-03,2014-02-07,2014-02-08,2014-02-09,2014-02-10,2014-02-11,2014-02-12,2014-02-13,2014-02-14,2014-02-15,2014-02-16,2014-02-17,2014-02-18,2014-02-20,2014-02-22,2014-02-23,2014-02-26,2014-02-27,2014-02-28,2014-03-01,2014-03-02,2014-03-04,2014-03-05,2014-03-06,2014-03-19,2014-03-20,2014-03-22,2014-03-23,2014-03-24,2014-03-25,2014-03-26,2014-03-27,2014-03-28,2014-03-30,2014-04-03,2014-04-05,2014-04-06,2014-04-07,2014-04-08,2014-04-09,2014-04-10,2014-04-11,2014-04-12,2014-04-13,2014-04-14,2014-04-15,2014-04-16,2014-04-17,2014-04-18,2014-04-19,2014-04-20,2014-04-21,2014-04-22,2014-04-23,2014-04-24,2014-04-25,2014-04-26,2014-04-27,2014-04-28,2014-04-29,2014-04-30,2014-05-01,2014-05-02,2014-05-03,2014-05-04,2014-05-05,2014-05-06,2014-05-07,2014-05-08,2014-05-09,2014-05-10,2014-05-11,2014-05-13,2014-05-14,2014-05-15,2014-05-16,2014-05-17,2014-05-18,2014-05-19,2014-05-20,2014-05-21,2014-05-22,2014-05-23,2014-05-24,2014-05-25,2014-05-26,2014-05-27,2014-05-28,2014-05-29,2014-05-30,2014-05-31,2014-06-01,2014-06-02,2014-06-03,2014-06-04,2014-06-05,2014-06-06,2014-06-07,2014-06-08,2014-06-09,2014-06-10,2014-06-11,2014-06-12,2014-06-17,2014-06-18,2014-06-19,2014-06-20,2014-06-21,2014-06-22,2014-06-23,2014-07-04,2014-07-05,2014-07-06,2014-07-07,2014-07-08,2014-07-09,2014-07-10,2014-07-11,2014-07-12,2014-07-13,2014-07-14,2014-07-15,2014-07-16,2014-07-17,2014-07-18,2014-07-19,2014-07-20,2014-07-21,2014-07-22,2014-07-23,2014-07-24,2014-07-25,2014-07-26,2014-07-27,2014-07-28,2014-07-29,2014-07-30,2014-07-31,2014-08-01,2014-08-02,2014-08-03,2014-08-04,2014-08-05,2014-08-06,2014-08-07,2014-08-08,2014-08-09,2014-08-10,2014-08-11,2014-08-12,2014-08-13,2014-08-14,2014-08-15,2014-08-16,2014-08-17,2014-08-18,2014-08-19,2014-08-20,2014-08-22,2014-08-23,2014-08-24,2014-08-25,2014-08-26,2014-08-27,2014-08-28,2014-08-29,2014-08-30,2014-08-31,2014-09-01,2014-09-02,2014-09-03,2014-09-04,2014-09-05,2014-09-06,2014-09-07,2014-09-08,2014-09-09,2014-09-10,2014-09-11,2014-09-12,2014-09-13,2014-09-14,2014-09-15,2014-09-16,2014-09-17,2014-09-18,2014-09-19,2014-09-20,2014-09-21,2014-09-22,2014-09-23,2014-09-24,2014-09-25,2014-09-26,2014-09-27,2014-10-08,2014-10-09,2014-10-10,2014-10-11,2014-10-12,2014-10-13,2014-10-14,2014-10-15,2014-10-16,2014-10-17,2014-10-18,2014-10-19,2014-10-20,2014-10-21,2014-10-23,2014-10-24,2014-10-25,2014-10-26,2014-10-27,2014-11-03,2014-11-04,2014-11-05,2014-11-06,2014-11-07,2014-11-08,2014-11-09,2014-11-10,2014-11-11,2014-11-12,2014-11-13,2014-11-14,2014-11-15,2014-11-16,2014-11-17,2014-11-18,2014-11-19,2014-11-20,2014-11-21,2014-11-24,2014-11-29,2014-11-30,2014-12-01,2014-12-02,2014-12-03,2014-12-04,2014-12-05,2014-12-06,2014-12-07,2014-12-08,2014-12-09,2014-12-13,2014-12-14,2014-12-15,2014-12-17,2014-12-18,2014-12-19,2014-12-20,2014-12-21,2014-12-22,2014-12-23,2014-12-24,2014-12-25,2014-12-26,2014-12-27,2014-12-28,2014-12-31,2015-01-04,2015-01-05,2015-01-10,2015-01-11,2015-01-12,2015-01-13,2015-01-14,2015-01-16,2015-01-17,2015-01-18,2015-01-19,2015-01-22,2015-01-23,2015-01-24,2015-01-25,2015-01-26,2015-01-30,2015-01-31,2015-02-05,2015-02-06,2015-02-07,2015-02-09,2015-02-12,2015-02-13,2015-02-18,2015-02-21,2015-02-22,2015-02-23,2015-02-24,2015-02-25,2015-02-26,2015-02-27,2015-03-01,2015-03-02,2015-03-03,2015-03-04,2015-03-05,2015-03-06,2015-03-07,2015-03-08,2015-03-09,2015-03-12,2015-03-13,2015-03-22,2015-03-25,2015-03-26,2015-03-27,2015-03-28,2015-03-29,2015-03-30,2015-03-31,2015-04-01,2015-04-02,2015-04-03,2015-04-04,2015-04-05,2015-04-06,2015-04-07,2015-04-08,2015-04-09,2015-04-10,2015-04-11,2015-04-12,2015-04-13,2015-04-14,2015-04-15,2015-04-16,2015-04-17,2015-04-18,2015-04-19,2015-04-20,2015-04-21,2015-04-22,2015-04-23,2015-04-24,2015-04-25,2015-04-26,2015-04-27,2015-04-28,2015-04-29,2015-04-30,2015-05-01,2015-05-02,2015-05-03,2015-05-04,2015-05-05,2015-05-06,2015-05-07,2015-05-08,2015-05-10,2015-05-11,2015-05-12,2015-05-13,2015-05-14,2015-05-20,2015-05-21,2015-05-22,2015-05-23,2015-05-24,2015-05-25,2015-05-26,2015-05-27,2015-05-28,2015-05-29,2015-05-30,2015-05-31,2015-06-01,2015-06-02,2015-06-03,2015-06-04,2015-06-05,2015-06-06,2015-06-07,2015-06-08,2015-06-09,2015-06-10,2015-06-11,2015-06-12,2015-06-13,2015-06-14,2015-06-15,2015-06-16,2015-06-17,2015-06-18,2015-06-19,2015-06-20,2015-06-21,2015-06-22,2015-06-23,2015-06-24,2015-06-25,2015-06-26,2015-06-27,2015-06-28,2015-06-29,2015-06-30,2015-07-01,2015-07-02,2015-07-03,2015-07-04,2015-07-05,2015-07-06,2015-07-11,2015-07-12,2015-07-13,2015-07-14,2015-07-15,2015-07-16,2015-07-17,2015-07-18,2015-07-19,2015-07-20,2015-07-21,2015-07-22,2015-07-24,2015-07-25,2015-07-27,2015-07-28,2015-07-29,2015-07-30,2015-07-31,2015-08-01,2015-08-02,2015-08-03,2015-08-04,2015-08-05,2015-08-08,2015-08-09,2015-08-10,2015-08-11,2015-08-12,2015-08-13,2015-08-14,2015-08-15,2015-08-16,2015-08-17 704 0 LOW (0.0000000 1.0000000) * Za izris drevesa uporabimo funkcijo rpart.plot v istoimenski knjižnici, ki jo je potrebno najprej namestiti.
Knjižnice instaliramo z ukazom “install.packages()”. Na primer: install.packages(“rpart.plot”). Ko je knjižnica nameščena, jo naložimo z ukazom “library()”.
library(rpart.plot) rpart.plot(treeModel)
Date = 2013−01−01,2013−01−02,2013−01−03,2013−01−04,2013−01−05,2013−01−09,2013−01−11,2013−01−15,2013−01−19,2013−01−20,2013−01−21,2013−01−26,2013−01−27,2013−01−28,2013−02−01,2013−02−05,2013−02−08,2013−02−11,2013−02−12,2013−02−13,2013−02−15,2013−02−16,2013−02−17,2013−02−18,2013−02−19,2013−02−20,2013−02−23,2013−02−24,2013−02−25,2013−02−26,2013−03−23,2013−03−28,2013−04−17,2013−04−18,2013−04−19,2013−04−30,2013−05−01,2013−05−02,2013−05−04,2013−06−18,2013−08−07,2013−08−08,2013−08−09,2013−10−05,2013−10−07,2013−10−08,2013−10−09,2013−10−10,2013−10−18,2013−10−24,2013−11−18,2013−11−28,2013−11−29,2013−11−30,2013−12−03,2013−12−04,2013−12−05,2013−12−06,2013−12−07,2013−12−08,2013−12−09,2013−12−10,2013−12−11,2013−12−12,2013−12−13,2013−12−14,2013−12−15,2013−12−16,2013−12−17,2013−12−18,2013−12−19,2013−12−20,2013−12−21,2014−01−01,2014−01−27,2014−01−28,2014−01−29,2014−01−30,2014−01−31,2014−02−01,2014−02−04,2014−02−05,2014−02−06,2014−02−19,2014−02−21,2014−02−25,2014−03−03,2014−03−07,2014−03−08,2014−03−09,2014−03−10,2014−03−11,2014−03−12,2014−03−13,2014−03−14,2014−03−15,2014−03−16,2014−03−17,2014−03−18,2014−03−21,2014−03−29,2014−03−31,2014−04−01,2014−04−02,2014−04−04,2014−10−07,2014−10−28,2014−10−29,2014−10−30,2014−10−31,2014−11−01,2014−11−02,2014−11−22,2014−11−23,2014−11−25,2014−11−26,2014−11−27,2014−11−28,2014−12−10,2014−12−11,2014−12−12,2014−12−16,2014−12−30,2015−01−01,2015−01−02,2015−01−03,2015−01−06,2015−01−07,2015−01−08,2015−01−09,2015−01−15,2015−01−20,2015−01−21,2015−01−27,2015−01−28,2015−01−29,2015−02−03,2015−02−04,2015−02−08,2015−02−10,2015−02−11,2015−02−14,2015−02−15,2015−02−16,2015−02−17,2015−02−19,2015−02−20,2015−02−28,2015−03−10,2015−03−11,2015−03−14,2015−03−15,2015−03−16,2015−03−17,2015−03−18,2015−03−19,2015−03−20,2015−03−21,2015−03−23,2015−03−24,2015−08−06,2015−08−07 LOW
0.81 100%
HIGH 0.00 19%
LOW 1.00 81%
Datum (v trenutni obliki) je zavajujoč atribut - model je neuporaben pred <- predict(treeModel, test, type="class")
obs <- test$PM10
table(obs, pred)
## pred
## obs HIGH LOW
## HIGH 0 59
## LOW 0 286 tab <- table(obs, pred)
Klasifikacijska točnost - delež pravilno klasificiranih primerov sum(diag(tab))/sum(tab)
## [1] 0.8289855
Datum lahko spremenimo v numerični atribut dayOfYear <- as.numeric(format(date,"%j")) summary(dayOfYear)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 85.0 165.0 171.8 254.5 365.0 myData <- origData
myData$Date <- NULL
myData$dayOfYear <- dayOfYear train <- myData[sel,]
test <- myData[!sel,]
treeModel <- rpart(PM10 ~ ., train) treeModel
## n= 866
#### node), split, n, loss, yval, (yprob)
## * denotes terminal node
#### 1) root 866 162 LOW (0.18706697 0.81293303)
## 2) Temp_site_mean< 3.58125 223 108 HIGH (0.51569507 0.48430493)
## 4) Wind_gust_mean< 2.33125 156 52 HIGH (0.66666667 0.33333333)
## 8) Pressure_max>=991.75 40 3 HIGH (0.92500000 0.07500000) *
## 9) Pressure_max< 991.75 116 49 HIGH (0.57758621 0.42241379)
## 18) Temp_site_mean< 0.51875 60 18 HIGH (0.70000000 0.30000000)
## 36) Temp_1500_mean>=-9.40625 48 10 HIGH (0.79166667 0.20833333)
## 72) Precipitation_sum< 1.15 38 4 HIGH (0.89473684 0.10526316) *
## 73) Precipitation_sum>=1.15 10 4 LOW (0.40000000 0.60000000) *
## 37) Temp_1500_mean< -9.40625 12 4 LOW (0.33333333 0.66666667) *
## 19) Temp_site_mean>=0.51875 56 25 LOW (0.44642857 0.55357143)
## 38) Wind_gust_min< 1.15 40 17 HIGH (0.57500000 0.42500000)
## 76) Temp_1500_mean>=-4.19375 26 7 HIGH (0.73076923 0.26923077) *
## 77) Temp_1500_mean< -4.19375 14 4 LOW (0.28571429 0.71428571) *
## 39) Wind_gust_min>=1.15 16 2 LOW (0.12500000 0.87500000) *
## 5) Wind_gust_mean>=2.33125 67 11 LOW (0.16417910 0.83582090) *
## 3) Temp_site_mean>=3.58125 643 47 LOW (0.07309487 0.92690513)
## 6) Temp_site_max< 8.45 122 25 LOW (0.20491803 0.79508197)
## 12) Wind_gust_mean< 1.66875 28 14 HIGH (0.50000000 0.50000000)
## 24) Wind_gust_min>=0.85 16 5 HIGH (0.68750000 0.31250000) *
## 25) Wind_gust_min< 0.85 12 3 LOW (0.25000000 0.75000000) *
## 13) Wind_gust_mean>=1.66875 94 11 LOW (0.11702128 0.88297872)
## 26) Glob_radiation_max>=70.5 27 9 LOW (0.33333333 0.66666667)
## 52) dayOfYear< 80.5 10 2 HIGH (0.80000000 0.20000000) *
## 53) dayOfYear>=80.5 17 1 LOW (0.05882353 0.94117647) *
## 27) Glob_radiation_max< 70.5 67 2 LOW (0.02985075 0.97014925) *
## 7) Temp_site_max>=8.45 521 22 LOW (0.04222649 0.95777351) * rpart.plot(treeModel)
Temp_site_mean < 3.6
Wind_gust_mean < 2.3
Pressure_max >= 992
Temp_site_mean < 0.52
Temp_1500_mean >= −9.4
Precipitation_sum < 1.2
Wind_gust_min < 1.2
Temp_1500_mean >= −4.2
Temp_site_max < 8.5
Wind_gust_mean < 1.7
Wind_gust_min >= 0.85 Glob_radiation_max >= 71
dayOfYear < 81 LOW
0.81 100%
HIGH 0.48 26%
HIGH 0.33 18%
HIGH 0.07 5%
HIGH 0.42 13%
HIGH 0.30
7%
HIGH 0.21 6%
HIGH 0.11 4%
LOW 0.60 1%
LOW 0.67 1%
LOW 0.55 6%
HIGH 0.42 5%
HIGH 0.27 3%
LOW 0.71 2%
LOW 0.88 2%
LOW 0.84 8%
LOW 0.93 74%
LOW 0.80 14%
HIGH 0.50 3%
HIGH 0.31 2%
LOW 0.75 1%
LOW 0.88 11%
LOW 0.67 3%
HIGH 0.20 1%
LOW 0.94 2%
LOW 0.97 8%
LOW 0.96 60%
yes no
pred <- predict(treeModel, test, type="class") obs <- test$PM10
Izračunajmo klasifikacijsko točnost odl. drevesa table(obs, pred)
## pred
## obs HIGH LOW
## HIGH 44 15
## LOW 22 264 tab <- table(obs, pred) sum(diag(tab))/sum(tab)
## [1] 0.8927536
Funkcija za izračun klasifikacijske točnosti CA <- function(observed, predicted) {
tab <- table(observed, predicted) sum(diag(tab))/sum(tab)
}
CA(obs, pred)
## [1] 0.8927536
Naključni gozd
Knjižnico randomForest je potrebno najprej namestiti z ukazom install.packages(“randomForest”).
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
rfModel <- randomForest(PM10 ~ ., train) pred <- predict(rfModel, test, type="class") obs <- test$PM10
table(obs, pred)
## pred
## obs HIGH LOW
## HIGH 45 14
## LOW 20 266 CA(obs, pred)
## [1] 0.9014493
Umetna nevronska mreža Uporabili bomo knjižnico nnet.
library(nnet)
Zvezne atribute skaliramo na interval [0,1].
Najprej poiščemo zalogo vrednosti atributov max_train <- apply(train[,-1], 2, max) min_train <- apply(train[,-1], 2, min) nato skaliramo podatke
train_scaled <- scale(train[,-1], center = min_train, scale = max_train - min_train) train_scaled <- data.frame(train_scaled)
train_scaled$PM10 <- train$PM10
vse vrednosti atributov v učni mnozici so sedaj na intervalu [0,1]
summary(train_scaled)
## Glob_radiation_max Glob_radiation_mean Wind_speed_max Wind_speed_mean
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.02585 1st Qu.:0.01536 1st Qu.:0.1324 1st Qu.:0.1192
## Median :0.17690 Median :0.14860 Median :0.1912 Median :0.1572
## Mean :0.28845 Mean :0.27497 Mean :0.2310 Mean :0.1947
## 3rd Qu.:0.55856 3rd Qu.:0.51310 3rd Qu.:0.2794 3rd Qu.:0.2304
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## Wind_speed_min Wind_gust_max Wind_gust_mean Wind_gust_min
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.04762 1st Qu.:0.09353 1st Qu.:0.1432 1st Qu.:0.1011
## Median :0.07143 Median :0.13309 Median :0.1793 Median :0.1236
## Mean :0.09648 Mean :0.18501 Mean :0.2154 Mean :0.1444
## 3rd Qu.:0.11905 3rd Qu.:0.22302 3rd Qu.:0.2450 3rd Qu.:0.1685
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## Precipitation_mean Precipitation_sum Pressure_max Pressure_mean
## Min. :0.00000 Min. :0.000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.5214 1st Qu.:0.5577
## Median :0.00000 Median :0.000000 Median :0.5973 Median :0.6333
## Mean :0.03204 Mean :0.033421 Mean :0.5956 Mean :0.6272
## 3rd Qu.:0.00000 3rd Qu.:0.005195 3rd Qu.:0.6707 3rd Qu.:0.6998
## Max. :1.00000 Max. :1.000000 Max. :1.0000 Max. :1.0000
## Pressure_min Humidity_max Humidity_mean Humidity_min
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.5914 1st Qu.:0.7750 1st Qu.:0.6857 1st Qu.:0.5577
## Median :0.6616 Median :0.8736 Median :0.8189 Median :0.7324
## Mean :0.6573 Mean :0.8366 Mean :0.7811 Mean :0.6999
## 3rd Qu.:0.7259 3rd Qu.:0.9388 3rd Qu.:0.9080 3rd Qu.:0.8711
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Temp_1500_max Temp_1500_mean Temp_1500_min Temp_site_max
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.3848 1st Qu.:0.3793 1st Qu.:0.3724 1st Qu.:0.3297
## Median :0.5365 Median :0.5385 Median :0.5310 Median :0.5297
## Mean :0.5286 Mean :0.5284 Mean :0.5212 Mean :0.5070
## 3rd Qu.:0.6713 3rd Qu.:0.6766 3rd Qu.:0.6696 3rd Qu.:0.6716
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Temp_site_mean Temp_site_min dayOfYear PM10
## Min. :0.0000 Min. :0.0000 Min. :0.0000 HIGH:162
## 1st Qu.:0.3899 1st Qu.:0.4141 1st Qu.:0.2170 LOW :704
## Median :0.5894 Median :0.6094 Median :0.4272
## Mean :0.5649 Mean :0.5842 Mean :0.4497
## 3rd Qu.:0.7315 3rd Qu.:0.7538 3rd Qu.:0.6593
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Tudi testno množico skaliramo na zalogo vrednosti iz učne mnozice!
test_scaled <- scale(test[,-1], center = min_train, scale = max_train - min_train) test_scaled <- data.frame(test_scaled)
test_scaled$PM10 <- test$PM10
Ni nujno, da bodo V testni množici vse vrednosti na intervalu [0,1]!
summary(test_scaled)
## Glob_radiation_max Glob_radiation_mean Wind_speed_max Wind_speed_mean
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.02585 1st Qu.:0.01445 1st Qu.:0.1324 1st Qu.:0.1165
## Median :0.17286 Median :0.15086 Median :0.1765 Median :0.1518
## Mean :0.27898 Mean :0.25930 Mean :0.2152 Mean :0.1891
## 3rd Qu.:0.48465 3rd Qu.:0.43306 3rd Qu.:0.2647 3rd Qu.:0.2168
## Max. :0.90436 Max. :0.99729 Max. :0.7500 Max. :0.9350
## Wind_speed_min Wind_gust_max Wind_gust_mean Wind_gust_min
## Min. :0.00000 Min. :-0.07194 Min. :-0.01245 Min. :0.00000
## 1st Qu.:0.04762 1st Qu.: 0.07914 1st Qu.: 0.13201 1st Qu.:0.08989
## Median :0.07143 Median : 0.12230 Median : 0.16936 Median :0.12360
## Mean :0.09752 Mean : 0.17233 Mean : 0.20537 Mean :0.14112
## 3rd Qu.:0.11905 3rd Qu.: 0.20863 3rd Qu.: 0.22540 3rd Qu.:0.16854
## Max. :0.76190 Max. : 0.76259 Max. : 0.84309 Max. :0.78652
## Precipitation_mean Precipitation_sum Pressure_max Pressure_mean
## Min. :0.00000 Min. :0.000000 Min. :0.2374 Min. :0.2833
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.5311 1st Qu.:0.5676
## Median :0.00000 Median :0.000000 Median :0.6089 Median :0.6457
## Mean :0.03266 Mean :0.032908 Mean :0.6211 Mean :0.6512
## 3rd Qu.:0.00000 3rd Qu.:0.005195 3rd Qu.:0.7062 3rd Qu.:0.7263
## Max. :0.94057 Max. :0.935065 Max. :1.0136 Max. :1.0274
## Pressure_min Humidity_max Humidity_mean Humidity_min
## Min. :0.3232 Min. :0.1133 Min. :0.1133 Min. :0.1034
## 1st Qu.:0.6058 1st Qu.:0.7750 1st Qu.:0.7041 1st Qu.:0.5787
## Median :0.6768 Median :0.8571 Median :0.8076 Median :0.7271
## Mean :0.6797 Mean :0.8280 Mean :0.7762 Mean :0.6975
## 3rd Qu.:0.7496 3rd Qu.:0.9080 3rd Qu.:0.8762 3rd Qu.:0.8426
## Max. :1.0321 Max. :0.9918 Max. :0.9893 Max. :0.9955
## Temp_1500_max Temp_1500_mean Temp_1500_min Temp_site_max
## Min. :0.08708 Min. :0.0756 Min. :0.0413 Min. :0.04054
## 1st Qu.:0.39888 1st Qu.:0.4036 1st Qu.:0.3953 1st Qu.:0.32162
## Median :0.53371 Median :0.5431 Median :0.5310 Median :0.51622
## Mean :0.54015 Mean :0.5418 Mean :0.5312 Mean :0.49692
## 3rd Qu.:0.68258 3rd Qu.:0.6841 3rd Qu.:0.6696 3rd Qu.:0.67568
## Max. :0.94101 Max. :0.9397 Max. :0.9233 Max. :0.89459
## Temp_site_mean Temp_site_min dayOfYear PM10
## Min. :0.07071 Min. :0.08207 Min. :0.01648 HIGH: 59
## 1st Qu.:0.37764 1st Qu.:0.39514 1st Qu.:0.27747 LOW :286
## Median :0.58386 Median :0.59574 Median :0.51923
## Mean :0.55604 Mean :0.57599 Mean :0.51866
## 3rd Qu.:0.74158 3rd Qu.:0.76292 3rd Qu.:0.76374
## Max. :0.96372 Max. :0.98176 Max. :1.00000
Če želimo ponovljive rezultate, lahko nastavimo izhodišče za generiranje naključnih števil set.seed(7675353)
Učenje in evalvacija nevronske mreže
nnModel <- nnet(PM10 ~ ., train_scaled, size=5, maxit=1000, trace=FALSE) pred <- predict(nnModel, test_scaled, type="class")
obs <- test$PM10 table(obs, pred)
## pred
## obs HIGH LOW
## HIGH 45 14
## LOW 22 264 CA(obs, pred)
## [1] 0.8956522
Napoved “danes bo enako kot včeraj”
pred <- test$PM10[-length(test$PM10)]
obs <- test$PM10[-1]
table(obs, pred)
## pred
## obs HIGH LOW
## HIGH 41 17
## LOW 17 269 CA(obs, pred)
## [1] 0.9011628
Verjetnostno napovedovanje
Klasifikatorji pri klasifikaciji novega primera namesto enega razreda lahko vrnejo verjetnostno porazdelitev po vseh razredih.
Verjetnostne napovedi odl. drevesa
predMat <- predict(treeModel, test, type="prob") head(predMat)
## HIGH LOW
## 867 0.89473684 0.1052632
## 868 0.89473684 0.1052632
## 869 0.73076923 0.2692308
## 870 0.12500000 0.8750000
## 871 0.04222649 0.9577735
## 872 0.02985075 0.9701493 Dejanski razredi testnih primerov obsMat <- class.ind(test$PM10) head(obsMat)
## HIGH LOW
## [1,] 1 0
## [2,] 1 0
## [3,] 1 0
## [4,] 1 0
## [5,] 0 1
## [6,] 0 1
Brierjevo mero uporabimo za ocenjevanje kvalitete verjetnostnih napovedi BrierScore <- function(observedMat, predictedMat)
{
sum((observedMat - predictedMat) ^ 2) / nrow(predictedMat) }
BrierScore(obsMat, predMat)
## [1] 0.1721226
Verjetnostne napovedi naključnega gozda
predMat <- predict(rfModel, test, type="prob") BrierScore(obsMat, predMat)
## [1] 0.1489673
Verjetnostne napovedi nevronske mreže
pred <- predict(nnModel, test_scaled, type="raw") head(pred)
## [,1]
## 867 0.6767817
## 868 0.3650151
## 869 0.1138574
## 870 0.9279571
## 871 0.9983989
## 872 0.9948844
Model nnet v primeru binarne klasifikacije vrne napovedi samo za en razred. Dodajmo še napoved za drugi razred
predMat <- cbind(1-pred, pred) head(predMat)
## [,1] [,2]
## 867 0.323218300 0.6767817
## 868 0.634984910 0.3650151
## 869 0.886142588 0.1138574
## 870 0.072042923 0.9279571
## 871 0.001601129 0.9983989
## 872 0.005115583 0.9948844 BrierScore(obsMat, predMat)
## [1] 0.1656211
Model, ki vedno napove apriorno porazdelitev razredov p0 <- table(train$PM10)/nrow(train)
p0
#### HIGH LOW
## 0.187067 0.812933
p0Mat <- matrix(rep(p0, times=nrow(test)), nrow = nrow(test), byrow=T) colnames(p0Mat) <- names(p0)
head(p0Mat)
## HIGH LOW
## [1,] 0.187067 0.812933
## [2,] 0.187067 0.812933
## [3,] 0.187067 0.812933
## [4,] 0.187067 0.812933
## [5,] 0.187067 0.812933
## [6,] 0.187067 0.812933
BrierScore(obsMat, p0Mat)
## [1] 0.2840524
Regresija
Prenesite datoteko “PM10_Reg.csv” v lokalno mapo. To mapo nastavite kot delovno mapo okolja R s pomočjo ukaza “setwd” oziroma iz menuja s klikom na File -> Change dir. . . Na primer: setwd(“c:\tecaj\data\”).
Datoteka “PM10_Reg.csv” vsebuje enake podatke kot “PM10_Class.csv” s to razliko, da je atribut PM10 zvezen.
origData <- read.csv("PM10_Reg.csv") summary(origData$PM10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.80 14.40 20.30 25.11 30.25 114.90 hist(origData$PM10)
Histogram of origData$PM10
origData$PM10
Frequency
0 20 40 60 80 100 120
0100200300400500
boxplot(origData$PM10)
020406080100
plot(date, origData$PM10)
2013 2014 2015 2016 2017
020406080100
date
origData$PM10
plot(dayOfYear, origData$PM10)
0 100 200 300
020406080100
dayOfYear
origData$PM10
Odstranimo neuporaben atribut in za vse učne primere izračunamo dan v tednu origData$Glob_radiation_min <- NULL
date <- as.Date(origData$Date)
dayOfYear <- as.numeric(format(date,"%j")) Kronološko razdelimo podatke na učno in testno množico sel <- date < "2016-1-1"
train <- origData[sel,]
test <- origData[!sel,]
Napoved “povprečna vrednost”
pred <- mean(train$PM10) obs <- test$PM10
Kvaliteto napovedi ocenimo s pomočjo srednje absolutne napake (razlika med napovedano in izmerjeno vrednostjo)
mean(abs(obs-pred))
## [1] 12.37992
Napovedi lahko ocenimo s srednjo kvadratno napako mean((obs-pred)^2)
## [1] 311.9037
Mere za ocenjevanje ucenja v regresiji mae <- function(obs, pred)
{
mean(abs(obs - pred)) }
mse <- function(obs, pred) { mean((obs - pred)^2) }
rmae <- function(obs, pred, mean.val) {
sum(abs(obs - pred)) / sum(abs(obs - mean.val)) }
rmse <- function(obs, pred, mean.val) {
sum((obs - pred)^2)/sum((obs - mean.val)^2) }
Primeri regresijskih modelov Pripravimo učno in testni množico myData <- origData
myData$Date <- NULL
myData$dayOfYear <- dayOfYear train <- myData[sel,]
test <- myData[!sel,]
Linearni model
linModel <- lm(PM10 ~ ., train) linModel
#### Call:
## lm(formula = PM10 ~ ., data = train)
#### Coefficients:
## (Intercept) Glob_radiation_max Glob_radiation_mean
## -97.847517 -0.020467 0.043646
## Wind_speed_max Wind_speed_mean Wind_speed_min
## -1.930987 4.266583 -0.730581
## Wind_gust_max Wind_gust_mean Wind_gust_min
## -0.052030 -2.889754 -0.501599
## Precipitation_mean Precipitation_sum Pressure_max
## -2.300669 -0.017124 1.496791
## Pressure_mean Pressure_min Humidity_max
## -3.008344 1.695056 -0.391854
## Humidity_mean Humidity_min Temp_1500_max
## -0.109261 0.161037 -0.927552
## Temp_1500_mean Temp_1500_min Temp_site_max
## 5.632819 -2.712942 1.023707
## Temp_site_mean Temp_site_min dayOfYear
## -3.302149 -0.335961 -0.006636
pred <- predict(linModel, test) obs <- test$PM10
mae(obs, pred)
## [1] 8.421626 mse(obs, pred)
## [1] 118.5413
rmae(obs, pred, mean(train$PM10))
## [1] 0.6802648
rmse(obs, pred, mean(train$PM10))
## [1] 0.3800574
Regresijsko drevo library(rpart)
treeModel <- rpart(PM10 ~ ., train) treeModel
## n= 866
#### node), split, n, deviance, yval
## * denotes terminal node
#### 1) root 866 222450.600 25.10566
## 2) Temp_site_max>=2.35 719 82378.800 21.02768
## 4) Temp_site_max>=7.55 545 39384.000 18.79541
## 8) Precipitation_mean>=0.03125 102 2920.188 11.91471 *
## 9) Precipitation_mean< 0.03125 443 30522.820 20.37968
## 18) Temp_1500_max< 15.45 389 25996.380 19.45964 *
## 19) Temp_1500_max>=15.45 54 1825.117 27.00741 *
## 5) Temp_site_max< 7.55 174 31772.870 28.01954
## 10) Wind_gust_mean>=2.03125 89 9681.488 22.06292 *
## 11) Wind_gust_mean< 2.03125 85 15627.110 34.25647
## 22) Pressure_max< 989.05 54 5752.155 28.95185 *
## 23) Pressure_max>=989.05 31 5708.570 43.49677
## 46) Temp_site_min>=3.05 17 1676.382 35.54706 *
## 47) Temp_site_min< 3.05 14 1653.235 53.15000 *
## 3) Temp_site_max< 2.35 147 69631.750 45.05170
## 6) Temp_1500_max< -3.85 76 17438.100 35.86316
## 12) Wind_gust_mean>=2.59375 23 4166.877 26.14348 *
## 13) Wind_gust_mean< 2.59375 53 10155.420 40.08113 *
## 7) Temp_1500_max>=-3.85 71 38908.520 54.88732
## 14) Temp_site_min>=-1.45 40 10627.410 42.93500 *
## 15) Temp_site_min< -1.45 31 15193.470 70.30968
## 30) Temp_site_min>=-3.5 16 5565.690 60.17500 *
## 31) Temp_site_min< -3.5 15 6231.444 81.12000 *
rpart.plot(treeModel)
Temp_site_max >= 2.4
Temp_site_max >= 7.6
Precipitation_mean >= 0.031
Temp_1500_max < 15
Wind_gust_mean >= 2
Pressure_max < 989
Temp_site_min >= 3.1
Temp_1500_max < −3.8
Wind_gust_mean >= 2.6 Temp_site_min >= −1.4
Temp_site_min >= −3.5 25
100%
21 83%
19 63%
12 12%
20 51%
19 45%
27 6%
28 20%
22 10%
34 10%
29 6%
43 4%
36 2%
53 2%
45 17%
36 9%
26 3%
40 6%
55 8%
43 5%
70 4%
60 2%
81 2%
yes no
pred <- predict(treeModel, test) obs <- test$PM10
mae(obs, pred)
## [1] 8.343685 mse(obs, pred)
## [1] 133.5494
rmae(obs, pred, mean(train$PM10))
## [1] 0.6739691
rmse(obs, pred, mean(train$PM10))
## [1] 0.4281752
Dodajanje novih atributov
Rezultat učenja lahko izboljšamo tako, da dodamo nove, informativne atribute.
Primer novega atributa: “kurilna sezona”
plot(train$dayOfYear, train$PM10) abline(v=124, col="red")
abline(v=273, col="red")
0 100 200 300
020406080100
train$dayOfYear
train$PM10
as.numeric(format(as.Date("2016-5-3"),"%j"))
## [1] 124
as.numeric(format(as.Date("2016-9-29"),"%j"))
## [1] 273
heatingSeason <- dayOfYear <= 124 | dayOfYear >= 273
myData <- origData myData$Date <- NULL
myData$dayOfYear <- dayOfYear myData$heating <- heatingSeason
train <- myData[sel,]
test <- myData[!sel,]
treeModel <- rpart(PM10 ~ ., train) treeModel
## n= 866
#### node), split, n, deviance, yval
## * denotes terminal node
#### 1) root 866 222450.600 25.10566
## 2) Temp_site_max>=2.35 719 82378.800 21.02768
## 4) Temp_site_max>=7.55 545 39384.000 18.79541
## 8) Precipitation_mean>=0.03125 102 2920.188 11.91471 *
## 9) Precipitation_mean< 0.03125 443 30522.820 20.37968
## 18) Temp_1500_max< 15.45 389 25996.380 19.45964
## 36) heating< 0.5 243 8447.655 17.19218 *
## 37) heating>=0.5 146 14219.970 23.23356 *
## 19) Temp_1500_max>=15.45 54 1825.117 27.00741 *
## 5) Temp_site_max< 7.55 174 31772.870 28.01954
## 10) Wind_gust_mean>=2.03125 89 9681.488 22.06292 *
## 11) Wind_gust_mean< 2.03125 85 15627.110 34.25647
## 22) Pressure_max< 989.05 54 5752.155 28.95185 *
## 23) Pressure_max>=989.05 31 5708.570 43.49677
## 46) Temp_site_min>=3.05 17 1676.382 35.54706 *
## 47) Temp_site_min< 3.05 14 1653.235 53.15000 *
## 3) Temp_site_max< 2.35 147 69631.750 45.05170
## 6) Temp_1500_max< -3.85 76 17438.100 35.86316
## 12) Wind_gust_mean>=2.59375 23 4166.877 26.14348 *
## 13) Wind_gust_mean< 2.59375 53 10155.420 40.08113 *
## 7) Temp_1500_max>=-3.85 71 38908.520 54.88732
## 14) Temp_site_min>=-1.45 40 10627.410 42.93500 *
## 15) Temp_site_min< -1.45 31 15193.470 70.30968
## 30) Temp_site_min>=-3.5 16 5565.690 60.17500 *
## 31) Temp_site_min< -3.5 15 6231.444 81.12000 * rpart.plot(treeModel)
Temp_site_max >= 2.4
Temp_site_max >= 7.6
Precipitation_mean >= 0.031
Temp_1500_max < 15
heating = 0
Wind_gust_mean >= 2
Pressure_max < 989
Temp_site_min >= 3.1
Temp_1500_max < −3.8
Wind_gust_mean >= 2.6 Temp_site_min >= −1.4
Temp_site_min >= −3.5 25
100%
21 83%
19 63%
12 12%
20 51%
19 45%
17 28%
23 17%
27 6%
28 20%
22 10%
34 10%
29 6%
43 4%
36 2%
53 2%
45 17%
36 9%
26 3%
40 6%
55 8%
43 5%
70 4%
60 2%
81 2%
yes no
pred <- predict(treeModel, test) obs <- test$PM10
mae(obs, pred)
## [1] 8.135822 mse(obs, pred)
## [1] 130.1761
rmae(obs, pred, mean(train$PM10))
## [1] 0.6571787
rmse(obs, pred, mean(train$PM10))
## [1] 0.4173601
Primer novega atributa: “temperaturna inverzija”
tempInv <- origData$Temp_1500_max > origData$Temp_site_min head(tempInv)
## [1] TRUE FALSE FALSE TRUE TRUE FALSE boxplot(origData$PM10 ~ tempInv)
FALSE TRUE
020406080100
myData <- origData myData$Date <- NULL
myData$dayOfYear <- dayOfYear myData$heating <- heatingSeason myData$tempInv <- tempInv
train <- myData[sel,]
test <- myData[!sel,]
treeModel <- rpart(PM10 ~ ., train) treeModel
## n= 866
#### node), split, n, deviance, yval
## * denotes terminal node
#### 1) root 866 222450.600 25.10566
## 2) Temp_site_max>=2.35 719 82378.800 21.02768
## 4) Temp_site_max>=7.55 545 39384.000 18.79541
## 8) Precipitation_mean>=0.03125 102 2920.188 11.91471 *
## 9) Precipitation_mean< 0.03125 443 30522.820 20.37968
## 18) Temp_1500_max< 15.45 389 25996.380 19.45964
## 36) heating< 0.5 243 8447.655 17.19218 *
## 37) heating>=0.5 146 14219.970 23.23356 *
## 19) Temp_1500_max>=15.45 54 1825.117 27.00741 *
## 5) Temp_site_max< 7.55 174 31772.870 28.01954
## 10) Wind_gust_mean>=2.03125 89 9681.488 22.06292 *
## 11) Wind_gust_mean< 2.03125 85 15627.110 34.25647
## 22) Pressure_max< 989.05 54 5752.155 28.95185 *
## 23) Pressure_max>=989.05 31 5708.570 43.49677
## 46) Temp_site_min>=3.05 17 1676.382 35.54706 *
## 47) Temp_site_min< 3.05 14 1653.235 53.15000 *
## 3) Temp_site_max< 2.35 147 69631.750 45.05170
## 6) tempInv< 0.5 105 31362.380 38.54190
## 12) Wind_gust_mean>=2.59375 24 4221.410 25.82917 *
## 13) Wind_gust_mean< 2.59375 81 22112.980 42.30864 *
## 7) tempInv>=0.5 42 22695.660 61.32619
## 14) Temp_site_mean>=-2.75 27 7322.336 50.32963
## 28) Wind_speed_mean>=0.625 18 3638.869 43.69444 *
## 29) Wind_speed_mean< 0.625 9 1306.080 63.60000 *
## 15) Temp_site_mean< -2.75 15 6231.444 81.12000 * rpart.plot(treeModel)
Temp_site_max >= 2.4
Temp_site_max >= 7.6
Precipitation_mean >= 0.031
Temp_1500_max < 15
heating = 0
Wind_gust_mean >= 2
Pressure_max < 989
Temp_site_min >= 3.1
tempInv = 0
Wind_gust_mean >= 2.6 Temp_site_mean >= −2.7
Wind_speed_mean >= 0.63 25
100%
21 83%
19 63%
12 12%
20 51%
19 45%
17 28%
23 17%
27 6%
28 20%
22 10%
34 10%
29 6%
43 4%
36 2%
53 2%
45 17%
39 12%
26 3%
42 9%
61 5%
50 3%
44 2%
64 1%
81 2%
yes no
pred <- predict(treeModel, test) obs <- test$PM10
mae(obs, pred)
## [1] 8.076779 mse(obs, pred)
## [1] 121.2421
rmae(obs, pred, mean(train$PM10))
## [1] 0.6524095
rmse(obs, pred, mean(train$PM10))
## [1] 0.3887165
Naključni gozd
library(randomForest)
rfModel <- randomForest(PM10 ~ ., train) pred <- predict(rfModel, test)
obs <- test$PM10 mae(obs, pred)
## [1] 7.041154 mse(obs, pred)
## [1] 86.36103
rmae(obs, pred, mean(train$PM10))
## [1] 0.5687559
rmse(obs, pred, mean(train$PM10))
## [1] 0.2768837
Napoved “danes bo enako kot vceraj”
pred <- test$PM10[-length(test$PM10)]
obs <- test$PM10[-1]
mae(obs, pred)
## [1] 7.579651 mse(obs, pred)
## [1] 120.8301
rmae(obs, pred, mean(train$PM10))
## [1] 0.6134656
rmse(obs, pred, mean(train$PM10))
## [1] 0.387831
Časovne vrste
Količino prašnih delcev lahko zastavimo kot modeliranje časovne vrste vals <- train[,"PM10"]
n <- nrow(train)
Sestavimo učno množico tako, da posamezna vrstica vsebuje štiri zaporedne meritve koncentracije prašnih delcev:
lagged_train <- data.frame(lag4=vals[1:(n-4)], lag3=vals[2:(n-3)], lag2=vals[3:(n-2)], lag1=vals[4:(n-1)], target=vals[5:n]) lagged_train[1:10,]
## lag4 lag3 lag2 lag1 target
## 1 51.4 44.3 49.0 61.3 38.9
## 2 44.3 49.0 61.3 38.9 30.3
## 3 49.0 61.3 38.9 30.3 26.8
## 4 61.3 38.9 30.3 26.8 28.5
## 5 38.9 30.3 26.8 28.5 67.6
## 6 30.3 26.8 28.5 67.6 32.9
## 7 26.8 28.5 67.6 32.9 43.4
## 8 28.5 67.6 32.9 43.4 23.0
## 9 67.6 32.9 43.4 23.0 31.6
## 10 32.9 43.4 23.0 31.6 29.7 Na enak način sestavimo tudi testno množico:
vals <- test[,"PM10"]
n <- nrow(test)
lagged_test <- data.frame(lag4=vals[1:(n-4)], lag3=vals[2:(n-3)], lag2=vals[3:(n-2)], lag1=vals[4:(n-1)], target=vals[5:n]) Zgradimo model
lagged.rf <- randomForest(target ~ ., lagged_train) pred <- predict(lagged.rf, lagged_test)
obs <- lagged_test$target plot(obs, type="l")
points(pred, type="l", col="red")
0 50 100 150 200 250 300 350
20406080100
Index
obs
ocenimo kvaliteto napovedi mae(obs, pred)
## [1] 7.236303 mse(obs, pred)
## [1] 118.9229
rmae(obs, pred, mean(train$PM10))
## [1] 0.5959766
rmse(obs, pred, mean(train$PM10))
## [1] 0.3947286
Rekurencne nevronske mreze
Pred prvo uporabo je potrebno namestiti TensorFlow z ukazom “install_keras()” Navodila za instalacijo:
https://keras.rstudio.com/reference/install_keras.html library(keras)
Tokrat bomo uporabili aktivacijsko funkcijo “tanh”.Podatke bomo zato normalizirali na interval [-1,1]
minV <- min(train$PM10) maxV <- max(train$PM10)
train.scaled <- 2 * ((train$PM10 - minV) / (maxV - minV)) - 1 range(train.scaled)
## [1] -1 1
Sestavili bomo ucno mnozico v naslednji obliki: input = koncentracija delcev v casu (t); output = koncentracija delcev v casu (t+1)