• Rezultati Niso Bili Najdeni

Uvod v strojno učenje učno gradivo za praktični del tečaja v okvirju Akademije FRI

N/A
N/A
Protected

Academic year: 2022

Share "Uvod v strojno učenje učno gradivo za praktični del tečaja v okvirju Akademije FRI"

Copied!
58
0
0

Celotno besedilo

(1)

Uvod v strojno učenje

učno gradivo za praktični del tečaja v okvirju Akademije FRI

Petar Vračar marec 2020

Contents

UVOD V R 2

Vektorji (osnovni podatkovni objekti v R) . . . 2

Faktorji . . . 7

Seznami . . . 8

Podatkovni okvirji (Data frames) . . . 9

NADZOROVANO UCENJE (SUPERVISED LEARNING) 13 Klasifikacija . . . 13

Regresija . . . 27

NENADZOROVANO UCENJE (UNSUPERVISED LEARNING) 45 Razvrščanje (Clustering) . . . 45

Povezovalna pravila (Association rules) . . . 50

(2)

UVOD V R

R lahko uporabljamo kot kalkulator (50 + 1.45)/12.5

## [1] 4.116 Operatorji prirejanja x = 945

y <- sin(0.47)^2 * sqrt(5) y^2 -> z

Trenutno vrednost objekta (spremenljivke) dobimo tako, da vnesemo njegovo ime x

## [1] 945 y

## [1] 0.4586309 z

## [1] 0.2103423

Izpis in odstranjevanje objektov iz pomnilnika ls()

## [1] "x" "y" "z"

rm(y) rm(x,z)

Za brisanje vseh objektov iz pomnilnika rm(list=ls())

Vektorji (osnovni podatkovni objekti v R)

Gradnja vektorja z naštevanjem vrednosti elementov v <- c(14,7,23.5,76.2)

v

## [1] 14.0 7.0 23.5 76.2 Gradnja aritmetičnih nizov v <- 1:10

v

## [1] 1 2 3 4 5 6 7 8 9 10 v <- seq(from=5, to=10, by=2)

v

## [1] 5 7 9

(3)

Gradnja vektorja s ponavljanjem elementov w <- rep(v, times = 2)

w

## [1] 5 7 9 5 7 9

Skalarji so vektorji z enim elementom w <- 45.0

w

## [1] 45

Vektor lahko zgradimo s pomočjo drugih vektorjev z <- c(v, 2.5, w)

z

## [1] 5.0 7.0 9.0 2.5 45.0 Uporabne funkcije nad vektorji v <- c(8, 4, 2, 3, 1, 9, 6) length(v)

## [1] 7 max(v)

## [1] 9 min(v)

## [1] 1 which.min(v)

## [1] 5 sum(v)

## [1] 33 mean(v)

## [1] 4.714286 sd(v)

## [1] 3.039424 rev(v)

## [1] 6 9 1 3 2 4 8 sort(v)

## [1] 1 2 3 4 6 8 9 sort(v, decreasing=T)

## [1] 9 8 6 4 3 2 1 order(v)

(4)

## [1] 5 3 4 2 7 1 6

Podatkovni tip elementov vektorja mode(v)

## [1] "numeric"

Logični vektor (elementi so logične konstante) b <- c(TRUE, FALSE, F, T)

b

## [1] TRUE FALSE FALSE TRUE mode(b)

## [1] "logical"

x <- 5 > 3 x

## [1] TRUE mode(x)

## [1] "logical"

Vektor stringov (elementi so znakovni nizi)

s <- c("character", "logical", "numeric", "complex") mode(s)

## [1] "character"

Elementi vektorja morajo biti istega tipa (v nasprotnem primeru R samodejno konvertira različne tipe) c(F, T, 5)

## [1] 0 1 5 c(2.5, 4, 8.1, T)

## [1] 2.5 4.0 8.1 1.0 c(4, 9, T, F, 12.6, "aaa")

## [1] "4" "9" "TRUE" "FALSE" "12.6" "aaa"

Operacije z vektorji Definirajmo dva vektorja:

v1 <- c(10,20,30,40) v2 <- 1:4

Aritmetične operacije se izvajajo nad istoležnimi elementi v1 + v2

## [1] 11 22 33 44 v1 * v2

## [1] 10 40 90 160

(5)

Funkcije se izvajajo po elementih vektorja v1^2

## [1] 100 400 900 1600 sqrt(v1)

## [1] 3.162278 4.472136 5.477226 6.324555 exp(v1)

## [1] 2.202647e+04 4.851652e+08 1.068647e+13 2.353853e+17 log2(v1)

## [1] 3.321928 4.321928 4.906891 5.321928

Če operatorja nista enako dolga, se med izvajanjem aritmetičnih operacij elementi krajšega vektorja ciklično ponavljajo

v1 * 10

## [1] 100 200 300 400 v1 + 1

## [1] 11 21 31 41 v1 + c(100, 200)

## [1] 110 220 130 240

Naslavljanje elementov vektorja Definirajmo vektor:

x <- c(-10,20,-30,40,-50,60,-70,80) x

## [1] -10 20 -30 40 -50 60 -70 80

Elemente lahko naslovimo z naštevanjem indeksov (položajev), ki nas zanimajo (prvi element vektorja je na položaju 1)

x[c(1,4,5)]

## [1] -10 40 -50 x[1:3]

## [1] -10 20 -30

Negativne vrednosti indeksov pomenijo, da želimo nasloviti vse elemente razen navedenih x[-1]

## [1] 20 -30 40 -50 60 -70 80 x[c(-4,-6)]

## [1] -10 20 -30 -50 -70 80 x[-(1:3)]

## [1] 40 -50 60 -70 80

(6)

Elemente je možno nasloviti tudi z logičnim vektorjem pri tem naslavljamo elemente, ki ustrezajo logični konstanti TRUE.

Rezultat primerjave po elementih vektorja predstavlja logični vektor x > 0

## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE

Naslavljanje z logičnim vektorjem (vrne elemente, ki ustrezajo položajem logičnih konstant TRUE) x[x>0]

## [1] 20 40 60 80 x[x <= -20 | x > 50]

## [1] -30 -50 60 -70 80 x[x > 40 & x < 100]

## [1] 60 80

Za preverjanje enakosti uporabljamo operator == Za preverjanje neenakosti uporabljamo operator !=

Funkcija which() vrne indekse, ki ustrezajo vrednosti konstante TRUE which(x > 0)

## [1] 2 4 6 8

Elemente vektorja je možno poimenovati point <- c(4.7, 3.6, 2.5)

names(point) <- c('x', 'y', 'z') point

## x y z

## 4.7 3.6 2.5

Sedaj lahko naslavljamo elemente z njihovim imenom point['x']

## x

## 4.7

point[c('x','z')]

## x z

## 4.7 2.5

Če ne podamo indeksov, naslovimo vse elemente vektorja point[] <- 0

point

## x y z

## 0 0 0

Popolnoma drugačen rezultat dobimo z naslednjim ukazom point <- 0

point

## [1] 0

(7)

Urejanje vektorjev Definirajmo vektor:

x <- c("a", "b", "c", "d") Spreminjanje vrednosti elementov x[2] <- "BBBBB"

x

## [1] "a" "BBBBB" "c" "d"

x[c(1,3)] <- c("AAAAA", "CCCCC") x

## [1] "AAAAA" "BBBBB" "CCCCC" "d"

Dodajanje novega elementa x[length(x)+1] = "EEEEE"

x

## [1] "AAAAA" "BBBBB" "CCCCC" "d" "EEEEE"

Kaj se zgodi, če ne definiramo vseh elementov vektorja?

x[10] <- "FFFFF"

x

## [1] "AAAAA" "BBBBB" "CCCCC" "d" "EEEEE" NA NA NA

## [9] NA "FFFFF"

Na katerih položajih manjkajo vrednosti elementov?

is.na(x)

## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE Odstranjevanje elementov vektorja

x <- x[-c(1,3)]

x

## [1] "BBBBB" "d" "EEEEE" NA NA NA NA "FFFFF"

x <- c(x[2],x[3]) x

## [1] "d" "EEEEE"

Faktorji

Definirajmo vektor:

gender <- c("f","m","m","m","f","m","f") gender

## [1] "f" "m" "m" "m" "f" "m" "f"

Faktorje uporabimo za modeliranje nominalnih spremenljivk gender <- factor(gender)

gender

(8)

## [1] f m m m f m f

## Levels: f m

Argument “levels” definira možne vrednosti elementov

smeri <- factor(c('levo','levo','desno'), levels = c('levo','desno','gor','dol')) smeri

## [1] levo levo desno

## Levels: levo desno gor dol Izpis seznama dovoljenih vrednosti levels(smeri)

## [1] "levo" "desno" "gor" "dol"

Vektorju lahko priredimo samo dovoljene vrednosti elementov smeri[1] <- "posevno"

## Warning in `[<-.factor`(`*tmp*`, 1, value = "posevno"): invalid factor

## level, NA generated smeri

## [1] <NA> levo desno

## Levels: levo desno gor dol smeri[1] <- "gor"

smeri

## [1] gor levo desno

## Levels: levo desno gor dol Frekvenčna tabela vrednosti table(gender)

## gender

## f m

## 3 4 table(smeri)

## smeri

## levo desno gor dol

## 1 1 1 0

Seznami

Seznam je urejena zbirka objektov

student <- list(id=12345,name="Marko",marks=c(10,9,10,9,8,10)) student

## $id

## [1] 12345

#### $name

## [1] "Marko"

##

(9)

## $marks

## [1] 10 9 10 9 8 10

Naslavljanje komponent seznama (z uporabo imen) student$id

## [1] 12345 student$name

## [1] "Marko"

student$marks

## [1] 10 9 10 9 8 10

Naslavljanje komponent seznama (z uporabo indeksov) student[[1]]

## [1] 12345 student[[2]]

## [1] "Marko"

student[[3]]

## [1] 10 9 10 9 8 10

Dodajanje nove komponente v seznam student$parents <- c("Ana", "Tomaz") student

## $id

## [1] 12345

#### $name

## [1] "Marko"

#### $marks

## [1] 10 9 10 9 8 10

#### $parents

## [1] "Ana" "Tomaz"

Podatkovni okvirji (Data frames)

Gradnja podatkovnega okvirja

height <- c(159, 185, 183, 170, 174, 165, 173, 169, 173, 158) weight <- c(45, 89, 70, 80, 62, 86, 50, 58, 72, 50)

gender <- factor(c("f","m","m","m","f","m","f","f","m","f")) student <- c(T, T, F, F, T, T, F, F, F, T)

df <- data.frame(gender, height, weight, student) df

## gender height weight student

## 1 f 159 45 TRUE

(10)

## 2 m 185 89 TRUE

## 3 m 183 70 FALSE

## 4 m 170 80 FALSE

## 5 f 174 62 TRUE

## 6 m 165 86 TRUE

## 7 f 173 50 FALSE

## 8 f 169 58 FALSE

## 9 m 173 72 FALSE

## 10 f 158 50 TRUE

Nekaj uporabnih funkcij summary(df)

## gender height weight student

## f:5 Min. :158.0 Min. :45.0 Mode :logical

## m:5 1st Qu.:166.0 1st Qu.:52.0 FALSE:5

## Median :171.5 Median :66.0 TRUE :5

## Mean :170.9 Mean :66.2

## 3rd Qu.:173.8 3rd Qu.:78.0

## Max. :185.0 Max. :89.0

names(df)

## [1] "gender" "height" "weight" "student"

nrow(df)

## [1] 10 ncol(df)

## [1] 4

Dostop do elementov podatkovnega okvirja df[5,]

## gender height weight student

## 5 f 174 62 TRUE

df[1:5,]

## gender height weight student

## 1 f 159 45 TRUE

## 2 m 185 89 TRUE

## 3 m 183 70 FALSE

## 4 m 170 80 FALSE

## 5 f 174 62 TRUE

df[,1]

## [1] f m m m f m f f m f

## Levels: f m df[,c(1,3,4)]

## gender weight student

## 1 f 45 TRUE

## 2 m 89 TRUE

## 3 m 70 FALSE

(11)

## 4 m 80 FALSE

## 5 f 62 TRUE

## 6 m 86 TRUE

## 7 f 50 FALSE

## 8 f 58 FALSE

## 9 m 72 FALSE

## 10 f 50 TRUE

df[1,-3]

## gender height student

## 1 f 159 TRUE

df$height

## [1] 159 185 183 170 174 165 173 169 173 158 df[df$height < 180,]

## gender height weight student

## 1 f 159 45 TRUE

## 4 m 170 80 FALSE

## 5 f 174 62 TRUE

## 6 m 165 86 TRUE

## 7 f 173 50 FALSE

## 8 f 169 58 FALSE

## 9 m 173 72 FALSE

## 10 f 158 50 TRUE

df[df$gender == "m",]

## gender height weight student

## 2 m 185 89 TRUE

## 3 m 183 70 FALSE

## 4 m 170 80 FALSE

## 6 m 165 86 TRUE

## 9 m 173 72 FALSE

Dodajanje novega stolpca v podatkovni okvir

df <- cbind(df, age = c(20, 21, 30, 25, 27, 19, 24, 27, 28, 24)) df

## gender height weight student age

## 1 f 159 45 TRUE 20

## 2 m 185 89 TRUE 21

## 3 m 183 70 FALSE 30

## 4 m 170 80 FALSE 25

## 5 f 174 62 TRUE 27

## 6 m 165 86 TRUE 19

## 7 f 173 50 FALSE 24

## 8 f 169 58 FALSE 27

## 9 m 173 72 FALSE 28

## 10 f 158 50 TRUE 24

df$name = c("Joan","Tom","John","Mike","Anna","Bill","Tina","Beth","Steve","Kim") df

## gender height weight student age name

(12)

## 1 f 159 45 TRUE 20 Joan

## 2 m 185 89 TRUE 21 Tom

## 3 m 183 70 FALSE 30 John

## 4 m 170 80 FALSE 25 Mike

## 5 f 174 62 TRUE 27 Anna

## 6 m 165 86 TRUE 19 Bill

## 7 f 173 50 FALSE 24 Tina

## 8 f 169 58 FALSE 27 Beth

## 9 m 173 72 FALSE 28 Steve

## 10 f 158 50 TRUE 24 Kim

summary(df)

## gender height weight student age

## f:5 Min. :158.0 Min. :45.0 Mode :logical Min. :19.00

## m:5 1st Qu.:166.0 1st Qu.:52.0 FALSE:5 1st Qu.:21.75

## Median :171.5 Median :66.0 TRUE :5 Median :24.50

## Mean :170.9 Mean :66.2 Mean :24.50

## 3rd Qu.:173.8 3rd Qu.:78.0 3rd Qu.:27.00

## Max. :185.0 Max. :89.0 Max. :30.00

## name

## Length:10

## Class :character

## Mode :character

####

##

(13)

NADZOROVANO UCENJE (SUPERVISED LEARNING)

Klasifikacija

Prenesite datoteko “PM10_Class.csv” v lokalno mapo. To mapo nastavite kot delovno mapo okolja R s po- mocjo ukaza “setwd” oziroma iz menuja s klikom na File -> Change dir. . . Na primer: setwd(“c:\tecaj\data\”).

Datoteka “PM10_Class.csv” vsebuje podatke o vremenu in onesnaženju zraka v obdobju od 2013 do 2016.

origData <- read.csv("PM10_Class.csv") summary(origData)

## PM10 Date Glob_radiation_max Glob_radiation_mean

## HIGH:221 2013-01-01: 1 Min. : 0.0 Min. : 0.000

## LOW :990 2013-01-02: 1 1st Qu.: 16.0 1st Qu.: 2.112

## 2013-01-03: 1 Median :108.0 Median : 20.625

## 2013-01-04: 1 Mean :176.9 Mean : 37.431

## 2013-01-05: 1 3rd Qu.:337.0 3rd Qu.: 66.188

## 2013-01-06: 1 Max. :619.0 Max. :138.375

## (Other) :1205

## Glob_radiation_min Wind_speed_max Wind_speed_mean Wind_speed_min

## Min. :0 Min. :0.00 Min. :0.0000 Min. :0.0000

## 1st Qu.:0 1st Qu.:0.90 1st Qu.:0.5375 1st Qu.:0.2000

## Median :0 Median :1.30 Median :0.7250 Median :0.3000

## Mean :0 Mean :1.54 Mean :0.8905 Mean :0.4064

## 3rd Qu.:0 3rd Qu.:1.90 3rd Qu.:1.0500 3rd Qu.:0.5000

## Max. :0 Max. :6.80 Max. :4.6125 Max. :4.2000

#### Wind_gust_max Wind_gust_mean Wind_gust_min Precipitation_mean

## Min. : 0.000 Min. : 0.000 Min. :0.000 Min. :0.0000

## 1st Qu.: 2.300 1st Qu.: 1.519 1st Qu.:0.800 1st Qu.:0.0000

## Median : 2.800 Median : 1.900 Median :1.100 Median :0.0000

## Mean : 3.521 Mean : 2.259 Mean :1.277 Mean :0.1559

## 3rd Qu.: 4.100 3rd Qu.: 2.544 3rd Qu.:1.500 3rd Qu.:0.0000

## Max. :14.900 Max. :10.162 Max. :8.900 Max. :4.8375

#### Precipitation_sum Pressure_max Pressure_mean Pressure_min

## Min. : 0.000 Min. : 951.9 Min. : 947.3 Min. : 942.1

## 1st Qu.: 0.000 1st Qu.: 978.8 1st Qu.: 977.9 1st Qu.: 977.3

## Median : 0.000 Median : 982.9 Median : 982.2 Median : 981.5

## Mean : 1.281 Mean : 982.9 Mean : 982.0 Mean : 981.3

## 3rd Qu.: 0.200 3rd Qu.: 986.9 3rd Qu.: 985.9 3rd Qu.: 985.4

## Max. :38.500 Max. :1004.0 Max. :1003.5 Max. :1003.1

#### Humidity_max Humidity_mean Humidity_min Temp_1500_max

## Min. : 39.1 Min. :35.41 Min. :32.40 Min. :-14.300

## 1st Qu.: 86.3 1st Qu.:79.73 1st Qu.:69.90 1st Qu.: -0.600

## Median : 92.0 Median :87.84 Median :81.10 Median : 4.800

## Mean : 89.9 Mean :85.55 Mean :79.04 Mean : 4.634

## 3rd Qu.: 95.8 3rd Qu.:93.29 3rd Qu.:90.00 3rd Qu.: 9.650

## Max. :100.0 Max. :99.71 Max. :99.10 Max. : 21.300

#### Temp_1500_mean Temp_1500_min Temp_site_max Temp_site_mean

## Min. :-14.750 Min. :-15.100 Min. :-7.90 Min. :-9.675

## 1st Qu.: -1.619 1st Qu.: -2.400 1st Qu.: 4.20 1st Qu.: 3.344

(14)

## Median : 3.725 Median : 2.900 Median :11.40 Median :10.150

## Mean : 3.466 Mean : 2.666 Mean :10.75 Mean : 9.312

## 3rd Qu.: 8.456 3rd Qu.: 7.600 3rd Qu.:17.00 3rd Qu.:15.081

## Max. : 19.475 Max. : 18.800 Max. :29.10 Max. :24.087

#### Temp_site_min

## Min. :-10.800

## 1st Qu.: 2.600

## Median : 9.100

## Mean : 8.343

## 3rd Qu.: 14.100

## Max. : 22.100

##

Opis podatkov:

Atribut Pomen

PM10 Nominalni atribut, dnevna koncentracija prašnih delcev premera 10 µm

Date Čas meritve v formatu YYYY-MM-DD

Glob_radiation_max Zv. atribut, najvišja vrednost globalnega sevanja med 0:00 in 7:00 Glob_radiation_mean Zv. atribut, povprečna vrednost globalnega sevanja med 0:00 in 7:00 Glob_radiation_min Zv. atribut, najnižja vrednost globalnega sevanja med 0:00 in 7:00 Wind_speed_max Zv. atribut, najvišja hitrost vetra med 0:00 in 7:00

Wind_speed_mean Zv. atribut, povprečna hitrost vetra med 0:00 in 7:00 Wind_speed_min Zv. atribut, najnižja hitrost vetra med 0:00 in 7:00 Wind_gust_max Zv. atribut, najvišja hitrost sunkov vetra med 0:00 in 7:00 Wind_gust_mean Zv. atribut, povprečna hitrost sunkov vetra med 0:00 in 7:00 Wind_gust_min Zv. atribut, najnižja hitrost sunkov vetra med 0:00 in 7:00 Precipitation_mean Zv. atribut, povprečna količina padavin (na uro) med 0:00 in 7:00 Precipitation_sum Zv. atribut, skupna količina padavin med 0:00 in 7:00

Pressure_max Zv. atribut, najvišja vrednost zračnega pritiska med 0:00 in 7:00 Pressure_mean Zv. atribut, povprečna vrednost zračnega pritiska med 0:00 in 7:00 Pressure_min Zv. atribut, najnižja vrednost zračnega pritiska med 0:00 in 7:00 Humidity_max Zv. atribut, najvišja vrednost vlažnosti zraka med 0:00 in 7:00 Humidity_mean Zv. atribut, povprečna vrednost vlažnosti zraka med 0:00 in 7:00 Humidity_min Zv. atribut, najnižja vrednost vlažnosti zraka med 0:00 in 7:00

Temp_1500_max Zv. atribut, najvišja temperatura zraka na višini 1500m med 0:00 in 7:00 Temp_1500_mean Zv. atribut, povprečna temperatura zraka na višini 1500m med 0:00 in 7:00 Temp_1500_min Zv. atribut, najnižja temperatura zraka na višini 1500m med 0:00 in 7:00 Temp_site_max Zv. atribut, najvišja temp. zraka na merilnem mestu med 0:00 in 7:00 Temp_site_mean Zv. atribut, povpr. temp. zraka na merilnem mestu med 0:00 in 7:00 Temp_site_min Zv. atribut, najnižja temp. zraka na merilnem mestu med 0:00 in 7:00

Glob_radiation_min ima samo eno vrednost - ne potrebujemo ga.

origData$Glob_radiation_min <- NULL

(15)

Spoznavanje s podatki in Vizualizacija Število meritev (vrstic) v naši podatkovni množici nrow(origData)

## [1] 1211

Število atributov (stolpcev) ncol(origData)

## [1] 24

Pogostost posameznih razredov table(origData$PM10)

#### HIGH LOW

## 221 990

tabPM10 <- table(origData$PM10) tabPM10

#### HIGH LOW

## 221 990 Stolpčni diagram barplot(tabPM10,

main="Stolpcni diagram koncentracije delcev PM10", ylab="Stevilo meritev",

xlab="Koncentracija delcev PM10")

HIGH LOW

Stolpcni diagram koncentracije delcev PM10

Koncentracija delcev PM10

Stevilo meritev 0200400600800

(16)

Krožni diagram

pie(tabPM10, main="Krozni diagram koncentracije delcev PM10")

HIGH

LOW

Krozni diagram koncentracije delcev PM10

Histogram

hist(origData$Humidity_mean,

main="Histogram povprecne vlaznosti zraka", xlab="Povprecna vlaznost zraka",

ylab="Stevilo meritev")

Histogram povprecne vlaznosti zraka

Povprecna vlaznost zraka

Stevilo meritev

40 50 60 70 80 90 100

050100200300

(17)

Kvantilni diagram povprečne temperature zraka

boxplot(origData$Temp_site_mean, main="Povprecna temperatura zraka", ylab="Temperatura v °C")

−100510152025

Povprecna temperatura zraka

Temperatura v °C

Kvantilni digram povprečne temperature zraka glede na različne koncentracije PM10

boxplot(Temp_site_mean ~ PM10, origData, main="Kvantilni diagram", xlab="PM10", ylab="Temperatura zraka v °C")

HIGH LOW

−100510152025

Kvantilni diagram

PM10

Temperatura zraka v °C

(18)

Struktura podatkovnega okvirja str(origData)

## 'data.frame': 1211 obs. of 24 variables:

## $ PM10 : Factor w/ 2 levels "HIGH","LOW": 1 1 1 1 1 2 2 2 1 2 ...

## $ Date : Factor w/ 1211 levels "2013-01-01","2013-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...

## $ Glob_radiation_max : num 1 1 3 7 6 2 2 2 7 2 ...

## $ Glob_radiation_mean: num 0.125 0.125 0.375 0.875 0.75 0.25 0.25 0.25 0.875 0.25 ...

## $ Wind_speed_max : num 0.9 1.1 1.2 1.3 1.3 0.9 1.1 0.9 1.1 2 ...

## $ Wind_speed_mean : num 0.65 0.675 0.738 0.887 1 ...

## $ Wind_speed_min : num 0.3 0.3 0.4 0.6 0.5 0.3 0.3 0.5 0.3 0.4 ...

## $ Wind_gust_max : num 2.1 2.9 2.6 3.2 4.5 2.3 2.8 2.3 2.4 4.2 ...

## $ Wind_gust_mean : num 1.61 1.81 1.73 2.21 2.85 ...

## $ Wind_gust_min : num 1.1 0.9 1 1.3 1.3 0.8 1.5 1.2 0.9 1.5 ...

## $ Precipitation_mean : num 0 0 0 0 0 0 0 0 0 0.075 ...

## $ Precipitation_sum : num 0 0 0 0 0 0 0 0 0 0.6 ...

## $ Pressure_max : num 986 984 999 995 987 ...

## $ Pressure_mean : num 984 983 998 994 986 ...

## $ Pressure_min : num 982 982 997 994 985 ...

## $ Humidity_max : num 96.1 82.9 94.5 94.2 95.6 92.2 97.6 79.9 86 94 ...

## $ Humidity_mean : num 95.8 79.2 93.6 93.3 94.3 ...

## $ Humidity_min : num 95.4 76.9 92.9 92.5 93.3 81.8 92.6 67.3 81.2 83.8 ...

## $ Temp_1500_max : num -0.4 -2.3 -4.3 3.3 3.5 0 0.4 -2.1 1.6 -3.4 ...

## $ Temp_1500_mean : num -1.69 -2.99 -4.53 2.96 2.99 ...

## $ Temp_1500_min : num -2.5 -4.3 -4.7 2.6 2.4 -0.7 -0.5 -4.2 1.1 -3.6 ...

## $ Temp_site_max : num -1.8 3.4 3.3 1.3 0.9 4.5 1.4 1.7 2.3 5.9 ...

## $ Temp_site_mean : num -2.138 3.2 2.875 0.688 0.588 ...

## $ Temp_site_min : num -2.7 2.9 2.3 0.1 0.3 2 0.2 0.6 1.7 4.3 ...

Datum je trenutno predstavljen kot nominalna spremenljivka in ni uporaben za modeliranje. R ima vgrajeno podporo za predstavitev koledarskih datumov

date <- as.Date(origData$Date)

Kronolosko razdelimo podatke na učno in testno množico sel <- date < "2016-1-1"

train <- origData[sel,]

test <- origData[!sel,]

Večinski klasifikator

Vecinski razred je razred z najvec ucnimi primeri table(train$PM10) / length(train$PM10)

#### HIGH LOW

## 0.187067 0.812933

Točnost klasifikacije, ki jo doseže trivialna teorija (vsak primer klasificira v več. razred) table(test$PM10) / length(test$PM10)

#### HIGH LOW

(19)

## 0.1710145 0.8289855

Točnost vecinskega klasifikatorja določa spodnjo mejo točnosti uporabnih modelov!

Odločitveno drevo

Učenje odločitvenega drevesa je implementirano v knjižnici rpart. Knjižnica je del osnovnega paketa sistema R in je ni potrebno namestiti. Knjižnico naložimo z ukazom “library()”

library(rpart)

Učenje odločitvenega drevesa

treeModel <- rpart(PM10 ~ ., train, usesurrogate=0) treeModel

## n= 866

#### node), split, n, loss, yval, (yprob)

## * denotes terminal node

#### 1) root 866 162 LOW (0.1870670 0.8129330)

## 2) Date=2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-09,2013-01-11,2013-01-15,2013-01-19,2013-01-20,2013-01-21,2013-01-26,2013-01-27,2013-01-28,2013-02-01,2013-02-05,2013-02-08,2013-02-11,2013-02-12,2013-02-13,2013-02-15,2013-02-16,2013-02-17,2013-02-18,2013-02-19,2013-02-20,2013-02-23,2013-02-24,2013-02-25,2013-02-26,2013-03-23,2013-03-28,2013-04-17,2013-04-18,2013-04-19,2013-04-30,2013-05-01,2013-05-02,2013-05-04,2013-06-18,2013-08-07,2013-08-08,2013-08-09,2013-10-05,2013-10-07,2013-10-08,2013-10-09,2013-10-10,2013-10-18,2013-10-24,2013-11-18,2013-11-28,2013-11-29,2013-11-30,2013-12-03,2013-12-04,2013-12-05,2013-12-06,2013-12-07,2013-12-08,2013-12-09,2013-12-10,2013-12-11,2013-12-12,2013-12-13,2013-12-14,2013-12-15,2013-12-16,2013-12-17,2013-12-18,2013-12-19,2013-12-20,2013-12-21,2014-01-01,2014-01-27,2014-01-28,2014-01-29,2014-01-30,2014-01-31,2014-02-01,2014-02-04,2014-02-05,2014-02-06,2014-02-19,2014-02-21,2014-02-25,2014-03-03,2014-03-07,2014-03-08,2014-03-09,2014-03-10,2014-03-11,2014-03-12,2014-03-13,2014-03-14,2014-03-15,2014-03-16,2014-03-17,2014-03-18,2014-03-21,2014-03-29,2014-03-31,2014-04-01,2014-04-02,2014-04-04,2014-10-07,2014-10-28,2014-10-29,2014-10-30,2014-10-31,2014-11-01,2014-11-02,2014-11-22,2014-11-23,2014-11-25,2014-11-26,2014-11-27,2014-11-28,2014-12-10,2014-12-11,2014-12-12,2014-12-16,2014-12-30,2015-01-01,2015-01-02,2015-01-03,2015-01-06,2015-01-07,2015-01-08,2015-01-09,2015-01-15,2015-01-20,2015-01-21,2015-01-27,2015-01-28,2015-01-29,2015-02-03,2015-02-04,2015-02-08,2015-02-10,2015-02-11,2015-02-14,2015-02-15,2015-02-16,2015-02-17,2015-02-19,2015-02-20,2015-02-28,2015-03-10,2015-03-11,2015-03-14,2015-03-15,2015-03-16,2015-03-17,2015-03-18,2015-03-19,2015-03-20,2015-03-21,2015-03-23,2015-03-24,2015-08-06,2015-08-07 162 0 HIGH (1.0000000 0.0000000) *

## 3) Date=2013-01-06,2013-01-07,2013-01-08,2013-01-10,2013-01-12,2013-01-13,2013-01-14,2013-01-16,2013-01-17,2013-01-18,2013-01-22,2013-01-23,2013-01-24,2013-01-25,2013-01-29,2013-01-30,2013-01-31,2013-02-02,2013-02-03,2013-02-04,2013-02-06,2013-02-09,2013-02-10,2013-02-14,2013-02-21,2013-02-22,2013-03-13,2013-03-14,2013-03-15,2013-03-16,2013-03-17,2013-03-18,2013-03-19,2013-03-20,2013-03-21,2013-03-22,2013-03-24,2013-03-25,2013-03-26,2013-03-27,2013-03-29,2013-03-30,2013-03-31,2013-04-01,2013-04-02,2013-04-03,2013-04-11,2013-04-12,2013-04-13,2013-04-14,2013-04-15,2013-04-16,2013-04-20,2013-04-21,2013-04-22,2013-04-23,2013-04-24,2013-04-25,2013-04-26,2013-04-27,2013-04-28,2013-04-29,2013-05-03,2013-05-05,2013-05-06,2013-05-07,2013-05-08,2013-05-09,2013-05-10,2013-05-11,2013-05-12,2013-05-13,2013-05-14,2013-05-15,2013-05-16,2013-05-17,2013-05-18,2013-05-19,2013-05-20,2013-05-21,2013-05-22,2013-05-23,2013-05-24,2013-05-25,2013-05-26,2013-05-27,2013-05-28,2013-05-29,2013-05-30,2013-05-31,2013-06-01,2013-06-02,2013-06-03,2013-06-04,2013-06-05,2013-06-06,2013-06-07,2013-06-08,2013-06-09,2013-06-10,2013-06-11,2013-06-12,2013-06-13,2013-06-14,2013-06-15,2013-06-16,2013-06-17,2013-06-19,2013-06-20,2013-06-21,2013-06-22,2013-06-23,2013-06-24,2013-06-27,2013-06-28,2013-06-29,2013-06-30,2013-07-01,2013-07-02,2013-07-03,2013-07-04,2013-07-05,2013-07-06,2013-07-07,2013-07-08,2013-07-09,2013-07-10,2013-07-11,2013-07-12,2013-07-13,2013-07-14,2013-07-15,2013-07-16,2013-07-17,2013-07-18,2013-07-19,2013-07-20,2013-07-21,2013-07-22,2013-07-23,2013-07-24,2013-07-25,2013-07-26,2013-07-27,2013-07-28,2013-07-29,2013-07-30,2013-07-31,2013-08-01,2013-08-02,2013-08-03,2013-08-04,2013-08-05,2013-08-06,2013-08-12,2013-08-13,2013-08-14,2013-08-15,2013-08-16,2013-08-17,2013-08-18,2013-08-19,2013-08-20,2013-08-21,2013-08-22,2013-08-23,2013-08-24,2013-08-26,2013-08-27,2013-08-28,2013-08-29,2013-08-30,2013-08-31,2013-09-01,2013-09-02,2013-09-03,2013-09-04,2013-09-05,2013-09-06,2013-09-07,2013-09-08,2013-09-11,2013-09-12,2013-09-13,2013-09-14,2013-09-15,2013-09-16,2013-09-17,2013-09-18,2013-09-19,2013-09-20,2013-09-21,2013-09-22,2013-09-23,2013-09-24,2013-09-25,2013-09-26,2013-09-27,2013-09-28,2013-09-29,2013-09-30,2013-10-01,2013-10-02,2013-10-03,2013-10-04,2013-10-06,2013-10-11,2013-10-12,2013-10-13,2013-10-14,2013-10-15,2013-10-16,2013-10-17,2013-10-19,2013-10-20,2013-10-21,2013-10-22,2013-10-23,2013-10-25,2013-10-26,2013-10-27,2013-10-28,2013-10-29,2013-10-30,2013-10-31,2013-11-01,2013-11-02,2013-11-03,2013-11-16,2013-11-17,2013-11-26,2013-11-27,2013-12-01,2013-12-02,2013-12-22,2013-12-23,2013-12-24,2013-12-25,2013-12-26,2013-12-27,2013-12-28,2013-12-29,2013-12-30,2013-12-31,2014-01-02,2014-01-03,2014-01-04,2014-01-05,2014-01-06,2014-01-07,2014-01-08,2014-01-09,2014-01-11,2014-01-12,2014-01-13,2014-01-14,2014-01-15,2014-01-16,2014-01-17,2014-01-18,2014-01-19,2014-01-20,2014-01-23,2014-01-24,2014-01-25,2014-01-26,2014-02-02,2014-02-03,2014-02-07,2014-02-08,2014-02-09,2014-02-10,2014-02-11,2014-02-12,2014-02-13,2014-02-14,2014-02-15,2014-02-16,2014-02-17,2014-02-18,2014-02-20,2014-02-22,2014-02-23,2014-02-26,2014-02-27,2014-02-28,2014-03-01,2014-03-02,2014-03-04,2014-03-05,2014-03-06,2014-03-19,2014-03-20,2014-03-22,2014-03-23,2014-03-24,2014-03-25,2014-03-26,2014-03-27,2014-03-28,2014-03-30,2014-04-03,2014-04-05,2014-04-06,2014-04-07,2014-04-08,2014-04-09,2014-04-10,2014-04-11,2014-04-12,2014-04-13,2014-04-14,2014-04-15,2014-04-16,2014-04-17,2014-04-18,2014-04-19,2014-04-20,2014-04-21,2014-04-22,2014-04-23,2014-04-24,2014-04-25,2014-04-26,2014-04-27,2014-04-28,2014-04-29,2014-04-30,2014-05-01,2014-05-02,2014-05-03,2014-05-04,2014-05-05,2014-05-06,2014-05-07,2014-05-08,2014-05-09,2014-05-10,2014-05-11,2014-05-13,2014-05-14,2014-05-15,2014-05-16,2014-05-17,2014-05-18,2014-05-19,2014-05-20,2014-05-21,2014-05-22,2014-05-23,2014-05-24,2014-05-25,2014-05-26,2014-05-27,2014-05-28,2014-05-29,2014-05-30,2014-05-31,2014-06-01,2014-06-02,2014-06-03,2014-06-04,2014-06-05,2014-06-06,2014-06-07,2014-06-08,2014-06-09,2014-06-10,2014-06-11,2014-06-12,2014-06-17,2014-06-18,2014-06-19,2014-06-20,2014-06-21,2014-06-22,2014-06-23,2014-07-04,2014-07-05,2014-07-06,2014-07-07,2014-07-08,2014-07-09,2014-07-10,2014-07-11,2014-07-12,2014-07-13,2014-07-14,2014-07-15,2014-07-16,2014-07-17,2014-07-18,2014-07-19,2014-07-20,2014-07-21,2014-07-22,2014-07-23,2014-07-24,2014-07-25,2014-07-26,2014-07-27,2014-07-28,2014-07-29,2014-07-30,2014-07-31,2014-08-01,2014-08-02,2014-08-03,2014-08-04,2014-08-05,2014-08-06,2014-08-07,2014-08-08,2014-08-09,2014-08-10,2014-08-11,2014-08-12,2014-08-13,2014-08-14,2014-08-15,2014-08-16,2014-08-17,2014-08-18,2014-08-19,2014-08-20,2014-08-22,2014-08-23,2014-08-24,2014-08-25,2014-08-26,2014-08-27,2014-08-28,2014-08-29,2014-08-30,2014-08-31,2014-09-01,2014-09-02,2014-09-03,2014-09-04,2014-09-05,2014-09-06,2014-09-07,2014-09-08,2014-09-09,2014-09-10,2014-09-11,2014-09-12,2014-09-13,2014-09-14,2014-09-15,2014-09-16,2014-09-17,2014-09-18,2014-09-19,2014-09-20,2014-09-21,2014-09-22,2014-09-23,2014-09-24,2014-09-25,2014-09-26,2014-09-27,2014-10-08,2014-10-09,2014-10-10,2014-10-11,2014-10-12,2014-10-13,2014-10-14,2014-10-15,2014-10-16,2014-10-17,2014-10-18,2014-10-19,2014-10-20,2014-10-21,2014-10-23,2014-10-24,2014-10-25,2014-10-26,2014-10-27,2014-11-03,2014-11-04,2014-11-05,2014-11-06,2014-11-07,2014-11-08,2014-11-09,2014-11-10,2014-11-11,2014-11-12,2014-11-13,2014-11-14,2014-11-15,2014-11-16,2014-11-17,2014-11-18,2014-11-19,2014-11-20,2014-11-21,2014-11-24,2014-11-29,2014-11-30,2014-12-01,2014-12-02,2014-12-03,2014-12-04,2014-12-05,2014-12-06,2014-12-07,2014-12-08,2014-12-09,2014-12-13,2014-12-14,2014-12-15,2014-12-17,2014-12-18,2014-12-19,2014-12-20,2014-12-21,2014-12-22,2014-12-23,2014-12-24,2014-12-25,2014-12-26,2014-12-27,2014-12-28,2014-12-31,2015-01-04,2015-01-05,2015-01-10,2015-01-11,2015-01-12,2015-01-13,2015-01-14,2015-01-16,2015-01-17,2015-01-18,2015-01-19,2015-01-22,2015-01-23,2015-01-24,2015-01-25,2015-01-26,2015-01-30,2015-01-31,2015-02-05,2015-02-06,2015-02-07,2015-02-09,2015-02-12,2015-02-13,2015-02-18,2015-02-21,2015-02-22,2015-02-23,2015-02-24,2015-02-25,2015-02-26,2015-02-27,2015-03-01,2015-03-02,2015-03-03,2015-03-04,2015-03-05,2015-03-06,2015-03-07,2015-03-08,2015-03-09,2015-03-12,2015-03-13,2015-03-22,2015-03-25,2015-03-26,2015-03-27,2015-03-28,2015-03-29,2015-03-30,2015-03-31,2015-04-01,2015-04-02,2015-04-03,2015-04-04,2015-04-05,2015-04-06,2015-04-07,2015-04-08,2015-04-09,2015-04-10,2015-04-11,2015-04-12,2015-04-13,2015-04-14,2015-04-15,2015-04-16,2015-04-17,2015-04-18,2015-04-19,2015-04-20,2015-04-21,2015-04-22,2015-04-23,2015-04-24,2015-04-25,2015-04-26,2015-04-27,2015-04-28,2015-04-29,2015-04-30,2015-05-01,2015-05-02,2015-05-03,2015-05-04,2015-05-05,2015-05-06,2015-05-07,2015-05-08,2015-05-10,2015-05-11,2015-05-12,2015-05-13,2015-05-14,2015-05-20,2015-05-21,2015-05-22,2015-05-23,2015-05-24,2015-05-25,2015-05-26,2015-05-27,2015-05-28,2015-05-29,2015-05-30,2015-05-31,2015-06-01,2015-06-02,2015-06-03,2015-06-04,2015-06-05,2015-06-06,2015-06-07,2015-06-08,2015-06-09,2015-06-10,2015-06-11,2015-06-12,2015-06-13,2015-06-14,2015-06-15,2015-06-16,2015-06-17,2015-06-18,2015-06-19,2015-06-20,2015-06-21,2015-06-22,2015-06-23,2015-06-24,2015-06-25,2015-06-26,2015-06-27,2015-06-28,2015-06-29,2015-06-30,2015-07-01,2015-07-02,2015-07-03,2015-07-04,2015-07-05,2015-07-06,2015-07-11,2015-07-12,2015-07-13,2015-07-14,2015-07-15,2015-07-16,2015-07-17,2015-07-18,2015-07-19,2015-07-20,2015-07-21,2015-07-22,2015-07-24,2015-07-25,2015-07-27,2015-07-28,2015-07-29,2015-07-30,2015-07-31,2015-08-01,2015-08-02,2015-08-03,2015-08-04,2015-08-05,2015-08-08,2015-08-09,2015-08-10,2015-08-11,2015-08-12,2015-08-13,2015-08-14,2015-08-15,2015-08-16,2015-08-17 704 0 LOW (0.0000000 1.0000000) * Za izris drevesa uporabimo funkcijo rpart.plot v istoimenski knjižnici, ki jo je potrebno najprej namestiti.

Knjižnice instaliramo z ukazom “install.packages()”. Na primer: install.packages(“rpart.plot”). Ko je knjižnica nameščena, jo naložimo z ukazom “library()”.

library(rpart.plot) rpart.plot(treeModel)

Date = 2013−01−01,2013−01−02,2013−01−03,2013−01−04,2013−01−05,2013−01−09,2013−01−11,2013−01−15,2013−01−19,2013−01−20,2013−01−21,2013−01−26,2013−01−27,2013−01−28,2013−02−01,2013−02−05,2013−02−08,2013−02−11,2013−02−12,2013−02−13,2013−02−15,2013−02−16,2013−02−17,2013−02−18,2013−02−19,2013−02−20,2013−02−23,2013−02−24,2013−02−25,2013−02−26,2013−03−23,2013−03−28,2013−04−17,2013−04−18,2013−04−19,2013−04−30,2013−05−01,2013−05−02,2013−05−04,2013−06−18,2013−08−07,2013−08−08,2013−08−09,2013−10−05,2013−10−07,2013−10−08,2013−10−09,2013−10−10,2013−10−18,2013−10−24,2013−11−18,2013−11−28,2013−11−29,2013−11−30,2013−12−03,2013−12−04,2013−12−05,2013−12−06,2013−12−07,2013−12−08,2013−12−09,2013−12−10,2013−12−11,2013−12−12,2013−12−13,2013−12−14,2013−12−15,2013−12−16,2013−12−17,2013−12−18,2013−12−19,2013−12−20,2013−12−21,2014−01−01,2014−01−27,2014−01−28,2014−01−29,2014−01−30,2014−01−31,2014−02−01,2014−02−04,2014−02−05,2014−02−06,2014−02−19,2014−02−21,2014−02−25,2014−03−03,2014−03−07,2014−03−08,2014−03−09,2014−03−10,2014−03−11,2014−03−12,2014−03−13,2014−03−14,2014−03−15,2014−03−16,2014−03−17,2014−03−18,2014−03−21,2014−03−29,2014−03−31,2014−04−01,2014−04−02,2014−04−04,2014−10−07,2014−10−28,2014−10−29,2014−10−30,2014−10−31,2014−11−01,2014−11−02,2014−11−22,2014−11−23,2014−11−25,2014−11−26,2014−11−27,2014−11−28,2014−12−10,2014−12−11,2014−12−12,2014−12−16,2014−12−30,2015−01−01,2015−01−02,2015−01−03,2015−01−06,2015−01−07,2015−01−08,2015−01−09,2015−01−15,2015−01−20,2015−01−21,2015−01−27,2015−01−28,2015−01−29,2015−02−03,2015−02−04,2015−02−08,2015−02−10,2015−02−11,2015−02−14,2015−02−15,2015−02−16,2015−02−17,2015−02−19,2015−02−20,2015−02−28,2015−03−10,2015−03−11,2015−03−14,2015−03−15,2015−03−16,2015−03−17,2015−03−18,2015−03−19,2015−03−20,2015−03−21,2015−03−23,2015−03−24,2015−08−06,2015−08−07 LOW

0.81 100%

HIGH 0.00 19%

LOW 1.00 81%

Datum (v trenutni obliki) je zavajujoč atribut - model je neuporaben pred <- predict(treeModel, test, type="class")

obs <- test$PM10

(20)

table(obs, pred)

## pred

## obs HIGH LOW

## HIGH 0 59

## LOW 0 286 tab <- table(obs, pred)

Klasifikacijska točnost - delež pravilno klasificiranih primerov sum(diag(tab))/sum(tab)

## [1] 0.8289855

Datum lahko spremenimo v numerični atribut dayOfYear <- as.numeric(format(date,"%j")) summary(dayOfYear)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1.0 85.0 165.0 171.8 254.5 365.0 myData <- origData

myData$Date <- NULL

myData$dayOfYear <- dayOfYear train <- myData[sel,]

test <- myData[!sel,]

treeModel <- rpart(PM10 ~ ., train) treeModel

## n= 866

#### node), split, n, loss, yval, (yprob)

## * denotes terminal node

#### 1) root 866 162 LOW (0.18706697 0.81293303)

## 2) Temp_site_mean< 3.58125 223 108 HIGH (0.51569507 0.48430493)

## 4) Wind_gust_mean< 2.33125 156 52 HIGH (0.66666667 0.33333333)

## 8) Pressure_max>=991.75 40 3 HIGH (0.92500000 0.07500000) *

## 9) Pressure_max< 991.75 116 49 HIGH (0.57758621 0.42241379)

## 18) Temp_site_mean< 0.51875 60 18 HIGH (0.70000000 0.30000000)

## 36) Temp_1500_mean>=-9.40625 48 10 HIGH (0.79166667 0.20833333)

## 72) Precipitation_sum< 1.15 38 4 HIGH (0.89473684 0.10526316) *

## 73) Precipitation_sum>=1.15 10 4 LOW (0.40000000 0.60000000) *

## 37) Temp_1500_mean< -9.40625 12 4 LOW (0.33333333 0.66666667) *

## 19) Temp_site_mean>=0.51875 56 25 LOW (0.44642857 0.55357143)

## 38) Wind_gust_min< 1.15 40 17 HIGH (0.57500000 0.42500000)

## 76) Temp_1500_mean>=-4.19375 26 7 HIGH (0.73076923 0.26923077) *

## 77) Temp_1500_mean< -4.19375 14 4 LOW (0.28571429 0.71428571) *

## 39) Wind_gust_min>=1.15 16 2 LOW (0.12500000 0.87500000) *

## 5) Wind_gust_mean>=2.33125 67 11 LOW (0.16417910 0.83582090) *

## 3) Temp_site_mean>=3.58125 643 47 LOW (0.07309487 0.92690513)

## 6) Temp_site_max< 8.45 122 25 LOW (0.20491803 0.79508197)

## 12) Wind_gust_mean< 1.66875 28 14 HIGH (0.50000000 0.50000000)

(21)

## 24) Wind_gust_min>=0.85 16 5 HIGH (0.68750000 0.31250000) *

## 25) Wind_gust_min< 0.85 12 3 LOW (0.25000000 0.75000000) *

## 13) Wind_gust_mean>=1.66875 94 11 LOW (0.11702128 0.88297872)

## 26) Glob_radiation_max>=70.5 27 9 LOW (0.33333333 0.66666667)

## 52) dayOfYear< 80.5 10 2 HIGH (0.80000000 0.20000000) *

## 53) dayOfYear>=80.5 17 1 LOW (0.05882353 0.94117647) *

## 27) Glob_radiation_max< 70.5 67 2 LOW (0.02985075 0.97014925) *

## 7) Temp_site_max>=8.45 521 22 LOW (0.04222649 0.95777351) * rpart.plot(treeModel)

Temp_site_mean < 3.6

Wind_gust_mean < 2.3

Pressure_max >= 992

Temp_site_mean < 0.52

Temp_1500_mean >= −9.4

Precipitation_sum < 1.2

Wind_gust_min < 1.2

Temp_1500_mean >= −4.2

Temp_site_max < 8.5

Wind_gust_mean < 1.7

Wind_gust_min >= 0.85 Glob_radiation_max >= 71

dayOfYear < 81 LOW

0.81 100%

HIGH 0.48 26%

HIGH 0.33 18%

HIGH 0.07 5%

HIGH 0.42 13%

HIGH 0.30

7%

HIGH 0.21 6%

HIGH 0.11 4%

LOW 0.60 1%

LOW 0.67 1%

LOW 0.55 6%

HIGH 0.42 5%

HIGH 0.27 3%

LOW 0.71 2%

LOW 0.88 2%

LOW 0.84 8%

LOW 0.93 74%

LOW 0.80 14%

HIGH 0.50 3%

HIGH 0.31 2%

LOW 0.75 1%

LOW 0.88 11%

LOW 0.67 3%

HIGH 0.20 1%

LOW 0.94 2%

LOW 0.97 8%

LOW 0.96 60%

yes no

pred <- predict(treeModel, test, type="class") obs <- test$PM10

Izračunajmo klasifikacijsko točnost odl. drevesa table(obs, pred)

## pred

## obs HIGH LOW

## HIGH 44 15

## LOW 22 264 tab <- table(obs, pred) sum(diag(tab))/sum(tab)

## [1] 0.8927536

Funkcija za izračun klasifikacijske točnosti CA <- function(observed, predicted) {

tab <- table(observed, predicted) sum(diag(tab))/sum(tab)

}

CA(obs, pred)

(22)

## [1] 0.8927536

Naključni gozd

Knjižnico randomForest je potrebno najprej namestiti z ukazom install.packages(“randomForest”).

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

rfModel <- randomForest(PM10 ~ ., train) pred <- predict(rfModel, test, type="class") obs <- test$PM10

table(obs, pred)

## pred

## obs HIGH LOW

## HIGH 45 14

## LOW 20 266 CA(obs, pred)

## [1] 0.9014493

Umetna nevronska mreža Uporabili bomo knjižnico nnet.

library(nnet)

Zvezne atribute skaliramo na interval [0,1].

Najprej poiščemo zalogo vrednosti atributov max_train <- apply(train[,-1], 2, max) min_train <- apply(train[,-1], 2, min) nato skaliramo podatke

train_scaled <- scale(train[,-1], center = min_train, scale = max_train - min_train) train_scaled <- data.frame(train_scaled)

train_scaled$PM10 <- train$PM10

vse vrednosti atributov v učni mnozici so sedaj na intervalu [0,1]

summary(train_scaled)

## Glob_radiation_max Glob_radiation_mean Wind_speed_max Wind_speed_mean

## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000

## 1st Qu.:0.02585 1st Qu.:0.01536 1st Qu.:0.1324 1st Qu.:0.1192

## Median :0.17690 Median :0.14860 Median :0.1912 Median :0.1572

## Mean :0.28845 Mean :0.27497 Mean :0.2310 Mean :0.1947

## 3rd Qu.:0.55856 3rd Qu.:0.51310 3rd Qu.:0.2794 3rd Qu.:0.2304

## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000

## Wind_speed_min Wind_gust_max Wind_gust_mean Wind_gust_min

(23)

## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000

## 1st Qu.:0.04762 1st Qu.:0.09353 1st Qu.:0.1432 1st Qu.:0.1011

## Median :0.07143 Median :0.13309 Median :0.1793 Median :0.1236

## Mean :0.09648 Mean :0.18501 Mean :0.2154 Mean :0.1444

## 3rd Qu.:0.11905 3rd Qu.:0.22302 3rd Qu.:0.2450 3rd Qu.:0.1685

## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000

## Precipitation_mean Precipitation_sum Pressure_max Pressure_mean

## Min. :0.00000 Min. :0.000000 Min. :0.0000 Min. :0.0000

## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.5214 1st Qu.:0.5577

## Median :0.00000 Median :0.000000 Median :0.5973 Median :0.6333

## Mean :0.03204 Mean :0.033421 Mean :0.5956 Mean :0.6272

## 3rd Qu.:0.00000 3rd Qu.:0.005195 3rd Qu.:0.6707 3rd Qu.:0.6998

## Max. :1.00000 Max. :1.000000 Max. :1.0000 Max. :1.0000

## Pressure_min Humidity_max Humidity_mean Humidity_min

## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000

## 1st Qu.:0.5914 1st Qu.:0.7750 1st Qu.:0.6857 1st Qu.:0.5577

## Median :0.6616 Median :0.8736 Median :0.8189 Median :0.7324

## Mean :0.6573 Mean :0.8366 Mean :0.7811 Mean :0.6999

## 3rd Qu.:0.7259 3rd Qu.:0.9388 3rd Qu.:0.9080 3rd Qu.:0.8711

## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000

## Temp_1500_max Temp_1500_mean Temp_1500_min Temp_site_max

## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000

## 1st Qu.:0.3848 1st Qu.:0.3793 1st Qu.:0.3724 1st Qu.:0.3297

## Median :0.5365 Median :0.5385 Median :0.5310 Median :0.5297

## Mean :0.5286 Mean :0.5284 Mean :0.5212 Mean :0.5070

## 3rd Qu.:0.6713 3rd Qu.:0.6766 3rd Qu.:0.6696 3rd Qu.:0.6716

## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000

## Temp_site_mean Temp_site_min dayOfYear PM10

## Min. :0.0000 Min. :0.0000 Min. :0.0000 HIGH:162

## 1st Qu.:0.3899 1st Qu.:0.4141 1st Qu.:0.2170 LOW :704

## Median :0.5894 Median :0.6094 Median :0.4272

## Mean :0.5649 Mean :0.5842 Mean :0.4497

## 3rd Qu.:0.7315 3rd Qu.:0.7538 3rd Qu.:0.6593

## Max. :1.0000 Max. :1.0000 Max. :1.0000 Tudi testno množico skaliramo na zalogo vrednosti iz učne mnozice!

test_scaled <- scale(test[,-1], center = min_train, scale = max_train - min_train) test_scaled <- data.frame(test_scaled)

test_scaled$PM10 <- test$PM10

Ni nujno, da bodo V testni množici vse vrednosti na intervalu [0,1]!

summary(test_scaled)

## Glob_radiation_max Glob_radiation_mean Wind_speed_max Wind_speed_mean

## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000

## 1st Qu.:0.02585 1st Qu.:0.01445 1st Qu.:0.1324 1st Qu.:0.1165

## Median :0.17286 Median :0.15086 Median :0.1765 Median :0.1518

## Mean :0.27898 Mean :0.25930 Mean :0.2152 Mean :0.1891

## 3rd Qu.:0.48465 3rd Qu.:0.43306 3rd Qu.:0.2647 3rd Qu.:0.2168

## Max. :0.90436 Max. :0.99729 Max. :0.7500 Max. :0.9350

## Wind_speed_min Wind_gust_max Wind_gust_mean Wind_gust_min

## Min. :0.00000 Min. :-0.07194 Min. :-0.01245 Min. :0.00000

## 1st Qu.:0.04762 1st Qu.: 0.07914 1st Qu.: 0.13201 1st Qu.:0.08989

## Median :0.07143 Median : 0.12230 Median : 0.16936 Median :0.12360

(24)

## Mean :0.09752 Mean : 0.17233 Mean : 0.20537 Mean :0.14112

## 3rd Qu.:0.11905 3rd Qu.: 0.20863 3rd Qu.: 0.22540 3rd Qu.:0.16854

## Max. :0.76190 Max. : 0.76259 Max. : 0.84309 Max. :0.78652

## Precipitation_mean Precipitation_sum Pressure_max Pressure_mean

## Min. :0.00000 Min. :0.000000 Min. :0.2374 Min. :0.2833

## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.5311 1st Qu.:0.5676

## Median :0.00000 Median :0.000000 Median :0.6089 Median :0.6457

## Mean :0.03266 Mean :0.032908 Mean :0.6211 Mean :0.6512

## 3rd Qu.:0.00000 3rd Qu.:0.005195 3rd Qu.:0.7062 3rd Qu.:0.7263

## Max. :0.94057 Max. :0.935065 Max. :1.0136 Max. :1.0274

## Pressure_min Humidity_max Humidity_mean Humidity_min

## Min. :0.3232 Min. :0.1133 Min. :0.1133 Min. :0.1034

## 1st Qu.:0.6058 1st Qu.:0.7750 1st Qu.:0.7041 1st Qu.:0.5787

## Median :0.6768 Median :0.8571 Median :0.8076 Median :0.7271

## Mean :0.6797 Mean :0.8280 Mean :0.7762 Mean :0.6975

## 3rd Qu.:0.7496 3rd Qu.:0.9080 3rd Qu.:0.8762 3rd Qu.:0.8426

## Max. :1.0321 Max. :0.9918 Max. :0.9893 Max. :0.9955

## Temp_1500_max Temp_1500_mean Temp_1500_min Temp_site_max

## Min. :0.08708 Min. :0.0756 Min. :0.0413 Min. :0.04054

## 1st Qu.:0.39888 1st Qu.:0.4036 1st Qu.:0.3953 1st Qu.:0.32162

## Median :0.53371 Median :0.5431 Median :0.5310 Median :0.51622

## Mean :0.54015 Mean :0.5418 Mean :0.5312 Mean :0.49692

## 3rd Qu.:0.68258 3rd Qu.:0.6841 3rd Qu.:0.6696 3rd Qu.:0.67568

## Max. :0.94101 Max. :0.9397 Max. :0.9233 Max. :0.89459

## Temp_site_mean Temp_site_min dayOfYear PM10

## Min. :0.07071 Min. :0.08207 Min. :0.01648 HIGH: 59

## 1st Qu.:0.37764 1st Qu.:0.39514 1st Qu.:0.27747 LOW :286

## Median :0.58386 Median :0.59574 Median :0.51923

## Mean :0.55604 Mean :0.57599 Mean :0.51866

## 3rd Qu.:0.74158 3rd Qu.:0.76292 3rd Qu.:0.76374

## Max. :0.96372 Max. :0.98176 Max. :1.00000

Če želimo ponovljive rezultate, lahko nastavimo izhodišče za generiranje naključnih števil set.seed(7675353)

Učenje in evalvacija nevronske mreže

nnModel <- nnet(PM10 ~ ., train_scaled, size=5, maxit=1000, trace=FALSE) pred <- predict(nnModel, test_scaled, type="class")

obs <- test$PM10 table(obs, pred)

## pred

## obs HIGH LOW

## HIGH 45 14

## LOW 22 264 CA(obs, pred)

## [1] 0.8956522

(25)

Napoved “danes bo enako kot včeraj”

pred <- test$PM10[-length(test$PM10)]

obs <- test$PM10[-1]

table(obs, pred)

## pred

## obs HIGH LOW

## HIGH 41 17

## LOW 17 269 CA(obs, pred)

## [1] 0.9011628

Verjetnostno napovedovanje

Klasifikatorji pri klasifikaciji novega primera namesto enega razreda lahko vrnejo verjetnostno porazdelitev po vseh razredih.

Verjetnostne napovedi odl. drevesa

predMat <- predict(treeModel, test, type="prob") head(predMat)

## HIGH LOW

## 867 0.89473684 0.1052632

## 868 0.89473684 0.1052632

## 869 0.73076923 0.2692308

## 870 0.12500000 0.8750000

## 871 0.04222649 0.9577735

## 872 0.02985075 0.9701493 Dejanski razredi testnih primerov obsMat <- class.ind(test$PM10) head(obsMat)

## HIGH LOW

## [1,] 1 0

## [2,] 1 0

## [3,] 1 0

## [4,] 1 0

## [5,] 0 1

## [6,] 0 1

Brierjevo mero uporabimo za ocenjevanje kvalitete verjetnostnih napovedi BrierScore <- function(observedMat, predictedMat)

{

sum((observedMat - predictedMat) ^ 2) / nrow(predictedMat) }

BrierScore(obsMat, predMat)

## [1] 0.1721226

(26)

Verjetnostne napovedi naključnega gozda

predMat <- predict(rfModel, test, type="prob") BrierScore(obsMat, predMat)

## [1] 0.1489673

Verjetnostne napovedi nevronske mreže

pred <- predict(nnModel, test_scaled, type="raw") head(pred)

## [,1]

## 867 0.6767817

## 868 0.3650151

## 869 0.1138574

## 870 0.9279571

## 871 0.9983989

## 872 0.9948844

Model nnet v primeru binarne klasifikacije vrne napovedi samo za en razred. Dodajmo še napoved za drugi razred

predMat <- cbind(1-pred, pred) head(predMat)

## [,1] [,2]

## 867 0.323218300 0.6767817

## 868 0.634984910 0.3650151

## 869 0.886142588 0.1138574

## 870 0.072042923 0.9279571

## 871 0.001601129 0.9983989

## 872 0.005115583 0.9948844 BrierScore(obsMat, predMat)

## [1] 0.1656211

Model, ki vedno napove apriorno porazdelitev razredov p0 <- table(train$PM10)/nrow(train)

p0

#### HIGH LOW

## 0.187067 0.812933

p0Mat <- matrix(rep(p0, times=nrow(test)), nrow = nrow(test), byrow=T) colnames(p0Mat) <- names(p0)

head(p0Mat)

## HIGH LOW

## [1,] 0.187067 0.812933

## [2,] 0.187067 0.812933

## [3,] 0.187067 0.812933

## [4,] 0.187067 0.812933

## [5,] 0.187067 0.812933

## [6,] 0.187067 0.812933

(27)

BrierScore(obsMat, p0Mat)

## [1] 0.2840524

Regresija

Prenesite datoteko “PM10_Reg.csv” v lokalno mapo. To mapo nastavite kot delovno mapo okolja R s pomočjo ukaza “setwd” oziroma iz menuja s klikom na File -> Change dir. . . Na primer: setwd(“c:\tecaj\data\”).

Datoteka “PM10_Reg.csv” vsebuje enake podatke kot “PM10_Class.csv” s to razliko, da je atribut PM10 zvezen.

origData <- read.csv("PM10_Reg.csv") summary(origData$PM10)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1.80 14.40 20.30 25.11 30.25 114.90 hist(origData$PM10)

Histogram of origData$PM10

origData$PM10

Frequency

0 20 40 60 80 100 120

0100200300400500

boxplot(origData$PM10)

(28)

020406080100

plot(date, origData$PM10)

2013 2014 2015 2016 2017

020406080100

date

origData$PM10

plot(dayOfYear, origData$PM10)

(29)

0 100 200 300

020406080100

dayOfYear

origData$PM10

Odstranimo neuporaben atribut in za vse učne primere izračunamo dan v tednu origData$Glob_radiation_min <- NULL

date <- as.Date(origData$Date)

dayOfYear <- as.numeric(format(date,"%j")) Kronološko razdelimo podatke na učno in testno množico sel <- date < "2016-1-1"

train <- origData[sel,]

test <- origData[!sel,]

Napoved “povprečna vrednost”

pred <- mean(train$PM10) obs <- test$PM10

Kvaliteto napovedi ocenimo s pomočjo srednje absolutne napake (razlika med napovedano in izmerjeno vrednostjo)

mean(abs(obs-pred))

## [1] 12.37992

Napovedi lahko ocenimo s srednjo kvadratno napako mean((obs-pred)^2)

## [1] 311.9037

Mere za ocenjevanje ucenja v regresiji mae <- function(obs, pred)

{

(30)

mean(abs(obs - pred)) }

mse <- function(obs, pred) { mean((obs - pred)^2) }

rmae <- function(obs, pred, mean.val) {

sum(abs(obs - pred)) / sum(abs(obs - mean.val)) }

rmse <- function(obs, pred, mean.val) {

sum((obs - pred)^2)/sum((obs - mean.val)^2) }

Primeri regresijskih modelov Pripravimo učno in testni množico myData <- origData

myData$Date <- NULL

myData$dayOfYear <- dayOfYear train <- myData[sel,]

test <- myData[!sel,]

Linearni model

linModel <- lm(PM10 ~ ., train) linModel

#### Call:

## lm(formula = PM10 ~ ., data = train)

#### Coefficients:

## (Intercept) Glob_radiation_max Glob_radiation_mean

## -97.847517 -0.020467 0.043646

## Wind_speed_max Wind_speed_mean Wind_speed_min

## -1.930987 4.266583 -0.730581

## Wind_gust_max Wind_gust_mean Wind_gust_min

## -0.052030 -2.889754 -0.501599

## Precipitation_mean Precipitation_sum Pressure_max

## -2.300669 -0.017124 1.496791

## Pressure_mean Pressure_min Humidity_max

## -3.008344 1.695056 -0.391854

## Humidity_mean Humidity_min Temp_1500_max

## -0.109261 0.161037 -0.927552

## Temp_1500_mean Temp_1500_min Temp_site_max

## 5.632819 -2.712942 1.023707

(31)

## Temp_site_mean Temp_site_min dayOfYear

## -3.302149 -0.335961 -0.006636

pred <- predict(linModel, test) obs <- test$PM10

mae(obs, pred)

## [1] 8.421626 mse(obs, pred)

## [1] 118.5413

rmae(obs, pred, mean(train$PM10))

## [1] 0.6802648

rmse(obs, pred, mean(train$PM10))

## [1] 0.3800574

Regresijsko drevo library(rpart)

treeModel <- rpart(PM10 ~ ., train) treeModel

## n= 866

#### node), split, n, deviance, yval

## * denotes terminal node

#### 1) root 866 222450.600 25.10566

## 2) Temp_site_max>=2.35 719 82378.800 21.02768

## 4) Temp_site_max>=7.55 545 39384.000 18.79541

## 8) Precipitation_mean>=0.03125 102 2920.188 11.91471 *

## 9) Precipitation_mean< 0.03125 443 30522.820 20.37968

## 18) Temp_1500_max< 15.45 389 25996.380 19.45964 *

## 19) Temp_1500_max>=15.45 54 1825.117 27.00741 *

## 5) Temp_site_max< 7.55 174 31772.870 28.01954

## 10) Wind_gust_mean>=2.03125 89 9681.488 22.06292 *

## 11) Wind_gust_mean< 2.03125 85 15627.110 34.25647

## 22) Pressure_max< 989.05 54 5752.155 28.95185 *

## 23) Pressure_max>=989.05 31 5708.570 43.49677

## 46) Temp_site_min>=3.05 17 1676.382 35.54706 *

## 47) Temp_site_min< 3.05 14 1653.235 53.15000 *

## 3) Temp_site_max< 2.35 147 69631.750 45.05170

## 6) Temp_1500_max< -3.85 76 17438.100 35.86316

## 12) Wind_gust_mean>=2.59375 23 4166.877 26.14348 *

## 13) Wind_gust_mean< 2.59375 53 10155.420 40.08113 *

## 7) Temp_1500_max>=-3.85 71 38908.520 54.88732

## 14) Temp_site_min>=-1.45 40 10627.410 42.93500 *

## 15) Temp_site_min< -1.45 31 15193.470 70.30968

## 30) Temp_site_min>=-3.5 16 5565.690 60.17500 *

## 31) Temp_site_min< -3.5 15 6231.444 81.12000 *

(32)

rpart.plot(treeModel)

Temp_site_max >= 2.4

Temp_site_max >= 7.6

Precipitation_mean >= 0.031

Temp_1500_max < 15

Wind_gust_mean >= 2

Pressure_max < 989

Temp_site_min >= 3.1

Temp_1500_max < −3.8

Wind_gust_mean >= 2.6 Temp_site_min >= −1.4

Temp_site_min >= −3.5 25

100%

21 83%

19 63%

12 12%

20 51%

19 45%

27 6%

28 20%

22 10%

34 10%

29 6%

43 4%

36 2%

53 2%

45 17%

36 9%

26 3%

40 6%

55 8%

43 5%

70 4%

60 2%

81 2%

yes no

pred <- predict(treeModel, test) obs <- test$PM10

mae(obs, pred)

## [1] 8.343685 mse(obs, pred)

## [1] 133.5494

rmae(obs, pred, mean(train$PM10))

## [1] 0.6739691

rmse(obs, pred, mean(train$PM10))

## [1] 0.4281752

Dodajanje novih atributov

Rezultat učenja lahko izboljšamo tako, da dodamo nove, informativne atribute.

Primer novega atributa: “kurilna sezona”

plot(train$dayOfYear, train$PM10) abline(v=124, col="red")

abline(v=273, col="red")

(33)

0 100 200 300

020406080100

train$dayOfYear

train$PM10

as.numeric(format(as.Date("2016-5-3"),"%j"))

## [1] 124

as.numeric(format(as.Date("2016-9-29"),"%j"))

## [1] 273

heatingSeason <- dayOfYear <= 124 | dayOfYear >= 273

myData <- origData myData$Date <- NULL

myData$dayOfYear <- dayOfYear myData$heating <- heatingSeason

train <- myData[sel,]

test <- myData[!sel,]

treeModel <- rpart(PM10 ~ ., train) treeModel

## n= 866

#### node), split, n, deviance, yval

## * denotes terminal node

#### 1) root 866 222450.600 25.10566

## 2) Temp_site_max>=2.35 719 82378.800 21.02768

## 4) Temp_site_max>=7.55 545 39384.000 18.79541

## 8) Precipitation_mean>=0.03125 102 2920.188 11.91471 *

## 9) Precipitation_mean< 0.03125 443 30522.820 20.37968

## 18) Temp_1500_max< 15.45 389 25996.380 19.45964

## 36) heating< 0.5 243 8447.655 17.19218 *

(34)

## 37) heating>=0.5 146 14219.970 23.23356 *

## 19) Temp_1500_max>=15.45 54 1825.117 27.00741 *

## 5) Temp_site_max< 7.55 174 31772.870 28.01954

## 10) Wind_gust_mean>=2.03125 89 9681.488 22.06292 *

## 11) Wind_gust_mean< 2.03125 85 15627.110 34.25647

## 22) Pressure_max< 989.05 54 5752.155 28.95185 *

## 23) Pressure_max>=989.05 31 5708.570 43.49677

## 46) Temp_site_min>=3.05 17 1676.382 35.54706 *

## 47) Temp_site_min< 3.05 14 1653.235 53.15000 *

## 3) Temp_site_max< 2.35 147 69631.750 45.05170

## 6) Temp_1500_max< -3.85 76 17438.100 35.86316

## 12) Wind_gust_mean>=2.59375 23 4166.877 26.14348 *

## 13) Wind_gust_mean< 2.59375 53 10155.420 40.08113 *

## 7) Temp_1500_max>=-3.85 71 38908.520 54.88732

## 14) Temp_site_min>=-1.45 40 10627.410 42.93500 *

## 15) Temp_site_min< -1.45 31 15193.470 70.30968

## 30) Temp_site_min>=-3.5 16 5565.690 60.17500 *

## 31) Temp_site_min< -3.5 15 6231.444 81.12000 * rpart.plot(treeModel)

Temp_site_max >= 2.4

Temp_site_max >= 7.6

Precipitation_mean >= 0.031

Temp_1500_max < 15

heating = 0

Wind_gust_mean >= 2

Pressure_max < 989

Temp_site_min >= 3.1

Temp_1500_max < −3.8

Wind_gust_mean >= 2.6 Temp_site_min >= −1.4

Temp_site_min >= −3.5 25

100%

21 83%

19 63%

12 12%

20 51%

19 45%

17 28%

23 17%

27 6%

28 20%

22 10%

34 10%

29 6%

43 4%

36 2%

53 2%

45 17%

36 9%

26 3%

40 6%

55 8%

43 5%

70 4%

60 2%

81 2%

yes no

pred <- predict(treeModel, test) obs <- test$PM10

mae(obs, pred)

## [1] 8.135822 mse(obs, pred)

## [1] 130.1761

rmae(obs, pred, mean(train$PM10))

## [1] 0.6571787

(35)

rmse(obs, pred, mean(train$PM10))

## [1] 0.4173601

Primer novega atributa: “temperaturna inverzija”

tempInv <- origData$Temp_1500_max > origData$Temp_site_min head(tempInv)

## [1] TRUE FALSE FALSE TRUE TRUE FALSE boxplot(origData$PM10 ~ tempInv)

FALSE TRUE

020406080100

myData <- origData myData$Date <- NULL

myData$dayOfYear <- dayOfYear myData$heating <- heatingSeason myData$tempInv <- tempInv

train <- myData[sel,]

test <- myData[!sel,]

treeModel <- rpart(PM10 ~ ., train) treeModel

## n= 866

#### node), split, n, deviance, yval

## * denotes terminal node

#### 1) root 866 222450.600 25.10566

## 2) Temp_site_max>=2.35 719 82378.800 21.02768

## 4) Temp_site_max>=7.55 545 39384.000 18.79541

(36)

## 8) Precipitation_mean>=0.03125 102 2920.188 11.91471 *

## 9) Precipitation_mean< 0.03125 443 30522.820 20.37968

## 18) Temp_1500_max< 15.45 389 25996.380 19.45964

## 36) heating< 0.5 243 8447.655 17.19218 *

## 37) heating>=0.5 146 14219.970 23.23356 *

## 19) Temp_1500_max>=15.45 54 1825.117 27.00741 *

## 5) Temp_site_max< 7.55 174 31772.870 28.01954

## 10) Wind_gust_mean>=2.03125 89 9681.488 22.06292 *

## 11) Wind_gust_mean< 2.03125 85 15627.110 34.25647

## 22) Pressure_max< 989.05 54 5752.155 28.95185 *

## 23) Pressure_max>=989.05 31 5708.570 43.49677

## 46) Temp_site_min>=3.05 17 1676.382 35.54706 *

## 47) Temp_site_min< 3.05 14 1653.235 53.15000 *

## 3) Temp_site_max< 2.35 147 69631.750 45.05170

## 6) tempInv< 0.5 105 31362.380 38.54190

## 12) Wind_gust_mean>=2.59375 24 4221.410 25.82917 *

## 13) Wind_gust_mean< 2.59375 81 22112.980 42.30864 *

## 7) tempInv>=0.5 42 22695.660 61.32619

## 14) Temp_site_mean>=-2.75 27 7322.336 50.32963

## 28) Wind_speed_mean>=0.625 18 3638.869 43.69444 *

## 29) Wind_speed_mean< 0.625 9 1306.080 63.60000 *

## 15) Temp_site_mean< -2.75 15 6231.444 81.12000 * rpart.plot(treeModel)

Temp_site_max >= 2.4

Temp_site_max >= 7.6

Precipitation_mean >= 0.031

Temp_1500_max < 15

heating = 0

Wind_gust_mean >= 2

Pressure_max < 989

Temp_site_min >= 3.1

tempInv = 0

Wind_gust_mean >= 2.6 Temp_site_mean >= −2.7

Wind_speed_mean >= 0.63 25

100%

21 83%

19 63%

12 12%

20 51%

19 45%

17 28%

23 17%

27 6%

28 20%

22 10%

34 10%

29 6%

43 4%

36 2%

53 2%

45 17%

39 12%

26 3%

42 9%

61 5%

50 3%

44 2%

64 1%

81 2%

yes no

pred <- predict(treeModel, test) obs <- test$PM10

mae(obs, pred)

## [1] 8.076779 mse(obs, pred)

## [1] 121.2421

(37)

rmae(obs, pred, mean(train$PM10))

## [1] 0.6524095

rmse(obs, pred, mean(train$PM10))

## [1] 0.3887165

Naključni gozd

library(randomForest)

rfModel <- randomForest(PM10 ~ ., train) pred <- predict(rfModel, test)

obs <- test$PM10 mae(obs, pred)

## [1] 7.041154 mse(obs, pred)

## [1] 86.36103

rmae(obs, pred, mean(train$PM10))

## [1] 0.5687559

rmse(obs, pred, mean(train$PM10))

## [1] 0.2768837

Napoved “danes bo enako kot vceraj”

pred <- test$PM10[-length(test$PM10)]

obs <- test$PM10[-1]

mae(obs, pred)

## [1] 7.579651 mse(obs, pred)

## [1] 120.8301

rmae(obs, pred, mean(train$PM10))

## [1] 0.6134656

rmse(obs, pred, mean(train$PM10))

## [1] 0.387831

Časovne vrste

Količino prašnih delcev lahko zastavimo kot modeliranje časovne vrste vals <- train[,"PM10"]

n <- nrow(train)

(38)

Sestavimo učno množico tako, da posamezna vrstica vsebuje štiri zaporedne meritve koncentracije prašnih delcev:

lagged_train <- data.frame(lag4=vals[1:(n-4)], lag3=vals[2:(n-3)], lag2=vals[3:(n-2)], lag1=vals[4:(n-1)], target=vals[5:n]) lagged_train[1:10,]

## lag4 lag3 lag2 lag1 target

## 1 51.4 44.3 49.0 61.3 38.9

## 2 44.3 49.0 61.3 38.9 30.3

## 3 49.0 61.3 38.9 30.3 26.8

## 4 61.3 38.9 30.3 26.8 28.5

## 5 38.9 30.3 26.8 28.5 67.6

## 6 30.3 26.8 28.5 67.6 32.9

## 7 26.8 28.5 67.6 32.9 43.4

## 8 28.5 67.6 32.9 43.4 23.0

## 9 67.6 32.9 43.4 23.0 31.6

## 10 32.9 43.4 23.0 31.6 29.7 Na enak način sestavimo tudi testno množico:

vals <- test[,"PM10"]

n <- nrow(test)

lagged_test <- data.frame(lag4=vals[1:(n-4)], lag3=vals[2:(n-3)], lag2=vals[3:(n-2)], lag1=vals[4:(n-1)], target=vals[5:n]) Zgradimo model

lagged.rf <- randomForest(target ~ ., lagged_train) pred <- predict(lagged.rf, lagged_test)

obs <- lagged_test$target plot(obs, type="l")

points(pred, type="l", col="red")

(39)

0 50 100 150 200 250 300 350

20406080100

Index

obs

ocenimo kvaliteto napovedi mae(obs, pred)

## [1] 7.236303 mse(obs, pred)

## [1] 118.9229

rmae(obs, pred, mean(train$PM10))

## [1] 0.5959766

rmse(obs, pred, mean(train$PM10))

## [1] 0.3947286

Rekurencne nevronske mreze

Pred prvo uporabo je potrebno namestiti TensorFlow z ukazom “install_keras()” Navodila za instalacijo:

https://keras.rstudio.com/reference/install_keras.html library(keras)

Tokrat bomo uporabili aktivacijsko funkcijo “tanh”.Podatke bomo zato normalizirali na interval [-1,1]

minV <- min(train$PM10) maxV <- max(train$PM10)

train.scaled <- 2 * ((train$PM10 - minV) / (maxV - minV)) - 1 range(train.scaled)

## [1] -1 1

Sestavili bomo ucno mnozico v naslednji obliki: input = koncentracija delcev v casu (t); output = koncentracija delcev v casu (t+1)

Reference

POVEZANI DOKUMENTI

S klikom File Open in opcijo All Files poišˇcemo datoteko s podatki; namesto na Open kliknemo na Advanced in izberemo Text import (congurable), text encoding ni pomemben.. Kliknemo

Ob zaključku šolanja na Srednji šoli Biotehniškega centra Naklo dijaki in dijakinje zaključnih letnikov s svojo karierno mapo predstavijo svoja znanja in spretnosti, ki so

Pripravili bomo tudi podatkovno množico za strojno učenje s ciljno spremenljivko »uspešnost pri uvodnem predmetu iz programiranja«, kjer so atributi naslednji: spol,

V likovno-pedagoškem delu sem zasnovala učno uro za osnovno šolo s področja oblikovanja simbola in logotipa na temo glasbe ter izvedla raziskavo s pomočjo

(2) Učni cilji: (1) razumeti pomen vode za življenje; (2) znati našteti vodne vire; (3) poznati vire onesnaževanja in onesnaževala voda; (4) sklepati na možen vzrok

Ker je motivacija za branje ključna sestavina bralne pismenosti, imajo učenci bralno mapo, v katero vpisujejo naslove prebranih knjig, člankov … Na seznam prebranega učenci vključijo

Vanj zapišemo besedilo prispevka, ki ga oblikujemo s pomočjo wiki oznak. V pomoč so le-te prikazane pod oknom za urejanjem. Med pisanjem prispevka si s klikom na gumb

novembra letos večina okužb posledica spolnih odnosov z okuženimi moškimi, sledile so okužbe žensk iz držav z velikim deležem okuženega prebivalstva, okužbe žensk, ki