Ljubljana,2021 Anestedsamplingalgorithmforquantifyingproteincopynumberinsuper-resolutionmicroscopy TinaKoˇsuta UniverzavLjubljani

(1)

Univerza v Ljubljani

Fakulteta za elektrotehniko, Biotehniˇska fakulteta, Ekonomska fakulteta, Fakulteta za druˇ zbene vede,

Fakulteta za matematiko in fiziko, Fakulteta za raˇ cunalniˇstvo in informatiko, Medicinska fakulteta

Tina Koˇsuta

A nested sampling algorithm for quantifying protein copy number

in super-resolution microscopy

Magistrsko delo

Magistrski ˇstudijski program druge stopnje Uporabna statistika

Mentor: izred. prof. dr. Mihael Perman Somentor: dr. Carlo Manzo

Ljubljana, 2021

(2)

(3)

University of Ljubljana

Faculty of Electrical Engeneering, Biotechnical Faculty, Faculty of Economics, Faculty of Social Sciences,

Faculty of Mathematics and Physics, Faculty of Computer and Information Science, Faculty of Medicine

Tina Koˇsuta

A nested sampling algorithm for quantifying protein copy number

in super-resolution microscopy1

Master thesis

Masters programme of Applied statistics

Mentor: izred. prof. dr. Mihael Perman Co-mentor: dr. Carlo Manzo

Ljubljana, 2021

(4)

(5)

Acknowledgement

Many thanks to my advisers, Carlo Manzo, who proposed the idea for my master thesis and helped me with his extensive knowledge to better understand the issue at hand, and Mihael Perman, who advised me on the mathematical and statistical theory. Thank you both for your patience and guidance.

I would also like to thank my committee members, Maja Pohar Perme and Jaka Smrekar, who critically assessed my work and suggested further improve- ments.

v

(6)

(7)

Povzetek

Magistrsko delo obravnava teˇzavo kvantifikacije proteinov v super-resolucijski mikroskopiji. Metode super-resolucijske mikroskopije so bistveno izboljˇsale loˇcjivost na naˇcin, da naenkrat vzbudijo le del fluorescenˇcnih molekul in jih slikajo. Ta proces je izveden veˇckrat zaporedoma, konˇcna slika pa je sestavljena iz vseh zaporednih slik. Vsak signal je obdelan in shranjen kot lokacija molekule v prostoru. Kljub izboljˇsanju loˇcljivosti je meja difrakcije teh metod med 20 in 50 nm, zaradi ˇcesar so molekule, ki so si bliˇzje od te razdalje, na konˇcni rekonstru- irani sliki vidne kot gruˇca. Gruˇcenje proteinov je bioloˇska lastnost, ki nas zanima.

Ker je posamezna gruˇca na konˇcni sliki lahko sestavljena iz ene ali veˇc loˇcenih molekul in vsaka od molekul lahko odda veˇc signalov, kvantifikacija s preprostim ˇstetjem posameznih gruˇc ni mogoˇca in problem ni lahko reˇsljiv.

Namen magistrskega dela je bil raziskati uspeˇsnost “nested sampling” metode v primerjavi s predhodno razvito metodo v Zanacchi et al. (2017). V omenjenem ˇ

clanku so avtorji razvili metodo, ki je skupno ˇstevilo proteinov ocenila kot meˇsani model veˇc oligomernih stanj. Deleˇz vsakega oligomernega stanja je bil ocenjen na podlagi ˇstevila lokalizacij v vsaki gruˇci. Rezultati v Zanacchi et al. (2017) so bili pridobljeni z numeriˇcno optimizacjo. V tej magistrski nalogi smo izvedli

“neseted sampling” na podlagi zgoraj opisanega meˇsanega modela in primerjali oba pristopa na simuliranih in realnih podatkih. Cilj obeh pristopov je bil na- jprej oceniti ˇstevilo razliˇcnih oligomernih stanj in nato oceniti ˇse njihov deleˇz v meˇsanem modelu.

Metode so bile ovrednotene s simulacijsko ˇstudijo in na podatkih, pridobljenih s STORM slikanjem. Simulacijska ˇstudija je pokazala boljˇse delovanje “neseted sampling”, zlasti pri manjˇsih vzorcih, medtem ko pri veˇcjih vzorcih veˇcjih razlik med metodami ni bilo. V analizi STORM podatkov, kjer so bili vsi vzorci veliki vii

(8)

(n >1000), so si bile ocenjene porazdelitve iz obeh metod zelo podobne.

Kljuˇcne besede: super-resolucijska mikroskopija, kvantifikacija ˇstevila proteinov, STORM, nested sampling, Bayesova statistika, robno verjetje

(9)

Abstract

This master thesis addresses the issue of protein cluster quantification in super- resolution microscopy. Methods in super-resolution microscopy, typically referred to as single molecule localization microscopy, have greatly improved the optical resolution by sequentially exciting only the part of fluorescent molecules and image their signals through time. The signal is processed and stored as a localization in space. The improved resolution has further demonstrated molecular clustering as a relevant biological feature for cellular function. However, each cluster can be composed of several molecules and each of them is subjected to stochasticity of molecule labeling and complex photophysics of the fluorescent probes. This leads to a broad distribution of number of localizations for each cluster size, which impacts exact quantification of cluster stoichiometry.

The aim of this master thesis was to investigate the performance of nested sampling compared to the previously developed method in Zanacchi et al. (2017).

In the said article, the authors developed a method based on numerical approximation which estimated the total protein count as the mixture model of several oligomeric states. The proportions of each oligomeric state were estimated based on the number of localizations in each cluster. In this master thesis, we implemented the nested sampling based on the mixture model described above and compared both approaches on simulated and real data. The goal of both approaches was firstly to estimate the number of different oligomeric states in the model and secondly their corresponding proportions.

The methods were evaluated in a simulation study and on the data generated from STORM imaging. The simulation study showed better performance of nested sampling especially in smaller samples, while in larger samples the methods performed similarly. In the STORM image analysis, where all the samples ix

(10)

were large (n >1000), all fitted distributions were almost identical.

Key words:super-resolution microscopy, protein quantification, STORM, nested sampling, Bayesian statistic, evidence

(11)

List of Figures

2.1 Diffraction limit in light microscopy. . . 4

2.2 Target molecule, primary and secondary antibody. . . 7

2.3 Reprinted by permission from Springer Nature Customer Service Centre GmbH: Springer Nature, Nature Methods, A DNA origami platform for quantifying protein copy number in super-resolution, Francesca Cella Zanacchi et al, 2017 . . . 8

2.4 Example of data structure. . . 10

2.5 A. Simulated image of protein clusters. The numbers in the picture represent the number of proteins in each cluster. B. Simulated localizations corresponding to clusters in the picture A. The number of localizations for each cluster follows log-normal distribution (f_n). C. The reconstructed super-resolution image based on the image B. D. Number of localizations per cluster size following log- normal distribution (f_n). [1] - Reproduced by permission of the PCCP Owner Societies. . . 11

2.6 Example of conditional probability. . . 15

2.7 Tree diagram. . . 16

2.8 Bayesian inference. . . 17

2.9 Likelihood function with area Z. . . 23 xiii

(14)

2.10 Relationship betweenξ^∗ andL^∗. The restriction on likelihood (L^∗) translates to the restriction in the prior space via the cumulant prior mass restraint (ξ^∗). The sampling of a new object is subjected

to the shrunken prior space. . . 25

2.11 New objects in shrunken prior space at each iteration. The iterations start at the bottom of the figure. . . 26

2.12 Posterior representatives scattered randomly over the area Z. . . . 33

2.13 Prevalence of monomers: Sample distribution and corresponding weighted distribution for each oligomeric state. . . 38

2.14 Prevalence of trimers: Sample distribution and corresponding weighted distribution for each oligomeric state. . . 39

2.15 Prevalence of pentamers: Sample distribution and corresponding weighted distribution for each oligomeric state. . . 40

3.1 Proportion of correctly identified number of the parameters in the model for each scenario in simulation study. . . 44

3.2 RMSE of chosen number of parameters. . . 45

3.3 Bias of chosen number of parameters. . . 46

3.4 Standard deviation of chosen number of parameters. . . 47

3.5 Distribution of proportion of oligomeric states in the scenario with prevalent monomers. . . 48

3.6 Distribution of proportion of oligomeric states in the scenario with prevalent trimers. . . 51

3.7 Distribution of proportion of oligomeric states in the scenario with prevalent pentamers. . . 52

3.8 Fit of estimated distributions to the data from Image 3. . . 54

(15)

List of Figures xv

3.9 Fit of estimated distributions to the data from Image 4. . . 55

(16)

(17)

List of Tables

2.1 Prevalence of monomers: Table of proportions for each oligomeric state. . . 36 2.2 Prevalence of trimers: Table of proportions for each oligomeric state. 37 2.3 Prevalence of pentamers: Table of proportions for each oligomeric

state. . . 37 3.1 Estimated weights from chosen model based on selection criteria

for each image. . . 53

xvii

(18)

(19)

List of Acronyms

Acronym Description

STED stimulated emission depletion

RESOLFT reversible saturable optical linear (flu-orescence) transitions SMLM single-molecule localization microscopy

(F)PALM (fluorescence) photo activated localization microscopy STORM stochastic optical reconstruction microscopy

PSF point spread function DNA deoxyribonucleic acid GFP green fluorescent protein BIC Bayesian information criteria AIC Akaike information criteria MCMC Markov chain Monte Carlo AIS Annealed importance sampling TRP true positive rate

RMSE root-mean-square error

xix

(20)

(21)

1 Introduction

Super-resolution fluorescence microscopy was one of the biggest advances in an- alytical methods in the field of light microscopy in the past decades. The idea was so impactful that the authors who contributed the most to its development, were awarded the Nobel price in chemistry in 2014 [2]. The main goal of several methods in super-resolution fluorescence microscopy is to overcome the diffraction limit of the light. In light microscopy, objects closer to each other than the diffraction limit (cca. 200-300 nm) are impossible to resolve. The only imaging method which could surpass the diffraction limit of visible light in biological microscopy imaging is the electron microscopy, the method which requires expensive equipment, well trained personnel and is not suitable for imaging in living cells because of its sample preparation. Therefore, the progress in the light microscopy is enabling us to resolve smaller objects in biological images without causing much damage to the sample (also with the possibility of in-vivo imaging), often with the equipment already available in many laboratories.

The initial focus (i.e. the imaging of biological structures in greater detail) has quickly expanded to other areas of interest, such as quantification of specific molecules in the sample. In last 15 years, plenty of different approaches and methods that estimate the number of copies of a specific molecule has been proposed with various degrees of success. The methods of quantification could be roughly divided into two approaches. The first one is estimating the number of molecules from the final reconstructed image, mostly using the two-dimensional density histograms of molecule localizations or fluorescence intensity. The second one is based on coordinates obtained from the localization process [3]. Some methods, simply estimate the number of molecules by counting the number of localizations and adjust the final number by correcting for overcounting due to fluorophore 1

(22)

blinking or by correcting the undercounting because of inefficient molecule detection [4], [3]. Other methods take into account the spatial distribution patterns, by estimating the number of molecules using pair-correlation function [5] or Ripley’s K function [6]. Another possible approach based on spatial distribution analyses and quantifies the localization clusters [7], [3].

In this master thesis, we will describe a new method based on Bayesian analysis that estimates the number of proteins in protein clusters as developed [8], but providing a more objective quantification with respect to the one used in previous works. In the method developed in [8] the first step is clustering the localizations based on proximity. Then, the quantity of proteins is estimated from the distribution of number of localizations from the clusters. The implementation of the nested sampling algorithm enables us to compare different (nested) models using calculated evidence and estimate the proportions of oligomeric states.

(23)

2 Methods

2.1 Super-resolution microscopy and the quantification of protein copy

2.1.1 Super-resolution microscopy methods

Recently, fluorescence microscopy has been one of the main tools in biological and medical research. The ability to non-invasively stain target molecules with high accuracy in multiple colors offered many applications in the study of cellular organelles, molecular organisation and interactions,... [9], [7]. The main disad- vantage of optical microscopy, including fluorescence microscopy, is the diffraction limit of the microscope, which was discovered by Ernst Abbe in 1873 [10]. Due to diffraction, objects separated by distance smaller than the approximately half of the visible light wavelength (∼ 200-300 nm) are indistinguishable, see Figure 2.1. This limit has been surpassed by super-resolution techniques, which will be briefly described below. For further introductory reading about the methods see [9], [10], [7].

For separation of the fluorescent molecules below the diffraction limit, another dimension was needed - time. The key to distinguish fluorescent molecules below the diffraction limit is to excite only part of the molecules, preventing all the molecules in the same diffraction-volume (i.e. the volume where separation is impossible due to diffraction limit) to be detected simultaneously. Most methods in super-resolution fluorescence microscopy are taking advantage of the transition between ‘on’ and ‘off’ state of fluorescent molecules, such that only part of the molecules are excited and recorded at once. The process is repeated several times 3

(24)

Figure 2.1: Diffraction limit in light microscopy.

until the whole sample is imaged. From individual images, locations of fluorescent molecules are obtained and later on mapped to produce the final image. Despite higher resolution, the diffraction limit for these methods is 20-50 nm, which means that some smaller biological molecules, e.g. proteins, that are closer together are not individually countable, but appear as clusters in the final image. Numerous techniques have been developed to count the number of the molecules in the sample as described previously.

The methods of super-resolution microscopy are divided into two approaches, deterministic and stochastic [10]. With deterministic methods, for example stimulated emission depletion (STED) [11] or reversible saturable optical linear (fluorescence) transitions (RESOLFT) [10], the sample is scanned at defined spatial coordinates using a doughnut-shaped light beam, which consists of a center light responsible for the excitation of molecules and an outer light for the depletion of

(25)

fluorescence, so only molecules in the inner circle are excited and recorded with high spatial precision. The doughnut-shaped light beam is needed, because the area of single (excitation) light beam would be too large, exciting multiple fluorescent molecules in small area at once, preventing the detection of the molecules separately. The sample scanning is very fast, which makes these techniques par- ticularly useful for recording images in time.

On the other hand, single-molecule localization microscopy (SMLM) methods such as (fluorescence) photo activated localization microscopy [(F)PALM] [12], [13] or stochastic optical reconstruction microscopy (STORM) [14], stochastically enable only one fluorescent molecule in diffraction-limited volume to be in ‘on’

state to emit fluorescence until it is turned off or photo-bleached. Afterwards, a different molecule is turned ‘on’ and localized. The process is repeated until the mapping of fluorescent molecules is completed.

2.1.2 Spatial resolution and construction of high-resolution image

A high-resolution image is reconstructed from the positions of fluorophores obtained from sequential images. The spatial resolution of the final image depends on labeling density, precision of single molecule localization and the reconstruction method [7], [3].

The labeling density has to be sufficiently high to provide good resolution.

The Nyquist criterion, which depends on the pixel size and signal-to-noise ratio (i.e. the ratio between the power of the signal and the power of background noise), states that the distance between two individual localizations must be smaller than half of the desired resolution [15]. In case of low labeling density, the resolution is lower, resulting in inaccurate images and loss of detail. The labeling density should be taken into account before the imaging, since it is crucial in the sample preparation step [15], [7].

The first step in the reconstruction of the final image is to determine the precise location of a single molecule from sequential images [15], [7], [3]. The picture of a single point object (in our case fluorophore) is blurry because of diffraction. The level of smearing depends on the quality of the imaging system

(26)

but cannot be avoided. Several other factors affect the signal of a single molecule in a way that its location is not easily determined. The biggest issue is that the signal from a single molecule is normally distributed around its original location (i.e. shot noise), which means that the signal of one single molecule appears at slightly different coordinates in different images [16]. Another problem is background noise, especially from fluorescent probes which are out of focus, or any background light, since this may lead to systematic bias in the final localization estimation [16].

Several software tools are available to tackle the problem of single molecule location (the list of some methods can be found at:

http://bigwww.epfl.ch/palm/software/). The main idea used in those methods is to estimate the center of a single molecule using the point spread function (PSF is the function of imaging system and it describes the shape of the smear after imaging the point origin of light, i.e. fluorescent probe). The most accu- rate models for isotopic point source (i.e. one that emits photons equally in all directions) are the Richards-Wolf model and the Gibson-Lanni model [16]. De- spite their accuracy, the calculations to obtain coordinates of the point emitter are very complex, thus many researchers approximate the PSF of the imaging system with a two-dimensional Gaussian function. The approximation is good for in-focus part of PSF, but could be a poor approximation in its tails. It is also not suitable for approximation of 3D PSF. Depending on further analysis, the obtained localizations are used to recreate the final image, or their coordinates are saved and further analysed (mostly used in quantification) [16].

2.1.3 Quantification of protein copy number using DNA origami platform as calibration tool

As mentioned, the idea for this master thesis comes from the method developed in Ref. [8], where the authors developed the calibration tool for quantification of proteins in STORM images using DNA origami labeled with small organic fluorophores.

In STORM, the target molecules in the sample are normally labeled by a primary antibody (the antibody that specifically binds to the chosen target in

(27)

the sample) and a secondary antibody (the antibody that binds to the Fc region of primary antibody and it is marked with a fluorescent dye) as seen in Figure 2.2. The labeling efficiency therefore depends on three steps: conjugation of the secondary antibody with flourescent dye, binding of the primary antibody onto target molecule and binding of the secondary antibody onto primary antibody, resulting in a stochastic process [8], [15], [7].

Figure 2.2: Target molecule, primary and secondary antibody.

The estimation of total target protein copies in the sample is affected by all the steps in labeling described previously and photophysics of fluorescent probes.

As a consequence of these events, the target proteins’ clusters produce a broad distribution of the number of localizations. Nevertheless, the number of localizations per cluster will increase with the increasing number of protein copies inside clusters. By using a reference with exactly known quantities treated equally as the sample, the number of protein copies can be calibrated and used to estimate the total number of copies in the sample [8].

The method relies on DNA origami (i.e. the structure obtained from the DNA folding to create specific two- and three-dimensional shapes at nanoscale level) as a platform on which the exact quantities of target molecules were bound [8].

Then, the calibration tool was treated and imaged under the same conditions as the sample, obtaining the calibration curves that were later used to estimate the total number of protein copies in the sample. The structure of DNA origami was

(28)

designed in such a way that the desired number of molecules could bind to the handles projecting from the chasis. In this way, the number of fluorophores and the distance between them can be controlled. Firstly, STORM imaging was performed on samples where the small flourescent dye Alexa Flour 647 was bound to a complementary antihandle sequence. Therefore, the attachment efficiency of the fluorescent dye to DNA origami via antihandle sequence only depended on the sequence of oligos and was independent of the fluorophore. Secondly, the experiment was replicated using the modified dimeric Saccharomyces cerevisiae dynein motor with one protomer containing the SNAP-tag (poly-peptide which connects the protein and the antihandle DNA sequence) and the other one containing green fluorescent protein (GFP). After the protein attachment to DNA origami via SNAP-tag, the region of GFP was labeled with the primary and the labeled secondary antibody and then imaged. More information can be found in [8].

Figure 2.3: Reprinted by permission from Springer Nature Customer Service Centre GmbH: Springer Nature, Nature Methods, A DNA origami platform for quantifying protein copy number in super-resolution, Francesca Cella Zanacchi et al, 2017

Localizations of single molecules were obtained using the Insight3 software.

Insight3 firstly identifies each single molecule in the image by fluorescence intensity higher than the threshold and then calculates the x and y coordinates by fitting the two-dimensional Gaussian function. The final image is reconstructed

(29)

by plotting each Gaussian point with the width of the localization precision (9 nm). The molecules closer than the localization precision are indistinguishable and thus treated as a single molecule [8], [17].

In [8], the list of single-molecule localizations was used to create discrete localization images. They were created by counting the number of localizations falling into each pixel of size 20 nm. Then the pictures were convoluted using a 2-dimensional kernel (7x7 pixels²) to obtain the density maps. From the density maps, the binary image was obtained using a constant threshold (the threshold was determined by imaging the unlabeled samples to determine the background signal values). Areas with density lower than the threshold were assigned the value 0 and discarded from further analysis. Next, only components with six- connected neighbouring pixels and signal value larger than threshold were further analysed. The construction of binary image was only an intermediate step to determine the regions containing localizations. Further analysis of those regions was done on raw localization data. A more detailed description can be found in [8] and [17].

The connected regions were analysed to obtain clusters of the localization data. The initial values for the number of clusters and the relative centroid coordinates were calculated by peak-finding routine from the local maxima of the density map within the connected regions. Molecular localizations were assigned to the clusters based on distance from the cluster centroids. The clusters centroids were iteratively recalculated and stored for further analysis. The clustering algorithm was stopped when the convergence of the sum of the squared distances between localizations and the centroid of associated cluster was reached. The cluster centroid positions, the number of localizations in each cluster and the cluster size (calculated as the standard deviation of localization coordinates from the cluster centroid) [8], [17].

2.1.4 Data structure and numerical optimization

The only information needed for the model estimation of the protein copy number in [8] and this master thesis is the number of localizations in each cluster.

(30)

Figure 2.4: Example of data structure.

Based on the empirical data from both experiments described in 2.1.3, the authors [8] found that the log-normal distribution (f₁, µ = 3.3 and σ = 0.85) provides the best fit for the cluster of size one (i.e. cluster which only has one target molecule, in our case protein monomer, n = 1). The distribution of number of localizations in the clusters with n > 1 was assumed to correspond to n convolutions of f₁.

f_n=fn−1∗f₁ (2.1)

where ∗ represents convolution andf₁ represents log-normal distribution.

f₁(x) = 1 xσ√

2πe⁻^(lnx−µ)2^2σ² (2.2)

The distribution of the number of localizations in the sample can be viewed as the mixture of oligomeric states. Therefore, it is possible to derive the oligomeric states present in the sample and their corresponding proportions by fitting the data to a linear combination of calibration distributions (f_n). From that, the total number of protein copies can be estimated. Even though the logic seems straightforward, the analysis of such models is not easy. Several issues arise in

(31)

A B

C D

Figure 2.5: A. Simulated image of protein clusters. The numbers in the picture represent the number of proteins in each cluster. B. Simulated localizations corresponding to clusters in the picture A. The number of localizations for each cluster follows log-normal distribution (f_n). C. The reconstructed super-resolution image based on the image B. D. Number of localizations per cluster size following log-normal distribution (f_n). [1] - Reproduced by permission of the PCCP Owner Societies.

estimating the parameters in finite mixtures as multiple maxima and the need to know the number of components.

In [8], this was done by a two-step numerical minimization. The objective function to minimize combines the sum of negative log-likelihood and entropy.

The optimization is performed by varying N_max(i.e. largest number of oligomeric states in the model), which means that the sample is treated as a finite mixture of convoluted log-normal distributions. Since the N_max is unknown, the fits for different scenarios need to be compared (i.e. models with various N_max). The problem arises, because the likelihood increases with the increasing number of

(32)

parameters. The measures like the Bayesian information criteria (BIC) [18] or the Akaike’s information criteria (AIC) [19] penalise the likelihood with the number of parametrs in the model to prevent overfitting. However, they control model selection better in larger samples, while they perform rather poorly with smaller samples. The fit can be improved if the sample and protein properties are taken into account e.g. the structure of the protein (for example, the investigated protein unit occurs as dimer), etc. For more details see [8].

In this master thesis, we will try to improve the method described above with the implementation of the nested sampling algorithm. However, with only one variable (number of localizations) present, the problems are compounded, since the distributions from different oligomeric states overlap. The aim of our model was not classification, but rather to estimate the proportion of each group and the total number of proteins in the sample. If there was other information available and the separation of the groups was better, the problem could be solved using different methods as for example clustering, classification or regression.

2.2 Bayesian statistics and nested sampling

2.2.1 Bayes’ theorem and the history of Bayesian statistics

Bayesian statistics is one of the two approaches (besides the more commonly known frequentist approach) in the field of statistics. Reverend Thomas Bayes discovered the theorem known today as the Bayes’ theorem, which bears the main idea of the said field, in An Essay Towards Solving a Problem in the Doctrine of Chances published posthumously in 1763 by his friend Richard Price. The ideas presented by Bayes were later used and developed mostly by Laplace in the 19^th century, but after his death they fell out of favour. The interest in Bayesian methods reemerged in mid 20^th when the topic was researched by de Finetti, Jeffreys, Lindley, Savage and others. Despite the development of statistical inference methods based on Bayes’ theorem, the Bayesian methods were considered troublesome due to philosophical and practical concerns [20].

The main philosophical issue has been the objectiveness (or better, the lack of

(33)

2.2 Bayesian statistics and nested sampling 13

it) of assigning prior probabilities [21]. Generally, two persons can have different beliefs about the same problem, therefore the assignment of the prior probability is subjective. In Bayesian statistics the prior probability describes our beliefs that something is true based on all relevant information available until then. From that point of view, two persons with same information about the process will decide for the same prior probability [21].

On the other hand, the practical concerns were more of a computational nature. Updating the prior probability based on our data to obtain the posterior probability, is either hard to calculate or often impossible with our current mathematical knowledge (e.g. integration in higher dimensions) [21]. In the 20^th century, the concept of conjugate priors was introduced by Howard Raiffa and Robert Schlaifer [20]. The conjugate prior is a closed-form expression for the posterior distribution, meaning that posterior distribution is obtained without numerical integration. This happens when the combination of likelihood and prior probability leads to the posterior probability distribution from the same family of distributions (e.g. exponential family). However, conjugate priors are only applicable in few specific scenarios. If there are no analytic results, numerical integration is needed.

The numerical estimation of posterior probability can be obtained via Markov chain Monte Carlo (MCMC) methods which are computationally intensive, therefore the advance in Bayesian statistics was closely linked to the development of computational power [21], [20]. The MCMC methods can sample from very complex sample probability distributions. The collected sample can be further analysed to estimate various distribution parameters (e.g. mean, variance, median, etc.). The most commonly used methods in Bayesian statistic from the MCMC family of algorithms are the Metropolis-Hastings algorithm (i.e. the algorithm for drawing samples from any probability distribution when a function proportional to the probability distribution is known and its values can be calculated, [22]) and the Gibbs sampler (i.e. special case of the Metropolis-Hastings algorithm, when the joint distribution is not known or hard to sample from, but the conditional distributions of each variable is easier to sample from. [23], [24]) [21]. More about MCMC in general can be found in section 2.2.6.

(34)

2.2.1.1 Bayes’ theorem

Bayes theorem is mathematically stated as the equation 2.3. The equation describes the relationship between conditional and marginal probabilities of events A and B.

P(A|B) = P(B|A)·P(A)

P(B) (2.3)

where:

• P(A|B) is the likelihood of event A given the outcome of B (conditional probability)

• P(B|A) is the likelihood of event B given the outcome of A (conditional probability)

• P(A) is the probability of event A (marginal probability)

• P(B) is the probability of event B (marginal probability)

The conditional probabilities (e.g. P(A|B)) in equation 2.3 can be also stated as

P(A|B) = P(B∩A)

P(B) (2.4)

where P(B ∩A) is the probability that both events A and B occur. The conditional probability therefore depends on the probability of the intersection of events A and B and the probability of the eventB.

For example, there is a bucket with three balls inside. One of them is red and the other two are green. The probability to draw the ball of specific colour corresponds to the proportion of the colour in the basket. Therefore, in the first draw, the probability to draw a green ball is ²₃ and a red one ¹₃. The first ball we

(35)

Figure 2.6: Example of conditional probability.

draw is the green colour. We do not return the ball in the bucket and then we draw the second ball. What is the probability that the second ball is red colour?

After the first draw there are two balls left, one of each colour, therefore the conditional probability for the second ball to be red at this point is ¹₂ which can be seen in figure 2.7. The conditional probability that the second ball is red, given that the first ball drawn was green is ¹₂. The equation 2.4 in this case is:

P(A = second ball is red|B = first ball is green) = P(B∩A) P(B) P(B = first ball is green) = 2

3

P(B ∩A) =P(B = first ball is green∩A= second ball is red) = 2 3 · 1

2 = 1 3 P(A = second ball is red|B = first ball is green) =

1 3 2 3

= 1 2

2.2.1.2 Bayesian inference

In the setting of Bayesian inference, the interpretation of Bayesian theorem describes how the probability of a hypothesis is updated after given evidence. If

(36)

Figure 2.7: Tree diagram.

The branch probabilities are conditional on the parent node.

A in equation 2.3 represent the hypothesis H and B represents the evidence E, then the equation 2.3 can be rewritten as:

P(H|E) = P(E|H)·P(H)

P(E) (2.5)

where

• P(H) is the probability of the hypothesis H (the prior probability)

• P(E|H) is the likelihood of the hypothesis given the evidence (likelihood)

• P(H|E) updated probability of hypothesis after given evidence (posterior probability)

• P(E) has several names in literature, e.g. marginal likelihood, integrated likelihood or evidence. In this master thesis, the expression evidence will be used to harmonize the term with John Skilling, the author of the nested sampling algorithm. The value of evidence is given by the likelihood function marginalised over the parameter space.

(37)

Let’s return to the previous example of drawing balls from a bucket, but with a different question. Firstly, the ball is drawn from a bucket, but we don’t know its colour. The ball is not returned to the basket and another ball is drawn from it. The colour of the second ball is known. Based on the outcome of second drawing, what can we say about the colour of the first ball? For example, let the second ball be green. What is the conditional probability that the first ball was red?

Figure 2.8: Bayesian inference.

Before drawing, the prior probability that the ball is red would be proportional to the frequency of the red colourP(H) = ¹₃. The evidence in our case is that the second ball is green. The probability to obtain such evidence among all possible is P(E) = ²₃, since after two draws the three equally likely scenarios are possible and two of them end with the green ball as second draw. The likelihood of the evidence (the second ball is green) given the hypothesis (first ball is red) is P(E|H) = 1, because if the red ball is drawn in the first round only the green balls remain in the bucket, therefore the probability that the second ball will be green is 1. We can calculate the posterior probability as:

(38)

P(H = first ball is red) = 1 3 P(E = second ball is green) = 2

3

P(E = second ball is green|H = first ball is red) = 1 P(H = first ball is red|E = second ball is green) = 1·¹₃

2 3

= 1 2

In general, methods in Bayesian statistic mainly focus on updating the prior distribution based on our data to obtain the posterior distribution for parameter estimation. Usually, the posterior distribution is estimated via MCMC methods and is proportional to the product of prior and likelihood. The evidence in this case plays the role of a normalising constant of posterior probability and it is not necessary to estimate it in order to analyse the posterior probability distribution.

P(H|E)∝P(E|H)·P(H) (2.6)

However, the evidence plays an important role in model comparison. When the data is modeled with two or more competing models, the Bayes factor (i.e.

the ratio between evidence of two or more models) could be used as the model selection criteria. Therefore, when more models are proposed for modeling the data, this value could give us important insights, but its calculation is generally very hard.

2.2.2 Estimation of evidence

As previously mentioned, the estimation of evidence and posterior is usually challenging because of the high dimensional integrals, which are hard or impossible to solve analytically with our current knowledge. The said problem is mostly tackled by numerical approximations. Therefore, the estimation of evidence has not been deeply investigated and the developed methods usually calculate it as a by-product of the posterior probability estimation.

(39)

Before diving into the evidence estimation, let’s rewrite the equation (2.3) in a more formal way. Given the model M, parameters of the modelθ and the data D, the Bayesian theorem written in equation (2.7) describes how the probabilities are updated for given data, giving the posterior probability, with the evidence as the normalising constant in the denominator on the right side.

P(θ|D, M) = P(D|θ, M)·P(θ|M)

P(D|M) (2.7)

Most Bayesian inference methods rely on estimation of (unnormalized) posterior probability using one of the MCMC sampling methods. In the following section, methods of evidence estimation based on MCMC methods will be described. Some methods using different approaches can be found at Doucet and Jasra (2006), Parise and Welling, (2007), Rue, Martino and Chopin, (2009), Wyse, Fricl and Rue (2011), etc. [25]. An overview of 19 methods for evidence estimation in the filed of Bayesian phylogenetics can be found in [26].

The first attempt at estimating the evidence comes from Tierney and Kadane (1986), via Laplace’s method [27]. If the posterior distribution can be adequately appoximated with a normal distribution and it’s highly peaked around the mode of the posterior distribution, the product of prior and likelihood can be approximated with a multivariate normal distribution. Integration of this approximation gives us the estimate of evidence. Since, the method has multiple strong assump- tions and only holds in very specific settings, it is rarely used.

In 1994, the new idea of the harmonic mean estimator was presented by Newton and Raftery [28]. The idea is easy to implement since it’s based on the sample drawn from the posterior distribution (θ⁽¹⁾, ..., θ^(N)). For each value of θ⁽ⁱ⁾, we calculate the likelihood and estimate the evidence as the harmonic mean of the likelihood.

P(D|M)≈ (︄1

N

∑︂

i=1

(P(D|θ⁽ⁱ⁾, M))⁻¹ )︄−1

(2.8)

Despite its simplicity, the method is not widely accepted in the statistical

(40)

community. The harmonic mean estimator often has infinite variance, which is not a desirable behaviour for any estimator as we cannot estimate how close we are to the true value. Also, because the estimation of likelihood is based on the samples drawn from the posterior distribution (which is usually peaked around some value), the estimator is insensitive to changes in prior probability, even though it is known that the value of evidence is very sensitive to the changes in the prior.

Chib (1995) [29] proposed a method which can be applied to the output of the Gibbs sampler [23], [24]. The estimation comes from rearranging the equation (2.7) to equation (2.9).

P(D|M) = P(D|θ, M)·P(θ|M)

P(θ|D, M) (2.9)

where the (logarithm of) prior and likelihood can usually be evaluated in closed form, and the (logarithm of) posterior is estimated as the average of the draws from the Gibbs sampler. This approach can be extended to higher dimensions for all models where the posterior distribution can be sampled with the Gibbs sampler. Later (in 2001), the method was extended by Chib and Jeliazkov to the outputs of the Metropolis-Hastings algorithm. This method is one of the most widely adopted in practice.

Annealed importance sampling (AIS), was developed in 2001 by Neal [30].

The algorithm uses a tempering algorithm to define an importance sampling function which is used to approximate the posterior distribution. It’s the importance sampling algorithm extended by using a Markov transition kernel (e.g.

Metropolis-Hasting kernel). The estimator of the evidence is calculated as the average of the importance weights from the algorithm. AIS performs well in com- plicated posteriors (including multi-modal posteriors) and is a popular choice in practice.

Power posteriors method developed in 2008 by Friel and Pettitt [31] is based on thermodynamic integration. They introduced the term power posterior (first part on the right hand side of equation (2.10)),

(41)

P_t(θ|D, M)∝ {P(D|θ, M)}^t·P(θ|M) (2.10) where t ∈ [0,1] and P(D|M) = ∫︁

θ{P(D|θ, M)}^t·P(θ|M)dθ. When t = 0, the integral simplifies to integrating the prior, giving us 1 and when the t = 1 the integral equals to the true model evidence. The (logarithm of) model evidence therefore equals to

log P(D|M) = log

{︃P(D|M)^t=1 P(D|M)^t=0

}︃

=

∫︂ 1

0

Eθ|t log P(D|θ, M) dt (2.11)

The t∈[0,1] is discretized and the evidence is estimated via trapezoidal rule (equation 2.12).

log P(D|M)≈

m

∑︂

j=1

(t_j−tj−1)

(︃Ej−1+E_j 2

)︃

(2.12)

The numerical integration presents the error in our estimation, therefore one should be careful when deciding on the temperature map - discretization of t.

Despite numerous efforts, the calculation of evidence has remained a hard issue to solve thorough the years. Several different approaches based on MCMC methods were described above in the literature, with various levels of success in achieving the goal and implementation. In the next section, the nested sampling method developed by Skilling in 2004 will be presented.

2.2.3 Nested sampling method

As mentioned, the evidence plays a key role in the model selection within the Bayesian inference. In our case, the first step in protein quantification was to select the model with the correct number of parameters. This was done by calcu- lating the evidence for each model and choose the one with the largest evidence.

The only difference between the models was the number of parameters, which

(42)

was increased sequentially from one to N_max (based on knowledge of the investigated protein and protein arrangement). The evidence for individual models was estimated by nested sampling method.

If we rewrite the equation (2.7) to arrange the terms which can be calculated in given point (i.e. prior and likelihood) on the right-hand side and leave the unknown terms on the left-hand side, we obtain

P(θ|D, M)·P(D|M) =P(θ|M)·P(D|θ, M) P osterior·Evidence=P rior·Likelihood

p(θ)·Z =π(θ)·L(θ)

(2.13)

where prior (π(θ)) and posterior (p(θ)) probabilities are normalized. In above equation Z represents the evidence and L(θ) the likelihood. As written before in the section 2.2.1.2, the evidence is also known as marginal likelihood, a term coming from the fact that evidence can be calculated as integrated likelihood function over the prior space (2.14). This was the main idea from which Skillng [32], [33], [21] started the development of nested sampling.

Z =

∫︂

θ

L(θ)π(θ)dθ (2.14)

At first, the problem seems to be solvable with the conventional methods of numerical analysis, but with increasing dimensions of the model, the evaluation gets much harder. Skilling approached the issue by transforming the multidimensional integral over prior space into a one dimensional integral using the prior mass variable defined in Eq. 2.15, [33], [21], [32].

The cumulant prior mass (ξ) of elements with likelihood greater than λ is defined as ξ(λ) in Eq. 2.15.

ξ(λ) =

∫︂

L(θ)>λ

π(θ)dθ (2.15)

Here λ is the likelihood restraint and can range from 0 to maximum likeli-

(43)

hood (L_max). The prior mass ξ in nature represents the proportion and is thus bounded in the interval between 0 and 1. By limiting the likelihood we obtain the proportion of prior mass where the prior elements have likelihood larger than the likelihood restraint. At minimum value of λ ≥ 0 where ξ = 1 (i.e. with no restraint on likelihood, the whole cumulatnt prior mass is available and therefore equals 1) and the maximum value λ = L_max where ξ = 0 (i.e. in case when λ equals to maximum likelihood, no prior mass is left). By increasing λ the ξ(λ) is (strictly) decreasing from 1 to 0 as seen in Fig. 2.9.

Figure 2.9: Likelihood function with area Z.

With the inverse function ofξ(λ), L(ξ(λ))≡λ, the multidimensional integral for calculation of the evidence is transformed into one dimensional integral. The evidence can thus be calculated as seen in Eq. 2.16.

Z =

∫︂ 1

0

L(ξ)dξ (2.16)

The evidence in Eq. 2.16 could be numerically estimated as the weighted sum of likelihood value L_j and it’s weight representing the change in the prior mass (∆ξ_j =ξj−1−ξ_j), if the change of the prior mass was known for each point. Since the change in prior mass is not known for each point, the issue is not solvable numerically, but statistically by combining the MCMC and restricted likelihood.

(44)

2.2.3.1 The nested sampling procedure

The main focus of nested sampling is the calculation of the evidence, which helps us as a model selection criterion. In our case, the optimal number of parameters in the model is not known. At the beginning of the analysis, only the maximum number of parameters is chosen (N_max). Then, the models with a sequentially increasing number of parameters are fitted with nested sampling. For each model we estimate the evidence and the values of parameters. Model with the highest evidence is chosen as the final model. Therefore, nested sampling is convenient since it enables us to compare the models and estimate the parameters at the same time.

The algorithm uses the set of n objects x, which are randomly sampled from the prior space while subjected to the evolving likelihood restraint, from now on noted as L^∗. As seen in Fig. 2.10, likelihood is tightly linked to cumulative prior mass, thus the restriction on likelihood (L^∗) holds also in the space of cumulant prior mass (ξ) as cumulant prior mass restraint (ξ^∗). Therefore, sampling the objects whose likelihood is larger than the L^∗, leads to random sampling in the shrunken prior space where ξ < ξ^∗ [33], [21], [32]. The main idea is to move towards smaller values of ξ by shrinking the prior space with increasing L^∗ to locate the prior regions with high likelihood.

At the beginning we randomly sample n objects from prior space. The object with the lowest likelihood is selected and its likelihood plays the role of likelihood restraint (L^∗) which translates into prior space restraint (ξ^∗). This object is discarded from the set of objects. The remainingn−1 objects are still uniformly distributed inξ, but are now also subjected toξ^∗. In the next step, a new object obeying the L^∗ is generated by MCMC (see 2.2.6) from the randomly chosen remaining object, replacing the discarded object in the previous step [33], [21], [32], see Fig. 2.11.

The name for nested sampling comes from this part of the algorithm because the new domain is nested within the old domain. By iteratively repeating the approach, where we use the object with the lowest likelihood as L^∗, the cumulant prior mass will shrink in each iteration. The shrinkage ratio (of cumulant prior mass) between iterations is defined ast= _ξ^ξ∗, with known probability distribution

(45)

Figure 2.10: Relationship between ξ^∗ and L^∗. The restriction on likelihood (L^∗) translates to the restriction in the prior space via the cumulant prior mass restraint (ξ^∗). The sampling of a new object is subjected to the shrunken prior space.

(f(t) =ntⁿ⁻¹). On logarithmic scale (logt) the mean of the shrinkage ratio equals to µ_log_t = −1/n and the standard deviation to sd_log_t = 1/n. Therefore, after k steps, the prior mass is estimated to shrink to ξ_k = ξ^∗ = ∏︁k

j=1t_j or on the logarithmic scale logξ^∗ = −k/n with the standard deviation sd_log_t = k/n [33], [21].

As mentioned before, the exact change in prior mass is not known in each iteration, but the change in the prior mass in each iteration can be evaluated statistically by using t. On average, we expect the prior space to shrink by µ_log_t =−1/n on logarithmic scale (nested sampling is usually carried out on the logarithmic scale because of numerical stability and easier computation). There- fore, the estimated evidence is calculated by using trapezoidal rule to compute the one-dimensional integral as

Z =

k

∑︂

j=1

w_jL_j =

k

∑︂

j=1

∆ξ_jL_j =

k

∑︂

j=1

(ξj−1−ξ_j)L_j (2.17)

and on logarithmic scale with approximation of ∆ξ =−1/n from µ_log_t=−1/n

(46)

Figure 2.11: New objects in shrunken prior space at each iteration. The iterations start at the bottom of the figure.

logZ ≈

k

∑︂

j=1

exp{−1/n+ logL_j} (2.18)

There is no rigorous rule for the termination of the algorithm. At the beginning, L is increasing faster than ξ is decreasing, but eventually L flattens out and the decrease of ξ becomes faster than the increase of L. Most of the evidence is usually found in the prior region ξ = e^−H, where H denotes the entropy (i.e. the average value of information from all the possible outcomes of the random variable) which is defined as H =∫︁∞

−∞P(ξ) logP(ξ)dξ and estimated as H ≈ ∑︁

k

∆ξiLi

Z log^L_Z^k. This measure gives us some guidelines about the remaining evidence after its recalculation in each iteration, but cannot serve as exact termination criteria [33], [21], [32].

Since the calculation of evidence is done numerically, we should address the numerical uncertainties of the estimation. The uncertainty is best expressed on the logarithmic scale and it follows the equation 2.19. More details about the calculation can be found in [21], [33]. Another approach consist of sampling t several times to obtain the distribution of evidence values. When given the distribution of evidence, we can simply calculate whatever statistic (e.g. standard

(47)

deviation, quantiles,...) to express the uncertainty.

logZ ≈∑︂

k

w_jL_j ±

√︃H

n (2.19)

The variation of nested sampling implemented in this master thesis was in- spired by the work in Ref. [34]. The authors present the value of the ‘remaining’

evidence, which is calculated in the same way as the evidence, but instead of using the likelihood value of the object with the lowest likelihood, it sums the likelihoods of all the remaining objects and scales them by appropriate weight in each iteration. The measure can give us the estimation of the remaining evidence which is yet to be discovered in the following iterations. The ratio between ‘remaining’ evidence and evidence, was implemented as the termination criteria for our model. More details about the model used in this master thesis can be found in Section 2.3.

2.2.4 Likelihood

As previously mentioned in section 2.1.4, the proposed model for our data is a finite mixture model composed of multiple convoluted log-normal distributions.

The idea of the model itself is simple, but the estimation of the parameters is quite challenging. The goal of our model is to estimate the proportion of each (convoluted) log-normal distribution.

Finite mixture models represent the overall population as the weighted sum of its sub populations. The finite mixture model is a mixture probability distribution as written in 2.20, where the overall population probability distribution is denoted as f(x) and each sub population is defined by its probability distribution (g_i(x)) and its weight (w_i, i.e. the proportion of values coming from that sub population over the overall population). In the equation i represents any of the n sub populations. Because the weights are proportions, all of them must be on the interval [0,1] (w_i ∈ [0,1]) and their sum must equal to 1 (∑︁n

i=1w_i = 1).

More about finite mixture models can be found in [35].

(48)

f(x) =

n

∑︂

i=1

wigi(x) (2.20)

The density of a sum of independent random variable is the convolution therefore, the convolution of log-normal variables results in sum of multiple univariate log-normal distributions. The issues arise from the fact that the probability distribution of cumulative log-normal distributions is not a closed form expression.

The convolutions of log-normal were numerically approximated. First, we created whole number sequence in the range of [0,10⁶). Then we discretized the log-normal distribution since the localizations can only be whole numbers. The discretized log-normal distribution was calculated as:

g₁(x) =F₁(x+ 1)−F₁(x) (2.21) where x is any whole number from previously mentioned sequence, F1 represents the cumulative distribution function of univariate log-normal distributionf1

(see equation 2.2) andg1 represents the discretized distribution f1. The obtained g1 distribution was used as the base for the estimation of probability distributions for convolution of log-normal. The probabilities for each whole number in convoluted log-normal distributions were estimated using the convolution theorem.

The theorem states that if the function (in our case is just one, which is convoluted with itself) is integrable with Fourier transform, the Fourier transform of convolution equals the product of the Fourier transforms [36]. The Fourier transformation was carried out in R [37] with the fft function, which uses the Fast Fourier Transform [37].

F {g∗g}=F {g} · F {g} (2.22) by applying the inverse we get the convolution of g with itself.

g∗g =F {F {g} · F {g}}⁻¹ (2.23) The aim is to create calibration curve for each cluster size in our model, so we

(49)

can compare our sample distribution with the distribution of calibration curves scaled by appropriate weights. The goodness of fit is estimated via extended maximum likelihood method. The said likelihood is calculated on binned data where each bin of data in the histogram is treated as an independent Poisson random variable with expectation λ. In our case the bins are of size 1, so no binning is needed after collecting the sample, as the number of localizations is an integer. The histogram likelihood can be formed as a product of Poisson likelihoods as written in equation (2.24), where M represents the number of all bins,brepresents the measured contents of the bin, n_b the number of observations in the bin and f_b the expected bin content (i.e. calibration curves scaled by weights) [38], [39].

L=

M

∏︂

b=1

f_bⁿ^b

nb!e^f^b (2.24)

taking the logarithm

logL=−

M

∑︂

b=1

n_b logf_b −f_b−logn_b (2.25)

In practical application, firstly, the weights’ proposal (i.e. the sampled point in nested sampling) is multiplied by calibration curves to obtain f_b. Then the

∑︁M

b=1n_b logf_b is calculated on the sample data.

To sum up, ideally, our model would be a simple finite mixture, but since the convolution of log-normal is not a closed form expression, the workaround was to calculate the calibration curves and scale them by the proposed weights.

This distribution was then compared with our sample distribution via extended likelihood method, to estimate the goodness of fit. The aim was to find the weights which scaled the calibration curves as close as possible to our sample distribution.

(50)

2.2.5 Prior

The prior we chose for the weights is the Dirichlet distribution. The Dirichlet distribution (probability density function with respect to Lebesgue measure on the Euclidean space is seen in Eq. 2.26) is a multivariate generalization of the beta distribution, meaning that the marginal distributions follow the beta distribution [40].

f(x) = 1 B(α)

K

∏︂

i=1

x^α_iⁱ⁻¹ (2.26)

where

B(α) =

∏︁K

i=1Γ(αi) Γ(∑︁K

i=1α_i) (2.27)

The special case of Dirichlet distribution is where all the values of vector elements α are the same. In case when all α = 1, the distribution is known as the flat Dirichlet distribution. Its special property is that all marginal distributions are uniform [40]. The univariate uniform distribution (X ∼ U(0,1)) is a special case of the univariate beta distribution Beta(α, β) when the parameters are α= 1 and β = 1. The flat Dirichlet prior therefore results in uniform marginal distributions for all weights, yet taking into account the multidimensionality.

With this approach we are able to obey the restrictions set on the weights (the weightsw₁, ..., w_n must bew_i ∈[0,1] and∑︁n

i=1w_i = 1) and obtain a multivariate non-informative prior. The non-informativeness in our case means that we did not include any specific information into our prior by treating every point as equally likely to any other point.

2.2.6 Generating a new object by MCMC

Markov Chain Monte Carlo methods are multiple algorithms which enable us to sample from many probability distributions. A Markov chain is a stochastic process where (the probability of) next step only depends on the current state.

(51)

Therefore, the process is memoryless meaning that future events do not depend on the chain of the history events that led up to current state, just on the current state [41]. Monte Carlo methods are algorithms which obtain numerical results by repeated random sampling. The values obtained from Monte Carlo methods are independent and identically distributed (i.i.d.) [41]. MCMC methods combine the Markov chain and Monte Carlo approaches resulting in chains of autocorre- lated, but randomly created values which come from non-normalised probability distribution. The chains of values start in a set of randomly chosen points (’walkers’) which move around randomly based on the probabilities for each move. On each step, the probabilities for the next move are recalculated. From the simulated non-normalised probability distributions we can calculate all of its basic properties (e.g. mean, variance, etc.). The results are only approximate because the values are simulated - we obtain a numerical approximation of the probability distributions [41], [21].

There are a lot of different approaches in the field of MCMC, resulting in various methods, such as Metropolis-Hastings algorithm and Gibbs sampling, slice sampling, multiple-try Metropolis, reversible-jump MCMC, Hamiltonian Monte Carlo, etc. In nested sampling, various MCMC methods can be applied [21]. In the early publications of Skilling [21], it’s stated that only a random walk with restraint on likelihood is sufficient, but nonetheless other MCMC methods can be applied. The chosen method must obey the likelihood constraint, i.e. the new object can only be accepted if its likelihood is larger than the limiting value L^∗ [21].

In our model we used the basic MCMC algorithm as described in Skilling [21]. When a new object within the likelihood constraint L(x) > L^∗ is needed, we already have n − 1 objects (remaining walkers) that fulfill that condition.

Therefore, we can use one of the surviving objects as a starting point from which we generate a new independent object by MCMC. First, we copy one of the randomly selected surviving object and then add an increment which is randomly sampled from normal distribution (with initial µ = 0 and σ = 0.1). In case of a multidimensional model, we may change one or multiple parameters at once.

In our case, only one parameter was developed by MCMC, but because of nor- malization all parameters were changed in the final object. Because our values represent proportions, they should range between 0 and 1 (see section 2.2.5). In

(52)

case any of the proposed values is below 0 or above 1 it’s reflected into the desired range. Alternatively, this point can be automatically rejected, since it falls out of the prior range. Then, the objects are normalized to satisfy the conditions set on weights as described in section 2.2.5. The new object is accepted if its likelihood is greater than L^∗, otherwise there is no movement.

The σ in normal distribution responsible for generating the increment should be large enough so the movement from the initial point is reasonably fast, but small enough to explore values close by. Since pre-judgement in size of σ is hard and the prior space is shrinking during the algorithm, the adjustable σ is implemented as proposed in [21] on page 194. Its value is based on the number of accepted points during MCMC with the expected acceptance ratio of 50%. The positive by-product of this implementation is the reduction of MCMC steps - in [21], the author claims that 20 steps are sufficient to generate new objects.

2.2.7 Posterior distribution and estimation of model parameters

In general, the representative sample is selected from the posterior distribution, which is calculated as prior probability scaled by the likelihood value. Alterna- tively, the random sampling of the posterior can also be carried out in the area of Z as seen in picture 2.12. In that case, we already have the posterior representatives in the list of discarded objects (i.e. objects with the lowest likelihood from each iteration) from nested sampling [21], [33]. Since, the evidence value is already calculated at this point, the posterior probability for each point can be estimated as

p_j = L_jw_j

Z (2.28)

The posterior probabilities are then normalized by the maximum probability from the said set [21]. From the given set of posterior representatives, the maximum number of representatives in the posterior probability sample is estimated by the entropy (i.e. the Shannon channel capacity) [33] as

(53)

N = exp (

m

∑︂

j=1

p_jlogp_j) (2.29)

After obtaining the appropriate size of the sample, the posterior distribution is sampled by randomly choosing the posterior representative and accept it with its probability (i.e. we generate random number from uniform distribution∼U(0,1) and accept the representative if its probability is higher than the random number).

The procedure is repeated until we reach the size of the posterior sample N [21].

Figure 2.12: Posterior representatives scattered randomly over the area Z.

The parameters of the posterior distribution can be estimated from the sample obtained in the previous step or directly from all discarded object from nested sampling. The estimation from the sampled posterior is done as regular parameter estimation from a sample, most commonly by estimating the mean of the posterior probability. In case of using the full list of discarded objects, the mean and standard deviation of the desired quantity, can be estimated from the first and second moments (if they exist) scaled by weights obtained in 2.28 [33]. The mean and standard deviation of the desired property are calculated as

Ljubljana,2021 Anestedsamplingalgorithmforquantifyingproteincopynumberinsuper-resolutionmicroscopy TinaKoˇsuta UniverzavLjubljani

Univerza v Ljubljani

Fakulteta za elektrotehniko, Biotehniˇska fakulteta, Ekonomska fakulteta, Fakulteta za druˇ zbene vede,

Fakulteta za matematiko in fiziko, Fakulteta za raˇ cunalniˇstvo in informatiko, Medicinska fakulteta

Tina Koˇsuta

A nested sampling algorithm for quantifying protein copy number

in super-resolution microscopy

Magistrsko delo

Magistrski ˇstudijski program druge stopnje Uporabna statistika

Mentor: izred. prof. dr. Mihael Perman Somentor: dr. Carlo Manzo

Ljubljana, 2021

University of Ljubljana

Faculty of Electrical Engeneering, Biotechnical Faculty, Faculty of Economics, Faculty of Social Sciences,

Faculty of Mathematics and Physics, Faculty of Computer and Information Science, Faculty of Medicine

Tina Koˇsuta

A nested sampling algorithm for quantifying protein copy number

in super-resolution microscopy1

Master thesis

Masters programme of Applied statistics

Mentor: izred. prof. dr. Mihael Perman Co-mentor: dr. Carlo Manzo

Ljubljana, 2021

Acknowledgement

Povzetek

Abstract

Table of contents

List of Figures

List of Tables

List of Acronyms

1 Introduction

2 Methods

2.1 Super-resolution microscopy and the quantification of protein copy

2.2 Bayesian statistics and nested sampling