Plausible Reasoning for Scientific Problems: Belief Driven by Priors and Data.

Plausible reasoning requires constructing rational arguments by use of syllogisms, and their analysis by deductive and inductive logic. Using this method of reasoning and expressing our beliefs, in a scientific hypothesis, in a numerical manner using probability theory is one of my interest. I try to condense the material from the first 4 chapters of [1] in this post – starting with a very brief description of syllogisms, logic, probabilities, sampling distributions and finishing up with a problem of scientific hypothesis testing. I have used the material from various sources and some of it links with the previous posts on model comparisons and mixture models. The two books that can be looked at for more details are:

[1] Jaynes, E. T. (2003). Probability theory the logic of science. Cambridge University Press (2003)

[2] Lavine, M. (2013). Introduction to Statistical Thought. Retrieved from http://www.stat.duke.edu/~michael/book.html

Syllogisms

Syllogisms are systematic ways of presenting logical arguments, and appeal to our intuitive reasoning abilities to draw conclusions. Usually we have two premises Major, Minor and that leads to a conclusion.

Strong Syllogism

A\; is\; true,\; then\; B\; is\; true

\frac{A\;is\; true }{\therefore B\; is\; true}

In the inverse form:

A\; is\; true,\; then\; B\; is\; true

\frac{B\;is\; false }{\therefore A\; is\; false}

Weak Syllogism

In common scientific problems, we have to deal with weaker syllogisms. Following the example from [1]:

A ≡ It will rain at 10 am at the latest

B ≡ It will be cloudy before 10 am

A\; is\; true,\; then\; B\; is\; true

\frac{B\;is\; true }{\therefore A\; becomes\; more\; plausible}

The major premise if A then B, creates a logical environment, where B is a logical consequence of A (not the physical causal consequence). Verification of the consequences of A (in this logical environment) increases our confidence in A. Another form can be expressed as:

\frac{A\;is\; false }{\therefore B\; becomes\; less\; plausible}

one of the reasons for B being true has been eliminated, hence we feel less confident about B.

Boolean Algebra

The ideas of plausible reasoning can be represented as symbolic logic using Boolean algebra. The symbols and operations we will use are

Symbol Name Other Names
AB Logical Product Conjunction
And
Intersection
A+B Logical Sum Disjunction
Or
Union

Given two propositions A and B, one is true if an only if the other is true – i.e. they have same truth value. So if A is the necessary and sufficient condition for B, they are logically equivalent propositions.

Some propositions and their meanings include:

\bar A \equiv A\; is\; false \newline Relation \;between \; A\;, \bar A \; is \; reciprocal \newline A = \bar A \; is\; false

The relationship between two propositions can be expressed as an implication

A \Rightarrow B \newline A\bar B \; is \; false \newline (\bar A + B) \; is \; false \newline A = AB \newline if \;A \; is \; true \; then \; B \; is \; true\newline if \;B \; is \; false \; then \; A \; is \; false\newline if \;A \; is \; false \; says \; nothing \; about \; B\newline if \;B \; is \; true \; says \; nothing \; about \; A\newline

Quantitative Rules

These rules have been formally derived in Chapter 2 Ref [1] and it is a recommended read. I will just mention the rules briefly.

Product Rule

AB | C : A and B are true given C; broken down into parts it can be written as:

  • B | C is true
  • accepting B is true, decide that A is true A | BC

OR

  • A | C is true
  • accepting A is true, decide that B is true B | AC

In functional form this can be written as: (AB | C) = F[(B|C), (A|BC)]

With 3 propositions, like (ABC | D), we can break the problem down into parts and use e.g. BC as a single proposition:

  1. (ABC | D)
  2. (BC | D) (A | BCD)
  3. (C | D) (B|CD) (A | BCD)

Sum Rule

The logical product of A \bar A is always false, while logical sum is A + \bar A is always true.

Primitive sum rule: (A|B)+(\bar A| B) = 1

Extended or generalised sum rule: (A+B|C) = (A|C)+(B|C)-(AB |C)

Qualitative Properties

The statements of Logic can be expressed in the form of product and sum rules.

A \Rightarrow B \newline C \;is \;the \;major \;premise \newline C \equiv A \Rightarrow B \newline if \;A \;is \;true \;then \;B \;is \;true \newline p(B|AC) = \frac{p(AB|C)}{p(A|C)}\newline \newline if \;B \;is \;false \;then \;A \;is \;false \newline p(A\mid \bar B C) = \frac{p(A\bar B \mid C)}{p(\bar B \mid C)}\newline \newline if \;B \;is \;true \;then \;A \;becomes \;more \;plausible \newline p(A\mid B C) = p(A \mid C)\frac{p(B \mid AC)}{p(B \mid C)}\newline \newline if \;A \;is \;false \;then \;B \;becomes \;less \;plausible \newline p(B\mid \bar A C) = p(B \mid C)\frac{p(\bar A \mid BC)}{p(\bar A \mid C)}\newline

Elementary Sampling Theory

The rules we have available are the product and sum rules described earlier, and using the principle of indifference (just one of the many principles), if B is the background information while (H_{1}, H_{2}, .. H_{N}) are equally likely mutually exclusive and exhaustive hypotheses then

P(H_{i}\mid B) = \frac{1}{N} \; \; 1 \leq i \leq N

As an example:

  • B ≡ Urn with N balls, with M red balls  and (N-M) white balls.
  • Ri ≡ Red ball on ith draw.
  • Wi ≡ White ball on ith draw.

For the first draw,

P(R_{1}\mid B) = \frac{M}{N} \newline P(W_{1}\mid B) = 1-\frac{M}{N}

These probability values should not be confused with physical properties of the urn, but are representative of state of knowledge or information – before any actual ball is drawn. This state of knowledge changes changes when a new question is asked – what is the probability of red on first two draws? (use the product rule)

P(R_{1} R_{2} \mid B)=P(R{1} \mid B)P(R{2} \mid R{1}B) \newline P(R_{1} R_{2} \mid B)=\frac{M}{N}\frac{M-1}{N-1}

Working this way (you can see the full derivation in [1]) we can look at any sequence of drawing – and if the question becomes: how many ways two red balls can be drawn in three draws, then the we are looking at the multiplicity of this event R_{1}R_{2}W_{3}, R_{1}W_{2}R_{3}, W_{1}R_{2}R_{3}

which can be calculated using the binomial coefficient

binomialCoef

So the question can be posed as:

  • B ≡ Urn with N balls, with M red balls  and (N-M) white balls.
  • A ≡ ‘r red balls in n draws, in any order’
  • P(A|B) ≡ P (r | N, M, n) ≡ Hypergeometric distribution (if we sample without replacement)
  •  P(A|B) ≡ P (r | N, M, n) ≡ Binomial distribution. (sample with replacement)

These probability distributions are called sampling distributions or direct probabilities: Given some hypothesis (e.g. contents M, N of the urn), what is the probability that we shall obtain some specified data D (e.g. some sequence of red and white balls). These sampling distributions make predictions about potential observations – and if the correct hypothesis is known then the predictions and observed data will agree closely.

As another example for calculating probabilities of events [2] let:

  • X ≡ set of outcomes e.g. rolling a six-sided die {1, 2, 3, 4, 5, 6}
  • Y ≡ subsets of X
  • μ : Y → [0, 1] where the function μ has the domain Y and has an image from 0 to 1.

We can determine the function μ using simulation, logic or experience. Following the example from [2] (the section on Basic Probability) – we want to calculate the probability of winning on the come-out roll in the game of craps (we need to roll a 7 or 11 using two six-sided die.

P(win on come-out roll) = P(7) + P(11)

The example in the book [2] shows how to calculate this mathematically or via simulation.

sim.game = function() {
  r = sample(1:6, 2, replace=T)
  if (sum(r) == 7 || sum(r) == 11) return(1) else return(0)
}
win = replicate(1000, sim.game())
sum(win)/1000
[1] 0.224

Elementary Hypothesis Testing:

In a general scientific problem, we already have the data and a set of hypotheses and want to decide which hypothesis is more likely when looking at the information. The problem can be written down as propositions:

X = prior \; information \newline H = hypothesis \newline D = data \newline \newline The \; joint\; parameter\; space\; is\; expanded\; using\; product\; rule\newline P(HD\mid X) = P(D \mid X) P(H \mid DX) = P(H \mid X) P(D \mid HX)\newline Performing\; some\; acrobatics,\; the\; equation\; becomes\newline P(H \mid DX) = P(H \mid X) \frac{P(D \mid HX)}{P(D \mid X)} .. Eq(1)

This is of course the famous Bayes theorem, and the left side of equation 1 is the posterior probability – I tend to think of this (as most times your hypothesis will consist of parameters) as the point in the vector space where your parameters converge when they are being restricted by the priors and the data. The factor in equation 1, P(D|HX) is a function with 2 names depending on the context: 1) sampling distribution when H is fixed and D varies; 2) likelihood function L(H) when the data is a fixed parameter while H changes. Ref [2] has a nice section on likelihood functions and should be read with the associated R code in the book.

I show a slightly modified example from Ref [1] (where this problem is solved in multiple ways)

  • X ≡ There are 15 machines, making widgets, and 10 of those machines make good quality widgets, 4 make average quality widgets and 1 is out of tune and makes mostly bad widgets. Each machine has a tuning parameter, which is the proportion of bad widgets, and we model that using a beta distribution.
  • M1 ≡ On average the machine produces bad widgets with a proportion of 0.16
  • M2 ≡ On average the machine produces bad widgets with a proportion of 0.49
  • M3 ≡ On average the machine produces bad widgets with a proportion of 0.83

We use equation 1, to estimate the posterior parameters for each hypothesis or model.

bayesEquation

We can treat this problem in various ways, using a parameter estimation approach, calculating 95% high density intervals to assign a p-value , a mixture distribution approach as shown in previous posts e.g. here and here. However the statement of the problem also shows us that we can assign a prior probability or weight to each hypothesis. The code snippet below shows the priors for the 3 models and the prior weights for each model – the full code can be found in the github repository.

### define three machines
## model m1 - machine makes less defective widgets
m1 = function(th) dbeta(th, 1, 5, log = T)

## model m2 - machine makes between average defective widgets
m2 = function(th) dbeta(th, 5, 5, log = T)

## model m3 - machine makes almost everything bad
m3 = function(th) dbeta(th, 5, 1, log = T)

## define an array that represents number of models in our parameter space
## each index has a prior weight/probability of being selected
## this can be thought of coming from a categorical distribution 
mix.prior = c(m1=10/15 ,m2= 4/15 ,m3= 1/15)

As the data is generated, the posterior weight for each model changes with respect to each other, depending on if we see a bad widget (1) or a good widget (0). The results below show the posterior weights as the total number of good or bad widgets change along with the number of observations.

        m1   m2   m3 Data - 1=Bad
 [1,] 0.78 0.20 0.02            1
 [2,] 0.53 0.42 0.05            1
 [3,] 0.64 0.34 0.02            0
 [4,] 0.46 0.49 0.05            1
 [5,] 0.33 0.59 0.08            1
 [6,] 0.25 0.64 0.11            1
 [7,] 0.19 0.66 0.16            1
 [8,] 0.15 0.65 0.20            1
 [9,] 0.12 0.63 0.25            1
[10,] 0.09 0.60 0.30            1
[11,] 0.08 0.57 0.36            1
[12,] 0.06 0.53 0.41            1
[13,] 0.05 0.49 0.46            1
[14,] 0.04 0.44 0.51            1
[15,] 0.07 0.58 0.35            0
[16,] 0.09 0.67 0.24            0
[17,] 0.12 0.72 0.17            0
[18,] 0.14 0.74 0.12            0
[19,] 0.17 0.74 0.09            0
[20,] 0.15 0.75 0.11            1
[21,] 0.13 0.75 0.12            1
[22,] 0.12 0.74 0.14            1

The figure below shows the same numbers and how one hypothesis becomes more plausible while others become less as evidence accumulates – the points represent the prior weights for each hypothesis.

hypothesisPlausible

Interestingly hypothesis M3 is a very rare event, or almost a dead hypothesis, that is resurrected [1] to a level where we start to consider it very plausible by the time we have 16 data points, as we had a lot of  bad widgets. This example shows an elegant way to compare multiple hypotheses with respect to each other, and our belief is driven by the prior information and the data in different directions. In a more complex model with many parameters, the denominator of the equation one i.e P(D) can be difficult to calculate, and I would perhaps try a MCMC based approach in a single finite mixture model, rather than optimisation based approach.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s