Skip to content

Latest commit

 

History

History
211 lines (133 loc) · 7.95 KB

lec2.org

File metadata and controls

211 lines (133 loc) · 7.95 KB

Lecture 2: Probabilistic Reasoning

The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man’s mind.

– James Clerk Maxwell (1850)

Agenda

All Cops Are Bad at Logic

Suppose some dark night a policeman walks down a street, apparently deserted. Suddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out through the broken window, carrying fistfulls of expensive jewelry. The policeman doesn’t hesitate at all in deciding that this gentleman is dishonest. But by what reasoning process does he arrive at this conclusion?

Deductive Reasoning

Organon of Aristotle gives deductive reasoning (apodeixis) as the result of repeated composition of two strong syllogisms:

1. if A is true, then B is true. A is true. therefore, B is true. 2. if A is true, then B is true. B is false. therefore, A is false.

Rarely we have adequate information for this kind of reasoning. Instead we fall back to weaker syllogisms (epagoge). Our evidence is not enough to deduce truthiness with certainty. However, we eliminate possible explanations of the evidence and so we feel more confident about consequences.

  1. if A is true, then B is true. B is true. therefore, A is more plausible.
  2. if A is true, then B is true. A is false. therefore, B is less plausible.

But our cop did not use even those weak syllogisms, relying on an entirely weaker major premise

5. if A is true, then B becomes more plausible. B is true. therefore, A is more plausible.

The Power of Plausible Reasoning

The cop’s reasoning suggests a conclusion with strong convincing power, almost as though it was deductive.

Not only whether something is more or less plausible, but the degree of plausibility.

As well, we make use of the context through prior information about past experience.

For example, suppose these sorts of things happen several times every night to every cop. And in every instance the cause was completely innocent. Very soon, ideally, police would learn to ignore these events.

Thinking Computer

You insist that there is something a machine cannot do. If you will tell me precisely what it is that a machine cannot do, then I can always make a machine which will do just that!

– John von Neumann (1948)

In this course we will consider mathematical models which reproduce mechanisms of thinking. We will construct these by prescribing a definite set of operations, algorithmically manipulating information corresponding to quantitative versions of the weak syllogisms. We are computing, in the presence of incomplete and uncertain information, with a calculus of probabilities.

Meet Our Robot

It helps to reason about a hypothetical robot, rather than our own thinking. For one, it is easier to revise and describe clearly. How should we define our robot’s behaviour for representing and reasoning about information?

Desiderata

  1. degrees of plausibility are represented by real numbers. Infinitesimally more plausibility coressponds to infinitessimially greater number.
  2. qualitative correspondence with common sense.
  3. consistently. If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result.

Not Deductive Logic

As we saw, the strong syllogisms are too restrictive to apply to real world reasoning problems.

However, we will recover deductive logic as a special case.

Not Causal Reasoning

We are interested in logical connections not physical causiation.

The major premise If A is true, then B is true is not interpreted as A is the physical cause of B. For example what would that say about the second strong syllogism? not-B is the physical cause of not-A is non-sense.

There is work in causal reasoning, and it is much more involved due to the consideration of contrapositives, (e.g. not-A)

Not Statistics

Statistics

  • Bespoke / theoretically motivated / human interpretable models
  • ad hoc procedures, summarizing methods, and tools
  • Rigorously proven and critically analyzed results (Kolmogorov)

Calculus of Probabilities

Probability theory is nothing but common sense reduced to calculation.

– Laplace, 1819

TL;DR: probability theory as extended logic

Product Rule

  • Relates the logical product AB to the plausibility of A and B separately.
  • Allows us to infer $AB| C$

Derived by breaking the statement down into elementary sequences

  1. Plausibility of $B| C$
  2. Given B is true, plausibility of $A | BC$.

If $A$ and $B$ are independent then $$A ⊥ B \iff P(A B) = P(A) * P(B)$$

Sum Rule

Captures that the plausibility of either $A$ or not-$A$ occurring, represented by $A + \bar A$, is certain.

$$ P(A) + P(\bar A) = 1$$

Qualitative Properties

Following these simple rules we can derive the strong syllogisms as a limiting case of plausible reasoning as the robot becomes more certain of its conclusions.

We can derive the weak syllogism if A is true then B is true. B is true. therefore A is more plausible using the product rule $$p(A | B C) = p(A | C) \frac{p(B | A C)}{p(B | C)}$$ and since $p(B | A C)=1$ we have the weak syllogism $$p(A | B C) \geq p(A | C)$$

We can also express the even weaker, policeman’s syllogism:

if A is true, then B is more plausible reads $p(B | A C) > p(B | C)$ and by the product rule we can derive the conclusion $p(A | B C) > p(A | C)$

Prior Information

Why can’t we make these statements with just $A$ and $B$? What is the role of the context, $C$?

Building our Robot

In general, we have variables $X = (X_1, …, X_N)$ that are either observed or unobserved. We want a model that captures the relationship between these variables. The approach of probabilistic generative models is to relate all variables by a learned joint probability distribution $p_θ(X)$.

The assumption here is that the variables, e.g. supervised learning data in the form of (input, label) pairs, were generated by some process that can be represented by some probability distribution $X ∼ p_\text{true}(X)$.

We will try to model this “true” distribution by introducing a parameterized family of distributions, $p_θ$. “Learning” our model will be finding the values of the parameters $θ$ for this joint probability distribution. We want the parameters for the specified distribution $p_θ(X)$ to “best match” the underlying data distribution $p_\text{true}(X)$.

However, we do not have access to $ptrue$ except through samples, our data/evidence/observations.

This course will investigate how to build these kinds of thinking machines by discussing

  • How should we specify $p_θ$?
  • What it means for $p_θ$ to “best match” $ptrue$?
  • How can we find the best parameters $θ$ from our observations $x ∼ ptrue$?

Probabilistic Machine Learning

With this perspective we can consider common machine learning tasks. It is helpful to distinguish between the different kinds of variables that are common in our problems:

  • input data, $X$.
  • discrete outputs, $C$, e.g. “labels”
  • continuous outputs, $Y$

If we model a joint distribution over these variables $p_θ(X,C,Y)$ we can use this model for familiar ML tasks by performing inference

  • Regression: $p(Y | X) = \frac{p(X,Y)}{p(X)} = \frac{p(X,Y)}{∫ p(X,Y) dY}$
  • Classification: $p(C | X) = \frac{p(X,C)}{p(X)} = \frac{p(X,C)}{∑_C p(X,C)}$