The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man’s mind.
– James Clerk Maxwell (1850)
Suppose some dark night a policeman walks down a street, apparently deserted. Suddenly he hears a burglar alarm, looks across the street, and sees a jewelry store with a broken window. Then a gentleman wearing a mask comes crawling out through the broken window, carrying fistfulls of expensive jewelry. The policeman doesn’t hesitate at all in deciding that this gentleman is dishonest. But by what reasoning process does he arrive at this conclusion?
Organon of Aristotle gives deductive reasoning (apodeixis) as the result of repeated composition of two strong syllogisms:
1.
if A is true, then B is true.
A is true. therefore, B is true.
2.
if A is true, then B is true.
B is false. therefore, A is false.
Rarely we have adequate information for this kind of reasoning. Instead we fall back to weaker syllogisms (epagoge). Our evidence is not enough to deduce truthiness with certainty. However, we eliminate possible explanations of the evidence and so we feel more confident about consequences.
if A is true, then B is true. B is true. therefore, A is more plausible.
if A is true, then B is true. A is false. therefore, B is less plausible.
But our cop did not use even those weak syllogisms, relying on an entirely weaker major premise
5.
if A is true, then B becomes more plausible.
B is true. therefore, A is more plausible.
The cop’s reasoning suggests a conclusion with strong convincing power, almost as though it was deductive.
Not only whether something is more or less plausible, but the degree of plausibility.
As well, we make use of the context through prior information about past experience.
For example, suppose these sorts of things happen several times every night to every cop. And in every instance the cause was completely innocent. Very soon, ideally, police would learn to ignore these events.
You insist that there is something a machine cannot do. If you will tell me precisely what it is that a machine cannot do, then I can always make a machine which will do just that!
– John von Neumann (1948)
In this course we will consider mathematical models which reproduce mechanisms of thinking. We will construct these by prescribing a definite set of operations, algorithmically manipulating information corresponding to quantitative versions of the weak syllogisms. We are computing, in the presence of incomplete and uncertain information, with a calculus of probabilities.
It helps to reason about a hypothetical robot, rather than our own thinking. For one, it is easier to revise and describe clearly. How should we define our robot’s behaviour for representing and reasoning about information?
- degrees of plausibility are represented by real numbers. Infinitesimally more plausibility coressponds to infinitessimially greater number.
- qualitative correspondence with common sense.
- consistently. If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result.
As we saw, the strong syllogisms are too restrictive to apply to real world reasoning problems.
However, we will recover deductive logic as a special case.
We are interested in logical connections not physical causiation.
The major premise If A is true, then B is true
is not interpreted as A is the physical cause of B
.
For example what would that say about the second strong syllogism? not-B is the physical cause of not-A
is non-sense.
There is work in causal reasoning, and it is much more involved due to the consideration of contrapositives, (e.g. not-A
)
Statistics
- Bespoke / theoretically motivated / human interpretable models
- ad hoc procedures, summarizing methods, and tools
- Rigorously proven and critically analyzed results (Kolmogorov)
Probability theory is nothing but common sense reduced to calculation.
– Laplace, 1819
TL;DR: probability theory as extended logic
- Relates the logical product
AB
to the plausibility ofA
andB
separately. - Allows us to infer
$AB| C$
Derived by breaking the statement down into elementary sequences
- Plausibility of
$B| C$ - Given B is true, plausibility of
$A | BC$ .
If
Captures that the plausibility of either
Following these simple rules we can derive the strong syllogisms as a limiting case of plausible reasoning as the robot becomes more certain of its conclusions.
We can derive the weak syllogism if A is true then B is true. B is true. therefore A is more plausible
using the product rule
We can also express the even weaker, policeman’s syllogism:
if A is true, then B is more plausible
reads
Why can’t we make these statements with just
In general, we have variables
The assumption here is that the variables, e.g. supervised learning data in the form of (input, label) pairs, were generated by some process that can be represented by some probability distribution
We will try to model this “true” distribution by introducing a parameterized family of distributions,
However, we do not have access to $ptrue$ except through samples, our data/evidence/observations.
This course will investigate how to build these kinds of thinking machines by discussing
- How should we specify
$p_θ$ ? - What it means for
$p_θ$ to “best match” $ptrue$? - How can we find the best parameters
$θ$ from our observations $x ∼ ptrue$?
With this perspective we can consider common machine learning tasks. It is helpful to distinguish between the different kinds of variables that are common in our problems:
- input data,
$X$ . - discrete outputs,
$C$ , e.g. “labels” - continuous outputs,
$Y$
If we model a joint distribution over these variables
-
Regression:
$p(Y | X) = \frac{p(X,Y)}{p(X)} = \frac{p(X,Y)}{∫ p(X,Y) dY}$ -
Classification:
$p(C | X) = \frac{p(X,C)}{p(X)} = \frac{p(X,C)}{∑_C p(X,C)}$