$$~$ \newcommand{\true}{\text{True}} \newcommand{\false}{\text{False}} \newcommand{\bP}{\mathbb{P}} $~$$
[summary: $$~$ \newcommand{\true}{\text{True}} \newcommand{\false}{\text{False}} \newcommand{\bP}{\mathbb{P}} $~$$
We can represent a [joint_probability_distribution_on_event probability distribution] $~$\bP(A,B)$~$ over two [event_probability events] $~$A$~$ and $~$B$~$ as a square:
We could also represent $~$\bP$~$ by [factoring_probability factoring], so using $~$\bP(A,B) = \bP(A)\; \bP(B \mid A)$~$ we'd make this picture:
]
Say we have two [event_probability events], $~$A$~$ and $~$B$~$, and a [joint_probability_distribution_on_event probability distribution] $~$\bP$~$ over whether or not they happen. We can represent $~$\bP$~$ as a square:
So for example, the probability $~$\bP(A,B)$~$ of both $~$A$~$ and $~$B$~$ occurring is the ratio of [the area of the dark red region] to [the area of the entire square]:
Visualizing probabilities in a square is neat because we can draw simple pictures that highlight interesting facts about our probability distribution.
Below are some pictures illustrating:
independent events (What happens if the columns and the rows in our square both line up?)
[marginal_probability marginal probabilities] (If we're looking at a square of probabilities, where's the probability $~$\bP(A)$~$ of $~$A$~$ or the probability $~$\bP(\neg B)$~$?)
conditional probabilities (Can we find in the square the probability $~$\bP(B \mid A)$~$ of $~$B$~$ if we condition on seeing $~$A$~$? What about the conditional probability $~$\bP(A \mid B)$~$?)
[factoring_probability factoring a distribution] (Can we always write $~$\bP$~$ as a square? Why do the columns line up but not the rows?)
the process of computing joint probabilities from [factoring_probability factored probabilities]
Independent events
Here's a picture of the joint distribution of two independent events $~$A$~$ and $~$B$~$:
Now the rows for $~$\bP(B)$~$ and $~$\bP(\neg B)$~$ line up across the two columns. This is because $~$\bP(B \mid A) = \bP(B) = \bP(B \mid \neg A)$~$. When $~$A$~$ and $~$B$~$ are independent, updating on $~$A$~$ or $~$\neg A$~$ doesn't change the probability of $~$B$~$.
For more on this visualization of independent events, see the aptly named Two independent events: Square visualization.
Marginal probabilities
We can see the [marginal_probability marginal probabilities] of $~$A$~$ and $~$B$~$ by looking at some of the blocks in our square. For example, to find the probability $~$\bP(\neg A)$~$ that $~$A$~$ doesn't occur, we just need to add up all the blocks where $~$\neg A$~$ happens: $~$\bP(\neg A) = \bP(\neg A, B) + \bP(\neg A, \neg B)$~$.
Here's the probability $~$\bP(A)$~$ of $~$A$~$, and the probability $~$\bP(\neg A)$~$ of $~$\neg A$~$:
Here's the probability $~$\bP(\neg B)$~$ of $~$\neg B$~$:
In these pictures we're dividing by the area of the whole square. Since the probability of anything at all happening is 1, we could just leave it out, but it'll be helpful for comparison while we think about conditionals next.
Conditional probabilities
We can start with some probability $~$\bP(B)$~$, and then assume that $~$A$~$ is true to get a conditional probability $~$\bP(B \mid A)$~$ of $~$B$~$. Conditioning on $~$A$~$ being true is like restricting our whole attention to just the possible worlds where $~$A$~$ happens:
Then the conditional probability of $~$B$~$ given $~$A$~$ is the proportion of these $~$A$~$ worlds where $~$B$~$ also happens:
If instead we condition on $~$\neg A$~$, we get:
So our square visualization gives a nice way to see, at a glance, the conditional probabilities of $~$B$~$ given $~$A$~$ or given $~$\neg A$~$:
We don't get such nice pictures for $~$\bP(A \mid B)$~$:
Factoring a distribution
Recall the square showing our joint distribution $~$\bP$~$:
Notice that in the above square, the reddish blocks for $~$\bP(A,B)$~$ and $~$\bP(A,\neg B)$~$ are the same width and form a column; and likewise the blueish blocks for $~$\bP(\neg A,B)$~$ and $~$\bP(\neg A,\neg B)$~$. This is because we chose to [factoring_probability factor] our probability distribution starting with $~$A$~$:
$$~$\bP(A,B) = \bP(A) \bP( B \mid A)\ .$~$$
Let's use the [event_variable_equivalence equivalence] between [event_probability events] and [binary_variable binary random variables], so if we say $~$\bP( B= \true \mid A= \false)$~$ we mean $~$\bP(B \mid \neg A)$~$. For any choice of truth values $~$t_A \in \{\true, \false\}$~$ and $~$t_B \in \{\true, \false\}$~$, we have
$$~$\bP(A = t_A,B= t_B) = \bP(A= t_A)\; \bP( B= t_B \mid A= t_A)\ .$~$$
The first factor $~$\bP(A = t_A)$~$ tells us how wide to make the red column $~$(\bP(A = \true))$~$ relative to the blue column $~$(\bP(A = \false))$~$. Then the second factor $~$\bP( B= t_B \mid A= t_A)$~$ tells us the proportions of dark $~$(B = \true)$~$ and light $~$(B = \false)$~$ within the column for $~$A = t_A$~$.
We could just as well have factored by $~$B$~$ first:
$$~$\bP(A = t_A,B= t_B) = \bP(B= t_B)\; \bP( A= t_A \mid B= t_b)\ .$~$$
Then we'd draw a picture like this:
By the way, earlier when we factored by $~$A$~$ first, we got simple pictures of the probabilities $~$\bP(B \mid A)$~$ for $~$B$~$ conditioned on $~$A$~$. Now that we're factoring by $~$B$~$ first, we have simple pictures for the conditional probability $~$\bP(A \mid B)$~$:
and for the conditional probability $~$\bP(A \mid \neg B)$~$:
Computing joint probabilities from factored probabilities
Let's say we know the factored probabilities for $~$A$~$ and $~$B$~$, factoring by $~$A$~$. That is, we know $~$\bP(A = \true)$~$, and we also know $~$\bP(B = \true \mid A = \true)$~$ and $~$\bP(B = \true \mid A = \false)$~$. How can we recover the joint probability $~$\bP(A = t_A, B = t_B)$~$ that $~$A = t_A$~$ is the case and also $~$B = t_B$~$ is the case?
Since
$$~$\bP(B = \false \mid A = \true) = \frac{\bP(A = \true, B = \false)}{\bP(A = \true)}\ ,$~$$
we can multiply the prior $~$\bP(A)$~$ by the conditional $~$\bP(\neg B \mid A)$~$ to get the joint $~$\bP(A, \neg B)$~$:
$$~$\bP(A = \true)\; \bP(B = \false \mid A = \true) = \bP(A = \true, B = \false)\ .$~$$
If we do this at the same time for all the possible truth values $~$t_A$~$ and $~$t_B$~$, we get back the full joint distribution:
[todo: information theory. a couple things, then point to another page. eg show example when two things have lots of mutual info.]
Comments
Eric Rogstad
Does this actually work for any proportions of A and B? Is there a simple proof?