Markov Chain Monte Carlo Without All the Bullshit (2015)

229 points by ibobev 3 months ago

Hmm. I'm not an expert, but some of this seems definitely not to be accurate. Some of the "Bullshit" turns out perhaps to be quite important.

Take the statement:

> Markov Chain is essentially a fancy name for a random walk on a graph

Is that really true? I definitely don't think so. To my understanding, a Markov process is a stochastic process that has the additional (aka "Markov") property that it is "memoryless". That is, to estimate the next state you only need to know the state now, not any of the history. It becomes a Markov chain if it is a discrete, rather than continuous process.

There are lots of random walks on graphs that satisfy this definition. Like say you have a graph and you just specify the end points and say "walk 5 nodes at random from this starting node, what is the expectation that you end up at a specific end node". This could be a Markov process. At any point to estimate the state you only need to know the state now.

But lots of random walks on graphs do not have the Markov property. For example, say I did the exact same thing as the previous example, so I have a random graph and a start and target node and I say "Walk n nodes at random from the starting node. What's the expectation that at some point you visit the target node". Now I have introduced a dependency on the history and my process is no longer memoryless. It is a discrete stochastic process and it is a random walk on a graph but is not a Markov chain.

An example of a Markov and non-Markov processes in real life is if I have a European option on a stock I only care about what the stock price is at the expiry date. But if I have a barrier option or my option has some knock-in/knock-out/autocallable features then it has a path dependence because I care about whether at any point in its trajectory the price hit the barrier level, not just the price at the end. So the price process for the barrier option is non-Markov.

low_tech_love 3 months ago

I won’t pretend to know the technical details (as the other replies do) but I want to make a point for the “pedagogical” effect here, which I agree with the author. The way I interpret the article, it’s not supposed to be a deep, theoretical treatise on the subject; more of an introductory, “intuitive” take on it. This works for those who need to either learn the concept to begin with, or refresh their memories if they don’t work with it every day. I think it’s a given that any intuitive take on a mathematical concept will always oversimplify things, with the underlying assumption that, if you actually need to know more, you’re going to have to dive deeper somewhere else. The most important thing I think is to help the reader build a rough conceptual understanding of the concept such that they can reason about it, instead of simply memorizing the terms.
- majoe 3 months ago
  
  I see, where you're coming from, but in that particular case, the "intuitive" explanation (walk on a graph) is far less intuitive for me than the proper explanation, that a Markov process is memoryless. That said, I used MCMC in the past to do physics simulation, where the Markov property also applies to the underlying physical process. So maybe it's just me.
  
  spookie 3 months ago
  
  No single explanation works for most. That's why you need multiple ways to explain things and how the standard education system fails at disseminating information.
- Suppafly 3 months ago
  
  >I won’t pretend to know the technical details (as the other replies do) but I want to make a point for the “pedagogical” effect here, which I agree with the author. The way I interpret the article, it’s not supposed to be a deep, theoretical treatise on the subject; more of an introductory, “intuitive” take on it.
  There are ways to teach things at a conceptual level without teaching them incorrectly though, especially when writing for an adult audience.
- gh02t 3 months ago
  
  Yeah, the hard part of MCMC is going from finite state and time spaces to continuous. But the general concepts and intuition for finite state/time get a LOT more complicated, even though they are ultimately fairly intuitive generalizations of the finite case. Maybe this is a hot take, but I did do my PhD on MCMC and to be honest I think most of the details for the more general case are superfluous for a lot of people who just want to use MCMC, since it boils down to "it basically just works algorithmically the same way as the easier to understand case of finite state/time but beware the actual details of proving that get very involved." The conclusion of the theory for MCMC is that it ends up being pretty darn lenient in terms of caveats and edge cases, so in practice as a user there aren't that many gotchas to not being really well versed in the deepest parts of the theory.
  A counterpoint is that finding texts and references that do handle the more advanced cases in depth is quite difficult, bordering on impossible if you want something approachable. I recreated the general proofs for at least the continuous state case as a chapter in my dissertation and stringing together an end-to-end proof with the proper references required quite a lot of digging and discussions with my (co-)advisor.
- sylware 3 months ago
  
  Yep, and that's shooting in the foot the "people" (I stay polite) trying to hide this intuitive "pedagogical" perspective for some agenda of their own...
jvanderbot 3 months ago

Markov property is about state, yes, but can't you expand state to accommodate your non Markov example?
As in, the state you track is no longer the probability that you have ended at a given node at time T, but instead includes a new vector of the probability you have visited the node at any time in the past (which can be obtained from PDF of location from previous time step + stochastic "diffusion").
So, we're randomly walking over a graph, but the graph is not the same as the graph used in MCMC. The MCMC graph is the state with random transitions that must model what you want to observe. That separation does not violate the statement that "it's just a random walk" it just severely complicates it I suppose.
- whatever1 3 months ago
  
  This is very typical in reinforcement learning. You just expand the state to include some more time periods. It definitely raises some academic eyebrows (since it’s not technically memory less), but hey if it works, it works
  
  jvanderbot 3 months ago
  
  It is memoryless just in a different state space.
  
  whatever1 3 months ago
  
  In the state space that includes all of the time periods, with infinitesimal granularity since the birth of the universe.
  
  jvanderbot 3 months ago
  
  Welcome to a well formulated POMDP.
graycat 3 months ago

Stochastic process with the Markov property: Past and future are conditionally independent given the present. The general version of conditionally independent is from probability theory based on measure theory and the Radon-Nikodym theorem (with von Neumann's novel proof in Rudin, Real and Complex Analysis), but an easier introduction is in Erhan Çınlar, Introduction to Stochastic Processes.
In a Poisson process the time until the next event has the Poisson distribution and, thus, from a simple calculus manipulation, is a Markov process.
E.g., time from now until a molecule of hydrogen peroxide H2O2 decomposes to water and oxygen is independent of when it was made from water and oxygen. That is the basis of half life, the same distribution until decomposition starting now no matter when the chemical or particle was created.
In WWII, searching at sea, e.g., for enemy submarines, was important, and then was Bernard O. Koopman, Search and Screening, 1946 with an argument that time to an encounter between two ships had a Poisson distribution, i.e., was a Markov process.
In grad school, there was a question about how long US submarines would last in war at sea. Well, take n Red ships and m Blue ships with, for each ship, position, speed, and detection radius and, for each Red-Blue pair, given a detection, probabilities of Red dies, Blue dies, both die, neither die (right, these four have to be non-negative and add to 1). Now have specified a Markov process that can evaluate with a relatively simple Monte-Carlo simulation.
Had written a random number generator in assembler using an Oak Ridge formula, typed quickly, and did the simulation. Had a review by a famous probability prof and passed when explained how the law of large numbers applied. So, some pure and applied math and computing worked, but some politics didn't!
- nivertech 3 months ago
  
  This Red/Blue submarine problem seems to be a better fit for ABM simulation, rather than Monte Carlo based on Markov processes.
  IRL this will be a path dependent since both sides will learn from the past actions and probabilities will be changing, i.e. the memorylessness Markov property will not hold.
  In ABM the ships (agents) can move on 2D space, which makes detection easier.
  Also, obviously there are lots of externalities, like weapons, food, and sailors supply, ceasfires, surrenders, politics, etc.
  All of the above is easier to simulate using ABM, rather than Monte Carlo.
- seanhunter 3 months ago
  
  I once failed an interview with a hedge fund because they asked me a variant on that red ships/blue ships problem and at the time I knew nothing about probability. They also hadn’t given me lunch which didn’t help.
NotAnOtter 3 months ago

Your overall point might be correct but your example does not prove your point.
A Makrov chain is just the path taken through the course of a markov process. The terms 'chain' and 'process' are sometimes conflated in this context, but this is the most common distinction I've seen. As such you can run a markov process for some number of steps N times, and then ask how many generated chains contain the property you are interested in. The process is memoryless but the chain is the result of the process and therefor contains memory.
I agree 'Random walk' is a superset of 'a markov process', but IMO when someone says Random walk - they normally make assumptions that qualify it as a markov chain. Therefor it's useful as a teaching to just call it a random walk.
- seanhunter 3 months ago
  
  Interesting. I'd not heard the term Markov Chain used to describe a path so I checked my copy of "All of Statistics" by Larry Wasserman and he (slightly) disagrees with both of us and says
  "A Markov chain is a stochastic process for which the distribution of X_t depends only on X_{t-1}."
  So he doesn't need it to be a discrete process but he also doesn't think it's a path. I guess the terminology is not 100% standardized. But anyhow thanks for making me think about this. Always interesting.
  
  JohnKemeny 3 months ago
  
  But a random walk is precisely a stochastic process for which the _next state_ depends only on the _current state_. In terms of graphs (where _random walk_ comes from), the next node is decided by randomly selecting a neighbor of the current node.
  
  seanhunter 3 months ago
  
  That's not true for random walks in general I don't think. A random walk is a process derived from taking random steps in some mathematical space. It can include jumps and it can include memory.
  Take a "path-avoiding" random walk. At time t the distribution of the next step depends on whether or not I have at some point hit any of the adjacent nodes in the current path. That's not the current state, that's memory.
  
  JohnKemeny 3 months ago
  
  A random walk on a graph is a stochastic process that starts at a vertex and, at each step, moves to a randomly chosen neighboring vertex. Formally:
  Given a graph, a random walk is a sequence of vertices [v1, v2, ..., vk] such that each v{i+1} is selected uniformly at random from the neighbors of vi.
  In weighted graphs, the next vertex is chosen with probability proportional to edge weights.
  
  seanhunter 3 months ago
  
  I’m pretty sure it’s not a requirement that the distribution is uniform and also not path-dependent as per the example I gave - a random walk where you’re not allowed to visit a node more than once.
  
  JohnKemeny 3 months ago
  
  Here's a reference for the definition I'm using:
  https://www.cs.yale.edu/homes/spielman/561/lect10-18.pdf
  It's from lecture notes (pdf) from a course in Spectral Graph Theory by Professor Daniel Spielman at Yale.
  
  seanhunter 3 months ago
  
  Those notes look interesting thanks. I’ve really never heard of someone saying you have to have uniform probability for a random walk on a graph. In fact the context where I’m most familiar with them (lattice/grid pricers) it’s always specified something like “the probability of branch a is p and b is 1-p” (ie explicitly not uniform) and they aren’t a weighted graph in the normal sense.
  
  srean 3 months ago
  
  It doesn't have to be uniform or memory less.
  However, in general when people mention random walks without further qualifications they are usually talking about the uniform and memoryless case.
  That vanilla case can be generalized extensively, on kinds of state spaces, kinds of index sets, kinds of dependence and so on.
  
  seanhunter 3 months ago
  
  That makes sense. THanks.
  
  vladimirralev 3 months ago
  
  But also "the memory" of the random walk can be encoded in a state itself. In your example you can just keep accumulating the visited nodes, so your state space will be now the space of tuples of nodes from your initial space.
  
  jmaker 3 months ago
  
  A Markov chain is commonly understood to be a time-discrete Markov process. Intuitively, it’s a “chain” because you can “single out” its states in time rather than intervals. That’s also Markov’s original definition. Instead of Wassermann one can look up the notion in Wikipedia. A path is a notion relevant to the state space of a process - it’s a realization of states at every single point in time for the time span of interest.
bjornsing 3 months ago

Not all Markov processes have stationary distributions, and of those that do not all correspond to a non-normalized probability function.
It therefore has some merit to think about MCMC as a random walk on a graph rather than Markov processes, because the “graph” needs to have some properties in order for the Markov process to be useful for MCMC. For example every “node” in the “graph” needs to be reachable from every other “node” (ergodicity).
- seanhunter 3 months ago
  
  Could you explain further please? I agree with what you're saying but I don't' understand how it applies to what I said so there's definitely something I could learn here.
  Edit: Thanks.
  
  bjornsing 3 months ago
  
  Sure. As I see MCMC it’s basically a mathematical trick that lets you sample from probability distributions even if you only know the relative probability of different samples. It’s based on the observation that some Markov processes have a stationary distribution that is identical to a distribution you may want to sample from. But you need to carefully construct the Markov process for that observation to hold, and the properties you need to ensure are most easily understood as properties of a random walk graph. So the interesting subset of Markov processes are most easily understood as such random walks on a graph.
  
  hazymemory 3 months ago
  
  I think the core of the subject is that you only need to know relative probabilities, p_j/p_k to be able to take a sample of a distribution.

gnabgib 3 months ago

(2015) Popular in:

2022 (233 points, 53 comments) https://news.ycombinator.com/item?id=33332437

2016 (125 points, 20 comments) https://news.ycombinator.com/item?id=12537043

2015 (114 points, 17 comments) https://news.ycombinator.com/item?id=9331808

zero_k 3 months ago

Science communication is so important. I write scientific papers and I always write a blog post about the paper later, because nobody understands the scientific paper -- not even the scientists. The scientists regularly read my blog instead. The "scientific style" has become so obtuse and useless that even the professionals read the blog instead. True insanity.

MichaelDickens 3 months ago

What would happen if you wrote the paper like a blog post and submitted it to journals? (With some structural changes to meet journal requirements, like having an abstract)

malshe 3 months ago

I had to learn Bayesian econometrics mostly on my own[1]. Fortunately, Jeff Miller[2] created a series of fantastic YouTube videos to explain Markov Chain Monte Carlo in detail. Personally, I prefer to learn mathematical concepts with equations and develop the intuition on my own. If you have the same preference, you will find his videos really helpful: https://www.youtube.com/watch?v=12eZWG0Z5gY

[1] I am lucky to know people who are fantastic Bayesian modelers and they helped me polish my concepts.

[2] https://jwmi.github.io/index.html

artofpongfu 3 months ago

So he sets up a toy problem (drawing from a baby names distribution), then never explains how to solve this problem.

The intuition is, you set up a graph where the vertices are names, and the edges are based on name similarity. Two names are neighbors if e.g. their edit distance is within some limit. You start at a random name, then at the neighbors, flip a biased coin with the ratio of the P(x) of your current name and the neighbor, if heads you move to the neighbor.

I'm sure this is wrong in many and subtle ways, but when I read an article like this I expect some intuition like this to be imparted.

or_am_i 3 months ago

> The “bullshit” here is the implicit claim of an author that such jargon is needed. Maybe it is to explain advanced applications (like attempts to do “inference in Bayesian networks”), but it is certainly not needed to define or analyze the basic ideas.

"The bullshit here is the implicit claim of an author that German language is needed. Maybe it is for advanced applications (like Goethe's poetry), but it is certainly not needed to describe basic ideas."

(proceeds to explain the same basic concepts 10x more verbose than in any math textbook on the subject)

Math/statistics nomenclature is certainly not perfect (think of it as a general utilities library that has been in active development for 200+ years), but it is widely used for a reason: once you learn the language it becomes second nature (very much the same as knowing all the details of the standard library API in your language of choice) allowing to communicate arbitrarily complex abstract ideas with speed and precision.

Scea91 3 months ago

Yes, its pretty simple. Different audiences prefer different style. Experts prefer brevity and jargon, they don't need explanations of concepts they already understand. Beginners prefer less jargon and inline explanations as they'd need to hunt for references otherwise to understand the text.

conjectures 3 months ago

Common pattern where a bright spark asks, 'why you all so complicated?' Proceeds to assume we're dealing with a finite graph / set.

All the complication is needed to handle the fact that the state can be a random vector of real numbers in a possibly varying dimensional space. It's not jerking off on jargon for its own sake.

Sure, there are simple cases - doesn't make the general case 'bullshit'.

low_tech_love 3 months ago

Interesting! I also make extensive notes of mathematical and computational concepts (and “buzzwords”) with a “no bullshit” title, and it works great. It’s great for quick refreshers once in a while.

nivertech 3 months ago

If I ever write abt "Markov chains without the Math/Jargon/BS" I'll use the clip from the "Ten Second Tom" scene from 50 First Dates[1] & a host of other Sci Fi movies abt time loops[2,3] to illustrate the Memorylessness[4] Markov property[5]

---

1. https://www.youtube.com/watch?v=iN_BDcKhtWk

2. https://en.wikipedia.org/wiki/Time_loop

3. https://en.wikipedia.org/wiki/List_of_films_featuring_time_l...

4. https://en.wikipedia.org/wiki/Memorylessness

5. https://en.wikipedia.org/wiki/Markov_property

emmelaich 3 months ago

Unfortunate that the equation rendering doesn't work in the body text.

pixelpoet 3 months ago

Works fine here, Firefox on Android
johnthesecure 3 months ago

I experienced this issue, but it works on another device, and after a refresh it worked on the original device. Good luck...

fidotron 3 months ago

There is definite truth to the idea stats people have a habit of writing deliberately impenetrable prose. (Probably due to proximity to economists and political scientists).

I happened to need to implement a markov chain for playing rock, paper, scissors in https://luduxia.com/showdown - once you actually understand it the code is short and took no more than 20 minutes to write, and was by far the easiest part of the entire effort, which was surprising. Vertically centering a div is harder, and involves dealing with even more jargon.

cfen 3 months ago

McMc is bullshit to begin with. Too slow…and alternatives are better for many reasons. It should not be the standard method taught to students when it comes to Bayesian Inference.

SamSeeder 3 months ago

[dead]

SamSeeder 3 months ago

[dead]

bilurrrbe 3 months ago

[dead]

jokoon 3 months ago

Feels too long

I only want the python code

fedeb95 3 months ago

it's better to understand the theory before putting up some faulty production code.