Rendered at 03:13:32 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
alok-g 19 hours ago [-]
I find several things confusing in this article.
>> ... A GP works by constructing an infinite amount of guesses or functions of the true process you want to approximate. As you accumulate more observations, it changes the shape of these functions to match the data, and hence the true process (just like the way you change your mind after getting new information)
Why is the 'true process' changing here? I understand our best guess or model is changing with new observations, but the true process should not be changing. If it actually is, then the formulation should be changed to isolate the parameters that is feeding back to it.
>> ... A GP works by constructing an infinite amount of guesses or functions of the true process you want to approximate. As you accumulate more observations, it changes the shape of these functions to match the data, ... A GP is simply a distribution over functions (or guesses). Because we have an infinite amount of guesses, the expected true guess (or best model) is the mean of all plausible guesses.
So is the shape of each function changing? OK. What is the 'distribution' over the functions doing? Is that also changing? Is the said 'distribution' just flat mean of these functions?
>> GP(m(x), k(x, x'))
What is 'x' here? (Sigh! We need to learn to define the variables before using.) I can infer that x' is not derivative of x.
>> In the context of GPs, a kernel or covariance function k(x, x') = Cov(f(x), f(x')), encodes which function values should vary together.
It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.
>> I will use the rest of this post to go over different kernel representations and their visualizations.
The plots now have y and x, and x1 and x2. How are these related?
And with k(x, x') = Cov(f(x), f(x')), what is 'f' for the various kernel functions being plotted.
The rest of the post looks fine as plots of the various functions given. But given the above, I have not understood their importance as kernel functions or use for GP.
llamaz 15 hours ago [-]
> But given the above, I have not understood their importance as kernel functions or use for GP.
If the author has a CV attached to their blog, the purpose is to signal competence and the target audience is future employers .
magicalhippo 14 hours ago [-]
I was similarly confused, but after a few rounds with Gemini 3.5 Flash (extended) it cleared things up some, for me anyway.
> What is 'x' here?
So as I understand it, a Gaussian Process is defined in terms of a set of random variables which are indexed, typically by either time (t), or space (x). So in the concrete example, x here would be the amount of cheese inserted into the magical machine. In general the "index" can be a vector. Say if the magical machine instead required inserting both cheese and milk to produce some amount of gold, the index x would be two-dimensional, to represents the various amounts of cheese and milk you inserted.
> It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.
Right, it's general, and it's kinda confusing to use f when everything else seems to use X_t or similar. Here f is actually a random variable index by x, so one example could be
f(x) = r_1 + x * r_2
where r_1 and r_2 are two independent random variables with the standard normal distribution. In this case f(x) represents all possible lines, and f(3) gives you a random variable for index 3, so r_1 + 3 * r_2, that also follows a normal distribution thanks to how normal random variables behave when added and scaled.
> The plots now have y and x, and x1 and x2. How are these related?
The left plot shows three realizations of y = f(x), ie for three different choices (samples) of the random variables that goes into f(x). The right-hand plot shows the output of the kernel function for two indices x and x'. In the first example, the kernel function was the dot product between the two inputs, but given the indices are 1-dimensional that reduces to just k(x, x') = x * x'.
Back to the example, you can feed the machine various amounts of cheese and record the various amounts of gold you get back. The amount of cheese are the indices which you use with the kernel function you picked, which you run through the Gaussian Process regression math, and you get a new function which takes an index (amount of cheese) and returns a normal distribution that predicts the amount of gold for that index (amount of cheese).
The process spits out the mean and the variance of the normal distribution, so you can look at the variance to determine how certain you can be about the prediction which will be centered around the mean.
As I understand it, the point of the left plot is that you can use it to get an idea for which kernel function to use for your measured data. And as mentioned you can easily make new kernel functions by adding (OR-like) and multiplying (AND-like) other kernel functions.
Also the author made a mistake, he mentioned kernel functions are parameterless, but he meant non-parametric. The kernel functions he shows like the periodic kernel has hyperparameters l and p for example.
At least that's my current understanding.
saeranv 3 hours ago [-]
> Why is the 'true process' changing here? I understand our best guess or model is changing with new observations, but the true process should not be changing. If it actually is, then the formulation should be changed to isolate the parameters that is feeding back to it.
He's not saying the true process is changing, just the functions that are being sampled from the GP. The true process refers to the true, underlying function so it's deterministic if you have correctly identified all its inputs.
> So is the shape of each function changing?
Yes, the function changes shape as you get more data because the parameters governing that function (that we define in the kernel) are updated with new observational data, so that over time it converges to the 'true' process/function we are trying to discover.
> What is the 'distribution' over the functions doing? Is that also changing? Is the said 'distribution' just flat mean of these functions?
I think you're confused because the example given with cheese is really confusing when we're trying to understand the functions as arising from a multivariate distribution. So, I'll try to clarify that part. GPs are typically used to represent some function where the input is time or distance. This is why its called a 'process' - because the variables in a random process are indexed by space or time. So in this 1D example, in the X domain, [x1, x2, x3] represents something like fixed increments of increasing cheese. f(X) represents the gold amount. Now imagine gold can take any value from 0-100. Now plot all possible values of f(x1) on the x axis of a grid, f(x2) on the y-axis of the grid, and f(x3) on the z-axis of the grid. We have 100^3 points in this 3D grid. If we select one point, it's x,y,z coordinates correspond to the f(x1), f(x2) and f(x3) gold amounts. The dimension index, corresponds (typically) to something like time, or distance. In this example it's cheese.
In a GP, we're modeling the sampled f(X) point as if its from a 3D multivariate normal distribution. So sampling one point gives us the gold amount for cheese amount 1, 2, and 3. This is the 'function', and as we sample more points, we get more 'functions' that give us varying gold amounts for cheese amount 1, 2, and 3. And because it's a multivariate distribution, we can capture correlations between dimensions, so the amount of gold you get for cheese-1, should influence how much gold you get at cheese-2 because its close by. This relationship is defined by the covariance function of the gaussian.
> GP(m(x), k(x, x')) What is 'x' here? (Sigh! We need to learn to define the variables before using.) I can infer that x' is not derivative of x.
x refers to some amount of gold, and k(x, x') just means that the kernel consumes any two values in our X vector (i.e. [x1, x3] or [x1, x2]).
> "In the context of GPs, a kernel or covariance function k(x, x') = Cov(f(x), f(x')), encodes which function values should vary together." It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.
I believe it is the same f actually. He's saying the kernel function takes in two values of x (cheese), and outputs the covariance between their output gold amounts. This illustrates his previous point that the "closeness" between x values should be reflected in the gold amounts.
> The plots now have y and x, and x1 and x2. How are these related?
y is gold. x is cheese. x1, x2 correspond to the first two x-values in the linear plot.
> And with k(x, x') = Cov(f(x), f(x')), what is 'f' for the various kernel functions being plotted.
f(X) is the approximation of the "true" process we're trying to learn from observational data. The observations are tuples of cheese and gold amoutns, so f(x), f(x') is just the corresponding gold amount, we don't actually model that function explicitly. The gaussian distribution we are sampling from for functions just models correlations between our variables, so it represents the function implicitly.
zahirbmirza 1 days ago [-]
I cant wait to read/view this in detail. Super exciting, thank you.
RickJWagner 1 days ago [-]
To the author: Thank you, quite readable. I like the thumbnail explanations.
ranger_danger 1 days ago [-]
Does not appear to have anything to do with operating systems... looks AI related
dullcrisp 1 days ago [-]
Sadly those of us hoping to learn about making popcorn are forced to look elsewhere.
trumpdong 1 days ago [-]
"AI" now refers to things like ChatGPT - this is ML related (machine learning) which is the thing that used to be called AI six years ago
ranger_danger 1 days ago [-]
I don't think there is a single universally applicable definition of "AI" that even most people would agree with.
thephyber 1 days ago [-]
Yes, kernel is an overloaded term. This is about functions running on GPUs, not Operating System core functionality.
srean 23 hours ago [-]
Not specifically those either.
This is about inner product functions in a specific kind of Hilbert spaces, a notion that is very useful in many branches of applied mathematics. Machine learning and functional analysis included.
The name collision is unfortunate.
One unfortunate difficulty is that these kernels don't map so well to GPU kernels unless explicitly embedded in high dimensional spaces. This is one of the reasons why kernel methods has recently fallen out of favour in machine learning - lack of mechanical sympathy. Note this just one reason, there are others.
>> ... A GP works by constructing an infinite amount of guesses or functions of the true process you want to approximate. As you accumulate more observations, it changes the shape of these functions to match the data, and hence the true process (just like the way you change your mind after getting new information)
Why is the 'true process' changing here? I understand our best guess or model is changing with new observations, but the true process should not be changing. If it actually is, then the formulation should be changed to isolate the parameters that is feeding back to it.
>> ... A GP works by constructing an infinite amount of guesses or functions of the true process you want to approximate. As you accumulate more observations, it changes the shape of these functions to match the data, ... A GP is simply a distribution over functions (or guesses). Because we have an infinite amount of guesses, the expected true guess (or best model) is the mean of all plausible guesses.
So is the shape of each function changing? OK. What is the 'distribution' over the functions doing? Is that also changing? Is the said 'distribution' just flat mean of these functions?
>> GP(m(x), k(x, x'))
What is 'x' here? (Sigh! We need to learn to define the variables before using.) I can infer that x' is not derivative of x.
>> In the context of GPs, a kernel or covariance function k(x, x') = Cov(f(x), f(x')), encodes which function values should vary together.
It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.
>> I will use the rest of this post to go over different kernel representations and their visualizations.
The plots now have y and x, and x1 and x2. How are these related?
And with k(x, x') = Cov(f(x), f(x')), what is 'f' for the various kernel functions being plotted.
The rest of the post looks fine as plots of the various functions given. But given the above, I have not understood their importance as kernel functions or use for GP.
If the author has a CV attached to their blog, the purpose is to signal competence and the target audience is future employers .
> What is 'x' here?
So as I understand it, a Gaussian Process is defined in terms of a set of random variables which are indexed, typically by either time (t), or space (x). So in the concrete example, x here would be the amount of cheese inserted into the magical machine. In general the "index" can be a vector. Say if the magical machine instead required inserting both cheese and milk to produce some amount of gold, the index x would be two-dimensional, to represents the various amounts of cheese and milk you inserted.
> It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.
Right, it's general, and it's kinda confusing to use f when everything else seems to use X_t or similar. Here f is actually a random variable index by x, so one example could be
where r_1 and r_2 are two independent random variables with the standard normal distribution. In this case f(x) represents all possible lines, and f(3) gives you a random variable for index 3, so r_1 + 3 * r_2, that also follows a normal distribution thanks to how normal random variables behave when added and scaled.> The plots now have y and x, and x1 and x2. How are these related?
The left plot shows three realizations of y = f(x), ie for three different choices (samples) of the random variables that goes into f(x). The right-hand plot shows the output of the kernel function for two indices x and x'. In the first example, the kernel function was the dot product between the two inputs, but given the indices are 1-dimensional that reduces to just k(x, x') = x * x'.
Back to the example, you can feed the machine various amounts of cheese and record the various amounts of gold you get back. The amount of cheese are the indices which you use with the kernel function you picked, which you run through the Gaussian Process regression math, and you get a new function which takes an index (amount of cheese) and returns a normal distribution that predicts the amount of gold for that index (amount of cheese).
The process spits out the mean and the variance of the normal distribution, so you can look at the variance to determine how certain you can be about the prediction which will be centered around the mean.
As I understand it, the point of the left plot is that you can use it to get an idea for which kernel function to use for your measured data. And as mentioned you can easily make new kernel functions by adding (OR-like) and multiplying (AND-like) other kernel functions.
Also the author made a mistake, he mentioned kernel functions are parameterless, but he meant non-parametric. The kernel functions he shows like the periodic kernel has hyperparameters l and p for example.
At least that's my current understanding.
He's not saying the true process is changing, just the functions that are being sampled from the GP. The true process refers to the true, underlying function so it's deterministic if you have correctly identified all its inputs.
> So is the shape of each function changing?
Yes, the function changes shape as you get more data because the parameters governing that function (that we define in the kernel) are updated with new observational data, so that over time it converges to the 'true' process/function we are trying to discover.
> What is the 'distribution' over the functions doing? Is that also changing? Is the said 'distribution' just flat mean of these functions?
I think you're confused because the example given with cheese is really confusing when we're trying to understand the functions as arising from a multivariate distribution. So, I'll try to clarify that part. GPs are typically used to represent some function where the input is time or distance. This is why its called a 'process' - because the variables in a random process are indexed by space or time. So in this 1D example, in the X domain, [x1, x2, x3] represents something like fixed increments of increasing cheese. f(X) represents the gold amount. Now imagine gold can take any value from 0-100. Now plot all possible values of f(x1) on the x axis of a grid, f(x2) on the y-axis of the grid, and f(x3) on the z-axis of the grid. We have 100^3 points in this 3D grid. If we select one point, it's x,y,z coordinates correspond to the f(x1), f(x2) and f(x3) gold amounts. The dimension index, corresponds (typically) to something like time, or distance. In this example it's cheese.
In a GP, we're modeling the sampled f(X) point as if its from a 3D multivariate normal distribution. So sampling one point gives us the gold amount for cheese amount 1, 2, and 3. This is the 'function', and as we sample more points, we get more 'functions' that give us varying gold amounts for cheese amount 1, 2, and 3. And because it's a multivariate distribution, we can capture correlations between dimensions, so the amount of gold you get for cheese-1, should influence how much gold you get at cheese-2 because its close by. This relationship is defined by the covariance function of the gaussian.
> GP(m(x), k(x, x')) What is 'x' here? (Sigh! We need to learn to define the variables before using.) I can infer that x' is not derivative of x.
x refers to some amount of gold, and k(x, x') just means that the kernel consumes any two values in our X vector (i.e. [x1, x3] or [x1, x2]).
> "In the context of GPs, a kernel or covariance function k(x, x') = Cov(f(x), f(x')), encodes which function values should vary together." It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.
I believe it is the same f actually. He's saying the kernel function takes in two values of x (cheese), and outputs the covariance between their output gold amounts. This illustrates his previous point that the "closeness" between x values should be reflected in the gold amounts.
> The plots now have y and x, and x1 and x2. How are these related?
y is gold. x is cheese. x1, x2 correspond to the first two x-values in the linear plot.
> And with k(x, x') = Cov(f(x), f(x')), what is 'f' for the various kernel functions being plotted.
f(X) is the approximation of the "true" process we're trying to learn from observational data. The observations are tuples of cheese and gold amoutns, so f(x), f(x') is just the corresponding gold amount, we don't actually model that function explicitly. The gaussian distribution we are sampling from for functions just models correlations between our variables, so it represents the function implicitly.
This is about inner product functions in a specific kind of Hilbert spaces, a notion that is very useful in many branches of applied mathematics. Machine learning and functional analysis included.
The name collision is unfortunate.
One unfortunate difficulty is that these kernels don't map so well to GPU kernels unless explicitly embedded in high dimensional spaces. This is one of the reasons why kernel methods has recently fallen out of favour in machine learning - lack of mechanical sympathy. Note this just one reason, there are others.