An optimization algorithm is best thought of as a priority queue of things to learn, and the thing that’s important to prove is that your algorithm learns the good things first.

Hmm. could be.

But is there a formal-theoretical explanation of deep learning methods and why they must work.

Something like:

“ Given specific topologies (see section A) of the N-dimensional space representing all independent variables describing a specific phenomena, this theorem shows that at least some subspaces of this space, will contain independent variables that can be represented as neural network with Z with the matrix T where columns represent active layers, and non zero cells represent active network nodes”

section A: list of topologies….

this is of course ‘wishful thinking’ on my part, but I have been looking for something that explains (to a layman like myself), as to why a neural network, once trained using existing methods (enumerated), would be close-to-theoretic-optimum classifier.

I didn’t find this argument very convincing. I’m still having trouble refuting NNAEPR; in that approach, it is shown that part of why neural networks seem to always be amenable to pruning is not necessarily due to “lottery tickets” which perform extremely well without much training, but rather due to immense redundancy in the encoding of multiple layers leading to collinearity which can be removed.

The key thing that makes the lottery ticket hypothesis compelling to me is that the subnetwork identified can only be retrained if it is initialized in the exact same way. It might be due to collinearity that it can be shrunk, but that doesn’t explain why it can’t start small.

[…] and statistics tells us what happens to all big sums. They become normal distributions, and they become relatively tighter and tighter around their mean as the number of terms in the sum increases. […]

Now I’m curious: The Central Limit Theorem is only applicable to a series of independent, identically distributed random variables with equal expected value and finite variance – do a model’s learned parameters meet these requirements? And does anyone know of a publication or textbook that explains how this can be modelled?

It applies to more than just that. The only thing that’s really critical is that the variances are finite, and comparable in size to each other such that no subset of variables dominates. Even if the variables aren’t independent, as long as you can extract enough uncorrelated terms by decomposition, then it still applies to the sum of those terms.

Interesting, thank you for your explanation. And also for the article. I enjoyed reading it a lot.

[…] as long as you can extract enough uncorrelated terms by decomposition, then it still applies to the sum of those terms.

This might be a typo, did you mean to write that one can extract enough independent terms by decomposition to make the CLT applicable? That would make sense to me, it should be possible. Otherwise I don’t see how uncorrelatedness implies the required independence in this scenario.

Yes, sorry, I was using uncorrelated as a stand in for independent. I’m not sure of the top of my head whether uncorrelated is sufficient. I think it might be because we’re dealing with sums and variances, and that tends to be how these things work out, but I’d have to check.

Is that post saying that VC dimension is a pointless concept?

Some interesting ideas here! Especially the bits that convergence is not important in practice and about early stopping vs. regularization

Hmm. could be.

But is there a formal-theoretical explanation of deep learning methods and why they

mustwork.Something like:

“ Given specific topologies (see section A) of the N-dimensional space representing all independent variables describing a specific phenomena, this theorem shows that at least some subspaces of this space, will contain independent variables that can be represented as neural network with Z with the matrix T where columns represent active layers, and non zero cells represent active network nodes”

section A: list of topologies….this is of course ‘wishful thinking’ on my part, but I have been looking for something that explains (to a layman like myself), as to why a neural network, once trained using existing methods (enumerated), would be close-to-theoretic-optimum classifier.

I didn’t find this argument very convincing. I’m still having trouble refuting NNAEPR; in that approach, it is shown that part of why neural networks seem to always be amenable to pruning is not necessarily due to “lottery tickets” which perform extremely well without much training, but rather due to immense redundancy in the encoding of multiple layers leading to collinearity which can be removed.

The key thing that makes the lottery ticket hypothesis compelling to me is that the subnetwork identified can only be retrained if it is initialized in the exact same way. It might be due to collinearity that it can be shrunk, but that doesn’t explain why it can’t start small.

Now I’m curious: The Central Limit Theorem is only applicable to a series of independent, identically distributed random variables with equal expected value and finite variance – do a model’s learned parameters meet these requirements? And does anyone know of a publication or textbook that explains how this can be modelled?

It applies to more than just that. The only thing that’s really critical is that the variances are finite, and comparable in size to each other such that no subset of variables dominates. Even if the variables aren’t independent, as long as you can extract enough uncorrelated terms by decomposition, then it still applies to the sum of those terms.

Interesting, thank you for your explanation. And also for the article. I enjoyed reading it a lot.

This might be a typo, did you mean to write that one can extract enough

independentterms by decomposition to make the CLT applicable? That would make sense to me, it should be possible. Otherwise I don’t see how uncorrelatedness implies the required independence in this scenario.Yes, sorry, I was using uncorrelated as a stand in for independent. I’m not sure of the top of my head whether uncorrelated is sufficient. I think it might be because we’re dealing with sums and variances, and that tends to be how these things work out, but I’d have to check.