Currently, this computes the approximation of negentropy, which is the objective function to maximize.
Currently, this computes the approximation of negentropy, which is the objective function to maximize.
To understand this, let w be a single row vector of W, let x be a single data vector, and let v be a standard normal random variable. To find this one independent component, we maximize
J(wTx) \approx ( Expec[G(wTx)] - Expec[G(v)] )^2,
where G is the function set at opts.G_function. So long as the W matrix (capital "W") is orthogonal, which we do enforce, then wTx satisfies the requirement that the variance be one. To extend this to the whole matrix W, take the sum over all the rows, so the problem is: maximize{ \sum_w J(wTx) }.
On the other hand, the batchSize should be much greater than one, so "data" consists of many columns. Denoting the data matrix as X, we can obtain the expectations by taking the sample means. In other words, we take the previous "user" matrix, W*X, apply the function G to the data, and THEN take the mean across rows, so mean(G(W*X),2). The mean across rows gives what we want since it's applying the same row of W to different x (column) vectors in our data.
An n x batchSize matrix, where each column corresponds to a data sample.
An intermediate matrix that stores (w_jT) * (x{i}) values.
The current pass through the data.
This performs the matrix fixed point update to the estimated W = A^{-1}:
This performs the matrix fixed point update to the estimated W = A^{-1}:
W+ = W + diag(alpha_i) * [ diag(beta_i) - Expec[g(Wx)*(Wx)T] ] * W,
where g = G', beta_i = -Expec[(Wx)_ig(Wx)_i], and alpha_i = -1/(beta_i - Expec[g'(Wx)_i]). We need to be careful to take expectations of the appropriate items. The gwtx and g_wtx terms are matrices with useful intermediate values that represent the full data matrix X rather than a single column/element x. The above update for W^+ goes in updatemats(0), except the additive W since that should be taken care of by the ADAGrad updater.
I don't THINK anything here changes if the data is not white, since one of Hyvärinen's papers implied that the update here includes an approximation to the inverse covariance matrix.
An n x batchSize matrix, where each column corresponds to a data sample.
An intermediate matrix that stores (w_jT) * (x{i}) values.
The current pass through the data.
Store data in "user" for use in the next mupdate() call, and updates the moving average if necessary.
Store data in "user" for use in the next mupdate() call, and updates the moving average if necessary. Also "orthogonalizes" the model matrix after each update, as required by the algorithm.
First, it checks if this is the first pass over the data, and if so, updates the moving average assuming that the number of data samples in each block is the same for all blocks. After the first pass, the data mean vector is fixed in modelmats(1). Then the data gets centered via: "data ~ data - modelmats(1)".
We also use "user ~ mm * data" to store all (w_jT) * (x{i}) values, where w_jT is the jth row of our estimated W = A{-1}, and x{i} is the i^{th} sample in this block of data. These values are later used as part of fixed point updates.
An n x batchSize matrix, where each column corresponds to a data sample.
An intermediate matrix that stores (w_jT) * (x{i}) values.
The current pass through the data.
Independent Component Analysis, using FastICA. It has the ability to center and whiten data. It is based on the method presented in:
A. Hyvärinen and E. Oja. Independent Component Analysis: Algorithms and Applications. Neural Networks, 13(4-5):411-430, 2000.
In particular, we provide the logcosh, exponential, and kurtosis "G" functions.
This algorithm computes the following modelmats array:
> modelmats(0) stores the inverse of the mixing matrix. If X = A*S represents the data, then it's the estimated A{-1}, which we assume is square and invertible for now. > modelmats(1) stores the mean vector of the data, which is computed entirely on the first pass. This means once we estimate A{-1} in modelmats(0), we need to first shift the data by this amount, and then multiply to recover the (centered) sources. Example:
Here, data is an n x N matrix, whereas modelmats(1) is an n x 1 matrix. For efficiency reasons, we assume a constant batch size for each block of data so we take the mean across all batches. This is true except for (usually) the last batch, but this almost always isn't enough to make a difference.
Thus, modelmats(1) helps to center the data. The whitening in this algorithm happens during the updates to W in both the orthogonalization and the fixed point steps. The former uses the computed covariance matrix and the latter relies on an approximation of W^T*W to the inverse covariance matrix. It is fine if the data is already pre-whitened before being passed to BIDMach.
Currently, we are thinking about the following extensions:
> Allowing ICA to handle non-square mixing matrices. Most research about ICA assumes that A is n x n. > Improving the way we handle the computation of the mean, so it doesn't rely on the last batch being of similar size to all prior batches. Again, this is minor, especially for large data sets. > Thinking of ways to make this scale better to a large variety of datasets
For additional references, see Aapo Hyvärinen's other papers, and visit: http://research.ics.aalto.fi/ica/fastica/