This paper was submitted to IEEE transations on signal processing and
it was rejected as below. Reviewer 1 had some useful comments, the
paper should certainly have include much more comprehensive testing of
the algorithm and it was a mistake to test on ogg files. However, I
decided not to carry on; for a side project, I had spent quite a lot
of time on this and, in retrospect, if it was ever really going to
work very nicely it should not have been so tricky to get it working
at all.


Reviewer Comments:

Reviewer: 1

Recommendation: R - Reject (A Major Rewrite Is Required; Encourage Resubmission)

Comments: The author describes a new algorithm for learning sets of
(time-domain) filters in a ``source-filter'' modelling of sound, based
on nonnegativity (and also sparsity) of the activation
coefficients. The proposed algorithm is technically sound and novel,
but I have serious concerns regarding the presentation of the work,
and most importantly, the lack of significant results.

First, there are indeed practically no results. I agree that
qualitative evaluation of learning algorithms is rather tricky, but
what is usually done is at least to display the learnt filters and
activation coefficient patterns obtained from a given signal or a
class of signals, so as to get a visual idea of what is actually
retrieved. Then quantitative evaluation can be carried out with
respect to a given task, such as coding, denoising, source separation,
music transcription, etc. Giving a mere SNR between the original and
reconstructed signal is clearly not enough to give a sense of the
output of the algorithm.

Second, the author states that the data was obtained from a Ogg file,
i.e, the output of a coding algorithm which has already done its best
to eliminate redundancy. I do not think that learning features from
such data makes much sense ! I urge the author to work with raw data.

Third, I did not find the presentation of the work very accessible. It
is not clear what is exactly factorized into what ? What is the model
you want to fit to your data ? Would you be able to express it in
matrix form ? I did not find the notations very suited to a signal
processing audience : e.g, the distinction between signals and
matrices is not clear, I did not find the $\bar{i}$ shorthand to
denote sums very helpful, etc. I did not find mixture of discrete
signals and continuous convolution very elegant either. These are
however less important remarks.

Fourth, references about related works in learning of audio features
are missing. I thinking in particular about recent papers from Lewicki
:

E. Smith and M. S. Lewicki, Efficient Auditory Coding, Nature, 439 (7079), 2006.
M. S. Lewicki, Efficient coding of natural sounds, Nature Neuroscience, 5 (4): 356-363, 2002.
http://www.cnbc.cmu.edu/cplab/publications.html

I also believe that more discussion about the work of Smaragdis and Plumbley is required, with comparative experimental results.

Other comments :

I did not see where you need NMF in the paper ? 

Do you explicitly need Eq. (7) to (11) somewhere ?  

More discussion about the relevance of assuming h(t) nonnegative would be welcome.  

In Eq (14), s -> s(t) ?  

There's practically no difference between Algorithm 1 and 2; I'd suggest you merge them using "switch ... case ... end" structure so as to gain space ?


Reviewer: 2

Recommendation: R - Reject (Paper Is Not Of Sufficient Quality Or
Novelty To Be Published In This Transactions)

Comments: The author proposes two algorithms to solve a non-negative
matrix deconvolution problem under sparsity constraints. The proposed
algorithms are very heuristic and not sufficiently justified
concerning their convergence. Also, there is no enough simulations to
support the performance and the comparison between the two
algorithms. Many works in this field are not cited, for example those
treating this problem in the source separation community.