J. B. Allen and D. A. Berkley, Image method for efficiently simulating small-room acoustics, The Journal of the Acoustical Society of America, vol.65, issue.4, pp.943-950, 1979.

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper et al., Deep speech 2: End-to-end speech recognition in English and Mandarin, Proc. Intl. Conference on Machine Learning, pp.173-182, 2016.

V. Andrei, H. Cucuand, and C. Burileanu, Detecting overlapped speech on short timeframes using deep learning, Proc. Interspeech Conf, 2017.

V. Andrei, H. Cucuand, A. Buzo, and C. Burileanu, Counting competing speakers in a time frame -human versus computer, Proc. Interspeech Conf, 2015.

X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland et al., Speaker diarization: A review of recent research, IEEE Trans. Audio, Speech, Lang. Process, vol.20, issue.2, pp.356-370, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00733397

T. Arai, Estimating number of speakers by the modulation characteristics of speech, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, vol.2, p.197, 2003.

S. Araki, T. Nakatani, H. Sawada, and S. Makino, Stereo source separation and source counting with map estimation with dirichlet prior considering spatial aliasing problem, Proc. Intl. Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), pp.742-750, 2009.

S. Arberet, R. Gribonval, and F. Bimbot, A robust method to count and locate audio sources in a multichannel underdetermined mixture, IEEE Trans. Signal Process, vol.58, issue.1, pp.121-133, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00305435

C. Arteta, V. Lempitsky, and A. Zisserman, Counting in the wild, European Conference on Computer Vision, pp.483-498, 2016.

J. Berger, Statistical Decision Theory and Bayesian Analysis, 1985.

J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, Algorithms for hyper-parameter optimization, Advances in neural information processing systems, pp.2546-2554, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00642998

K. Boakye, O. Vinyals, and G. Friedland, Two's a crowd: Improving speaker diarization by automatically identifying and excluding overlapped speech, Proc. Interspeech Conf, 2008.

L. Boominathan, S. S. Kruthiventi, and R. V. Babu, Crowdnet: A deep convolutional network for dense crowd counting, Proc. ACM Intl. Conference on Multimedia (ACMMM), pp.640-644, 2016.

A. S. Bregman, Auditory scene analysis: The perceptual organization of sound, 1994.

A. B. Chan and N. Vasconcelos, Bayesian poisson regression for crowd counting, Proc. IEEE Intl. Conference on Computer Vision (ICCV), pp.545-551, 2009.

P. Chattopadhyay, R. Vedantam, R. R. Selvaraju, D. Batra, and D. Parikh, Counting everyday objects in everyday scenes, Proc. Intl. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017.

K. Choi, G. Fazekas, M. Sandler, and K. Cho, Convolutional recurrent neural networks for music classification, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.2392-2396, 2017.

K. P. Choi, On the medians of gamma distributions and an equation of ramanujan, Proceedings of the, vol.121, pp.245-251, 1994.

F. Chollet, , 2015.

S. Dieleman and B. Schrauwen, End-to-end learning for music audio, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.6964-6968, 2014.

L. Drude, A. Chinaev, D. H. Vu, and R. Haeb-umbach, Source counting in speech mixtures using a variational EM approach for complex watson mixture models, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.6834-6838, 2014.

N. Fallah, H. Gu, K. Mohammad, S. A. Seyyedsalehi, K. Nourijelyani et al., Nonlinear poisson regression using neural networks: a simulation study, Neural Computing and Applications, vol.18, issue.8, p.939, 2009.

D. Garcia-romero, D. Snyder, G. Sell, D. Povey, and A. Mccree, Speaker diarization using deep neural network embeddings, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.4930-4934, 2017.

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett et al., DARPA TIMIT acoustic phonetic continuous speech corpus CDROM, 1993.

J. T. Geiger, F. Eyben, B. W. Schuller, and G. Rigoll, Detecting overlapping speech with long short-term memory recurrent neural networks, Proc. Interspeech Conf, pp.1668-1672, 2013.

E. M. Grais and M. D. Plumbley, Single channel audio source separation using convolutional denoising autoencoders, Proc. GlobalSIP, pp.1265-1269, 2017.

G. Gravier, G. Adda, N. Paulson, M. Carr'e, A. Giraudel et al., The ETAPE Corpus for the Evaluation of Speech-based TV Content Processing in the French Language, LREC -Eighth international conference on Language Resources and Evaluation, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00712591

E. A. Habets, Room impulse response (RIR) generator, 2016.

G. Hagerer, V. Pandit, F. Eyben, and B. Schuller, Enhancing lstm rnn-based speech overlap detection by artificially mixed data, Proc. Audio Eng. Soc. Conference on Semantic Audio, 2017.

J. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, Deep clustering: Discriminative embeddings for segmentation and separation, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.31-35, 2016.

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, vol.29, issue.6, pp.82-97, 2012.

S. Hochreiter, The vanishing gradient problem during learning recurrent neural nets and problem solutions, Int. J. Uncertain. Fuzziness Knowl.-Based Syst, vol.6, issue.2, pp.107-116, 1998.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput, vol.9, issue.8, pp.1735-1780, 1997.

M. Hrúz and M. Kune?ová, Convolutional neural network in the task of speaker change detection, International Conference on Speech and Computer, pp.191-198, 2016.

M. Huijbregts, D. A. Van-leeuwen, and F. Jong, Speech overlap detection in a two-pass speaker diarization system, Proc. Interspeech Conf, 2009.

D. Huron, Voice denumerability in polyphonic music of homogeneous timbres, An Interdisciplinary Journal, vol.6, issue.4, pp.361-382, 1989.

T. F. Jaeger, Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models, Journal of memory and language, vol.59, issue.4, pp.434-446, 2008.

W. S. Jevons, The power of numerical discrimination, Nature, vol.3, issue.67, pp.281-282, 1871.

Y. Jiao, M. Tu, V. Berisha, and J. Liss, Online speaking rate estimation using recurrent neural networks, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.5245-5249, 2016.

M. Kashino and T. Hirahara, One, two, many -judging the number of concurrent talkers, J. Acoust. Soc. Am, vol.99, issue.4, pp.2596-2603, 1996.

T. Kawashima and T. Sato, Perceptual limits in a simulated cocktail party. Attention, Perception and Psychophysics, vol.77, pp.2108-2120, 2015.

A. Khan, S. Gould, and M. Salzmann, Deep convolutional neural networks for human embryonic cell counting, European Conference on Computer Vision, pp.339-348, 2016.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, Proc. ICLR, 2014.

A. Lefevre, F. Bach, and C. Févotte, Itakura-saito nonnegative matrix factorization with group sparsity, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.21-24, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00567344

S. Leglaive, R. Hennequin, and R. Badeau, Singing voice detection with deep recurrent neural networks, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.121-125, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01110035

B. Loesch and B. Yang, Source number estimation and clustering for underdetermined blind source separation, Proc. Intl. Workshop Acoust. Echo Noise Control (IWAENC), 2008.

M. Marsden, K. Mcguiness, S. Little, and N. E. O'connor, Fully convolutional crowd counting on highly congested scenes, 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2017.

C. E. Mcculloch and J. M. Neuhaus, Generalized Linear Mixed Models, 2006.

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah et al., DCASE 2017 challenge setup: Tasks, datasets and baseline system, DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01627981

A. Mesaros, T. Heittola, and T. Virtanen, TUT database for acoustic scene classification and sound event detection, Proc. European Signal Processing Conf. (EUSIPCO), 2016.

S. Mirzaei and Y. Norouzi, Blind audio source counting and separation of anechoic mixtures using the multichannel complex NMF framework, Signal Processing, vol.115, pp.27-37, 2015.

H. Osser and F. Peng, A cross cultural study of speech rate, Language and Speech, vol.7, issue.2, pp.120-125, 1964.

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: An ASR corpus based on public domain audio books, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.5206-5210, 2015.

S. Pasha, J. Donley, and C. Ritz, Blind speaker counting in highly reverberant environments by clustering coherence features, Asia-Pacific Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017.

D. Pavlidi, A. Griffin, M. Puigt, and A. Mouchtaris, Source counting in real-time sound source localization using a circular microphone array, IEEE Signal Processing Workshop on Sensor Array and Multichannel (SAM), pp.521-524, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00772688

C. Pierre and C. Jutten, Handbook of Blind Source Separation, 2010.

J. Pons, T. Lidy, and X. Serra, Experimenting with musically motivated convolutional neural networks, Intl. Workshop on Content-Based Multimedia Indexing (CBMI), pp.1-6, 2016.

J. Pons, O. Slizovskaia, R. Gong, E. Gómez, and X. Serra, Timbre analysis of music audio signals with convolutional neural networks, Proc. European Signal Processing Conf. (EUSIPCO), 2017.

V. S. Ramaiah and R. R. Rao, Speaker diarization system using HXLPS and deep neural network. Alexandria Engineering Journal, 2017.

S. H. Rezatofighi, V. K. Bg, A. Milan, E. Abbasnejad, A. Dick et al., DeepSetNet: Predicting sets with deep neural networks, Proc. IEEE Intl. Conference on Computer Vision (ICCV, 2017.

M. Rouvier, P. Bousquet, and B. Favre, Speaker diarization through speaker embeddings, Proc. European Signal Processing Conf. (EUSIPCO), pp.2082-2086, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01194233

M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin et al., An open-source state-of-the-art toolbox for broadcast news diarization, Proc. Interspeech Conf, 2013.
URL : https://hal.archives-ouvertes.fr/hal-01433449

T. Sainath, O. Vinyals, A. Senior, and H. Sak, Convolutional, long short-term memory, fully connected deep neural networks, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.4580-4584, 2015.

H. Sayoud and S. Ouamour, Proposal of a new confidence parameter estimating the number of speakers-an experimental investigation, Journal of Information Hiding and Multimedia Signal Processing, vol.1, issue.2, pp.101-109, 2010.

J. Schlüter, Learning to pinpoint singing voice from weakly labeled examples, Proc. Intl. Society for Music Information Retrieval Conference (ISMIR), pp.44-50, 2016.

J. Schlüter and T. Grill, Exploring data augmentation for improved singing voice detection with neural networks, Proc. Intl. Society for Music Information Retrieval Conference (ISMIR), pp.121-126, 2015.

M. Schoeffler, F. Stöter, H. Bayerlein, B. Edler, and J. Herre, An experiment about estimating the number of instruments in polyphonic music: a comparison between internet and laboratory results, Proc. Intl. Society for Music Information Retrieval Conference (ISMIR), pp.389-394, 2013.

S. Seguí, O. Pujol, and J. Vitria, Learning to count with deep object features, Proc. Intl. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp.90-96, 2015.

M. L. Seltzer, D. Yu, and Y. Wang, An investigation of deep neural networks for noise robust speech recognition, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.7398-7402, 2013.

N. Shokouhi and J. H. Hansen, Teager-kaiser energy operators for overlapped speech detection, IEEE/ACM Trans. Audio, Speech, Lang. Process, vol.25, issue.5, pp.1035-1047, 2017.

K. Simonyan, A. Vedaldi, and A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps, CoRR, 2013.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, Proc. ICLR, 2015.

J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, Striving for simplicity: The all convolutional net. CoRR, abs/1412, vol.6806, 2014.

F. Stöter, M. Schoeffler, B. Edler, and J. Herre, Human ability of counting the number of instruments in polyphonic music, Proceedings of Meetings on Acoustics, vol.19, 2013.

F. Stöter, S. Chakrabarty, B. Edler, and E. A. Habets, Classification vs. regression in supervised learning for single channel speaker count estimation, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, 2018.

G. T. Hoopen and J. Vos, Effect on numerosity judgment of grouping of tones by auditory channels. Attention, Perception, & Psychophysics, vol.26, pp.374-380, 1979.

S. Uhlich, F. Giron, and Y. Mitsufuji, Deep neural network based instrument extraction from music, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.2135-2139, 2015.

O. Walter, L. Drude, and R. Haeb-umbach, Source counting in speech mixtures by nonparametric bayesian estimation of an infinite Gaussian mixture model, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.459-463, 2015.

C. Wang, H. Zhang, L. Yang, S. Liu, and X. Cao, Deep people counting in extremely dense crowds, Proc. ACM Intl. Conference on Multimedia (ACMMM), pp.1299-1302, 2015.

D. Wang, X. Zhang, and Z. Zhang, THCHS-30 : A free chinese speech corpus, 2015.

C. Xu, S. Li, G. Liu, Y. Zhang, E. Miluzzo et al., Crowd++: Unsupervised speaker count with smartphones, Proc. of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pp.43-52, 2013.

S. H. Yella, A. Stolcke, and M. Slaney, Artificial neural network features for speaker diarization, IEEE Workshop on Spoken Language Technology (SLT), pp.402-406, 2014.

R. Yin, H. Bredin, and C. Barras, Speaker change detection in broadcast TV using bidirectional long short-term memory networks, Proc. Interspeech Conf. ISCA, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01690244

D. Yu, M. Kolbaek, Z. Tan, and J. Jensen, Permutation invariant training of deep models for speakerindependent multi-talker speech separation, Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, 2017.

C. Zhang, H. Li, X. Wang, and X. Yang, Cross-scene crowd counting via deep convolutional neural networks, Proc. Intl. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp.833-841, 2015.

J. Zhang, S. Ma, M. Sameki, S. Sclaroff, M. Betke et al., Salient object subitizing, Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CCVPR), pp.4045-4054, 2015.

E. , G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection, IEEE/ACM Trans. Audio, Speech, Lang. Process, vol.25, issue.6, pp.1291-1303, 2017.