Skip to Main content Skip to Navigation
Theses

Classification de textes : de nouvelles pondérations adaptées aux petits volumes

Flavien Bouillot 1, 2
1 ADVANSE - ADVanced Analytics for data SciencE
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier
Abstract : Every day, classification is omnipresent and unconscious. For example in the process of decision when faced with something (an object, an event, a person), we will instinctively think of similar elements in order to adapt our choices and behaviors. This storage in a particular category is based on past experiences and characteristics of the element. The largest and the most accurate will be experiments, the most relevant will be the decision. It is the same when we need to categorize a document based on its content. For example detect if there is a children’s story or a philosophical treatise. This treatment is of course more effective if we have a large number of works of these two categories and if books had a large number of words. In this thesis we address the problem of decision making precisely when we have few learning documents and when the documents had a limited number of words. For this we propose a new approach based on new weights. It enables us to accurately determine the weight to be given to the words which compose the document. To optimize treatment, we propose a configurable approach. Five parameters make our adaptable approach, regardless of the classification given problem. Numerous experiments have been conducted on various types of documents in different languages and in different configurations. According to the corpus, they highlight that our proposal allows us to achieve superior results in comparison with the best approaches in the literature to address the problems of small dataset. The use of parameters adds complexity since it is then necessary to determine optimital values. Detect the best settings and best algorithms is a complicated task whose difficulty is theorized through the theorem of No-Free-Lunch. We treat this second problem by proposing a new meta-classification approach based on the concepts of distance and semantic similarities. Specifically we propose new meta-features to deal in the context of classification of documents. This original approach allows us to achieve similar results with the best approaches to literature while providing additional features. In conclusion, the work presented in this manuscript has been integrated into various technical implementations, one in the Weka software, one in a industrial prototype and a third in the product of the company that funded this work.
Document type :
Theses
Complete list of metadatas

Cited literature [162 references]  Display  Hide  Download

https://hal-lirmm.ccsd.cnrs.fr/tel-01379336
Contributor : Pascal Poncelet <>
Submitted on : Tuesday, October 11, 2016 - 1:58:17 PM
Last modification on : Thursday, May 24, 2018 - 3:59:25 PM
Long-term archiving on: : Saturday, February 4, 2017 - 6:52:41 PM

Identifiers

  • HAL Id : tel-01379336, version 1

Collections

Citation

Flavien Bouillot. Classification de textes : de nouvelles pondérations adaptées aux petits volumes. Base de données [cs.DB]. Université de Montpellier, 2015. Français. ⟨tel-01379336⟩

Share

Metrics

Record views

326

Files downloads

838