A Text Mining Pipeline for Mining the Quantum Cascade Laser Properties
Abstract
The development of the Terahertz laser technology in quantum cascade
lasers (qcl) has brought about great potential for industrial applications. These
lasers are based on the Terahertz electromagnetic waves, in the frequency range
from about 100GHz to 10THz. There is need to understand the structure of the
laser and its influence on the performance in order to optimize the design process.
One way of collating this information is by having ontologies and knowledge bases
capturing the various qcl designs and their performance characteristics. Majority of
the laser design data is usually contained in scientific literature. The main drawback
of such textual data sources is their unstructured nature. The complex nature of the
laser design and the varying author language styles poses some level of difficulty in
retrieving this information. Owing to this, the existing methods needs improvement
in order retrieve the laser information at a high precision(with minimal number of
incorrect records extracted) and minimized number of correct records not extracted.
In this paper, we tackle this initial challenge by proposing a text mining pipeline for
mining the qcl properties by extending the grammar rules of a conditional random
field (CRF) based model using a rule-based approach. The properties of interest
include: hetero-structure (laser stacking properties), working temperature, lasing
frequency, laser thickness and the optical power. We evaluate the pipeline on sample
open access journal papers from AIP, OPTICA and IOP Publishers.