A decision support system for eco-eﬀicient biorefinery process comparison using a semantic approach

Enzymatic hy d rolysis of the main components of lignocellulosic biomass is one of the promising methods to further upgra d ing it into biofuels. Biomass pre treatment is an essential step in or d er to re d uce cellu lose crystallinity, increase surface and porosity and separate the major constituents of biomass. Scientific literature in this d omain is increasing fast and coul d be a valuable source of data. As these abundant sci entific data are mostly in textual format and heterogeneously structured, using them to compute biomass pre treatment efficiency is not straightforward. This paper presents the implementation of a Decision Support System (DSS) b ased on an original pipeline coupling knowledge engineering (KE) based on semantic web technologies, soft computing techniques and environmental factor computation. The DSS allows using d ata foun d in the literature to assess environmental sustainability of biorefinery sys tems. The pipeline permits to: (1) structure and integrate relevant experimental data, (2) assess data source reliability, (3) compute and visualize green indicators taking into account data imprecision and source reliability. This pipeline has been made possible thanks to innovative researches in the coupling of ontologies, uncertainty management and propagation. In this first version, data acquisition is d one by experts an d facilitate d by a termina ontological resource. Data source reliability assessment is based on domain knowledge an d done b y experts. The operational prototype has been used by fiel d experts on a realistic use case (rice straw). The obtained results have vali d ate d the usefulness of the system. Further work will ad d ress the question o f a higher automation level for d ata acquisition and data source reliabil ity assessment

The bioconversion of lignocellulosic biomass has been exten sively studied in the past 30 years. Enzymatic hydrolysis of the main components of the biomass is one of the promising methods to further upgrading it into biofuels (Fig. 1 ). The structural hetero geneity and complexity of cell wall constituents such as crys tallinity of cellulose microfibrils, speàfic surface area and porosity of matrix polymers are responsible for the recalàtrance of cellulosic materials. Biomass pre treatment is consequently an essential step in order to reduce cellulose crystallinity, increase surface and porosity and separate the major constituents of bio mass (e.g. cellulose, hemicellulose, lignin, phenolic acids). The objective of such pre treatments depends on the process type and biomass structure. For instance, pre treatment methods can be divided into different categories: mechanical, physical, chemi cal, physicochemical and biological or various combinations of these ( Fig. 2i Each method has its drawbacks such as energy  consumption, corrosion of processing tools, water consumption, introduction of inhibiting effects, or the high number of separation and purification steps. Low or no water consumption during ligne cellulosic pre treatment can decrease the generated effluents, and also reduce the energy input for the biomass pre treatment (Zhu and Pan, 2010;Barakat et al., 2014).
In recent years, environmentally friendly pre treatments such as milling or ultrasonic, plasma and wet explosions have been studied for biomasses such as woods, bagasse, rioe and wheat straw (Kumar et al., 2009;Zhu and Pan, 2010;Adapa et al., 2011;Schultz Jensen et al., 2011;Sheikh et al., 2013 ). Currently, these prooesses are not cost effective, not only because of high invest ment costs but also because they can be very heavy on energy. For example, the total energy requirement of milling processes depends on the physicochemical properties of biomass and on the ratio of particle size distribution of materials before and after milling, this ratio being strongly dependent on the equipment or machine used. The environmental factor, energy consumption and energy efficiency are classically used to compare the perfor mances, efficiencies and environmental impacts of different pre treatment processes (Zhu and Pan, 2010;Barakat et al., 2014;Chueto r et al., 2015). However, survey articles concerning these three criteria for chemical, physicochemical and mechanical treat ment of lignocellulosic biomass remain scarce. Moreover, the rapidly increasing scientific literature in this demain would make such surveys quickly obsolete. To take advantage of this huge and potentially valuable source of information, innovative tools able to integrate continuously new information are required.
The main obstacle holding back the use of those scientific data is their textual format and heterogeneous structure. Our first aim in this paper is to show the relevance of semantic web based KE methods to structure the experimental inform ation and express it in a standardized vocabulary. Such structuring can be done using an ontology (the semantic part of our model) to represent the experimental data of interest (see step 1 in Fig. 3). Ontologies are knowledge representation models that facilitate linkage of open data and offer automated reasoning tools. Once structured in ontologies, collected information and data are made homogeneous and can be processed to compute criteria allowing the comparisons of prooesses.
Our second aim in this paper is to demonstrate the feasibility of a pipeline (see Fig. 3), taking as inputs process data found in scien tific documents, and whose final output is a ranking of those pro cesses integrating data source reliability. Note that our system is partially inspired from previous semantic approaches used to facil itate "a priori" calculation of environmental in dicators in industrial symbiosis (Trokanas et al., 2015;Raafat et al., 2013).
To illustrate our proposai, we present a fi rst attempt to compare different pre treatment processes (Fig. 2) in terms of sugar yield after enzymatic hydrolysis and of environmental factor, by reusing data already published in the sàentific literature. Energy efficiency is out of the scope of this paper as there is a Jack of data about energy consumption in the current literature. The illustrating example concerns glucose extraction in rice straw and compares the four processes presented in Fig. 2. These processes may include a sequence of unit operations, as shown in Table 1. Step 1 Knowledge structuring n RDF DB : : Step 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 iJ� E·factor computation Step 3  Table 1 Definition of process types in terms of unit operations.

Milling Pre-milling + Ultrafine milling Pre-milling + Physicochemical treatment + Press and separation + Ultrafin milling Pre-milling + Physicochemical treatment + Press and separation
The scope of application, as well as its relevance within the field of Biorefineries, have been defined in close collaboration with the 3BCAR network. The 3BCAR French Carnot network (http://www. 3bcar.fr) brings together researchers frorn seven laboratories in France (500 researchers) involved in the design of biornass trans formation processes into bioenergy, bio based materials and mole cules. ln the framework of the IC2ACV project financed by the 3BCAR network, the need arose for a Decision Support System (DSS) able to help researchers involved in the Biorefinery design. The purpose of this DSS is to assist deàsion makers in making rational choices based on data and knowledge expressed by domain experts in the scientific literature. lt is a real boon to col laborate with the 3BCAR network as researchers act as a stake holder advisory board and help to specify the project scope, by defining the key parameters that must be in the generic tool. More over, a preliminary communication (Busset et al., 2015) indicates that this tool is of great interest for the international research corn munity in the field of Biorefinery and industrial companies. The DSS developed in this context aims at achieving progress in the complex issue of assessing the environmental impact of alternative Biorefinery systems. The presented DSS has the following functionalities.
(i) lt is able to annotate, store and maintain potentially incom plete or imprecise data extracted from the scientific litera ture, in dedicated databases containing biorefinery system characteristics ( e.g. glucose yield, mass balance in terms of water, chemical reagents) and process parameters ( e.g. milling rotation speed, treatment duration, temperature, pres sure); (ii) lt computes environmental impact indicators ( called Efactor in the following) integrating data reliability aspects; (iii) lt ranks the different Biorefinery processes according to their environmental impact.
The underlying pipeline, shown in Fig. 3, is based on an original coupling of IOE methods, soft computing techniques and environ mental factor computation to assess environmental sustainability of Biorefinery systems (see steps 2 and 3 of Fig. 3). To our knowl edge, there is currently no data treatment pipeline similar to the one designed and implemented in this study. This pipeline takes as input a set of scientific papers extracted from bibliographical resources on the Web and generates as output a ranking of biore finery processes based on environmental impact assessment.
The paper is organized as follows. Our approach is compared to the state of the art in Section 2. Functional specifications of the DSS are introduced in Section 3. The corresponding software architec ture is detailed in Section 4. lmplementation details of the graphical user interface in Java 1. the RDF 2 data base management system with Jena 3 , the environmental impact calculations, as well as the DSS assessment are presented in Section 5. Section 6 concludes the paper.

Comparison with the state of the art
Our DSS, which will be described in more details in Section 4, is a pipeline composed of three steps: (1) annotation guided by an ontology of experiment al data published in scientific papers, (2) annotated data extraction and Efactor indicator computation, (3) process ranking including yield and Efactor indicator visualization in graphical maps. To our knowledge, no similar methodology exists in the literature, in particular for steps 2 and 3. We can how ever provide some elements of comparison for the first step of the pipeline: experimental data structuring using the semantic tool (including data annotation and querying guided by an ontology) proposed in this paper, called @Web (for Annotated Tables from  the Web). ln general, relevant experimental data are scattered in different parts of the document and expressed in different formats. For example, in Biorefinery related papers, unit operation controlled parameters are often described in sentences within the Material and Method section, while experimental results are presented in tables located in the Results and discussion section. ln this context, the automation of information extraction and annotation from text and tables is a major issue that should be discussed. Let us first discuss the automatic extraction in text of relevant and related pieces of information. The literature on this topic is twofold.
On one hand, a substantial amount of work on binary relation extraction has been done. The first approaches to discover relations between entities focused on a limited linguistic context and relied on discovering co occurrences and manually designed pattern matching (Huang et al., 2004). Rule based techniques defined as regular expressions over words or part of speech (POS) tags have been used to construct linguistic or syntactic patterns (Proux, 2000, Hao et al., 2005Hawizy et al., 2011;Raja et al., 2013). How ever, manually defined rules require heavy human effort. Later on, machine learning based approaches, e.g. Support Vector Machines (SVMs) (Minard et al., 2011), were widely employed (Rosario, 2005, Miwa et al., 2009Van Landeghem et al., 2009, Zhang, 2011 to solve classification tasks (Rosario and Hearst, 2004). Those methods have shown their usefulness but require a large amount of annotated data for training, which usually takes tremendous human efforts to achieve. Being based on numerical models, they may not be directly understandable by the final user.
On the other hand, the extraction task of n ary relations (i.e. relations having more than 2 arguments) is a more complex issue, though it is needed in our application context. Work was conducted dividing n ary relation extraction into three main steps. The first step consists in identifying entities (or arguments) using resources such as ontologies or dictionaries. The second one consists in iden tifying the trigger word of the relation using dictionary based methods or rule based approaches to construct patterns from dependency parse results (Le Minh et al., 2011), or using machine learning methods (Bjorne et al., 2009, Buyko et al., 2009, Bui et al., 2011, Zhou et al., 2014 in order to predict the trigger word of the relation. Finally in the third step, binary relations are con structed using the trigger word and machine learning methods are used to classify whether or not binary relations belong to the searched n ary relation, but with a substantial loss of accuracy. Relation extraction methods are in general based on those three independent steps. In our context, arguments of the n ary relation can be implicitly expressed in the text and usually appear in several sentences. Therefore, state of the art methods, which make the hypothesis of the presence of a trigger word, are not directly usable. Let us now discuss automatic extraction of relevant information in tables. State of the art methods and tools (Knoblock et al., 2012;Buche et al., 2013;Tian et al., 2013;Zhang, 2014) make the assumption that data tables are organized in the same way than in relational databases: a data table is composed of vertical col umns (each column corresponding to a single feature, for example Biomass, temperature, etc.), themselves composed of cells. Unfor tunately, this assumption is not always valid for data tables pub lished in scientific articles. Various features may be present in the same column (temperature and treatment duration may be given in the header of a column corresponding to the process yield) or tables may have two entries (vertical and horizontal). Therefore robust automatic data table pre treatment must be designed and validated in order to apply state of the art tools to data tables extracted from scientific documents.
Automatic extraction of relevant information from text and tables of scientific articles being an active research topic, it is not ready yet for use in an operational system. Hence in our pipeline (see Fig. 3) annotation is performed manually, the ontology being used to guide the annotator. In this way, we may consider the annotation process as semi automated.
The only tool comparable with @Web to implement the first step of the DSS is, to the best of our knowledge, Rosanne (Rijgersberg et al., 2011), an Excel ''add in" application built upon the OM ontology, an ontology of quantities and units of measure. Rosanne allows quantities and units of measures associated with columns of an Excel table to be annotated using concepts from OM. As @Web does, Rosanne manages the notion of phenomenon, very similar to the notion of symbolic concept in @Web, which rep resents non numerical data, for example process name or type of material. The main difference is that @Web defines the notion of relation, which links together data (studied object with controlled parameters and results) in order to represent a whole experiment. This notion is important in the DSS, being used to extract anno tated data in order to compute Efactor indicators. There is no such notion, nor an equivalent one, available in Rosanne. Authors of Rosanne made the choice to develop an Excel ''plug in" that brings semantics to data under Excel, a widely used tool in the scientific community. Being a Web application, @Web was naturally designed as a collaborative platform to share documents (for example scientific articles) associated with annotated tables. Moreover, @Web proposes an end user graphical interface to query annotated tables (see Section 4.1) which is not available in the cur rent version of Rosanne. For its part, Rosanne proposes an interest ing functionality to merge several annotated tables sharing a column annotated with the same concept. In conclusion, @Web and Rosanne tools are complementary and are based on a partly common ontological representation, the quantity units component of @Web being very close to the one used by OM. Some perspec tives are given in the conclusion regarding that complementarity.

Functional specifications of the system
Since the DSS functional specifications depend on the users, the first step was therefore to identify the potential users. They were identified in the 3BCAR network of researchers. Then, the func tional specifications were determined during the IC2ACV project's meetings, gathering several experts of the 3BCAR network. Finally, the functional specifications were refined during annual meetings of 3BCAR about IC2ACV results.
The following functional specifications are implemented in the IC2ACV DSS: 1. gathering, integrating and structuring heterogeneous data available in the scientific literature about biomass transforma tion processes; 2. defining a simple and generic ontological model of the informa tion which must be identified and annotated in the scientific papers. The model must be simple because it should be easily updated by biorefinery experts (who are not computer scien tists) and generic in order to be useful for other kinds of data managed by 3BCAR researchers (for instance packaging characteristics); 3. allowing an open data access to original data and associated units of measure. This can be done thanks to permalinks 4 and dedicated querying system managing unit conversion to facilitate data reusability; 4. assessing the reliability of data (sources) and taking it into account in the environmental impact assessment; 5. managing imprecise data since experimental data associated with biomass (eg. glucose rate) and biomass process (eg. glu cose yield) are subject to uncertainty; 6. taking into account the biological variability associated with biomass processes and the subsequent uncertainty propagation during the environmental impact indicator computation; 7. computing environmental factor indicators (mass balance indicators called Efactors); 8. visualizing the ranking of biomass processes according to process yield and Efactors.

Architecture of the decision support system
This section details the three steps of the data treatment pipe line. In the first step, experimental data published in scientific papers are annotated thanks to an ontology implemented in OWL 5 2/DL and assessed in terms of their source reliability. Annotated data are stored in a RDF database and available in open access via permalinks, a SPARQL 6 end point and a dedicated query ing system guided by the ontology. The second step consists in extracting annotated data from the RDF database to compute Efactor indicators and data reliability scores. This extraction is done using SPARQL queries generated by the dedicated querying system guided by the ontology. Process yields, Efactor indicators and data reliability scores can be visualized in the third step as graphical maps. This last step provides a synthetic and global overview of biomass pre treatment process ranking.

Heterogeneous experimental data integration (step 1)
To facilitate integration of scientific data coming from heteroge neous sources, one of the relevant solutions is to use ontologies (Noy, 2004;Doan et al., 2012). An ontology defines a set of primi tives to model a domain of interest: classes, attributes (or proper ties) and relations between members of the classes (Guarino et al., 2009). The ontology is used to create and/or reuse standardized vocabularies and to index data sources with those vocabularies in order to allow data source interoperability. Our system uses @Web to capitalize experimental data extracted from scientific documents found on the Web. Here are its main components. @Web implements a complete workflow (see Fig. 4) to manage experimental data: extraction and semantic annotation of data from scientific documents, data source reliability assessment and uniform querying of the collected data stored in a database opened on the Web. @Web relies on an Ontological and Terminological Resource (OTR) which guides scientific data semantic annotation and querying. An OTR associates a terminological component to an ontology in order to establish a clear distinction between the linguistic expressions in different languages (i.e. the term) and the notion it denotes (i.e. the concept) (Roche et al., 2009;McCrae et al., 2011;Cimiano et al., 2011). For instance, English terms ''Grasses and energetic plants" and ''Energy crops" and the French term ''Plantes énergétiques" denote the same symbolic con cept Grasses and energetic plants. The OTR is designed to model scientific experiments. It is composed of two layers: a generic one and a specific one dedicated to a given application domain. Since the OTR is at the heart of the scientific data capitalization workflow, @Web can be reused for different application domains: only the specific part of the OTR must be redefined to re use @Web for a new domain (see Touhami et al., 2011 for a reuse in food packaging). Let us point out that the OTR satisfies functional specifications 2 of the IC2ACV DSS presented in Section 2. @Web is composed of two sub systems (see Fig. 4). The first one is an annotation sub system for the acquisition and annotation, with concepts of the OTR, of experimental data extracted from sci entific documents; those annotated data are being stored into a database. This sub system also allows the reliability of data sources to be assessed using the approach of (Destercke et al., 2013). The second sub system is a flexible querying system based on the approach presented in (Destercke et al., 2011). @Web is implemented using the semantic web standards (XML 7 , RDF, OWL, SPARQL): the OTR is defined in OWL2 DL, annotated tables in XML/RDF and the querying in SPARQL. We present in Section 4.1.1 the Biorefinery OTR used in @Web. Section 4.1.2 details the model used to assess data source reliability.

Biorefinery OTR model
The OTR is designed to represent scientific experiments in order to annotate data tables in a given domain (see Touhami et al., 2011 for more details). We made the choice to represent an experiment by using n ary relations between several experimental parameters and a given result. This structures information in a simple way as requested by functional specification 2 (see Section 3). As recom mended by W3C (Noy et al., 2006), we used the design pattern which represents a n ary relation thanks to a concept associated with its arguments via properties. Let us illustrate this notion by using the example of n ary relation Biomass Glucose Composi tion Relation (see Fig. 5). This relation is characterized by 4 argu ments: (1) the glucose rate, which is the experimental result, (2) the biomass, which is the studied object, and associated experi mental parameters being (3) the biomass state (untreated or trea ted) and (4) the experiment number reported in the document. This relation is used to create annotated tables, as shown in Table 2. It presents an example of annotated table extracted from the scien tific document (Hideno et al., 2009), which determines the glucose rate of rice straw, in two different experiments. The columns of the annotated table correspond to the arguments of the relation Bio mass Glucose Composition Relation.
An excerpt of Biorefinery OTR global structure is presented in Fig. 6. The conceptual component of Biorefinery OTR is composed of a core ontology to represent n ary relations between experimen tal data and a domain ontology to represent specific concepts of a given application domain. In the Up core ontology, generic con cepts Relation and Argument represent respectively n ary relations and arguments. The representation of n ary relations between experimental data requires a particular focus on the management of quantities and their associated units of measure. In the Down core ontology, generic concepts Dimension, UM_Concept, Unit_Concept and Quantity allow the management of quantities and their associated units of measure. The sub concepts of the gen eric concept Symbolic_Concept represent the non numerical argu ments of n ary relations between experimental data. The domain ontology contains specific concepts of a given application domain, in this paper the Biorefinery domain. They appear as sub concepts of the generic concepts of the core ontology. In the Biorefinery OTR, relations represent either experiments which characterize biomass (see Fig. 5) or experiments involving unit operations performed on biomass. For instance, the milling operation is represented by the n ary relation Milling Solid Quantity Output Relation (see Fig. 7).
It is characterized by 7 arguments and represents the milling solid quantity output, which is the milling experimental result for a given biomass associated with a set of experimental parame ters being the biomass input quantity, the total pre treatment energy used for the milling, the treatment duration, the milling rotation speed and the type of milling.
In the Biorefinery OTR, all concepts are represented as OWL classes, hierarchically organized by the subsumption relation sub ClassOf and pairwise disjoints.
The terminological component of the Biorefinery OTR contains the domain related set of terms used to annotate data tables. Sub concepts of the generic concepts Relation, Symbolic_Concept and Quantity, as well as instances of the generic concept Unit_Concept, are all denoted by at least one term of the terminological compo nent. Each of these sub concepts or instances are, in a given lan guage, denoted by a preferred label and optionally by a set of alternative labels, which correspond to synonyms or abbreviations.

Reliability assessment scores for Biorefinery
Excerpt of the annotated table biomass composition.
When gathering data from various documents, the question rapidly arises as to how reliable these data or these documents are. @Web proposes a reliability estimation tool, presented in details in (Destercke et al., 2013) and whose basis we recall here. This tool aims at providing an automatic, a priori (that is, avoiding a specific examination) estimation of the document and data reli ability from a set of meta information related to the data and the document. Labels are associated with a concept or an instance thanks to SKOS 8 labelling properties, recommended by W3C to represent controlled vocabularies associated with concepts (see the "Gras ses and Energetic plants" example given in the introduction of Section 4.1 ).

Simple Knowledge Organization System.
To this effect, S groups A1, ... ,As of meta information important to assess reliability are first defined in accordance with decision makers and domain experts, a group A; taking C; values ail, ... , a;c,. For instance, the C; values for the group "Sugar analysis method" would be the different available methods. Various types of meta information, summarized in Table 3, have been considered for the Biorefinery data sources: ... • meta information related to means used to produce data. In papers based on experiments in Life Science, such information is typically included in a section called Material and method, which thoroughly describes the experimental protocol and material. Sorne methods may be known to be Jess accurate than others, but are still chosen for practical considerations; • meta information on the data source itself, for instance the source type ( e.g. scientifi c publication, technical report), the source reputation, citation data; • meta information related to statistical procedures: presenoe of repetitions, uncertainty quantification (i.e. variance, confidence interval), elaboration of an experimental design.
In practice, the groups are made so that their impact on reliabil ity can be estimated independently, while a group � may contain multiple criteria (e.g. number of citations and publication date).
For each possible value of each group, the method then consists in assessing the reliability of a document or data having this partie ular value. After the groups have been formed, for each value ay, i 1, ... , S,j 1, .. . , C;, an expert of the field from whieh data are collected gives his/her opinion on about how reliable are the data whose meta information is ay. This opinion is expressed lin guistically, chosen from a set of limited modalities ( or combina tions of them), e.g. ve,y unreliable, slightly unreliable, neutral slightly reliable, ve,y reliable and unknown. The number of modali ties is limited (usually 5 or 7), accounting for known limitations of human cognitive abilities (Miller, 1956). ln practiee, several experts of the field belonging to the 3BCAR network, specialists of different unit operations (respectively mechanical, physico chemical and biological) have been interviewed resulting in a con sensual opinion To each document o, are then assoàated S linguistie assess ments ( according to the value taken by the corresponding meta information). A missing value in the meta information is simply treated as the linguistie assessment unknown in terms of reliability. ln order to apply refined fusion techniques able to deal with poten tially conflicting information (as some meta information may indi cate an unreliable document, while others may designate a rather reliable one), the linguistie assessments are translated into a numerical format, using the notion of fuzzy sets, that are adequate numerical models of linguistie values. ln order to have enough flex ibility, these fuzzy sets are defined on an ordered finite reliability spaoe 0 {01, ... , Os} of 5 elements, 01 being the lowest reliability value, Os the highest. The number of elements could be higher than 5, but this is a reasonable choiee. lndeed that number should remain odd in order to have a neutral element, not too low so that fuzzy sets corresponding to different terms can be numerically quite distinguishable and not too high so that computational prob lems do not arise. Each modality is then transformed into a fuzzy set on 0 (see Fig. 8 for an illustration).
The S fuzzy sets µ <t, , ... , µ 0 ; corresponding to document o's group reliability are then merged together using evidential theory and a maximal coherent subset approach which allows conflicting evidenoes to be taken into account. Such an approach aims at rec onciling ail sources while keeping as much information as possible, meaning that missing information (i.e. presence of the unknown modality) does not impact the result. The potential conflict in meta information (i.e. assessment of high reliability for one aspect but of low reliability for another one) is retlected in the imprecision of the final mode): presenoe of conflict will end up in a quite impre cise estimation of the reliability, while its absence will result in a quite precise estimation. The result of this merging is a mass distri bution m 0 : 2 9 -[O, 1) whieh retlects the global reliability of o (see Destercke et al., 2013) for more details i The mass m 0 provides an accurate synthesis of the different meta information contributions, and its analysis could allow one to automatically identify subsets of conflicting and of coherent meta information. Yet, it is still too complex to be analysed at a glance. For this reason, a further summarizing is provided, in which where/(0;) i, that is each 0; is replaoed by its corresponding rank (a natural choice, even if other functions could be chosen). E., is obtained with the same formula, replacing in/ by sup. The length or imprecision of �, t;'j retlects to which extent the various pieces of meta information are consistent: E., provides in some way a "worst" possible reliability, while Eo correspond to a "best" possible reliability, the two being close to each other when information is consistent. These scores can then be used in the querying system to rank annotated data assoàated with document s according to their reliability, where one can adopt different strategies: e.g., opti mistie (pessimistie) by ranking according to Eo (Eo).
These values will be also used in order to compute a global reli ability score R within a collection of documents whieh includes several experiments of the same biorefinery prooess (see Sec tion 4.2). For a given set of documents 01 , ... , On, the global reliabil ity score R is a value in the interval [O, 1] with R = 0 for a set of unreliable documents and R = 1 for a set of very reliable docu ments. As reliability scores [Eo,, Eo,) assoàated with documents o; are imprecise, we compute standardized bounds [B.,R), given Emin and E m "" the minimal and maximal reliability scores (E min 1, E rnax 5 in this paper), as follows: More details are presented in (Destercke et al., 2013); in partie ular various means to analyse the reliability results, such as the benefits one can retrieve from imprecise assessments or the way to detect subgroups of agreeing/disagreeing meta information

Eco design indicators (steps 2 and 3)
As seen in the previous section, the first step of the pipeline pre sented in Fig. 3 allows knowledge extracted from heterogeneous data sources to be annotated and its reliability to be assessed. The second step consists in extracting annotated data from the RDF database to compute environmental impact indicators. Fol lowing functional specification 7 listed in Section 2, we now pre sent the computation of mass balance indicators denoted Efactors. The Efactor indicator can be seen as the total input quan tity of matter not valorised into glucose but required to produce 1 kg of glucose. This indicator is often used in survey papers dedi cated to biorefinery processes comparison (Zhu and Pan, 2010;Barakat et al., 2014;Chuetor et al., 2015). Ali kinds of matters whieh are inputs of the process are taken into account (by exam pie, the biomass, water, chemieal reagents ... ). For a given set of n documents o 1 , ... , On, we consider for each document O; the m experimental settings whieh are described in O; denoted eil, ... , eim. ln the following, we call experimental settings, the set of controlled parameter adjustments for the given process (by example in a milling, different durations are tested resulting in dif ferent experimental settings). Each experimental setting is associ ated with a given biomass, denoted biomass(ey), whieh belongs to the set of I studied biomasses b 1 , .. . , b 1 • This biomass(ey) has been assigned (during the first step) to a given Biorefinery prooess, denoted process(ey), whieh belongs to the set of k alternative pro cesses p 1 , ... ,Pk, (see Fig. 2), whieh will be compared in Section 5.2. Following the functional specification 7 expressed by 3BCAR researchers, a matter balance indicator, denoted Efactor(o;,p, b) can be computed for experimental setting {eii} belonging to a given document o;.
Remark. As biomass quantities differ in the considered experi ments, ait values are norrnalized for 1 kg of initial biomass in order to compute comparable Efactor indicators. Efactor is defined as in (Chuetor et al., 2015): Efactor definition.
• ChemîcalReagentQty is the chemical reagent product quantity used in the process (kg). • SolventQty is the quantity of solvent (water and/or solution) used in the process (kg). • GlucoseReleasedQty (kg) is a quantity defined as the biomass quantity (input of the enzymatic hydrolysis unit operation) multiplied by the glucose rate (available in the raw biomass) and the glucose yield which depends on the considered exper imental setting.
Functional speàfications 5 and 6 expressed by 3BCAR research ers consist in (i) taking into account the uncertainty recorded in experimental data results and (ii) propagating unoertainties in the Efactor computation. The experimental results considered in this study are GlucoseRate and GlucoseYield. For each experiment, the available results may be given as a scalar value, as an interval, or as a tuple of the mean value and the standard deviation over the experimental repetitions. Consequently, experimental results Glu coseRate (resp. GlucoseYield) can be considered as a sample drawn from a random variable. We have noticed that, in ail documents, the GlucoseYîeld random variable depends on experimental set tings, which is not the case for the GlucoseRate random variable whose sampling shows no variation In the following, we propose for a given document o;, a given biomass b E {b 1 , ... , b 1 } and a given prooess p E {p1 , ... ,Pk}. two ways to compute Efactor(o;,P, b). The first one consists in selecting the best experimental setting presented in document O; and corn puting Efactor ,_ (o;, p, b). As the uncertainty level is not always available in experimental data results, we propose a second way to compute Efactor which consists in taking into account the infor mation provided by the entire set of settings and computing Efactor 0 ' ( o;,p, b). lt is an indirect way to provide information about the uncertainty associated with experimental data results of a doc ument o;. We also define, for a given biomass b and a given process p, an Efactor indicator calculated for the en tire set of settings of the en tire set of n documents 01, ... , On.
Computing Efactor for the best experimental setting in document o;: Having in mind the imprecision expressed for random variable GlucoseYield, a pessimistic point of view will prefer to guarantee the highest minimal GlucoseYield, while an optimistic one will pre fer to guarantee the highest maximal GlucoseYield. In this paper , we have chosen the pessimistic point of view to select the best exper imental setting. Let us consider GY,, 1 (resp. <Tcy, ,1 ) the mean value (resp. the standard deviation) associated with the GlucoseYield ran dom variable of experimental settingj described in document o;.
We assume that the sample is drawn from a normal distribu tion, the sample size being unknown (this is a reasonable assump tion in such experiments). We recall that under this assumption, the 95% confidence interval of the GlucoseYield random variable Is defined by [GY," 2<Tcy, ,GY,, ,+ 2<T GY , ). We consider for each For the four experimental settings described in (Amiri et al., 2014), the results are presented in Fig. 9 for the following rice straw pre treatment process type "Pre Milling then Physicochem ical treatment then Press and Separation" ( called PM PC PS). The best experimental setting corresponds to the pessimistic choice discussed above, i.e. the one having the maximal lower bound of the 95% confidence interval associated with the GlucoseYield ran dom variable ([33.07 36.47] Computing Efactor for all of the settings in document o i : In this case, we want to take into account Glucose Yield values obtained for all of the settings. As settings are inter dependent, we define the global Glucose Yield as the interval including the 95% confi dence intervals associated with the GlucoseYield values obtained in all of the settings. For instance, in Fig. 9 Computing Efactor for the entire set of settings of the entire set of documents: An aggregated mass balance indicator for a set of n doc uments o 1 ; . . . ; o n associated with a given biomass b 2 fb 1 ; . . . ; b l g and a given process p 2 fp 1 ; . . . ; p k g must also be computed, denoted Efactorðp; bÞ. In this case, we consider that experimental settings described in different documents are independent.

Implementation
In this section, we detail the implementation of the data treat ment pipeline presented in Fig. 3. In Section 5.1, we describe the @Web software. Section 5.2 deals with the implementation of Efac tor indicator computation and visualization.

@Web
We have presented in Section 4.1.1 the Biorefinery OTR which has been defined to model experimental data in the domain of biorefinery pre treatment processes. Biorefinery OTR is used in @Web for the task of experimental data source annotation and querying, using n ary relation concepts. @Web relies on the gen eric part of the OTR model (see the core ontology in Fig. 7) and allows the management of the domain ontology of Biorefinery OTR with its associated terminology. As @Web relies on the generic part of the OTR model, several OTR dedicated to different applica tion domains can be managed simultaneously in @Web. For instance, in our current implementation, an OTR dedicated to gas transfer in packaging materials has also been defined and is avail able at http://www6.inra.fr/cati icat atweb. Let us notice that the core ontology, which has been designed to be non modifiable, is not accessible to ontology managers. Moreover, we made the choice to manage units in a transversal way defining only one OTR of units of measure, because some units of measure may be used in different OTR. Units can therefore be used by all the OTR defined in @Web. The current version of the OTR of units of mea sure is available at http://www6.inra.fr/cati icat atweb (section @Web platform, thumbnail Ontology, option Unit Ontology).
Recorded tutorials of the current @Web version are available on line (http://www6.inra.fr/cati icat atweb/Tutorials). Here we focus our presentation on the annotation sub system of @Web, which implements the five sub steps presented in Fig. 10. The annotation sub system of @Web implements a complete workflow to extract experimental data from scientific documents and semantically annotate them with n ary relation concepts defined in the Biorefinery OTR.
In the first sub step, called Document selection, relevant docu ments according to the OTR are retrieved from the Web and man ually selected by a domain expert. This selection may be done using classical bibliographical tools (for instance Web of Science 9 ). Documents can be uploaded in @Web from a desktop or from a col laborative repository management using Mendeley 10 . After docu ment loading, @Web manages their bibliographical references as well as their entire text both in HTML and PDF formats. Documents are grouped in topics. Four topics have been defined for biorefinery and correspond to the four pre treatment processes (see Fig. 2 and Table 1) which are compared in Section 5.2. For instance, (Amiri et al., 2014), whose experimental settings have been presented in Fig. 9, has been stored in PM PC PS topic which corresponds to the PM PC PS pre treatment process presented in Table 1.
The second sub step is dedicated to document reliability assess ment using the model presented in Section 4.1.2. In the current version, meta information associated with each document is man ually entered in order to compute reliability score. In Fig. 11, the reliability score reflects an imprecise assessment ½E o ; E o ½1:5; 4:98, due to a conflict between expert opinions asso ciated with meta information: ''citation age and citation number" and ''source type" are con sidered as very reliable. ''Enzymatic hydrolysis reproducibility" and ''Biochemical and physicochemical analysis reproducibility" are considered as hardly reliable because only the average value of experimental results associated with those unit operations is given in the document.
All operations involving belief functions needed to compute reliability scores have been implemented in an R package. The package is called belief (Maillet et al., 2010), and it includes basic functions to manipulate belief functions and associated mass assignments (currently on finite spaces only).
In the third sub step shown in Fig. 10, called Table extraction, data tables are automatically extracted from HTML versions of doc uments using tag analysis. The discovered tables are then pre sented to the domain expert for validation as they represent a synthesis of some experimental data published in the document and may be used to facilitate the manual entering. The fourth sub step, called Table annotation, corresponds to the manual semantic annotation of the selected data tables using the concepts of the Biorefinery OTR. Taking into account the actual content of  the original table, the annotator selects from the n ary relation concepts defined in the Biorefinery OTR those relevant to annotate the table. For instance, in Fig. 12, the expert selects several n ary relations including, for instance, Enzymatic hydrolysis output so lid constituent quantity relation and Mill ing Solid Quantity Output Relation from the list of n ary relations concepts defined in the Biorefinery ITTR. The signatures of both n ary relations concepts are visualized in a table, one signature per row. This will guide the expert in his/ her entering task, allowing him/her not to forget to fulfil argu ments of the selected n ary relation concepts. This is important for data reusability. This example shows that several relations may be used in a given annotated table in order to annota te exper imental data associated with a complete pre treatment process as the one presented in Table S. This table presents an example of annotated table in @Web extracted from the scientific document (Hideno et al., 2009), which describes a biorefinery pre treatment process composed of a sequence of four unit operations occuring in experiments 1 and 2. The columns of the annotated table correspond to arguments of the relation Milling Solid Quantity Output Relation (see Fig. 6). For instance, we can see on the first row that the first unit opera tion is a cutting milling, instance of the relation Milling Solid Quan tity Output Relation. The third row shows that the third unit oper ation of this process is another milling, dry ball milling, another instance of the relation Mill ing Solid Quantity Output Relation.
During the manual data entering guided by the OTR, @Web pro poses assistance in several tasks. For instance, when entering quantity values and their associated units of measure, the expert may select a unit in the list of units associated with the quantity in the Biorefinery OTR. The expert may also drag and drop the quantitative values from the original table to the annotated table, which makes the data entry easier and reduces the risk of errors. As requested in functional specification 5 listed in Section 3, it is possible to enter a quantitative value as an interval or a mean/stan dard deviation pair. For instance, in Table 5, the quantity Output solid constituent quantity is defined as the precise value 5 g for Cut ting mill ing treatment in row no. 1 and as the interval [3.1e 2,4.9e 2] g. for Enzymatic hydrolysis treatment in row no. 4. Missing data are denoted by the interval [ inf; inf]. @Web may also assist the expert in the symbolic concepts entry task, by allowing him/her to navigate in the hierarchy of symbolic concepts belonging to the Biorefinery ITTR. For instance, in Fig. 13, the expert may navi gate into the Biomass concept hierarchy from the Biorefinery ITTR and see the labels of the selected concept in the upper right corner of the snapshot.
In the fifth and last step of annotation. called St orage, the anno tated data tables are stored in a RDF triple store which can be quer ied through either an end user querying interface or a SPARQL end point for open data access. The data annotated by the annotation sub system of@Web may be queried through an end user interface, whieh implements func tional specification 3 presented in Section 3. A detailed presenta tion of the flexible bipolar querying method whieh has been used to implement the querying sub system of @Web is given in (Destercke et al., 2011(Destercke et al., , 2013. It should be notieed that this query ing method simultaneously performs three kinds of reasoning: (1) inference using specialization relation defined in the Biorefinery OTR, (2) ranking according to fuzzy pattern matching between preferences expressed in the query and imprecise data, (3) ranking according to preferences expressed about data source reliability. In this paper, we present the implementation of the querying sub system through an example whieh illustrates the way data needed for Efactor indicator computation are extracted from the RDF triple store. For instance, the query presented in Fig. 14 has been built in order to compute the Efactor indicator assoàated with pre treatment processes of type Organosolv pre treatment for riee straw. First, the user selects the ontology IC2ACV which is the name associated with the Biorefinery OTR in @Web. Secondly, the user selects one of the concept relations defined in IC2ACV to build the query. In the example of Fig. 14, the Physicochemical pre treatment solid quantity output relation has been selected because its arguments allow the elaboration of the matter balance required to compute the Efactor. Selection criteria can be expressed on relation arguments. They may be mandatory or desir able. Mandatory means that only instances of relation Physico chemical pre treatment solid quantity output relation whieh fulfil the selection criterion will be retrieved. In the example of Fig. 14, only results assoàated with Riee straw will be retrieved. Desirable criteria allow the ranking of results to be refined.
In Fig. 14, instances of Physicochemical pre treatment solid quan tity output relation whieh correspond to Organosolv treatment will be ranked first. Fig. 15 presents the results associated with the query expressed in Fig. 14. We can see that the first four results correspond to the 4 experimental settings presented in Fig. 9 for Amiri et al. (2014). Biomass (resp. solvent chemieal reagents) quantity which is required to compute Efactor is presented in the Biomass quantity column (resp. Aàd quantity). Results may be downloaded in a CSV file for Efactor computation.

Eco design indicator computation and visualization
In this section, we present the implementation of Efactor corn putation and visualization which corresponds to the second step of the pipeline presented in Fig. 3. In Section 4.2, we have defined three kinds of Efactor indicators for a given set of n documents o1 , ... , O n , and for each document o i , m experimental settings which are described in o i denoted e il , ... , e "" : • Efactol''st (oi,P, b) computes Efactor for the best experimental set ting in document Oj.
• Efactor' n (oi,P, b) computes Efactor for ail settings in document Oi, • Efactor(p, b) computes Efactor for the entire set of settings of the entire set of documents.
In the implementation, we consider that the set of documents on whieh Efactor has been computed corresponds to a tapie in @Web. In this article, we have considered four tapies, each of them associated with one of the four pre treatment processes presented in Fig. 2. We have seen in the previous section that we use @Web queries to extract csv data files in order to compute the Efactor associated with a given tapie. We have implemented the computa tion of the three Efactor indicators in an Excel file in whieh have been previously stored data extracted from @Web. Graphical rep resentations are generated in VBA programming language executed in an Excel file to display an X Y plot for a given tapie and a given biomass where X corresponds to Efactor and Y to glu cose yield. For instance, in Fig. 16, we show a ranking of pre treatments based on Efactor computation for the best experiment of considered documents. Each point corresponds to a given pre treatment of rice straw presented in a given document. For each point, the category of pre treatment is represented by geometric symbol (for instance for PM UFM, see the legend of Fig. 16).
Reliability scores associated with each document, whose com putation has been presented in Section 4.1.2, have been repre sented in two colors for each point. The surrounding (resp. inner) color corresponds to the upper bound (resp. lower bound). For instance, the point ''CM then dry BM" 11 (corresponding to pre treatment category PM UFM) has a glucose yield around 90% and a low Efactor. It is associated with reliability scores which correspond to an imprecise assessment due to disagreeing meta information represented by an external circle painted in red and an internal one in green (see Reliability index in Fig. 16). Fig. 17 presents a ranking of pre treatments realized on rice straw based on Efactor computation for all experimental settings of four topics. Each point corresponds to a given rice straw pre treatment studied in the entire set of documents. For instance, the point PM PC PS corresponds to the Efactor associated with topic PM PC PS computed using Efactor all indicators presented in Table 4. It integrates 23 experimental settings extracted from 6 documents. For each topic, reliability scores associated with each document have been merged into a global reliability score, as defined in Section 4.1.2.

DSS assessment and discussion
Results obtained on rice straw with the DSS have been pre sented to 3BCAR experts in biorefinery. Those results have been positively assessed by experts who used tables and graphics asso ciated with Efactor indicators produced by the DSS to perform the following analysis. Fig. 17 shows that the highest glucose yield (86% ± 2%) from rice straw was obtained after wet disk milling (PM PC UFM PS). Nitric acid, oxidizing and ionic liquid pre treatment (PM PC PS) also achieves a good glucose yield (72.03% ± 3.32%). But these experimental conditions result in a high Efactor min estimated to about 70.6 (resp. 33.90) for PM PC UFM PS (resp. PM PC PS). In Fig. 16, it must be noticed that a low Efactor (2.03 ± 0.14) was estimated for Cutting Milling (CM) coupling to Ball Milling (BM) with about 90% of glucose yield (89.4% ± 2%) even if data source reliability is not fully established (see reliability indicator in Fig. 16 and associated metadata in Fig. 11). In general, water or chemical pre treatments of rice straw produced more glucose compared to mechanical or dry pre treatment (mechanical, torrefaction . . .), but also generated more effluents with a high Efactor. Results presented in Fig. 16 clearly show that dry pre treatments (milling, torrefaction . . .) are simpler technologies which are in general less effective in the production of glucose, but without the need of any chemical or water inputs. They have a low environmental impact (low Efactor), thus mini mizing waste generation while maximizing value of the lignocellu losic biomass.
The results obtained on the Rice straw use case demonstrate the usefulness of the DSS data treatment pipeline feasibility. In this experimentation, users have particularly appreciated the following functionalities: The DSS permits to enrich continuously the RDF database with new scientific data and gives the possibility to compare them with already stored scientific data. The OTR provides a simple reading grid to homogenize hetero geneous textual data, even if the annotation remains manual.  • The DSS pennits to easily navigate from a graphical representa tion of indicators (see Section 5.2) to detailed annotated data stored in the RDF database. It has been recognized to be useful to design new experimental study protocols. For instance, Fig. 16 made biorefinery experts think of designing a new experimental protocol to study more precisely the impact of torrefaction and particle size on the glucose yield. The range of particle sizes is easily available by consulting the annotated tables (see Table 5  The current version of Efactor computation step remains partly manual as data extracted from the database using @Web queries (see Section 5.1) must be put in an Excel file to compute the indi cator. We are currently working on an advanoed version which will extract the data directly from the RDF database.
pling of ITTR and text mining approaches, to complete n ary rela tion annotation by suggesting the more relevant sentences in which a given argument of the n ary relation appears. Moreover, time cost necessary to annotate experimental data should be corn pared to the one required to produce similar data in the laboratory. As discussed in Section 2, the complete automation of experi mental data extraction from textual documents is still a challenge to be met. We believe that the combination of KE and text mining methods will permit to make essential advanoes. In the short term, our approach will focus on adding assistants in the @web software, in order to speed up manual annotation guided by the OTR. In the very near future, we will implement an assistant based on the cou In our experimentation, it took the annotators about 80 days to design the ontology, select and read scientifi c papers and to man ually enter more than 400 experimental results concerning 4 bio masses and 6 pre treatment processes described in 32 publications. This cornes to an average time of 0.2 day per experi mental result. It must be put in perspective with the time spent to produce experimental results in the laboratory. In our experiment,  the annotators who also made experimentations in the laboratory made 36 laboratory ex periments in 160 days. This represents an average of 4.4 days per experimental result. The time spent for manual entering of an experimental result from the literature was considered by end users smalt enough compared to the time spent to produoe a similar ex perimental result at lab scale (ratio 1/22i Moreover, the DSS design is envisioned in an iterative approach in which annotated data will be reused to develop new functionalities as the one presented in Section 6.

Conclusion and prospect
ln this paper we have proposed a decision support system for eco efficient biorefinery process selection based on an ontology based semantic approach. The ontology is used to guide the anno tation of potentially incomplete or imprecise experimental data retrieved from the bibliography in order to store them in a struc tured database. Moreover a model has been used to assess the reli ability of data sources. Finally, a ranking of biorefinery processes has been computed in terms of glucose yield and Efactor indicators taking into account data imprecision and reliability.
The interest of Efactors is to give an overview of the process mass balance. lt may be considered as a "local" indicator (because it is based on data at process scale) and "inventory level" indicator (because it uses physical tlows ) . lt completes the classical process yield in order to rank different options. The less input consuming and waste generating process is generally preferred. Used with the reliability indicators, Efactor gives interesting information at an early deàsion stage at research or laboratory scale. Other indi cators could be considered regarding the environmental balance. Therefore, a perspective of our work will be to compute life cycle indicators to complete the list of assessment criteria for decision makers. The application of life cycle assessment (LCA) according to the ISO 14040 standards (ISO, 2006) would deepen the analysis through the system boundaries extension (inclusion of inputs life cycle with the use of life cycle inventory databases) and through the potential environmental impacts calculation. Impact assess ment aims at transforming inventory results into environmental indicators (also called impacts ca tegories). To compute those envi ronmental indicators, we will have to deal with the lack of data in the literature concerning the energy consumption and energy effi ciency of chemical, physicochemical and mechanical treatment of rice straw for example. ldeally, these indicators will complete the Efactor and glucose yield. More generally, this approach may be applied to any kind of biomass (food or no food) transformation process. Consequently, the number of publications which could be valorized using this approach is potentially ve ry high. Moreover, it must be noticed that the first step of the treatment pipeline (data integration) may be applied to a lot of kinds of scientific data in order to perforrn numerical treatments (meta analysis, decision support tools ... ). For example, we already use this approach to create decision support systems which determine optimal selec tion and dimensioning of food packagings (Guillard et al., 2015 ) , by reusing literature data about matter transfer. Another exàting perspective would be to develop interoperability between @Web and Rosanne, to take the best of both tools in order to improve automatic annotation of relevant information from scientific documents.