HERMIS International Journal of Computers mathematics and its Applications. 2003 Vol. 4. pp.163-173. LEA.
The AROMA System for Intelligent Text Mining
by JOHN KONTOS, IOANNA MALAGARDI AND JOHN PEROS
Abstract-- The AROMA system for Intelligent Text Mining consisting of three subsystems is presented and its operation analyzed. The first subsystem extracts knowledge from sentences. The knowledge extraction process is speeded-up by a search algorithm. The second subsystem is based on a causal reasoning process that generates new knowledge by combining knowledge extracted by the first subsystem as well as an explanation of the reasoning. The third subsystem is performing simulation that generates time-dependent course and textual descriptions of the behavior of a nonlinear system. The knowledge of structure and parameter values of the equations extracted automatically from the input scientific texts are combined with prerequisite knowledge such as ontology and default values of relation arguments. The application of the system is demonstrated by the use of two examples concerning cell apoptosis compiled from a collection of MEDLINE abstracts of papers used for supporting the model of a biomedical system.
Index Terms—intelligent text mining, biomedical text mining, knowledge extraction, knowledge discovery from text, simulation, history
I. INTRODUCTION
We define Intelligent Text Mining as the process of creating novel knowledge from texts that is not stated in them by combining sentences with deductive reasoning. An early implementation of this kind of Intelligent Text Mining is reported in [8]. A review of different kinds of text mining including Intelligent Text Mining is presented in [14]. There are two possible methodologies of applying deductive reasoning on texts. The first methodology is based on the translation of the texts into some formal representation of their “content” with which deduction is performed. The advantage of this methodology depends on the simplicity and availability of the required inference engine but its serious disadvantage is the need for preprocessing all the texts and storing their translation into a formal representation every time something changes. In particular in the case of scientific texts what may change is some part of the prerequisite knowledge such as ontology used for deducing new knowledge. The second methodology eliminates the need for translation of the texts into a formal representation because an inference engine capable of performing deductions “on the fly” i.e. directly from the original texts is implemented. The only disadvantage of this second methodology is that a more complex inference engine than the one needed by the first one must be built. The strong advantage of the second methodology is that the translation into a formal representation is avoided. In [8] the second methodology is chosen and the method developed is therefore called ARISTA i.e. Automatic Representation Independent Syllogistic Text Analysis.
In [8] one of the application examples concerned medical text mining related to the human respiratory system mechanics. Biomedical text mining is now recognized as a very important field of study [18]. More specifically an early attempt was made in [8] to implement model-based question answering using the scientific text as knowledge base describing the model. The idea of connecting text mining with simulation and question answering was pursued further by our group as reported in [12] and [13] as well as in the present paper. We consider model discovery, updating and question answering as worthy final aims of intelligent text mining.
The AROMA (ARISTA Oriented Modeling Adaptation) system for Intelligent Text Mining accomplishes knowledge discovery using our knowledge representation independent method ARISTA that performs causal reasoning “on the fly” directly from text in order to extract qualitative or quantitative model parameters as well as to generate an explanation of the reasoning. The application of the system is demonstrated by the use of two biomedical examples concerning cell apoptosis and compiled from a collection of MEDLINE paper abstracts related to a recent proposal of a relevant mathematical model. A basic characteristic of the behavior of such biological systems is the occurrence of oscillations for certain values of the parameters of their equations. The solution of these equations is accomplished with a Prolog program. This program provides an interface for manipulation of the model by the user and the generation of textual descriptions of the model behavior. This manipulation is based on the comparison of the experimental data with the graphical and textual simulator output. This is an application to a mathematical model of the p53-mdm2 feedback system. The dependence of the behavior of the model on these parameters may be studied using our system. The variables of the model correspond to the concentrations of the proteins p53 and mdm2 respectively. The non-linearity of the model causes the appearance of oscillations that differ from simple sine waves. The AROMA system for Intelligent Text Mining consists of three subsystems as briefly described below.
The first subsystem achieves the extraction of knowledge from individual sentences that is similar to traditional information extraction from texts. In order to speed up the whole knowledge acquisition process a search algorithm is applied on a table of combinations of keywords characterizing the sentences of the text.
The second subsystem is based on a causal reasoning process that generates new knowledge by combining “on the fly” knowledge extracted by the first subsystem. Part of the knowledge generated is used as parametric input to the third subsystem and controls the model adaptation.
The third subsystem is based on system modeling that is used to generate time dependent numerical values that are compared with experimental data. This subsystem is intended for simulating in a semi-quantitative way the dynamics of nonlinear systems. The characteristics of the model such as structure and parameter polarities of the equations are extracted automatically from scientific texts and are combined with prerequisite knowledge such as ontology and default process and entity knowledge.
In our papers [11], [12] and [13] parts of the AROMA System have been described separately. In the present paper we describe how the integrated system uses causal knowledge discovered from texts as a basis for modeling systems whose study is reported in texts analyzed by our system. We are thus aiming at automating part of the cognitive process of model discovery based on experimental data and supported by domain knowledge extracted automatically from scientific texts. A brief historical review of the technology of simulation is first presented below.
II. THE HISTORY OF SIMULATION
Computer Simulation is playing an increasingly valuable role within science and technology because it is one of the most powerful problem-solving techniques available today. Simulation has its roots in a variety of disciplines including Computer Science, Operations Research, Management Science, Statistics and Mathematics, and is widely utilized in industry and academia as well as in government organizations. Each of these groups influenced the directions of simulation as they developed historically. The History of simulation may be traced to ancient efforts or wishes for the construction of machines or automata simulating inanimate or animate systems. According to the hypothesis of de Sola Price [5] the Ancient Greek Antikythera Mechanism that is dated approximately at 80 B.C. is a mechanized calendar which calculated the places of the Sun and the Moon through the cycle of years and months. These calculations could be considered as performing a simulation. Therefore this mechanism may be considered as the first known constructed simulator in the history of this technology.
Human action was already being simulated in the mechanical theatre of the Ancient Greek scientist and engineer Heron of Alexandria (about 100 A.D.). Using a system of cords, pulleys and levers bound to counterweights, as well as sound effects and changing scenery, Heron was able to create an illusion that brought legends to life. The conception of human beings as machines reaches back to antiquity. As early as the second century A.D., the famous Ancient Greek physician Galen conceived his pneumatic model of the human body in terms of the hydraulic technology of his age [4]. The development of simulators in the modern era predates the advent of digital computers, with the first flight simulator being credited to Edward A. Link in 1927 and a lot of work that followed using mechanical, electrical and electronic analogue computers. One of the earliest simulation studies of biochemical reactions was carried out by Chance during the second world-war [16]. In a remarkable paper where experiment, analysis and simulation were all used, Chance showed the validity of the Michaelis-Menten model for the action of the enzyme peroxidase. The simulations were carried out with a mechanical differential analyser. The differential analyser belongs to a class of computers known as analogue computers. These computers were designed to take input in terms of continuous variables, such as hydraulic pressure or voltage. These were then processed by various units, which altered them according to particular mathematical functions including derivatives and integrals. Analogue computers were programmed by coupling various elementary devices, and in our field of interest they would solve a set of differential equations, in a continuous simulation of the model. Note that unlike digital computers, there is no round-off error involved in calculations processed with analogue computers [16].
Digital simulation was developed after the appearance of the digital computer. The history of digital simulation has been described as comprised of 4 periods:
(1) the advent (circa 1950 - 1960),
(2) the era of simulation programming languages (circa 1960 - 1980),
(3) the era of simulation support environments (circa 1980 - 1990), and
(4) the very modern era (circa 1990 - today).
The advent begins with the inception of digital computers and is marked by early theories of model representation and execution. For example, the fixed-time increment and variable-time increment time flow mechanisms (TFMS) were proposed during this period. The use of random methods for statistical evaluation of simulation results were also formulated during the advent. Once the general principles and techniques had been identified special purpose languages for simulation emerged. The history of simulation programming languages has been organized in five periods of similar developments. The five periods start in 1955 and end in 1986 namely: The Period of Search (1955-1960); The Advent (1961-1965); The Formative Period (1966-1970); The Expansional Period (1971-1978) and The Period of Consolidation and Regeneration (1979-1986) [19].
The proliferation of simulation programming languages (SPLs) like Simula, Simscript, GPSS and MODSIM mark the second period in this history. The design and evolution of SPLs also helped refine the principles underlying simulation. The dominant conceptual frameworks (or world views ) for discrete event simulation -- event scheduling, process interaction, and activity scanning -- were defined during the second period largely as a result of SPL research. Also during this period the first cohesive theories of simulation modeling were formulated, e.g. DEVS (discrete event system specification) and its basis in general systems theory [19].
The third period in the history of digital simulation is evidenced by a shift of focus from the development of the simulation program toward the broader life cycle of a simulation project. This means extending software and methodological support to such activities as problem formulation, objectives identification and presentation of simulation results. This was the era of the integrated simulation support environment (ISSE). Environment research occurred throughout the simulation communities: within academia, industry and within governments. Also during this third period a great interest emerged within the U.S.A. Department of Defence in interoperable, networked simulators.
In the contemporary era emphasis and direction seem to vary across the primary simulation communities. In the commercial sector environments and languages remain a major focus. The objective seems to be maximized market share through specialization; for example commercial environments specializing in communications network simulation. Within academia environments are also a focus, but these environments appear to favour generality of purpose over specialization. The majority of modelling methodology activity remains in the academic sector and most of the work involving the execution of simulation models on parallel computers is also occurring in university laboratories.
One of the pioneers in the field is John McLeod [15] who in 1947 went to work at the U.S. Naval Air Missile Test Center (now Pacific Missile Range) Point Mugu, California. While there, he sparked and supervised development of the Guidance Simulation Laboratory which, within a few years, became one of the leading American simulation facilities. In 1952 he organized the Simulation Council (now The Society for Modeling and Simulation International) and with his wife Suzette began publication of the Simulation Council Newsletter. From 1956 to 1963 he worked as Design Specialist, Space Navigation and Data Processing, with General Dynamics/Astronautics in San Diego. While there, he received a grant to support testing of an extra-corporeal perfusion device (heart-lung machine) which he had developed on his own time. He also acted as co-founder of the San Diego Symposium for Biomedical Engineering, and edited the proceedings of the first symposium, held in 1961.
The first anaesthesia simulator was developed at the University of Southern California in the late 1960s. It featured spontaneous ventilation, a heart beat, temporal and carotid pulses, blood pressure, opened and closed mouth, blinks eyes, muscle fasciculation, and coughed. It responded to 4 intravenous drugs: thiopental, succinylcholine, epinephrine, and atropine. Additionally it responded to oxygen and nitrous oxide. The first anesthesia simulator was based upon “scripting”. A script prescribes the consequences of an action [2].
A team at the University of Florida resurrected the early technology in the late 1980s [3]. In concert with a team of computer scientists and engineers, the Human Patient Simulator [HPS] was created and introduced in the early 1990s. What distinguishes the Human Patient Simulator from the first anaesthesia simulator is that it is based upon modelling. The software that runs the HPS uses complex mathematical equations, which define the many factors comprising the cardiovascular and respiratory systems of humans. If a drug or event affects one or more factors, the new equation will describe the resulting changes. Thus, if an intervention is correct and timely we will see improvement in the simulated patient’s condition. If the intervention is incorrect the simulated patient’s condition will deteriorate and ultimately lead to cardiac arrest and death.
As far as Greek specialists are concerned it should be noted that J. Kontos started research in simulation since 1960 with the simulation e.g. of part of the respiratory system and magnetohydrodynamic systems [6], [7].
III. BIOMEDICAL SYSTEM MODEL DISCOVERY SUPPORT
There exists some research on computational methods for the application of inductive learning methods in discovery of new knowledge. However the knowledge induced by such methods usually has little relation to the formalisms and concepts used by scientists and engineers. Experts in some domains may reject output of a learning system, unless it is related to their prior knowledge. In contrast the use of models in science and engineering may provide an explanation that includes variables, objects, or mechanisms that are unobserved, but that help predict the behavior of observed variables. Explanations also make use of general concepts and ontologies or relations for explaining experimental findings using scientific models.
We will focus here on a particular class of system models consisting of processes that describe one or more causal relations between input variables and output variables. A process may also include conditions, stated as threshold tests on its input variables, that describe it when it is active. This knowledge is expressed in terms of differential equations when it involves change over time or algebraic equations when it involves instantaneous effects. A process model consists of a set of processes that link observable input variables with observable output variables, possibly through unobserved theoretical terms. The concept of process is fundamental to our original early proposal of the ARISTA method and its modeling application in [8].
Process models are often designed to characterize the behavior of dynamical systems that change over time, though they can also handle systems in equilibrium. The data produced by such systems differ from those that arise in most induction tasks in a variety of ways. First, these variables are primarily continuous, since they represent quantitative measurements of the system under study. Second, the observed values are not independently and identically distributed, since those observed at later time steps depend on those measured earlier. The dynamical systems explained by our models are viewed as deterministic. The observations themselves may well contain noise but we assume that the processes themselves are always active whenever their conditions are met and that their equations have the same form all the time. We use this assumption because scientists and engineers often treat the systems they study as deterministic.
A biomedical system model discovery support system like AROMA may revolutionize scientific discovery by providing computer tools for the automatic checking of the validity of a model as supported by experimental findings. With such tools the updating of models may also be facilitated whenever experimental findings are reported that disagree with a generally accepted model. Therefore model discovery support is a worthy final aim for Intelligent Text Mining.
IV. THE GENERAL ARCHITECTURE OF OUR AROMA SYSTEM
The general architecture of our system is shown in Figure 1. and consists of three subsystems namely the Knowledge Extraction Subsystem, the Causal Reasoning Subsystem and the Simulation Subsystem. These subsystems are briefly described below. The texts of the example application presented here are compiled from the MEDLINE abstracts of papers used by [1] as references that amount to 73 items. Most of these papers are used in [1] to support the discovery of a quantitative model of protein concentration oscillations related to cell apoptosis constructed as a set of differential equations. We are aiming at automating part of such a cognitive process by our system. The collection of the MEDLINE abstracts is processed by a preprocessor module so that they take the form required by our Prolog programs i.e. one sentence per line.
A. The Knowledge Extraction Subsystem
This subsystem integrates partial causal knowledge extracted from a number of different texts. This knowledge is expressed in natural language using causal verbs such as “regulate”, “enhance” and “inhibit”. These verbs usually take as arguments entities such as protein names and gene names that occur in the biomedical texts that we use for the present applications. In this way causal relation between the entities are expressed. The input files used for this subsystem contain abstracts downloaded from MEDLINE. A lexicon containing words such as causal verbs and stopwords are also input to this subsystem. An output file is produced by the system that contains parts of sentences collected from the original sentences of different abstracts. These output file is used for reasoning by the second subsystem. The operation of the subsystem is based on the recognition of a causal verb or verb group. After this recognition complements of the verbs are chunked by processing the neighboring left and right context of the verb. This is accomplished by using a number of stopwords such as conjunctions and relative pronouns. The input texts are submitted first to a preprocessing module of the subsystem that converts automatically each sentence into a form consisting of Prolog facts that represent numerically information concerning the identification of the sentence that contains the word and its position in the sentence. This set of Prolog facts has nothing to do with logical representation of the “content” of the sentences as it seems to have been inaccurately reported in [10]. It should be emphasized that we do not deviate from our ARISTA method with this translation. We simply “annotate” each word with information concerning its position within the text. It should be noted that our annotation is not added in the original text but it is represented as the set of Prolog facts mentioned above.
B. The Causal Reasoning Subsystem
The output of the first subsystem is used as input to the second subsystem that combines causal knowledge in natural language form to produce by automatic deduction conclusions not mentioned explicitly in the input text. The operation of this subsystem is based on the ARISTA method [8]. The sentence fragments containing causal knowledge are parsed and the entity-process pairs are recognized. The user questions are analysed and reasoning goals are extracted from them. The answers to the user questions are generated automatically by a reasoning process together with explanations in natural language form. This is accomplished by the chaining “on the fly” of causal statements using prerequisite knowledge such as ontology to support the reasoning process. A second output of this subsystem consists of both qualitative and quantitative information that is input to the third subsystem and controls the adaptation of the model of the biomedical system.
C. The Simulation Subsystem
The third subsystem is used for modelling the dynamics of the biomedical system specified on the basis of the MEDLINE abstracts processed by the first subsystem. The characteristics of the model such as structure and parameter values will eventually be extracted from the input texts combined with prerequisite knowledge such as ontology and default process and entity knowledge. Considering the above example two coupled first order differential equations are used as the approximate mathematical model of the biomedical system in rough correspondence with the model proposed in [1]. A basic characteristic of the behaviour of such a system is the occurrence of oscillations for certain values of the parameters of the equations. The equations in finite difference form that approximate the differential equations are:
Δx= a1*x + b1*y + c1*x*y (1)
Δy= a2*y + b2*delay(d,x) (2)
Where Δx means the difference between the value of the variable x at the present time and the value of the variable x at the next time instant and delay(d,x) means the value of x before d units of time. Time is taken to advance in discrete steps. The variables x and y correspond to the concentrations of the proteins p53 and mdm2 respectively. The symbols a1, b1, c1, a2, b2 stand for the parameters of the equations. It is noted that multiplicative term c1*x*y renders equation (1) non-linear. This non-linearity causes the appearance of the oscillations to differ from simple sine waves. The solution of these equations is accomplished with a Prolog program that provides an interface for manipulating the parameters of the model. An important module of the simulation subsystem is one that generates text describing the behaviour of the variables of the model on true.
V. A FIRST EXAMPLE OF BIOMEDICAL TEXT MINING
An illustrative subset of sentences used in this first illustrative example is the following where the reference numbers of the papers with which the authors of [1] refer to are given in parentheses:
The p53 protein is activated by DNA damage. (23)
Expression of Mdm2 is regulated by p53. (32)
Mdm2 increase inhibits p53 activity. (17)
Using these sentences our system discovers automatically the qualitative causal process model with a negative feedback loop that can be summarized as:
DNA damage +causes p53 +causes mdm2 -causes p53
Where +causes means “causes increase” and -causes means “causes decrease or inhibition”
by answering the question:
Is there a process loop of p53?
This question is internally represented as the Prolog goal: “cause(P1,p53,P2,p53,S)”, where P1 and P2 are two process names that the system extracts from the texts and characterize the behavior of p53. S stands for the overall effect of the feedback loop found i.e. whether it is a positive or a negative feedback loop. In this case S is found equal to “-” or “negative” since a positive causal connection is followed by a negative one.
The short answer automatically generated by our system is:
Yes.
The loop is p53 activity –causes p53 production.
The long answer automatically generated by our system is:
Using sentence 17 with inference rule IR4
since the DEFAULT process of p53 is
using sentence 32
the EXPLANATION is:
sinceis equivalent to
p53 production –causes activity of p53
because
p53 production +causes expression of Mdm2
and
increase of Mdm2 –causes activity of p53
It should be noted that the combination of sentences (17) and (32) in a causal chain that forms a closed negative feedback loop is based on two facts of prerequisite ontological knowledge.
This knowledge is inserted manually in our system as Prolog facts and can be stated as:
“the DEFAULT process of p53 is ‘production’” or
in Prolog: “default(p53,production).”.
“the process ‘increase’ is equivalent to the process ‘expression’” or
in Prolog “equivalent(increase,expression).”.
The above analysis of the text fragments of the first example is partially based on the following prerequisite knowledge which is also manually inserted as Prolog facts:
kind_of(“the”,“determiner”)
kind_of(“is”,“copula”)
kind_of(“of”,“preposition”)
kind_of(“p53”,“entity_noun”)
kind_of(“protein”,“entity_noun”)
kind_of(“DNA”,“entity_noun”)
kind_of(“Mdm2”,“entity_noun”)
kind_of(“activated”,“causal_connector”)
kind_of(“inhibits”,“causal_connector”)
kind_of(“regulated”,“causal_connector”)
kind_of(“damage”,“process”)
kind_of(“expression”,“process”)
kind_of(“increase”,“process”)
kind_of(“activity”,“process”)
The above prerequisite knowledge base fragment contains both general linguistic and domain dependent ontological knowledge about the words occurring in the corpus. In practice of course these two parts of knowledge are and handled differently by the inference rules of the reasoning module.
VI. A SECOND EXAMPLE OF BIOMEDICAL TEXT MINING
The second example text is also compiled from two MEDLINE abstracts of papers used by [1] as references. These two abstracts downloaded from MEDLINE again contain knowledge concerning the interaction of the proteins p53 and mdm2. These proteins are involved in the life cycle of the cell. The first abstract consists of six sentences from which two are selected by the first subsystem from which the following fragments are extracted automatically.
“The p53 protein regulates the mdm2 gene” “regulates both the activity of the p53 protein”
These fragments are then automatically transformed as Prolog facts in order to be processed by the second subsystem as shown below:
t(“325”, “The p53 protein regulates the mdm2 gene”).
t(“326”, “regulates both the activity of the p53 protein”).
The numbers 325 and 326 denote that these fragments are extracted from the sentences 5 and 6 of the text 32.
The second abstract consists of seven sentences from which two are selected by the first subsystem from which the following fragments are extracted automatically
“The mdm2 gene enhances the tumorigenic potential of cells”
“The mdm2 oncogene can inhibit p53_mediated transactivation”
and expressed in the form of Prolog facts as:
t(“923”, “The mdm2 gene enhances the tumorigenic potential of cells”).
t(“927”, “The mdm2 oncogene can inhibit p53_mediated transactivation”).
Using the sentences of the second example our system discovers the causal negative feedback loop:
p53 +causes mdm2 -causes p53
Where +causes means “causes increase” and -causes means “causes decrease or inhibition”
by answering the question:
Is there a process loop of p53?
This question is internally represented as the Prolog goal:
“cause(P1,p53,P2,p53,S)”
where P1 and P2 are two process names that the system extracts from the texts and characterize the behavior of p53. S stands for the overall effect of the feedback loop found i.e. whether it is a positive or a negative feedback loop. In this case S is found equal to “-” since a positive causal connection is followed by a negative one.
The short answer automatically generated by our system is:
Yes.
The loop is p53 activity –causes p53 production.
The long answer automatically generated by our system is:
the QUESTION is:
Get process loop of p53
OR
cause(P1,p53,P2,p53,S)
USING INFERENCE RULE IR4a
since the DEFAULT entity ofis
USING sentence 927 with inference rule IR4
USING INFERENCE RULE IR4b
USING sentence 325
the EXPLANATION is:
sinceis a kind of
p53 protein -causes p53
because
p53 protein +causes gene of mdm2
and
oncogene of mdm2 -causes p53_mediatedtransactivation of p53
It should be noted that the combination of sentences (92) and (32) in a causal chain that forms a closed negative feedback loop is based on two facts of prerequisite ontological knowledge.
This knowledge is inserted manually in our system as Prolog facts and can be stated as:
the DEFAULT entity ofis or default(p53_mediated, p53).
is a kind of or kind_of(oncogene, gene).
VII. EXAMPLES OF OPERATION OF THE SIMULATION SUBSYSTEM
The results of two examples of simulation are produced by two different sets of values of the parameters a1, a2, b1, b2, c1 and d of the following equations:
Δx= a1*x + b1*y + c1*x*y (1a)
Δy= a2*y + b2*delay(d,x) (2a)
VIII. CONCLUSIONS
We presented our AROMA system for Intelligent Text Mining. The AROMA system that we developed consists of three main subsystems. The first subsystem achieves the extraction of knowledge from sentences that is related to the structure and the parameters of the biomedical system simulated. The second subsystem is based on a reasoning process that generates new knowledge by combining “on the fly” knowledge extracted by the first subsystem as well as explanations of the reasoning. The third subsystem is a system simulator written in Prolog that generates the time behavior of the model’s variables. Two important features of the simulation subsystem are the use of structure and parameter knowledge automatically extracted from scientific texts and the automatic generation of texts describing the behavior of the system being simulated. Our final aim is to be able to model biomedical systems by integrating knowledge extracted from different texts and give the user a facility for questioning these models during a collaborative man-machine model discovery procedure. The model based question answering we are aiming at may support both biomedical researchers and medical practitioners.
REFERENCES
[1] Bar-Or, R. L. et al (2000). Generation of oscillations by the p53-Mdm2 feedback loop: A theoretical and experimental study. PNAS, vol. 97, No 21 pp. 11250-11255, October.
[2] Denson JS and Abrahamson S (1969). A computer-controlled patient simulator. JAMA 208:504-508.
[3] Good, M.L. and Gravenstein , J.S. (1996). Anesthesia simulators and training devices, in Prys Roberts C, Brown BR Jr (eds) International Practice of Anaesthesia, Vol 2. Oxford, Butterworth-Heinemann, pp. 2/167/1-11
[4] Grau, O. (2000). The History of Telepresence automata, illusion and the rejection of the body. , in: Ken Goldberg (Hg.): The Robot in the Garden: Telerobotics and Telepistemology on the Internet, Cambridge/Mass, S. 226-246.
[5] Kean, V., J. (1995). The Ancient Greek Computer from Rhodes known as the Antikythera Mechanism. Efstathiadis Group. Greece.
[6] Kontos, J (1965). Computation of Instability Growth Rates in Finite Conductiity Magnetohydrodynamics. Nuclear Fusion, Vol 5, No 2.
[7] Kontos, J. et al (1966), A Control Engineering Approach to Magnetohydrodynamic Stability. Proceedings of the I.E.E., Vol 113, No 3.
[8] Kontos, J. (1992) ARISTA: Knowledge Engineering with Scientific Texts. Information and Software Technology. vol. 34, No 9, pp.611-616.
[9] Kontos, J. and Malagardi, I. (1999). Information Extraction and Knowledge Acquisition from Texts using Bilingual Question-Answering. Journal of Intelligent and Robotic Systems, vol 26, No. 2, pp. 103-122, October.
[10] Kontos, J. and Malagardi, I. (1999). Information Extraction and Knowledge Acquisition from Texts using Bilingual Question-Answering. Journal of Intelligent and Robotic Systems vol. 26, No. 2, pp. 103-122, October.
[11] Kontos, J. and Malagardi, I. (2001) A Search Algorithm for Knowledge Acquisition from Texts. HERCMA 2001, 5th Hellenic European Research on Computer Mathematics & its Applications Conference. Athens.
[12] Kontos, J., Elmaoglou, A. and Malagardi, I. (2002a) ARISTA Causal Knowledge Discovery from Texts Discovery Science 2002 Luebeck, Germany. Proceedings of the 5th International Conference DS 2002 Springer Verlag. pp. 348-355.
[13] Kontos, J., Malagardi, I., Peros, J., Elmaoglou, A. (2002b). System Modeling by Computer using Biomedical Texts. Res Systemica Volume 2 Special Issue. October 2002. (http:/www.afscet.asso.fr/resSystemica/accueil.html).
[14] Kroetze, J.,A. te al. (2003) Differentiating Data- and Text-Mining Terminology. Proceedings of SAICSIT 2003. pp. 93-101.
[15] McLeod J. and Osborne J. (1966). "Physiological simulation in general and particular," pp. 127-138, Natural Automata and Useful Simulations, edited by H. H. Pattee, E. A. Edelsack, Louis Fein, A. B. Callahan , Spartan Books, Washington DC.
[16] Mendes, P. & Kell, D.B. (1996). Computer simulation of biochemical kinetics. In BioThermoKinetics of the living cell (eds. H.V. Westerhoff, J.L. Snoep, F.E. Sluse, J.E. Wijker and B.N. Kholodenko), pp. 254-257. BioThermoKinetics Press, Amsterdam.
[17] Price de Solla D. (1974). Gears from the Greeks. American Philosophical Society.
[18] Weeber, M. et al. (2000). Text-Based Discovery in Biomedicine: The architecture of the DAD-system. Proceedings of the 2000 AMIA Annual Fall Symposium. Pp. 903-907. Hanley and Belfus, Philadelphia, PA.
[19] Zeigler, B.P. (1976). Theory of Modelling and Simulation, John Wiley and Sons, New York.
The AROMA System for Intelligent Text Mining
by JOHN KONTOS, IOANNA MALAGARDI AND JOHN PEROS
Abstract-- The AROMA system for Intelligent Text Mining consisting of three subsystems is presented and its operation analyzed. The first subsystem extracts knowledge from sentences. The knowledge extraction process is speeded-up by a search algorithm. The second subsystem is based on a causal reasoning process that generates new knowledge by combining knowledge extracted by the first subsystem as well as an explanation of the reasoning. The third subsystem is performing simulation that generates time-dependent course and textual descriptions of the behavior of a nonlinear system. The knowledge of structure and parameter values of the equations extracted automatically from the input scientific texts are combined with prerequisite knowledge such as ontology and default values of relation arguments. The application of the system is demonstrated by the use of two examples concerning cell apoptosis compiled from a collection of MEDLINE abstracts of papers used for supporting the model of a biomedical system.
Index Terms—intelligent text mining, biomedical text mining, knowledge extraction, knowledge discovery from text, simulation, history
I. INTRODUCTION
We define Intelligent Text Mining as the process of creating novel knowledge from texts that is not stated in them by combining sentences with deductive reasoning. An early implementation of this kind of Intelligent Text Mining is reported in [8]. A review of different kinds of text mining including Intelligent Text Mining is presented in [14]. There are two possible methodologies of applying deductive reasoning on texts. The first methodology is based on the translation of the texts into some formal representation of their “content” with which deduction is performed. The advantage of this methodology depends on the simplicity and availability of the required inference engine but its serious disadvantage is the need for preprocessing all the texts and storing their translation into a formal representation every time something changes. In particular in the case of scientific texts what may change is some part of the prerequisite knowledge such as ontology used for deducing new knowledge. The second methodology eliminates the need for translation of the texts into a formal representation because an inference engine capable of performing deductions “on the fly” i.e. directly from the original texts is implemented. The only disadvantage of this second methodology is that a more complex inference engine than the one needed by the first one must be built. The strong advantage of the second methodology is that the translation into a formal representation is avoided. In [8] the second methodology is chosen and the method developed is therefore called ARISTA i.e. Automatic Representation Independent Syllogistic Text Analysis.
In [8] one of the application examples concerned medical text mining related to the human respiratory system mechanics. Biomedical text mining is now recognized as a very important field of study [18]. More specifically an early attempt was made in [8] to implement model-based question answering using the scientific text as knowledge base describing the model. The idea of connecting text mining with simulation and question answering was pursued further by our group as reported in [12] and [13] as well as in the present paper. We consider model discovery, updating and question answering as worthy final aims of intelligent text mining.
The AROMA (ARISTA Oriented Modeling Adaptation) system for Intelligent Text Mining accomplishes knowledge discovery using our knowledge representation independent method ARISTA that performs causal reasoning “on the fly” directly from text in order to extract qualitative or quantitative model parameters as well as to generate an explanation of the reasoning. The application of the system is demonstrated by the use of two biomedical examples concerning cell apoptosis and compiled from a collection of MEDLINE paper abstracts related to a recent proposal of a relevant mathematical model. A basic characteristic of the behavior of such biological systems is the occurrence of oscillations for certain values of the parameters of their equations. The solution of these equations is accomplished with a Prolog program. This program provides an interface for manipulation of the model by the user and the generation of textual descriptions of the model behavior. This manipulation is based on the comparison of the experimental data with the graphical and textual simulator output. This is an application to a mathematical model of the p53-mdm2 feedback system. The dependence of the behavior of the model on these parameters may be studied using our system. The variables of the model correspond to the concentrations of the proteins p53 and mdm2 respectively. The non-linearity of the model causes the appearance of oscillations that differ from simple sine waves. The AROMA system for Intelligent Text Mining consists of three subsystems as briefly described below.
The first subsystem achieves the extraction of knowledge from individual sentences that is similar to traditional information extraction from texts. In order to speed up the whole knowledge acquisition process a search algorithm is applied on a table of combinations of keywords characterizing the sentences of the text.
The second subsystem is based on a causal reasoning process that generates new knowledge by combining “on the fly” knowledge extracted by the first subsystem. Part of the knowledge generated is used as parametric input to the third subsystem and controls the model adaptation.
The third subsystem is based on system modeling that is used to generate time dependent numerical values that are compared with experimental data. This subsystem is intended for simulating in a semi-quantitative way the dynamics of nonlinear systems. The characteristics of the model such as structure and parameter polarities of the equations are extracted automatically from scientific texts and are combined with prerequisite knowledge such as ontology and default process and entity knowledge.
In our papers [11], [12] and [13] parts of the AROMA System have been described separately. In the present paper we describe how the integrated system uses causal knowledge discovered from texts as a basis for modeling systems whose study is reported in texts analyzed by our system. We are thus aiming at automating part of the cognitive process of model discovery based on experimental data and supported by domain knowledge extracted automatically from scientific texts. A brief historical review of the technology of simulation is first presented below.
II. THE HISTORY OF SIMULATION
Computer Simulation is playing an increasingly valuable role within science and technology because it is one of the most powerful problem-solving techniques available today. Simulation has its roots in a variety of disciplines including Computer Science, Operations Research, Management Science, Statistics and Mathematics, and is widely utilized in industry and academia as well as in government organizations. Each of these groups influenced the directions of simulation as they developed historically. The History of simulation may be traced to ancient efforts or wishes for the construction of machines or automata simulating inanimate or animate systems. According to the hypothesis of de Sola Price [5] the Ancient Greek Antikythera Mechanism that is dated approximately at 80 B.C. is a mechanized calendar which calculated the places of the Sun and the Moon through the cycle of years and months. These calculations could be considered as performing a simulation. Therefore this mechanism may be considered as the first known constructed simulator in the history of this technology.
Human action was already being simulated in the mechanical theatre of the Ancient Greek scientist and engineer Heron of Alexandria (about 100 A.D.). Using a system of cords, pulleys and levers bound to counterweights, as well as sound effects and changing scenery, Heron was able to create an illusion that brought legends to life. The conception of human beings as machines reaches back to antiquity. As early as the second century A.D., the famous Ancient Greek physician Galen conceived his pneumatic model of the human body in terms of the hydraulic technology of his age [4]. The development of simulators in the modern era predates the advent of digital computers, with the first flight simulator being credited to Edward A. Link in 1927 and a lot of work that followed using mechanical, electrical and electronic analogue computers. One of the earliest simulation studies of biochemical reactions was carried out by Chance during the second world-war [16]. In a remarkable paper where experiment, analysis and simulation were all used, Chance showed the validity of the Michaelis-Menten model for the action of the enzyme peroxidase. The simulations were carried out with a mechanical differential analyser. The differential analyser belongs to a class of computers known as analogue computers. These computers were designed to take input in terms of continuous variables, such as hydraulic pressure or voltage. These were then processed by various units, which altered them according to particular mathematical functions including derivatives and integrals. Analogue computers were programmed by coupling various elementary devices, and in our field of interest they would solve a set of differential equations, in a continuous simulation of the model. Note that unlike digital computers, there is no round-off error involved in calculations processed with analogue computers [16].
Digital simulation was developed after the appearance of the digital computer. The history of digital simulation has been described as comprised of 4 periods:
(1) the advent (circa 1950 - 1960),
(2) the era of simulation programming languages (circa 1960 - 1980),
(3) the era of simulation support environments (circa 1980 - 1990), and
(4) the very modern era (circa 1990 - today).
The advent begins with the inception of digital computers and is marked by early theories of model representation and execution. For example, the fixed-time increment and variable-time increment time flow mechanisms (TFMS) were proposed during this period. The use of random methods for statistical evaluation of simulation results were also formulated during the advent. Once the general principles and techniques had been identified special purpose languages for simulation emerged. The history of simulation programming languages has been organized in five periods of similar developments. The five periods start in 1955 and end in 1986 namely: The Period of Search (1955-1960); The Advent (1961-1965); The Formative Period (1966-1970); The Expansional Period (1971-1978) and The Period of Consolidation and Regeneration (1979-1986) [19].
The proliferation of simulation programming languages (SPLs) like Simula, Simscript, GPSS and MODSIM mark the second period in this history. The design and evolution of SPLs also helped refine the principles underlying simulation. The dominant conceptual frameworks (or world views ) for discrete event simulation -- event scheduling, process interaction, and activity scanning -- were defined during the second period largely as a result of SPL research. Also during this period the first cohesive theories of simulation modeling were formulated, e.g. DEVS (discrete event system specification) and its basis in general systems theory [19].
The third period in the history of digital simulation is evidenced by a shift of focus from the development of the simulation program toward the broader life cycle of a simulation project. This means extending software and methodological support to such activities as problem formulation, objectives identification and presentation of simulation results. This was the era of the integrated simulation support environment (ISSE). Environment research occurred throughout the simulation communities: within academia, industry and within governments. Also during this third period a great interest emerged within the U.S.A. Department of Defence in interoperable, networked simulators.
In the contemporary era emphasis and direction seem to vary across the primary simulation communities. In the commercial sector environments and languages remain a major focus. The objective seems to be maximized market share through specialization; for example commercial environments specializing in communications network simulation. Within academia environments are also a focus, but these environments appear to favour generality of purpose over specialization. The majority of modelling methodology activity remains in the academic sector and most of the work involving the execution of simulation models on parallel computers is also occurring in university laboratories.
One of the pioneers in the field is John McLeod [15] who in 1947 went to work at the U.S. Naval Air Missile Test Center (now Pacific Missile Range) Point Mugu, California. While there, he sparked and supervised development of the Guidance Simulation Laboratory which, within a few years, became one of the leading American simulation facilities. In 1952 he organized the Simulation Council (now The Society for Modeling and Simulation International) and with his wife Suzette began publication of the Simulation Council Newsletter. From 1956 to 1963 he worked as Design Specialist, Space Navigation and Data Processing, with General Dynamics/Astronautics in San Diego. While there, he received a grant to support testing of an extra-corporeal perfusion device (heart-lung machine) which he had developed on his own time. He also acted as co-founder of the San Diego Symposium for Biomedical Engineering, and edited the proceedings of the first symposium, held in 1961.
The first anaesthesia simulator was developed at the University of Southern California in the late 1960s. It featured spontaneous ventilation, a heart beat, temporal and carotid pulses, blood pressure, opened and closed mouth, blinks eyes, muscle fasciculation, and coughed. It responded to 4 intravenous drugs: thiopental, succinylcholine, epinephrine, and atropine. Additionally it responded to oxygen and nitrous oxide. The first anesthesia simulator was based upon “scripting”. A script prescribes the consequences of an action [2].
A team at the University of Florida resurrected the early technology in the late 1980s [3]. In concert with a team of computer scientists and engineers, the Human Patient Simulator [HPS] was created and introduced in the early 1990s. What distinguishes the Human Patient Simulator from the first anaesthesia simulator is that it is based upon modelling. The software that runs the HPS uses complex mathematical equations, which define the many factors comprising the cardiovascular and respiratory systems of humans. If a drug or event affects one or more factors, the new equation will describe the resulting changes. Thus, if an intervention is correct and timely we will see improvement in the simulated patient’s condition. If the intervention is incorrect the simulated patient’s condition will deteriorate and ultimately lead to cardiac arrest and death.
As far as Greek specialists are concerned it should be noted that J. Kontos started research in simulation since 1960 with the simulation e.g. of part of the respiratory system and magnetohydrodynamic systems [6], [7].
III. BIOMEDICAL SYSTEM MODEL DISCOVERY SUPPORT
There exists some research on computational methods for the application of inductive learning methods in discovery of new knowledge. However the knowledge induced by such methods usually has little relation to the formalisms and concepts used by scientists and engineers. Experts in some domains may reject output of a learning system, unless it is related to their prior knowledge. In contrast the use of models in science and engineering may provide an explanation that includes variables, objects, or mechanisms that are unobserved, but that help predict the behavior of observed variables. Explanations also make use of general concepts and ontologies or relations for explaining experimental findings using scientific models.
We will focus here on a particular class of system models consisting of processes that describe one or more causal relations between input variables and output variables. A process may also include conditions, stated as threshold tests on its input variables, that describe it when it is active. This knowledge is expressed in terms of differential equations when it involves change over time or algebraic equations when it involves instantaneous effects. A process model consists of a set of processes that link observable input variables with observable output variables, possibly through unobserved theoretical terms. The concept of process is fundamental to our original early proposal of the ARISTA method and its modeling application in [8].
Process models are often designed to characterize the behavior of dynamical systems that change over time, though they can also handle systems in equilibrium. The data produced by such systems differ from those that arise in most induction tasks in a variety of ways. First, these variables are primarily continuous, since they represent quantitative measurements of the system under study. Second, the observed values are not independently and identically distributed, since those observed at later time steps depend on those measured earlier. The dynamical systems explained by our models are viewed as deterministic. The observations themselves may well contain noise but we assume that the processes themselves are always active whenever their conditions are met and that their equations have the same form all the time. We use this assumption because scientists and engineers often treat the systems they study as deterministic.
A biomedical system model discovery support system like AROMA may revolutionize scientific discovery by providing computer tools for the automatic checking of the validity of a model as supported by experimental findings. With such tools the updating of models may also be facilitated whenever experimental findings are reported that disagree with a generally accepted model. Therefore model discovery support is a worthy final aim for Intelligent Text Mining.
IV. THE GENERAL ARCHITECTURE OF OUR AROMA SYSTEM
The general architecture of our system is shown in Figure 1. and consists of three subsystems namely the Knowledge Extraction Subsystem, the Causal Reasoning Subsystem and the Simulation Subsystem. These subsystems are briefly described below. The texts of the example application presented here are compiled from the MEDLINE abstracts of papers used by [1] as references that amount to 73 items. Most of these papers are used in [1] to support the discovery of a quantitative model of protein concentration oscillations related to cell apoptosis constructed as a set of differential equations. We are aiming at automating part of such a cognitive process by our system. The collection of the MEDLINE abstracts is processed by a preprocessor module so that they take the form required by our Prolog programs i.e. one sentence per line.
A. The Knowledge Extraction Subsystem
This subsystem integrates partial causal knowledge extracted from a number of different texts. This knowledge is expressed in natural language using causal verbs such as “regulate”, “enhance” and “inhibit”. These verbs usually take as arguments entities such as protein names and gene names that occur in the biomedical texts that we use for the present applications. In this way causal relation between the entities are expressed. The input files used for this subsystem contain abstracts downloaded from MEDLINE. A lexicon containing words such as causal verbs and stopwords are also input to this subsystem. An output file is produced by the system that contains parts of sentences collected from the original sentences of different abstracts. These output file is used for reasoning by the second subsystem. The operation of the subsystem is based on the recognition of a causal verb or verb group. After this recognition complements of the verbs are chunked by processing the neighboring left and right context of the verb. This is accomplished by using a number of stopwords such as conjunctions and relative pronouns. The input texts are submitted first to a preprocessing module of the subsystem that converts automatically each sentence into a form consisting of Prolog facts that represent numerically information concerning the identification of the sentence that contains the word and its position in the sentence. This set of Prolog facts has nothing to do with logical representation of the “content” of the sentences as it seems to have been inaccurately reported in [10]. It should be emphasized that we do not deviate from our ARISTA method with this translation. We simply “annotate” each word with information concerning its position within the text. It should be noted that our annotation is not added in the original text but it is represented as the set of Prolog facts mentioned above.
B. The Causal Reasoning Subsystem
The output of the first subsystem is used as input to the second subsystem that combines causal knowledge in natural language form to produce by automatic deduction conclusions not mentioned explicitly in the input text. The operation of this subsystem is based on the ARISTA method [8]. The sentence fragments containing causal knowledge are parsed and the entity-process pairs are recognized. The user questions are analysed and reasoning goals are extracted from them. The answers to the user questions are generated automatically by a reasoning process together with explanations in natural language form. This is accomplished by the chaining “on the fly” of causal statements using prerequisite knowledge such as ontology to support the reasoning process. A second output of this subsystem consists of both qualitative and quantitative information that is input to the third subsystem and controls the adaptation of the model of the biomedical system.
C. The Simulation Subsystem
The third subsystem is used for modelling the dynamics of the biomedical system specified on the basis of the MEDLINE abstracts processed by the first subsystem. The characteristics of the model such as structure and parameter values will eventually be extracted from the input texts combined with prerequisite knowledge such as ontology and default process and entity knowledge. Considering the above example two coupled first order differential equations are used as the approximate mathematical model of the biomedical system in rough correspondence with the model proposed in [1]. A basic characteristic of the behaviour of such a system is the occurrence of oscillations for certain values of the parameters of the equations. The equations in finite difference form that approximate the differential equations are:
Δx= a1*x + b1*y + c1*x*y (1)
Δy= a2*y + b2*delay(d,x) (2)
Where Δx means the difference between the value of the variable x at the present time and the value of the variable x at the next time instant and delay(d,x) means the value of x before d units of time. Time is taken to advance in discrete steps. The variables x and y correspond to the concentrations of the proteins p53 and mdm2 respectively. The symbols a1, b1, c1, a2, b2 stand for the parameters of the equations. It is noted that multiplicative term c1*x*y renders equation (1) non-linear. This non-linearity causes the appearance of the oscillations to differ from simple sine waves. The solution of these equations is accomplished with a Prolog program that provides an interface for manipulating the parameters of the model. An important module of the simulation subsystem is one that generates text describing the behaviour of the variables of the model on true.
V. A FIRST EXAMPLE OF BIOMEDICAL TEXT MINING
An illustrative subset of sentences used in this first illustrative example is the following where the reference numbers of the papers with which the authors of [1] refer to are given in parentheses:
The p53 protein is activated by DNA damage. (23)
Expression of Mdm2 is regulated by p53. (32)
Mdm2 increase inhibits p53 activity. (17)
Using these sentences our system discovers automatically the qualitative causal process model with a negative feedback loop that can be summarized as:
DNA damage +causes p53 +causes mdm2 -causes p53
Where +causes means “causes increase” and -causes means “causes decrease or inhibition”
by answering the question:
Is there a process loop of p53?
This question is internally represented as the Prolog goal: “cause(P1,p53,P2,p53,S)”, where P1 and P2 are two process names that the system extracts from the texts and characterize the behavior of p53. S stands for the overall effect of the feedback loop found i.e. whether it is a positive or a negative feedback loop. In this case S is found equal to “-” or “negative” since a positive causal connection is followed by a negative one.
The short answer automatically generated by our system is:
Yes.
The loop is p53 activity –causes p53 production.
The long answer automatically generated by our system is:
Using sentence 17 with inference rule IR4
since the DEFAULT process of p53 is
using sentence 32
the EXPLANATION is:
since
p53 production –causes activity of p53
because
p53 production +causes expression of Mdm2
and
increase of Mdm2 –causes activity of p53
It should be noted that the combination of sentences (17) and (32) in a causal chain that forms a closed negative feedback loop is based on two facts of prerequisite ontological knowledge.
This knowledge is inserted manually in our system as Prolog facts and can be stated as:
“the DEFAULT process of p53 is ‘production’” or
in Prolog: “default(p53,production).”.
“the process ‘increase’ is equivalent to the process ‘expression’” or
in Prolog “equivalent(increase,expression).”.
The above analysis of the text fragments of the first example is partially based on the following prerequisite knowledge which is also manually inserted as Prolog facts:
kind_of(“the”,“determiner”)
kind_of(“is”,“copula”)
kind_of(“of”,“preposition”)
kind_of(“p53”,“entity_noun”)
kind_of(“protein”,“entity_noun”)
kind_of(“DNA”,“entity_noun”)
kind_of(“Mdm2”,“entity_noun”)
kind_of(“activated”,“causal_connector”)
kind_of(“inhibits”,“causal_connector”)
kind_of(“regulated”,“causal_connector”)
kind_of(“damage”,“process”)
kind_of(“expression”,“process”)
kind_of(“increase”,“process”)
kind_of(“activity”,“process”)
The above prerequisite knowledge base fragment contains both general linguistic and domain dependent ontological knowledge about the words occurring in the corpus. In practice of course these two parts of knowledge are and handled differently by the inference rules of the reasoning module.
VI. A SECOND EXAMPLE OF BIOMEDICAL TEXT MINING
The second example text is also compiled from two MEDLINE abstracts of papers used by [1] as references. These two abstracts downloaded from MEDLINE again contain knowledge concerning the interaction of the proteins p53 and mdm2. These proteins are involved in the life cycle of the cell. The first abstract consists of six sentences from which two are selected by the first subsystem from which the following fragments are extracted automatically.
“The p53 protein regulates the mdm2 gene” “regulates both the activity of the p53 protein”
These fragments are then automatically transformed as Prolog facts in order to be processed by the second subsystem as shown below:
t(“325”, “The p53 protein regulates the mdm2 gene”).
t(“326”, “regulates both the activity of the p53 protein”).
The numbers 325 and 326 denote that these fragments are extracted from the sentences 5 and 6 of the text 32.
The second abstract consists of seven sentences from which two are selected by the first subsystem from which the following fragments are extracted automatically
“The mdm2 gene enhances the tumorigenic potential of cells”
“The mdm2 oncogene can inhibit p53_mediated transactivation”
and expressed in the form of Prolog facts as:
t(“923”, “The mdm2 gene enhances the tumorigenic potential of cells”).
t(“927”, “The mdm2 oncogene can inhibit p53_mediated transactivation”).
Using the sentences of the second example our system discovers the causal negative feedback loop:
p53 +causes mdm2 -causes p53
Where +causes means “causes increase” and -causes means “causes decrease or inhibition”
by answering the question:
Is there a process loop of p53?
This question is internally represented as the Prolog goal:
“cause(P1,p53,P2,p53,S)”
where P1 and P2 are two process names that the system extracts from the texts and characterize the behavior of p53. S stands for the overall effect of the feedback loop found i.e. whether it is a positive or a negative feedback loop. In this case S is found equal to “-” since a positive causal connection is followed by a negative one.
The short answer automatically generated by our system is:
Yes.
The loop is p53 activity –causes p53 production.
The long answer automatically generated by our system is:
the QUESTION is:
Get process loop of p53
OR
cause(P1,p53,P2,p53,S)
USING INFERENCE RULE IR4a
since the DEFAULT entity of
USING sentence 927 with inference rule IR4
USING INFERENCE RULE IR4b
USING sentence 325
the EXPLANATION is:
since
p53 protein -causes p53
because
p53 protein +causes gene of mdm2
and
oncogene of mdm2 -causes p53_mediatedtransactivation of p53
It should be noted that the combination of sentences (92) and (32) in a causal chain that forms a closed negative feedback loop is based on two facts of prerequisite ontological knowledge.
This knowledge is inserted manually in our system as Prolog facts and can be stated as:
the DEFAULT entity of
VII. EXAMPLES OF OPERATION OF THE SIMULATION SUBSYSTEM
The results of two examples of simulation are produced by two different sets of values of the parameters a1, a2, b1, b2, c1 and d of the following equations:
Δx= a1*x + b1*y + c1*x*y (1a)
Δy= a2*y + b2*delay(d,x) (2a)
VIII. CONCLUSIONS
We presented our AROMA system for Intelligent Text Mining. The AROMA system that we developed consists of three main subsystems. The first subsystem achieves the extraction of knowledge from sentences that is related to the structure and the parameters of the biomedical system simulated. The second subsystem is based on a reasoning process that generates new knowledge by combining “on the fly” knowledge extracted by the first subsystem as well as explanations of the reasoning. The third subsystem is a system simulator written in Prolog that generates the time behavior of the model’s variables. Two important features of the simulation subsystem are the use of structure and parameter knowledge automatically extracted from scientific texts and the automatic generation of texts describing the behavior of the system being simulated. Our final aim is to be able to model biomedical systems by integrating knowledge extracted from different texts and give the user a facility for questioning these models during a collaborative man-machine model discovery procedure. The model based question answering we are aiming at may support both biomedical researchers and medical practitioners.
REFERENCES
[1] Bar-Or, R. L. et al (2000). Generation of oscillations by the p53-Mdm2 feedback loop: A theoretical and experimental study. PNAS, vol. 97, No 21 pp. 11250-11255, October.
[2] Denson JS and Abrahamson S (1969). A computer-controlled patient simulator. JAMA 208:504-508.
[3] Good, M.L. and Gravenstein , J.S. (1996). Anesthesia simulators and training devices, in Prys Roberts C, Brown BR Jr (eds) International Practice of Anaesthesia, Vol 2. Oxford, Butterworth-Heinemann, pp. 2/167/1-11
[4] Grau, O. (2000). The History of Telepresence automata, illusion and the rejection of the body. , in: Ken Goldberg (Hg.): The Robot in the Garden: Telerobotics and Telepistemology on the Internet, Cambridge/Mass, S. 226-246.
[5] Kean, V., J. (1995). The Ancient Greek Computer from Rhodes known as the Antikythera Mechanism. Efstathiadis Group. Greece.
[6] Kontos, J (1965). Computation of Instability Growth Rates in Finite Conductiity Magnetohydrodynamics. Nuclear Fusion, Vol 5, No 2.
[7] Kontos, J. et al (1966), A Control Engineering Approach to Magnetohydrodynamic Stability. Proceedings of the I.E.E., Vol 113, No 3.
[8] Kontos, J. (1992) ARISTA: Knowledge Engineering with Scientific Texts. Information and Software Technology. vol. 34, No 9, pp.611-616.
[9] Kontos, J. and Malagardi, I. (1999). Information Extraction and Knowledge Acquisition from Texts using Bilingual Question-Answering. Journal of Intelligent and Robotic Systems, vol 26, No. 2, pp. 103-122, October.
[10] Kontos, J. and Malagardi, I. (1999). Information Extraction and Knowledge Acquisition from Texts using Bilingual Question-Answering. Journal of Intelligent and Robotic Systems vol. 26, No. 2, pp. 103-122, October.
[11] Kontos, J. and Malagardi, I. (2001) A Search Algorithm for Knowledge Acquisition from Texts. HERCMA 2001, 5th Hellenic European Research on Computer Mathematics & its Applications Conference. Athens.
[12] Kontos, J., Elmaoglou, A. and Malagardi, I. (2002a) ARISTA Causal Knowledge Discovery from Texts Discovery Science 2002 Luebeck, Germany. Proceedings of the 5th International Conference DS 2002 Springer Verlag. pp. 348-355.
[13] Kontos, J., Malagardi, I., Peros, J., Elmaoglou, A. (2002b). System Modeling by Computer using Biomedical Texts. Res Systemica Volume 2 Special Issue. October 2002. (http:/www.afscet.asso.fr/resSystemica/accueil.html).
[14] Kroetze, J.,A. te al. (2003) Differentiating Data- and Text-Mining Terminology. Proceedings of SAICSIT 2003. pp. 93-101.
[15] McLeod J. and Osborne J. (1966). "Physiological simulation in general and particular," pp. 127-138, Natural Automata and Useful Simulations, edited by H. H. Pattee, E. A. Edelsack, Louis Fein, A. B. Callahan , Spartan Books, Washington DC.
[16] Mendes, P. & Kell, D.B. (1996). Computer simulation of biochemical kinetics. In BioThermoKinetics of the living cell (eds. H.V. Westerhoff, J.L. Snoep, F.E. Sluse, J.E. Wijker and B.N. Kholodenko), pp. 254-257. BioThermoKinetics Press, Amsterdam.
[17] Price de Solla D. (1974). Gears from the Greeks. American Philosophical Society.
[18] Weeber, M. et al. (2000). Text-Based Discovery in Biomedicine: The architecture of the DAD-system. Proceedings of the 2000 AMIA Annual Fall Symposium. Pp. 903-907. Hanley and Belfus, Philadelphia, PA.
[19] Zeigler, B.P. (1976). Theory of Modelling and Simulation, John Wiley and Sons, New York.
Δεν υπάρχουν σχόλια:
Δημοσίευση σχολίου