Δευτέρα, 30 Νοεμβρίου 2009

Motion Verbs and Vision

Motion Verbs and Vision
Ioanna Malagardi and John Kontos
HERCMA 2007 8th Hellenic European Research on Computer Mathematics & its Applications Conference. Athens
 (abridged)
Abstract--In the present paper we aim at the use of the analysis of the definitions of the Motion Verbs for the application to the understanding and description of action sequences such as those recorded in a video. The main points of computer processing of verbs of motion involve that the definitions are given as input to a system that produces an output that gives a grouping of the verbs and synthesized definitions of these verbs using primitives. In the system presented here the input action sequence is analyzed using the semantics of primitive motion verbs and the way they combine for the synthesis of complex verbs that summarize the action sequence. A future application of this work could be in the automatic text generation of descriptions of motion images obtained by artificial vision systems. These texts may be helpful for people with vision disabilities.
Index Terms – Cognitive Vision, Moving Images, Motion Verbs, Semantic Ontology, Video

1 INTRODUCTION
In the present paper we aim at the use of the analysis of the definitions of the Motion Verbs for the application to the understanding and description of action sequences such as those recorded in a video. Motion Verbs are analyzed using primitive verbs as described below using definition chains. A primitive motion verb can be classified according to pictorial criteria that may be obtained by the comparative analysis of a sequence of images. This classification can be inherited by non primitive verbs in accordance with their dependence on primitive verbs.
A variety of approaches have been proposed for the processing of action sequences recorded in a video. Most of these approaches refer to one or more of three levels of representation, namely, image level, logic level and natural language level. Motion verbs are useful for the third level representation.
In previous work [1] and [2] we presented a system of programs that concerns the processing of definitions of 486 verbs of motion as they are presented in a dictionary. This processing aimed at the exploitation of dictionaries for Natural Language Processing systems. A recent proposal for the organization of a Machine Readable Dictionary is based on the structure and development of the Brandeis Semantic Ontology (BSO), a large generative lexicon ontology and lexical database. The BSO has been designed to allow for more widespread access to Generative Lexicon-based lexical resources and help researchers in a variety of natural language computational tasks [3].
The semantic representation of images resulting from knowledge-assisted semantic image analysis e.g. [4] can be used to identify a primitive motion verb describing the sequence of a few images. Other efforts for the semantic annotation and representation of image sequences aim at the building of tools for the pre-processing of images and are briefly present below.

2 RELATED WORK
Cho et al. [5] propose to measure similarity between trajectories in a video using motion verbs. They use a hierarchical model based on is_a and part_of relation in combination with antonym relations. The ontological knowledge required by their system is extracted from WordNet.
They created five base elements to represent motion of moving objects namely “approach”, “go to”, “go into”, “depart” and “leave”. They use motion verbs to represent moving of objects as high level features from the cognitive point of view. According to this paper the problem of bridging the gap between high level semantics and no level video features is still open. The method proposed in the present paper is a novel contribution towards the solution of the bridging problem mentioned above.
The University of Karlsruhe group Dahlkamp, and Nagel, [6], [7], is developing a system for cognitive vision applied to the understanding of inner-city vehicular road traffic scenes. They argue that an adequate natural language description of developments in a real–word scene can be taken as a proof of “understanding what is going on”. In addition to vehicle manoeuvres the lane structure of inner-city roads and road intersections are extracted from images and provide reference for both the prediction of vehicle movements and the formulation of textual descriptions. Individual actions of an agent vehicle are associated to verb phrases that can be combined with a noun phrase referring to the agent vehicle to construct a single sentence in isolation. The next step is to concatenate individual manoeuvres into admissible sequences of occurrences. Such knowledge about vehicular behaviour is represented internally as a situation graph formed by situation nodes connected by prediction edges. They organize situation nodes not only according to their temporal concatenation but also according to a degree of conceptual refinement. For example an abstract situation node called “cross” (for cross an intersection) is refined into a sub graph that consists of a concatenation of three situation nodes, namely, (1) drive_to_intersection, (2) drive_on_intersection, (3) live_intersection. Such a refinement can take place recursively. A subordinate situation node inherits all predicates from its superordinate situation nodes. A path through a directed situation graph tree implies that the agent executes the actions specified in the most detailed situation node reached at its point in time during traversal semicolon, that is, such a path implies the behavior associated with a concatenation of actions encountered along such a path.
Using these situation graphs trees the above mentioned system generates a list of elementary sentences describing simply events. These sentences are analogues to the sentences we input to our system which instead of traffic scenes analyses office scenes and recognizes higher level event structures.
Recently learning systems have been developed for the detection and representation of events in videos. For example A. Hakeem, M. Shah [8] who propose an extension of CASE representation of natural languages that facilitates the interface between users and the computer. CASE is a representation proposed by Fillmore [9]. They propose two critical extensions to CASE that concern the involvement of multiple agents and the inclusion of temporal information.
M. Fleishman at al. (2006) [10] present a methodology to facilitate learning temporal structure in event recognition. They modeled complex events using a lexicon of hierarchical patterns of movement, which were mined from a large corpus of unannotated video data. These patterns act as features for a tree kernel based Support Vector Machine that is trained on a small set of manually annotated events. To distinct types of information are encoded by these patterns. First, by abstracting to the level of events as opposed to lower level observations of motion, the patterns allow for the encoding of more fine grained temporal relations than traditional HMM approaches. Additionally, hierarchical patterns of movement have the ability to capture global information about an event.

3 GREEK MOTION VERB PRIMITIVES
The main points of computer processing of verbs of motion involve that the definitions are given as input to a system of Prolog programmes and an output is produced that gives a grouping of the verbs and synthesized definitions of these verbs.
The set of 486 verb entries related to motion and were used as input to a system that produced groups of them on the basis of chains of their definitions. The verb at the end of a chain was used as the criterion of verb grouping. Prior to using these chains it was necessary to eliminate the cyclic parts of the definition chains which were also automatically detected by the system. The definitions of the verbs in each definition are in turn retrieved from the lexicon and in this way chains of definitions are formed. These chains end up in circularities that correspond to reaching basic verbs. The elimination of circularity that occurs in certain chains requires the choice of suitable verb as terminal the chain. The choice for each case of elimination of circularity requires the adoption of some “ontology”.
The results of automatic grouping were compared with groupings in Greek, German and English language that were done manually. The construction of chains was then applied to the automatic construction of definitions using a small number of verbs that appeared in at the end of the definition chains and were named “basic” representing primitive actions. The English translation of some of the primitive Greek motion verbs obtained by our system are: Touch, take, put, stir, raise, push, walk that are used for the system presented here in this paper in order to make it intelligible to a wider audience.

4 UNDERSTANDING AND DESCRIPTION OF MOVING IMAGES
The choice of one or more primitive verbs for the automatic description of a motion sequence is based on the abstract logical description of this sequence. The abstract logical description is supposed to contain declarations of the position and state of different entities depicted in the images from which the action sequence is extracted. The comparative logical analysis of the semantic representation of images resulting from knowledge-assisted semantic image analysis is used to identify a primitive motion verb describing the sequence of a few images. The synthesized definitions of more complex motion verbs together with other domain knowledge is used to generate text that describes a longer part of the action sequence with these verbs. A system that we implemented for the description in English of action sequences is described below.

5 SYSTEM DESCRIPTION
The system consists of an input module that accepts formal descriptions of action sequences. These sequences are analyzed by a primitive action recognizer module that provides input to a complex verb recognizer and finally an output module generates the sequence description. The system was implemented in Prolog and two examples of its operation are given below. The three main modules of our system are briefly described below giving indicative Prolog rules that are used for the accomplishment of its basic function. Finally a simple example is given of how the system could be augmented in order to be able to answer natural language questions about the evolution of the action sequence that was input. This constitutes an image grounded human computer interface for multimodal natural language question answering systems. Such an interface could be a test of whether “Cognitive Vision” is achieved. An early system for the generation of visual scenes with a multimodal interface was reported in [11].

5.1 The Input Module
The Input Module accepts a sequence of facts representing images using a predicate herewith named “frame” that constitute abstract descriptions of the images. The “frame” predicate records information concerning the time of taking the image, location of the acting agent and the state of the agent and all other entities of interest.
The following are examples of rules defining the semantics of the primitive verbs “take” and “put” and which are used for the recognition of the occurrence of primitive actions by combining successive image formal descriptions:

THE TAKE RULE:

take(A,X,L1,C5):-frame(T1,L,_,_,L1,_),
frame(T2,L,_,_,hand,_),T2=T1+1,L1<>"hand",
entities(A,_,X,_),
c(" the ",A,C1),c(C1," took the ",C2),
c(C2,X,C3),c(C3," from the ",C4),c(C4,L1,C5).

THE PUT RULE:

put(A,X,D):-frame(T1,L,_,_,hand,_),
frame(T2,L,_,_,D,_),T2=T1+1,D<>"hand",!,
entities(A,_,X,_),write(" the ",A," put the ",X," at the ",D).

Where :

c(X,Y,Z):-concat(X,Y,Z) that constructs the concatenation Z of the strings X and Y.

The positions of the objects in the microcosm are stated as below:

position(desk,1).
position(door,3).
position(bookcase,6).

The possible states of the door and the book are given by: state(opened,open).

The states of the agent are classified as stationary and moving. E.g. the stationary states are defined by:

stationary(sitting).
stationary(standing).

5.2 The Complex Verb Recognizer Module
The Complex Verb Recognizer Module uses the semantics of the complex verbs used in the action sequence descriptions expressed in terms of the primitive verbs used for the low level description of the actions depicted by short sequences of images. The following is an example of a rule defining the semantics of complex verb such as “transport” that is used by the Complex Verb Recognizing Module.

THE TRANSPORT RULE

transport(A,X,L1,L2):-write(" because "),nl,
take(A,X,L1,C),!,write(C),
write(" and "),nl,put(A,X,L2),
L1<>L2,write(" it follows that "),nl,
write("The ",A," transported the ",X," from the ",L1, " to the ",L2),nl.

5.3 The Action Sequence Description Generation Module
The output of our system is a sentence describing briefly the input action sequence and an explanation giving the reasons that support this description using primitive verbs. The operation of this module is closely related to the operation of the complex verb recognizer module and generates a single sentence description together with an explanation that supports the description generation.

6 EXAMPLES USED FOR THE EVALUATION OF THE SYSTEM

6.1 The First Example of Action Sequence Description
A simple example is presented her that was used for the evaluation of the feasibility of our approach. Consider the microcosm of an office environment. A video taken of an agent acting on such an environment may depict the agent approaching a book case in another room taking a book from it and placing the book on her desk. The sequence of images may show the following sequence of actions:

1. The agent is sitting at her desk.
2. The agent is getting up and walking to the door of her room.
3. The agent opens and goes through the door of her room.
4. The agent approaches the bookcase.
5. The agent takes a book from the bookcase.
6. The agent approaches her desk and puts the book on it.
7. The agent sits at her desk and opens the book.

This action sequence may be finally described by the sentence “The agent transported a book from the bookcase on her desk”.
The above action sequence is represented first as a set facts using the “frame” predicate as follows:

frame(1,1,sitting,closed,bookcase,closed).
frame(2,1,standing,closed,bookcase,closed).
frame(3,2,walking,closed,bookcase,closed).
frame(4,3,walking,closed,bookcase,closed).
frame(5,3,standing,open,bookcase,closed).
frame(6,3,walking,open,bookcase,closed).
frame(7,4,walking,open,bookcase,closed).
frame(8,5,walking,open,bookcase,closed).
frame(9,6,standing,open,bookcase,closed).
frame(10,6,standing,open,hand,closed).
frame(11,5,walking,open,hand,closed).
frame(12,4,walking,open,hand,closed).
rame(13,3,walking,open,hand,closed).
frame(14,2,walking,open,hand,closed).
frame(15,1,standing,open,hand,closed).
frame(16,1,standing,open,desk,closed).
frame(17,1,sitting,open,desk,open).

6.2 The Second Example of Action Sequence Description
The second example concerns the same microcosm as above but with a different action sequence. This action sequence of the second example may be described by the sentence
“The agent transported a book from her desk to the bookcase”. This action sequence is represented as a set of facts using the “frame” predicate as follows:

frame(1,1,sitting,closed,desk,closed).
frame(2,1,standing,closed,hand,closed).
frame(3,2,walking,closed,hand,closed).
frame(4,3,walking,closed,hand,closed).
frame(5,3,standing,open,hand,closed).
frame(6,3,walking,open,hand,closed).
frame(7,4,walking,open,hand,closed).
frame(8,5,walking,open,hand,closed).
frame(9,6,standing,open,hand,closed).
frame(10,6,standing,open,bookcase,closed).
frame(11,5,walking,open,bookcase,closed).
frame(12,4,walking,open,bookcase,closed).
frame(13,3,walking,open,bookcase,closed).
frame(14,2,walking,open,bookcase,closed).
frame(15,1,standing,open,bookcase,closed).
frame(16,1,standing,open,bookcase,closed).
frame(17,1,sitting,open,bookcase,open).

entities(agent,door,book,book).

7 QUESTION ANSWERING FROM ACTION SEQUENCES
A future expansion of our system is the implementation of question answering module that may answer questions about the evolution of action in an action sequence. For example when the question "when is the door opened?" the time is given as output. The processing of such a question can be accomplished by rules like:

q1:-q(1,Q),f(Q,when,R1),f(R1,is,R2),f(R2,the,R3), f(R3,door,R4),f(R4,QS,""),state(QS,S),
ans(S,T),write("The time is ",T),nl.
ans(S,T):-frame(T,_,_,S,_,_).

Where:

f(X,W,Z):-fronttoken(X,W,Z) that puts the first word of X in W and the rest in Z.

The processing of such questions involves the syntactic and semantic analysis of the questions. The semantic analysis is grounded on the input formal representations of the images depicting an action sequence.

8 DESCRIPTION OF VISUALIZATIONS OF BRAIN FUNCTIONS
Using modern technology some cognitive functions of the human brain can now be visualized and observed in real time. One example is the observation of the reading process of a human using a MEG (Magnetoencephalogram) [12]. The MEG is obtained by a system that collects magnetic signals from over 200 points on the scull of a human that are processed by computer to give values for the electrical excitation at different areas inside the brain. These deduced excitations are supposed to correspond to activations of the corresponding point of the brain during the performance of a cognitive function. A strong advantage of a MEG system is its time resolution which is about 4msecs and provides the capability of detailed observations.Some Brain MEG Data from an experiment during which reading and saying a word is performed by a human is given in Table 1. The Data were provided by Prof. Andreas Papanikolaou, University of Texas. These data result when a human is reading aloud a word projected to him while being monitored by a MEG system. This human is supposed to perform the cognitive functions of first reading silently the word, then processing it for recognition and finally saying it aloud.

N TIME X Y Z BRAIN AREA

VISUAL

1 256.71 -3.23 -3.80 6.92
2 260.64 -2.90 -3.34 6.71
3 264.58 -2.60 -3.05 6.51
4 268.51 -2.31 -2.84 6.33
5 272.44 -2.03 -2.63 6.18
6 276.37 -1.78 -2.45 6.04
7 280.31 -1.53 -2.27 5.86
8 284.24 -1.28 -2.08 5.61
9 288.17 -0.85 -1.76 5.08
10 343.22 -3.23 -2.67 3.13
11 347.15 -3.26 -2.74 3.25
12 351.09 -3.64 -3.04 3.18
13 355.02 -3.80 -3.24 3.09
14 358.95 -3.80 -3.35 2.97

SPEECH

15 370.75 1.51 5.78 6.02
16 374.68 1.60 5.62 5.82
17 378.61 1.75 5.64 5.67
18 382.54 1.83 5.70 5.53
19 386.48 1.91 5.75 5.40
20 390.41 2.03 5.85 5.31
21 394.34 2.18 5.94 5.23

MOTION

22 465.12 -1.35 3.74 7.89
23 469.05 -1.46 3.35 7.35
24 472.98 -1.50 2.83 6.84
25 555.56 -0.59 1.57 7.65
26 559.49 -0.69 1.93 7.60
27 563.42 -0.65 1.89 7.19
28 567.36 -0.68 1.85 6.94
TABLE 1.: The MEG data from one experiment.

Every cognitive action consists of a number of point activations N. The activations of the different brain areas during the above cognitive actions are supposed to be as follows:

The Visual (V) area activated for silent reading of the word.
The Speech (S) area activated for word processing
The Motion (M) area activated for saying the word.

We may use motion verbs to describe the dynamics of the real time visualization of such cognitive phenomena considering the MEG point activations as elementary events. Using a logical representation of these events high level descriptions of the cognitive actions observed can be described in natural language using the system presented in the present paper. Such a description will require the use of an anatomical database that provides the correspondence of the numerical coordinates of the points of activation with the medical names of the anatomical regions of the brain that these points lie. Example descriptions are: “Activation of the speech area follows activation of the vision area” and “Activation of the motion area follows activation of the speech area”.

9 CONCLUSION
In the present work we aim at the use of the analysis of the definitions of the Motion Verbs for the application to the understanding and description of action sequences such as those included in a video.
The evaluation of the feasibility of useful performance of our system was presented using two examples of the processing of action sequences and explaining how the output descriptive sentence is generated by the system and an example of work in progress for the description of brain MEG imaging sequences.
A future application of this work could be in the automatic text generation of descriptions of motion images obtained by artificial vision systems. These texts may be helpful for people with vision disabilities.
Finally it had shown how the system could be augmented in the direction of multimodal question answering.

ACKNOWLEDGMENT
We thank Prof. Andreas Papanikolaou, University of Texas for the provision of the MEG data.


REFERENCES
[1] J. Kontos, I. Malagardi and M. Pegou, “Processing of Verb Definitions from Dictionaries” 3rd International Conference nf Greek Linguistics pp. 954-961, 1997. Athens (in Greek).
[2] I. Malagardi, “Grouping of Modern Greek Verbs related to Motion using their Definitions” Journal of Glossologia, Athens Greece. 11-12 2000. pp. 282-294 (in Greek).
[3] J. Pustejovsky, C. Havasi, R. Saur, P, Hanks, and A. Rumshisky, “Towards a generative lexical resource: The Brandeis Semantic Ontology” Submitted to LREC 2006, Genoa.
[4] P. Panagi, S. Dasiopoulou, G.Th. Papadopoulos, I. Kompatsiaris and M.G. Strintzis, “A Genetic Algorithm Approach to Ontology-Driven Semantic Image Analysis” 3rd IEE International Conference of Visual Information Engineering (VIE), K-Space Research on Semantic Multimedia Analysis for Annotation and Retrieval special session, 2006. Bangalore, India.
[5] M. Cho, C. Choi and P. Kim, “Measuring Similarity between Trajectories using Motion Verbs in Semantic Level”, ICACT2007, pp. 511- 515, 2007, Korea.
[6] H.-H. Nagel, “Steps toward a Cognitive Vision System” AI Magazine 25(2), pp.31-50, 2004.
[7] H. Dahlkamp, H.-H. Nagel, A. Ottlik, P. Reuter, “A Framework for Model- Based Tracking Experiments in Image Sequences. International Journal of Computer Vision. 73(2), pp. 139-157, 2007.
[8] A. Hakeem, M. Shah, “Learning, detection and representation of multi- agent events in videos” Artificial Intelligence, 2007 Elsevier, (in press).
[9] C.J. Fillmore, “The case for CASE”, in : E. Bach, R. Harms (Eds), Universals in Linguistic Theory, Holt, Rinehart and Winston, New York, pp. 1-88, 1968.
[10] M. Fleischman, P. Decamp and D. Roy, “Mining temporal patterns of movement for video content classification”, International Multimedia Conference. Proceedings of the 8th ACM international workshop on Multimedia information retrieval. Poster Session. 2006. Santa Barbara, California USA.
[11] J. Kontos, I. Malagardi and D. Trikkalidis, “Natural Language Interface to an Agent”. EURISCON ’98 Third European Robotics, Intelligent Systems & Control Conference Athens. Published in Conference Proceedings “Advances in Intelligent Systems: Concepts, Tools and Applications” (Kluwer) pp.211-218, 1998.
[12] R. Salmelin, “Clinical Neurophysiology of Language: The MEG Approach” (Invited Review), Clinical Neurophysiology, ELSEVIER Ireland Ltd, 118, pp. 237-254, 2007.


Κυριακή, 29 Νοεμβρίου 2009

QUESTION ANSWERING FROM PROCEDURAL SEMANTICS TO MODEL DISCOVERY

QUESTION ANSWERING FROM PROCEDURAL SEMANTICS TO MODEL DISCOVERY

Prof. John Kontos and Dr. Ioanna Malagardi
Encyclopedia of Human Computer Interaction, Edited by Dr. Claude Ghaoui. Idea Group Inc.Hershey, USA (2005)

(abridged)
ABSTRACT
A series of systems that can answer questions from various data or knowledge sources are briefly described and future developments are proposed. The line of development of ideas starts with procedural semantics and leads to human-computer interfaces that support researchers for the study of causal models of systems. The early implementation of question answering systems was based on procedural semantics. Deductive systems appeared later that produce answers implicit in the original database used for answering. Some systems were developed that instead of databases they used collections of texts for extracting the answers. Finally it is described how the information extracted from scientific and technical texts is used by modern systems for the answering of questions concerning the behaviour of causal models using appropriate linguistic and deduction mechanisms. It is predicted that the perfection of such systems will revolutionize the discovery practice of scientists and engineers.

HISTORICAL INTRODUCTION
Question Answering (QA) is one of the branches of Artificial Intelligence (AI) that involves the processing of human language by computer. QA systems accept questions in natural language and generate answers often also in natural language. The answers are derived from databases, text collectons or knowledge bases. The main aim of QA systems is to generate a short answer to a question rather than a list of possibly relevant documents. As it becomes more and more difficult to find answers on the WWW using standard search engines, the technology of QA systems will become increasingly important. A series of systems that can answer questions from various data or knowledge sources are briefly described below. These systems provide a friendly interface to the user of information systems that is particularly important for users who are not computer experts. The line of development of ideas starts with procedural semantics and leads to interfaces that support researchers for the discovery of parameter values of causal models of systems under scientific study. QA systems historically developed roughly during the 1960-1970 decade (Simmons, 1970). A few of the QA systems that were implemented during this decade are:

The BASEBALL System (Green et al, 1961)
The FACT RETRIEVAL System (Cooper, 1964)
The DELFI Systems (Kontos and Papakontantinou, 1970; Kontos and Kossidas, 1971)

The BASEBALL System
This system was implemented in the Lincoln Laboratory and it was the first QA system reported in the literature according to the references cited in the first book with a collection of AI papers (Feigenbaum and Feldman, 1963). The inputs were questions in English about games played by baseball teams. The system transformed the sentences to a form that permits search of a systematically organized memory store for the answers.Both the data and the dictionary were list structures and questions were limited to a single clause.

The FACT RETRIEVAL System
The system was implemented using the COMIT compiler-interpreter system as programming language. A translation algorithm was incorporated into the input routines. This algorithm generates the translation of all information sentences and all question sentences into their logical equivalents.

The DELFI System
The DELFI system answers natural language questions about the space relations between a set of objects. These are questions with unlimited nesting of relative clauses that were automatically translated into retrieval procedures consisting of general purpose procedural components that retrieved information from the database that contained data about the properties of the objects and their space relations.

The DELFI II System
The DELFI II system (Kontos and Kossidas, 1971) was an implementation of the second edition of the system DELFI augmented by deductive capabilities. In this system the procedural semantics of the questions are expressed using macro-instructions that are submitted to a macro-processor that expands them with a set of macro-definitions into full programs. Every macro-instruction corresponded to a procedural semantic component. In this way a program was generated that corresponded to the question and could be compiled and executed in order to generate the answer. DELFI II was used in two new applications. These applications concerned the processing of the database of the personnel of an organization and the answering of questions by deduction from a database with airline flight schedules using the rules:

If flight F1 flies to city C1 and flight F2 departs from city C1 then F2 follows F1.
If flight F1 follows flight F2 and the time of departure of F1 is at least two hours later than the time of arrival of F2 then F1 connects with F2.
If flight F1 connects with flight F2 and F2 departs from city C1 and F1 flies to city C2 then C2 is reachable from C1.

Given a database that contains the data:

F1 departs from Athens at 9 and arrives at Rome at 11
F2 departs from Rome at 14 and arrives at Paris at 15
F3 departs from Rome at 10 and arrives at London at 12

If the question “Is Paris reachable from Athens?” is submitted to the system then the answer it gives is “yes”, because F2 follows F1 and the time of departure of F2 is three hours later than the time of arrival of F1. It should be noted also that F1 departs from Athens and F2 flies to Paris.

If the question “Is London reachable from Athens?” is submitted to the system then the answer it gives is “no”, because F3 follows F1 but the time of departure of F3 is one hour earlier than the time of arrival of F1. It should be noted here that F1 departs from Athens and F3 flies to London.

BACKGROUND

The SQL QA Systems
In order to facilitate the commercial application of the results of research work like the one described so far it was necessary to adapt the methods used to the industrial data base environment. One important adaptation was the implementation of the procedural semantics interpretation of natural language questions using a commercially available database retrieval language. The SQL QA systems implemented by different groups including the author’s followed this direction by using SQL (Structured Query Language) so that the questions can be answered from any commercial database system.
The domain of an illustrative application of our SQL QA system involves information about different countries. The representation of the knowledge of the domain of application connected a verb phrase like “exports” or “has capital” to the corresponding table of the database that the verb is related to. This connection between the verbs and the tables provided the facility of the system to locate the table a question refers to using the verbs of the question. During the analysis of questions by the system an ontology related to the domain of application may be used for the correct translation of ambiguous questions to appropriate SQL queries. Some theoretical analysis of SQL QA systems has appeared recently (Popescu et al, 2003) and a recent system with a relational database is described in Samsonova et al [2003].

QA from Texts Systems
Some QA systems use collections of texts instead of databases for extracting answers. Most such systems are able to answer simple “factoid” questions only. Factoid questions seek an entity involved in a single fact. Some recent publications on QA from texts are Diekema [2003], Doan-Nguyen and Kosseim [2004], Harabagiu et al [2003], Kosseim et al [2003], Nyberg et al [2002], Plamondon and Kosseim [2002], Ramakrishnan [2004], Roussinof and Robles-Flores [2004] and Waldinger et al [2003]. Some future directions of QA from texts are proposed in Maybury [2003]. An international competition between question answering systems from texts has been organized by NIST (National Institute of Standards and Technology (Voorhees, 2001).
In what follows it is described how the information extracted from scientific and technical texts may be used by future systems for the answering of complex questions concerning the behaviour of causal models using appropriate linguistic and deduction mechanisms. An important function of such systems is the automatic generation of a justification or explanation of the answer provided.

The ARISTA System
The implementation of the ARISTA system is a QA system that answers questions by knowledge acquisition from natural language texts and it was first presented in Kontos [1992]. The ARISTA system was based on the representation independent method also called ARISTA for finding the appropriate causal sentences from a text and chaining them by the operation of the system for the discovery of causal chains.
This method achieves causal knowledge extraction through deductive reasoning performed in response to a user's question. This method is an alternative to the traditional method of translating texts into a formal representation before using their content for deductive question answering from texts. The main advantage of the ARISTA method is that since texts are not translated into any representation formalism retranslation is avoided whenever new linguistic or extra linguistic prerequisite knowledge has to be used for improving the text processing required for question answering.
An example text that is an extract from a medical physiology book in the domain of pneumonology and in particular of lung mechanics enhanced by a few general knowledge sentences was used as a first illustrative example of primitive knowledge discovery from texts (Kontos, 1992). The ARISTA system was able to answer questions from that text that require the chaining of causal knowledge acquired from the text and produced answers that were not explicitly stated in the input texts.

The use of Information Extraction
A system using information extraction from texts for QA was presented in Kontos and Malagardi [1999]. The system described had as ultimate aim the creation of flexible information extraction tools capable of accepting natural language questions and generating answers that contained information either directly extracted from the text or extracted after applying deductive inference. The domains examined were oceanography, medical physiology and ancient Greek law (Kontos and Malagardi, 1999). The system consisted of two main subsystems. The first subsystem achieved the extraction of knowledge from individual sentences that was similar to traditional information extraction from texts (Cowie and Lehnert, 1996; Grishman, 1997) while the second subsystem was based on a reasoning process that combines knowledge extracted by the first subsystem for answering questions without the use of a template representation.

QUESTION ANSWERING FOR MODEL DISCOVERY

The AROMA System
A modern development in the area of QA that points to the future is our implementation of the AROMA (ARISTA Oriented Model Adaptation) system. This system is a model-based QA system that may support researchers for the discovery of parameter values of procedural models of systems by answering “What if” questions. (Kontos et al, 2002). The concept of “What if” questions are considered here to involve the computation data of describing the behaviour of a simulated model of a system.
The knowledge discovery process relies on the search for causal chains that in turn relies on the search for sentences containing appropriate natural language phrases. In order to speed up the whole knowledge acquisition process the search algorithm described in Kontos and Malagardi [2001] was used for finding the appropriate sentences for chaining. The increase in speed results because the repeated sentence search is made a function of the number of words in the connecting phrases. This number is usually smaller than the number of sentences of the text that may be arbitrarily large.

The Knowledge Extraction Subsystem
This subsystem integrates partial causal knowledge extracted from a number of different texts. This knowledge is expressed in natural language using causal verbs such as “regulate”, “enhance” and “inhibit”. These verbs usually take as arguments entities such as entity names and process names that occur in the texts that we use for the applications. In this way causal relations are expressed between the entities, processes or entity-process pairs.
The input texts are submitted first to a preprocessing module of the subsystem that converts automatically each sentence into a form that shows word data with numerical information concerning the identification of the sentence that contains the word and its position in that sentence. This conversion has nothing to do with logical representation of the content of the sentences. It should be emphasized that we do not deviate from our ARISTA method with this conversion. We simply annotate each word with information concerning its position within the text. This form of sentences is then parsed and partial texts with causal knowledge are generated.

The Causal Reasoning Subsystem
The output of the first subsystem is used as input to the second subsystem that combines causal knowledge in natural language form to produce answers and model data by deduction not mentioned explicitly in the input text. The operation of this subsystem is based on the ARISTA method. The sentence fragments containing causal knowledge are parsed and the entity-process pairs are recognized. The user questions are processed and reasoning goals are extracted from them. The answers to the user questions that are generated automatically by the reasoning process contain explanations in natural language form. All this is accomplished by the chaining of causal statements using prerequisite knowledge such as ontology to support the reasoning process.

The Simulation Subsystem
The third subsystem is used for modelling the dynamics of a system specified on the basis of the texts processed by the first and second subsystem. The data of the model such as structure and parameter values are extracted from the input texts combined with prerequisite knowledge such as ontology and default process and entity knowledge. The solution of the equations describing the system is accomplished with a program that provides an interface with which the user may test the simulation outputs and manipulate the structure and the parameters of the model.

FUTURE TRENDS
The architecture of the AROMA system is pointing to future trends in the field of QA by serving among other things the processing of “What if” questions. These are questions about what will happen to a system under certain conditions. Implementing systems for the answering “What if” questions will be an important research goal in the future (Maybury 2003) .
Another future trend is the development of systems that may conduct an explanatory dialog with their human user by answering “Why” questions using the simulated behaviour of system models. A “Why question” seeks the reason for the occurrence of certain system behavior.
The work on model discovery QA systems paves the way towards important developments and justifies effort leading to the development of tools and resources aiming at the solution of the problems of model discovery based on larger and more complex texts. These texts may report experimental data that may be used to support the discovery and adaptation of models with computer systems.

CONCLUSION
A series of systems that can answer questions from various data or knowledge sources were briefly described. These systems provide a friendly interface to the user of information systems that is particularly important for users that are not computer experts. The line of development of systems starts with procedural semantics systems and leads to interfaces that support researchers for the discovery of model parameter values of simulated systems. If these efforts for more sophisticated human-computer interfaces succeed then a revolution may take place in the way research and development is conducted in many scientific fields. This revolution will make computer systems even more useful for research and development.

REFERENCES
Cooper, W. S. (1964). Fact Retrieval and Deductive Question–Answering Information Retrieval Systems. Journal of the ACM, Vol. 11, No. 2, pp. 117-137.
Cowie J., and Lehnert, W., (1996). Information Extraction. Communications of the ACM. Vol. 39, No. 1, pp. 80-91.
Diekema, A., R. (2003). What do You Mean? Finding Answers to Complex Questions. New Directions on Question Answering. Papers from 2003 AAAI Spring Symposium. The AAAI Press. USA, pp. 87-93.
Doan-Nguyen, H. and Kosseim, L. (2004). Improving the Precision of a Closed-Domain Question Answering System with Semantic Information. Proceedings of Researche d’ Information Assistee Ordinateur (RIAO-2004). pp. 850-859. Avignon, France. April 2004.
Feigenbaum, E., A. and Feldman, J. (1963). Computers and Thought. McGraw Hill. New York.
Green B. F. et al. (1961). BASEBALL: An Automatic Question Answerer. Proceedings of the Western Joint Computer Conference 19, pp. 219-224.
Grishman R., (1997). Information Extraction: Techniques and Challenges. In Pazienza, M. T. Information Extraction. LNAI Tutorial. Springer, pp. 10-27.
Harabagiu, S., M., Maiorano, S., J. and Pasca, M., A. (2003). Open-domain textual question answering techniques. Natural Language Engineering, Vol. 9 (3). Pp. 231-267.
Kontos J. and Papakonstantinou G., (1970). A question-answering system using program generation. In Proceedings of the ACM International Computing Symposium, Bonn, Germany, pp. 737-750.
Kontos J. and Kossidas A., (1971). On the Question-Answering System DELFI and its Application. Proceedings of the Symposium on Artificial Intelligence, Rome, Italy, pp. 31-36.
Kontos J. (1992) ARISTA: Knowledge Engineering with Scientific Texts. Information and Software Technology, Vol. 34, No 9, pp. 611-616.
Kontos J. and Malagardi I. (1999). Information Extraction and Knowledge Acquisition from Texts using Bilingual Question-Answering. Journal of Intelligent and Robotic Systems, Vol 26, No. 2, pp. 103-122, October.
Kontos J. and Malagardi I. (2001) A Search Algorithm for Knowledge Acquisition from Texts. HERCMA 2001, 5th Hellenic European Research on Computer Mathematics & its Applications Conference, Athens, Greece, pp. 226-230.
Kontos J., Elmaoglou A. and Malagardi I. (2002). ARISTA Causal Knowledge Discovery from Texts. Proceedings of the 5th International Conference on Discovery Science DS 2002, Luebeck, Germany, pp. 348-355.
Kontos, J., Malagardi, I., Peros, J. (2003). “The AROMA System for Intelligent Text Mining” HERMIS International Journal of Computers mathematics and its Applications. Vol. 4. pp.163-173. LEA.
Kontos, J. (2004). Artificial Intelligence. Chapter in Book “Cognitive Science: The New Science of the Mind”. Gutenbeng. Athens 2004. pp. 43-153 (in Greek).
Kosseim, L., Plamondon, L. and Guillemette, L., J. (2003). Answer Formulation for Question-Answering. Proceedings of the Sixteenth Conference of the Canadian Society for Computational Studies of Intelligence. (AI’2003). Lecture Notes in Artificial Intelligence no 2671, pp. 24-34. Springer Verlag. June 2003. Halifax Canada.
Maybury, M., T. (2003). Toward a Question answering Roadmap. New Directions on Question Answering. Papers from 2003 AAAI Spring Symposium. The AAAI Press. USA, pp. 8-11.
Nyberg, E. et al. (2002). The JAVELIN Question-Answering System at TREC 2002. NIST Special Publication:11th Text Retrieval Conference (TREC), 2002.
Plamondon, L. and Kosseim, L. (2002). QUANTUM: A Function-Based Question Answering System. Proceedings of The Fifteenth Canadian Conference on artificial Intelligence (AI’2002). Lecture Notes in artificial Intelligence no. 2338, pp.281-292. Springer Verlag. Berlin . May 2002. Calgary, Canada.
Popescu, A., Etzioni, O. and Kautz, H. (2003). Towards a Theory of Natural Language Interfaces to Databases. Proceedings of IUI’03, Miami, Florida, USA, pp. 149-157.
Ramakrishnan, G. et al. (2004). Is Question Answering an Acquired Skill? (2004). WWW2004. May 2004, New York, USA.
Roussinof, D., Robles-Flores, J., A. (2004). Web Question Answering: Technology and Business Applications. Proceedings of the Tenth Americas Conference on Information Systems. New York. August 2004.
Samsonova, M., Pisarev, A. and Blagov M. (2003). Processing of natural language queries to a relational database. Bioinformatics. Vol. 19 Suppl. 1. pp. i241-i249
Simmons R., F. (1970). Natural Language Question-Answering Systems: 1969. Computational Linguistics, Vol. 13, No 1, January, pp. 15-30.
Voorhees E., M. (2001). The Trec Question answering Track. Natural Language Engineering, Vol. 7, Issue 4.pp.361-378.
Waldinger, R. et al. (2003). Deductive Question Answering from Multiple Resources. New Directions in Question Answering, AAAI 2003.

Terms and Definitions

Ontology: A structure that represents taxonomic or meronomic relations between entities.
Question Answering System: A computer system that can answer a question posed to it by a human being using pre-stored information from a database, a text collection or a knowledge base.
Procedural Semantics: A method for the translation of a question by a computer program into a sequence of actions that retrieve or combine parts of information necessary for answering the question.
Causal Relation: A relation between the members of an entity-process pair where the first member is the cause of the second member that is the effect of the first member.
Causal Chain: A sequence of instances of causal relations such that the effect of each instance but the last one is the cause of the next one in sequence.
Model: A set of causal relations that specify the dynamic behavior of a system.
Explanation: A sequence of statements of the reasons for the behavior of the model of a system.
Model Discovery: The discovery of a set of causal relations that predict the behavior of a system.
“What if” question: A question about what will happen to a system under given conditions or inputs.
“Why” question: A question about the reason for the occurrence of certain system behavior.




Παρασκευή, 27 Νοεμβρίου 2009

Cosmetic Surgery






QUESTION ANSWERING AND Rhetoric ANALYSIS

QUESTION ANSWERING AND Rhetoric ANALYSIS of Biomedical Texts in the AROMA System


 JOHN KONTOS, IOANNA MALAGARDI and JOHN PEROS

7th Hellenic European Research on Computer Mathematics & its Applications Conference. Athens.

Abstract--Question answering with intelligent knowledge management of biomedical texts and analysis of rhetoric relations in the AROMA system is presented. The development of the AROMA system aims at the creation of an intelligent tool for the support of the discovery and adaptation of biomedical models based on data extracted from natural language texts. The system operation includes three main functions namely question answering and text mining and simulation. The question answering function generates model based answers and their explanations. The operation of AROMA allows the exploitation of rhetoric relations between a “basic” text that proposes a model of a biomedical system and parts of the abstracts of papers that present experimental findings supporting the model. An important use of AROMA concerns the comparison of experimental data with the model proposed in the basic text. The AROMA system consists of three subsystems. The first subsystem extracts knowledge including rhetoric relations from biomedical texts. The second subsystem answers questions with causal knowledge extracted by the first subsystem and generates explanations using rhetoric relation knowledge in addition to other knowledge. The third subsystem simulates the time-dependent behavior of a model from which textual descriptions of the waveforms are generated automaticaly.
Index Terms-- rhetoric relations, intelligent biomedical text mining, knowledge discovery from text, simulation, question answering, explanation, p53, mdm2.

I. INTRODUCTION

The new AROMA (Automatic Rhetoric Organizer for Model Analysis) system is presented in the present paper and an illustrative example of application is described. This system is an intelligent computer tool for question answering and text mining [5] of biomedical knowledge including the management of rhetoric knowledge. Parts of the older version of the system were presented in [6], [7], [8], [9], [10].
The development of the AROMA system aims at the creation of an intelligent tool for the support of the discovery and adaptation of biomedical models based on data extracted from natural language texts. The system operation includes two main functions namely question answering and text mining. The question answering function generates model based explanations of the answers. The expanded operation of AROMA allows the exploitation of rhetoric relations between a basic text that proposes a model of a biomedical system and parts of the text of papers that present experimental data supporting the model.
In a typical application of the system a theoretical text presenting the model of a system under scientific investigation is taken as the “basic” one and a computerized method is used for the analysis of the relationship of the model to the texts of the papers that provide experimental support of the model or theory proposed by it. Text mining is applied both to the basic paper and the supporting papers in order to extract the knowledge fragments that have to be rhetorically related and computationally processed. An important aspect of the application of the system is the answering of natural language questions for the computer aided comparison of new experimental data with the predictions of the model proposed in the “basic” paper. The friendly man-machine interface of the system interface is capable of answering such questions and generating explanations that help the user in tracing the support of the model by experimental and background domain knowledge.
It is envisaged that the AROMA System may prove useful for the support of the discovery activity of research scientists with the intelligent management of scientific knowledge mined from texts. Text mining differs from data mining [2] in that it uses unstructured texts for the collection of knowledge rather than structured sources like databases. We define intelligent text mining as the process of creating novel knowledge from texts that is not stated in them by combining sentences with deductive reasoning. An early implementation of this kind of intelligent text mining was reported by us in [4]. A review of different kinds of text mining including intelligent text mining is presented in [11].
There are two possible methodologies of applying deductive reasoning to texts. The first methodology is based on the translation of the texts into some formal representation of their “content” with which deduction is performed [12]. The advantage of this methodology depends on the simplicity and availability of the required inference engine but its serious disadvantage is the need for reprocessing all the texts and storing their translation into a formal representation every time something changes in their domain. In the case of scientific texts what may change is some part of the background knowledge such as the ontology used for deducing new knowledge. The second methodology eliminates the need for translation of the texts into a formal representation because an inference engine capable of performing deductions “on the fly”, i.e. directly from the original texts, is implemented.
A disadvantage of this second methodology is that a more complex inference engine than the one needed by the first one must be built. The implementation of such inference engines has however proved feasible for causal knowledge. The strong advantage of the second methodology is that the translation into a formal representation is avoided. In [4] we proposed the second methodology and the method we developed was therefore called ARISTA i.e. Automatic Representation Independent Syllogistic Text Analysis. The ARISTA method is used in the AROMA system for its basic reasoning functions.
An early attempt was also made by us in [4] to implement a model-based question answering system using the scientific text as a knowledge base describing a qualitative model. One of our early application examples concerned medical text mining related to human respiratory system mechanics [4]. Biomedical text mining is now recognized as a very important field of study [3] and [17] particularly for molecular biology. The system GeneWays [16] is a typical example of a biomedical text analysis system for the extraction of molecular pathway data. The idea of combining text mining with simulation and question answering was pursued further by our group as reported in [7], [8], [9] and [10] as well as in the present paper.
The general architecture of our AROMA system consists of three subsystems namely the Knowledge Extraction Subsystem, the Causal Reasoning Subsystem and the Simulation Subsystem. These subsystems are briefly described below.

II. THE GENERAL ARCHITECTURE OF THE AROMA SYSTEM

A. The Knowledge Extraction Subsystem

This subsystem integrates partial causal knowledge extracted from a number of different texts. This knowledge is expressed in natural language using causal verbs such as “regulate”, “enhance” and “inhibit”. These verbs usually take as arguments entities such as protein names and gene names that occur in the biomedical texts that we use for the present applications. In this way causal relations between entities and processes are expressed. A lexicon containing words such as causal verbs and stop words are used by this subsystem. An output file is produced by the system that contains parts of sentences collected from the original sentences of different input texts. These output file is used for reasoning by the second subsystem. The input files used for this subsystem in the example of p53-mdm2 dynamics contain texts downloaded from MEDLINE. The operation of the subsystem is based on the recognition of noun phrases and verb groups and their relations.

B. The Causal Reasoning Subsystem

The output of the first subsystem is used as input to the second subsystem that combines background knowledge with causal knowledge in natural language form to produce by automatic deduction conclusions not mentioned explicitly in the input text. The operation of this subsystem is based on the ARISTA method [4] and results in the recognition of causal relations on-the-fly of the form “causes(process1, entity1, process2, entity2, manner)”.
The pair (process1, entity1) stands for the cause, the pair (process2, entity2) stands for the effect and “manner” stands for the kind of causality i.e. whether it is positive or negative. The sentence fragments containing causal knowledge are analyzed and the entity-process pairs are recognized. The user questions are analyzed and reasoning goals are extracted from them. The answers to the user questions are generated automatically by a reasoning process together with explanations in natural language form. This is accomplished by the chaining “on the fly” of causal statements using background knowledge such as an ontology to support the reasoning process.

C. The Simulation Subsystem

The third subsystem is used for modelling the dynamics of the biomedical system specified in the “basic” text. The characteristics of the model such as structure and parameter values are extracted from the input texts combined with background knowledge such as ontology and default process and entity knowledge [9]. Considering the p53-mdm2 example two coupled first order differential equations were used as the approximate mathematical model of the biomedical system in rough correspondence with the model proposed in [1]. A basic characteristic of the behaviour of such a system is the occurrence of oscillations for certain values of the parameters of the equations. Two finite difference equations that approximate the differential equations system of the model are:

Δx= a1*x + b1*y + c1*x*y (1)
Δy= a2*y + b2*delay(d,x) (2)

where Δx means the difference between the value of the variable x at the present time and the value of the variable x at the next time instant and delay(d,x) at equation (2) stands for the value that x had d units of time before present time. Time is taken to advance in discrete steps. The symbols x and y are the variables that represent the concentrations of the proteins p53 and mdm2 respectively. The symbols a1, b1, c1, a2, b2 are the coefficients of the equations that represent the parameters of the biomedical system. It is noted that multiplicative term c1*x*y renders equation (1) non-linear. This non-linearity causes the appearance of the oscillations to differ from simple sine waves. The solution of these equations is accomplished with a Prolog program that provides an interface for manipulating the parameters of the model. An important module of the simulation subsystem is one that generates text describing the behaviour of the variables of the model on true. The above information concerning the connection of the parts of the differential equations with the parts of the biomedical system being modelled is formalized using rhetorical relations explained below.

III. THE RHETORIC RELATIONS

Research on the rhetoric or discourse analysis of biomedical texts has started only recently [15]. In [15] an annotation scheme is proposed for a rhetorical analysis of biology articles using the “zoning” method. This method characterizes parts of a scientific text using an annotation scheme to identify these parts or “zones”.
In our system we apply the rhetoric relations approach [13, 14] in contrast with the zoning approach. This approach connects parts of sentences and other textual fragments using rhetoric relations.
About 50 rhetoric relations have been theoretically defined in [13] but only 4 were computationally defined in [14] namely “contrast”, “cause-explanation-evidence”, “elaboration” and “condition” that were defined at a much coarser level of granularity for practical reasons. It is however proposed to enrich this approach of analysis by using a few more rhetoric relations necessary for the representation of the content of scientific texts related to models of systems.
We distinguish between two kinds of purely textual rhetoric relations namely internal to the “basic” paper (symbolized as “inr”) and external to it (symbolized as “exr”). A third kind of relations (symbolized as mbr) is proposed that formalizes the appearance of the waveforms of the time behaviour of the model variables. These waveforms are treated as “numerical narratives” with a “rhetoric” structure of events equivalent to the “pattern structure” of the waveforms where peaks and valleys or maxima and minima play the role of events. All these kinds of relations are briefly described below and illustrated using examples from [1] where “coerel” and “varrel” are of the inr kind, “parbib” and “entfra” are of the exr kind and “behrel” and “tfollows” are of the mbr kind and are defined below.
The formal representation of the rhetoric relations used by our system consists of a single predicate rr(PAR_1,…,PAR_n) where the arguments PAR_1 to PAR_n stand for n parameters that define a rhetoric relation between two rhetorically related “objects”. These objects may be of different kinds such as sentences or other text fragments, equations, equation variables, physical quantities, citations, references or parts of waveforms representing the time behavior of model variables. The formal representation:

rr(relation_name,relation_kind,

first_object_kind,first_object_identifier,

second_object_kind,second_object_identifier).

is used below for the illustration of some rhetorical relations used by our system.

a. Internal Relations (inr)

1) The relation name “coerel” stands for relations between a coefficient “c” of a mathematical model and the corresponding entity property or parameter “ep” of the biomedical system modeled.

Or formally: “rr(coerel,inr,coefficient,c,parameter,ep).

2) The relation name “varrel” stands for relations between a variable “x” of an equation of the model and a physical entity “e” or an entity property “ep” of the biomedical system modeled.

Or formally: “rr(varrel,inr,variable,x,entity,e) and “rr(varrel,inr,variable,x,entityp,ep).

b. External Relations (exr)

3) The relation name “parbib” stands for relations between a parameter “p” of the biomedical system and a bibliographic reference “r” to a paper that contains experimental data that support the inclusion and possibly the numerical value of this parameter.

Or formally: rr(parbib,exr,parameter,p,reference,r).

4) The relation name “entfra” stands for relations between an entity “e” of a biomedical system and a text fragment labeled “f” of a reference text labeled “r” and is symbolized by “r_f” that presents the experimental data that support the inclusion and possibly the numerical values of the properties of the entity “e”.

Or formally: rr(entfra,exr,ent,e,fragment,r_f).

c. Model Behavior Relations (mbr)

5) The relation name “behrel” stands for relations between some coefficient “c” of the model the time behavior of a variable “v” of the model.

Or formally: “rr(behrel,mbr,coefficient,c, variable,v).

6) The relation name “tfollows” stands for time ordering relations between a part in position “wp1” of a waveform representing the time variation of a variable v1 of the model such as a peak or a valley symbolized as wp1_v1 and a part in position “wp2” of a waveform representing the time variation of variable v2 and symbolized as wp2_v2. The kind of object “waveform part” is symbolized as “wpart”.

Or formally: rr(tfollows,mbr,wpart,wp1_v1,wpart,wp2,p2_v2)

IV. SOME RHETORIC RELATIONS IN THE P53-MDM2 EXAMPLE

In the p53-mdm2 example the rhetoric relations extracted concern the text [1] that proposes a mathematical model for the interaction of the proteins p53 and mdm2 as well as the MEDLINE abstracts of papers related to the model proposed by [1] either used or not as references by [1]. These abstracts were downloaded from MEDLINE and contain knowledge of experimental results concerning the interaction of the proteins p53 and mdm2. These proteins are involved in the life cycle of the cell and interact through a negative feedback system. Some rhetoric relations found in the text [1] are:

r1:rr(coerel,inr,coefficient, sourcep53, parameter,synthesis_rate_of_the_p53_protein)
extracted from the sentence: “Here the coefficient sourcep53 specifies the synthesis rate of the p53 protein.”
r2:rr(coerel,inr,coefficient,activity,parameter,p53's_sequence-specific_DNA_binding activity”
extracted from the sentence: “The coefficient activity can include p53's sequence-specific DNA binding activity”
r3:rr(varrer,inr,variable, degradation(t),entity,rate of degradation)
extracted from the sentence: “The variable degradation(t) measures the rate of degradation”
r4:rr(coerel,inr,coefficient, p1,entity, rate_of_p53-independent_mdm2_transcription)
and
r5:rr(parbib,exr,entity, mdm2,reference,26)
extracted from the sentence: “Here the coefficient p1 denotes the rate of p53-independent mdm2 transcription and translation (24)”

The abstracts of two papers presenting experimental data supporting the qualitative model proposed by [1] named with the labels “32” and “92” can be used to illustrate the question answering process of the AROMA system and the use of external rhetoric relations in the explanations generated by the question answering process.

The first abstract labeled as “32” as listed in [1] consists of six sentences from which two are selected by the first subsystem of AROMA and from which the following two sentence fragments are extracted automatically:

1.“The p53 protein regulates the mdm2 gene”
2.“regulates both the activity of the p53 protein”

These fragments are then automatically transformed to Prolog facts in order to be processed by the second subsystem as shown below:

t(“32_5”, “The p53 protein regulates the mdm2 gene”).
t(“32_6”, “regulates both the activity of the p53 protein”).

The labels 32_5 and 32_6 denote that these fragments are extracted from the sentences 5 and 6 of the text with label 32.

The second abstract labeled by the number “92” that was found independently of [1] consists of seven sentences from which two are selected by the first subsystem of AROMA from which the following two sentence fragments are extracted automatically:

3.“The mdm2 gene enhances the tumorigenic potential of cells”
4.“The mdm2 oncogene can inhibit p53_mediated transactivation”
and expressed in the form of Prolog facts as:
t(“92_3”, “The mdm2 gene enhances the tumorigenic potential of cells”).
t(“92_7”, “The mdm2 oncogene can inhibit p53_mediated transactivation”).

The labels 92_3 and 92_7 denote that these fragments are extracted from the sentences 3 and 7 of the text with label 92.

Using the above sentence fragments of the p53-mdm2 example our system discovers the causal negative feedback loop by appropriate chaining of the relevant sentence fragments and represents it as:

p53 +causes mdm2 -causes p53

Where

+causes means “causes increase” and
-causes means “causes decrease or inhibition”.

This is effected by answering the question:

Is there a process loop of p53?
This question is internally represented as the Prolog goal:

“cause(P1,p53,P2,p53,S)”
where P1 and P2 are two process names that the system extracts from the texts and characterize the behavior of the entity p53. S stands for the overall effect of the feedback loop found i.e. whether it is a positive or a negative feedback loop. In this case S is found equal to “-” since a positive causal connection is followed by a negative one.
The short answer automatically generated by our system is:
Yes.
The loop is p53 activity –causes p53 production.

By a short answer we mean a simple answer not connected to any explanation of the reasoning followed for the derivation of the answer.
The long answer automatically generated by our system together with an appropriate explanation is as follows:

The QUESTION is:
“Is there a process loop of p53 ? ”

Represented internally in Prolog as:
cause(P1,p53,P2,p53,S).

USING INFERENCE RULE IR4a
since the DEFAULT entity of is

with rhetoric relations:

rr(entfra,exr,entity,p53,fragment,92_7)
rr(entfra,exr,entity,mdm2,fragment,92_7)

USING INFERENCE RULE IR4b
with rhetoric relations:

rr(entfra,exr,entity,p53,fragment,32_5)
rr(entfra,exr,entity,mdm2,fragment,32_5)

the EXPLANATION is:

since is a kind of p53 protein -causes p53

because

p53 protein +causes gene of mdm2
and
oncogene of mdm2 -causes p53 mediated transactivation of p53.

It should be noted that the combination of sentence fragments (92_7) and (32_5) in a causal chain that forms a closed negative feedback loop is based on two facts of background knowledge.

This background knowledge is inserted manually in our system as Prolog facts and can be stated as:

the DEFAULT entity of is or default(p53_mediated, p53).

is a kind of or kind_of(oncogene, gene).

The above analysis of the text fragments of the example is partially based on the following background linguistic and domain knowledge which is manually inserted as Prolog facts:

Linguistic Knowledge:

kind_of(“the”,“determiner”).
kind_of(“is”,“copula”).
kind_of(“of”,“preposition”)
kind_of(“activated”,“causal_connector”).
kind_of(“inhibits”,“causal_connector”).
kind_of(“regulated”,“causal_connector”).

Domain Knowledge:

kind_of(“protein”,“entity”).
kind_of(“DNA”,“entity”)
kind_of(“p53”,“entity”).
kind_of(“Mdm2”,“entity”).
kind_of(“damage”,“process”).
kind_of(“expression”,“process”).
kind_of(“increase”,“process”).
kind_of(“activity”,“process”).

The above knowledge base fragment contains both linguistic and domain knowledge to support the analysis of the sentences occurring in the corpus. In practice these two parts of knowledge are handled differently by the inference rules of the reasoning module.

V. AUTOMATIC DESCRIPTION OF THE BEHAVIOR OF THE MODEL VARIABLES

The system can generate automatically descriptions of the the time behaviour of the concentration of the two proteins that are represented by the two variables of the model using the “tfollows” rhetoric relation. More details may be found in [9] an [10]. An example is shown below of an automatically produced description of the numerical results produced by the solution of the model equations where T stands for time and protein names capitalized as P53 and MDM2 for displaying emphasis only:

“A peak of P53 at T=3.6 P53=28830 is followed by a peak of MDM2 at T=6.4 MDM2=16550 which is followed by a valley of P53 at T=8.6 P53=-6100 which is followed by a valley of MDM2 at T=14.6 MDM2=9360 which is followed by a peak of P53 at T=17 P53=670”.

VI. CONCLUSIONS

In the present paper we presented the question answering function of our AROMA system with intelligent knowledge management and rhetoric analysis of biomedical texts related to the modeling of a biomedical system. The AROMA system that we have developed consists of three main subsystems. The first subsystem achieves the extraction of knowledge from texts that is related to the structure and the parameters of the biomedical system simulated. The second subsystem is based on a reasoning process that answers questions by combining causal knowledge extracted by the first subsystem with background knowledge and generates explanations of the reasoning followed. The third subsystem is a system simulator written in Prolog that generates the time behavior of the model’s variables. An important feature of the system presented here is its ability for model based non-factoid question answering with the use of rhetoric relation recognition and intelligent causal knowledge extraction from scientific texts with explanation generation and automatic generation of textual descriptions of the dynamic behavior of the model of a biomedical system.

References

[1] Bar-Or, R. L. et al (2000). Generation of oscillations by the p53-Mdm2 feedback loop: A theoretical and experimental study. PNAS, vol. 97, No 21 pp. 11250-11255.
[2] Barrera J. et al (2004). An environment for knowledge discovery in biology. Computers in Biology and Medicine, vol 34 pp 427-447.
[3] Hirshman L. et al (2002). Accomplishments and challenges in literature data mining for biology. Bioinformatics, vol. 18, 12, pp 1553-1561.
[4] Kontos, J. (1992). ARISTA: Knowledge Engineering with Scientific Texts. Information and Software Technology. vol. 34, No 9, pp 611-616.
[5] Kontos, J. and Malagardi, I. (1999). Information Extraction and Knowledge Acquisition from Texts using Bilingual Question-Answering. Journal of Intelligent and Robotic Systems, vol 26, No. 2, pp. 103-122.
[6] Kontos, J. and Malagardi, I. (2001). A Search Algorithm for Knowledge Acquisition from Texts. Proceedings of HERCMA 2001, 5th Hellenic European Research on Computer Mathematics & its Applications Conference. Athens.
[7] Kontos, J. et al (2002a). ARISTA Causal Knowledge Discovery from Texts Discovery Science 2002 Luebeck, Germany. Proceedings of the 5th International Conference DS 2002 Springer Verlag. pp 348-355.
[8] Kontos, J. et al (2002b). System Modeling by Computer using Biomedical Texts. Res Systemica, volume 2, Special Issue, October 2002. (http:/www.afscet.asso.fr/resSystemica/accueil.html).
[9] Kontos J. et al (2003a). The Simulation Subsystem of the AROMA System HERCMA 2003. Proceedings of 6th Hellenic European Research on Computer Mathematics & its Applications Conference, Athens.
[10] Kontos J. et al (2003b). The AROMA System for Intelligent Text Mining. HERMIS, vol 4, pp 163-173.
[11] Kroetze J. A. et al (2003). Differentiating Data- and Text-Mining Terminology. Proceedings of SAICSIT 2003, pp 93-101.
[12] Kuehne S. E. and Forbus K. D. (2004). Capturing QP-relevant Information from Natural Language Text. Proceedings of the 18th International Workshop on Qualitative Reasoning, August, Evanston, Illinois, USA.
[13] Mann, W. C. and Thompson, S. A. (1988). Rhetorical structure theory: toward a functional theory of text organization. Text, 8(3), pp 243-281.
[14] Marcu, D. & Echihabi, A. (2002). An unsupervised approach to recognizing discource relations. ACL 2002.
[15] Mizuta, Y. & Collier N. (2004). An Annotation Scheme for Rhetorical Analysis of Biology Articles. Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). Lisbon, Portugal.
[16] Rzhetsky, A. et al (2004). GeneWays: a system for extracting, analyzing, visualising and integrating molecular pathway data. Journal of Biomedical Informatics. vol 37, pp. 43-53.
[17] Shatkay, H. and Feldman R. (2003). Mining the Biomedical Literature in the Genomic Era: An Overview. Journal of Computatinal Biology, 10 (3) 821-855.






Τετάρτη, 25 Νοεμβρίου 2009

A System for Intelligent Text Mining

  The AROMA System for Intelligent Text Mining
(abridged)

JOHN KONTOS, IOANNA MALAGARDI AND JOHN PEROS

HERMIS International Journal of Computers mathematics and its Applications. 2003 Vol. 4. pp.163-173. LEA.

Abstract-- The AROMA system for Intelligent Text Mining consisting of three subsystems is presented and its operation analyzed. The first subsystem extracts knowledge from sentences. The knowledge extraction process is speeded-up by a search algorithm. The second subsystem is based on a causal reasoning process that generates new knowledge by combining knowledge extracted by the first subsystem as well as an explanation of the reasoning. The third subsystem is performing simulation that generates time-dependent course and textual descriptions of the behavior of a nonlinear system. The knowledge of structure and parameter values of the equations extracted automatically from the input scientific texts are combined with prerequisite knowledge such as ontology and default values of relation arguments. The application of the system is demonstrated by the use of two examples concerning cell apoptosis compiled from a collection of MEDLINE abstracts of papers used for supporting the model of a biomedical system.
Index Terms—intelligent text mining, biomedical text mining, knowledge extraction, knowledge discovery from text, simulation, history

I. INTRODUCTION

We define Intelligent Text Mining as the process of creating novel knowledge from texts that is not stated in them by combining sentences with deductive reasoning. An early implementation of this kind of Intelligent Text Mining is reported in [8]. A review of different kinds of text mining including Intelligent Text Mining is presented in [14]. There are two possible methodologies of applying deductive reasoning on texts. The first methodology is based on the translation of the texts into some formal representation of their “content” with which deduction is performed. The advantage of this methodology depends on the simplicity and availability of the required inference engine but its serious disadvantage is the need for preprocessing all the texts and storing their translation into a formal representation every time something changes. In particular in the case of scientific texts what may change is some part of the prerequisite knowledge such as ontology used for deducing new knowledge. The second methodology eliminates the need for translation of the texts into a formal representation because an inference engine capable of performing deductions “on the fly” i.e. directly from the original texts is implemented. The only disadvantage of this second methodology is that a more complex inference engine than the one needed by the first one must be built. The strong advantage of the second methodology is that the translation into a formal representation is avoided. In [8] the second methodology is chosen and the method developed is therefore called ARISTA i.e. Automatic Representation Independent Syllogistic Text Analysis.
In [8] one of the application examples concerned medical text mining related to the human respiratory system mechanics. Biomedical text mining is now recognized as a very important field of study [18]. More specifically an early attempt was made in [8] to implement model-based question answering using the scientific text as knowledge base describing the model. The idea of connecting text mining with simulation and question answering was pursued further by our group as reported in [12] and [13] as well as in the present paper. We consider model discovery, updating and question answering as worthy final aims of intelligent text mining.
The AROMA (ARISTA Oriented Modeling Adaptation) system for Intelligent Text Mining accomplishes knowledge discovery using our knowledge representation independent method ARISTA that performs causal reasoning “on the fly” directly from text in order to extract qualitative or quantitative model parameters as well as to generate an explanation of the reasoning. The application of the system is demonstrated by the use of two biomedical examples concerning cell apoptosis and compiled from a collection of MEDLINE paper abstracts related to a recent proposal of a relevant mathematical model. A basic characteristic of the behavior of such biological systems is the occurrence of oscillations for certain values of the parameters of their equations.
The solution of these equations is accomplished with a Prolog program. This program provides an interface for manipulation of the model by the user and the generation of textual descriptions of the model behavior. This manipulation is based on the comparison of the experimental data with the graphical and textual simulator output. This is an application to a mathematical model of the p53-mdm2 feedback system. The dependence of the behavior of the model on these parameters may be studied using our system. The variables of the model correspond to the concentrations of the proteins p53 and mdm2 respectively. The non-linearity of the model causes the appearance of oscillations that differ from simple sine waves. The AROMA system for Intelligent Text Mining consists of three subsystems as briefly described below.
The first subsystem achieves the extraction of knowledge from individual sentences that is similar to traditional information extraction from texts. In order to speed up the whole knowledge acquisition process a search algorithm is applied on a table of combinations of keywords characterizing the sentences of the text.
The second subsystem is based on a causal reasoning process that generates new knowledge by combining “on the fly” knowledge extracted by the first subsystem. Part of the knowledge generated is used as parametric input to the third subsystem and controls the model adaptation.
The third subsystem is based on system modeling that is used to generate time dependent numerical values that are compared with experimental data. This subsystem is intended for simulating in a semi-quantitative way the dynamics of nonlinear systems. The characteristics of the model such as structure and parameter polarities of the equations are extracted automatically from scientific texts and are combined with prerequisite knowledge such as ontology and default process and entity knowledge.
In our papers [11], [12] and [13] parts of the AROMA System have been described separately. In the present paper we describe how the integrated system uses causal knowledge discovered from texts as a basis for modeling systems whose study is reported in texts analyzed by our system. We are thus aiming at automating part of the cognitive process of model discovery based on experimental data and supported by domain knowledge extracted automatically from scientific texts. A brief historical review of the technology of simulation is first presented below.

II. THE HISTORY OF SIMULATION

Computer Simulation is playing an increasingly valuable role within science and technology because it is one of the most powerful problem-solving techniques available today. Simulation has its roots in a variety of disciplines including Computer Science, Operations Research, Management Science, Statistics and Mathematics, and is widely utilized in industry and academia as well as in government organizations. Each of these groups influenced the directions of simulation as they developed historically. The History of simulation may be traced to ancient efforts or wishes for the construction of machines or automata simulating inanimate or animate systems.
According to the hypothesis of de Sola Price [5] the Ancient Greek Antikythera Mechanism that is dated approximately at 80 B.C. is a mechanized calendar which calculated the places of the Sun and the Moon through the cycle of years and months. These calculations could be considered as performing a simulation. Therefore this mechanism may be considered as the first known constructed simulator in the history of this technology.
Human action was already being simulated in the mechanical theatre of the Ancient Greek scientist and engineer Heron of Alexandria (about 100 A.D.). Using a system of cords, pulleys and levers bound to counterweights, as well as sound effects and changing scenery, Heron was able to create an illusion that brought legends to life. The conception of human beings as machines reaches back to antiquity. As early as the second century A.D., the famous Ancient Greek physician Galen conceived his pneumatic model of the human body in terms of the hydraulic technology of his age [4].
The development of simulators in the modern era predates the advent of digital computers, with the first flight simulator being credited to Edward A. Link in 1927 and a lot of work that followed using mechanical, electrical and electronic analogue computers.
One of the earliest simulation studies of biochemical reactions was carried out by Chance during the second world-war [16]. In a remarkable paper where experiment, analysis and simulation were all used, Chance showed the validity of the Michaelis-Menten model for the action of the enzyme peroxidase. The simulations were carried out with a mechanical differential analyser. The differential analyser belongs to a class of computers known as analogue computers. These computers were designed to take input in terms of continuous variables, such as hydraulic pressure or voltage. These were then processed by various units, which altered them according to particular mathematical functions including derivatives and integrals. Analogue computers were programmed by coupling various elementary devices, and in our field of interest they would solve a set of differential equations, in a continuous simulation of the model. Note that unlike digital computers, there is no round-off error involved in calculations processed with analogue computers [16].
Digital simulation was developed after the appearance of the digital computer.

The history of digital simulation has been described as comprised of 4 periods:

(1) the advent (circa 1950 - 1960),
(2) the era of simulation programming languages (circa 1960 - 1980),
(3) the era of simulation support environments (circa 1980 - 1990), and
(4) the very modern era (circa 1990 - today).

The advent begins with the inception of digital computers and is marked by early theories of model representation and execution. For example, the fixed-time increment and variable-time increment time flow mechanisms (TFMS) were proposed during this period. The use of random methods for statistical evaluation of simulation results were also formulated during the advent. Once the general principles and techniques had been identified special purpose languages for simulation emerged.
The history of simulation programming languages has been organized in five periods of similar developments. The five periods start in 1955 and end in 1986 namely: The Period of Search (1955-1960); The Advent (1961-1965); The Formative Period (1966-1970); The Expansional Period (1971-1978) and The Period of Consolidation and Regeneration (1979-1986) [19].
The proliferation of simulation programming languages (SPLs) like Simula, Simscript, GPSS and MODSIM mark the second period in this history. The design and evolution of SPLs also helped refine the principles underlying simulation. The dominant conceptual frameworks (or world views ) for discrete event simulation -- event scheduling, process interaction, and activity scanning -- were defined during the second period largely as a result of SPL research. Also during this period the first cohesive theories of simulation modeling were formulated, e.g. DEVS (discrete event system specification) and its basis in general systems theory [19].
The third period in the history of digital simulation is evidenced by a shift of focus from the development of the simulation program toward the broader life cycle of a simulation project. This means extending software and methodological support to such activities as problem formulation, objectives identification and presentation of simulation results. This was the era of the integrated simulation support environment (ISSE). Environment research occurred throughout the simulation communities: within academia, industry and within governments. Also during this third period a great interest emerged within the U.S.A. Department of Defence in interoperable, networked simulators.
In the contemporary era emphasis and direction seem to vary across the primary simulation communities. In the commercial sector environments and languages remain a major focus. The objective seems to be maximized market share through specialization; for example commercial environments specializing in communications network simulation. Within academia environments are also a focus, but these environments appear to favour generality of purpose over specialization. The majority of modelling methodology activity remains in the academic sector and most of the work involving the execution of simulation models on parallel computers is also occurring in university laboratories.
One of the pioneers in the field is John McLeod [15] who in 1947 went to work at the U.S. Naval Air Missile Test Center (now Pacific Missile Range) Point Mugu, California. While there, he sparked and supervised development of the Guidance Simulation Laboratory which, within a few years, became one of the leading American simulation facilities. In 1952 he organized the Simulation Council (now The Society for Modeling and Simulation International) and with his wife Suzette began publication of the Simulation Council Newsletter. From 1956 to 1963 he worked as Design Specialist, Space Navigation and Data Processing, with General Dynamics/Astronautics in San Diego. While there, he received a grant to support testing of an extra-corporeal perfusion device (heart-lung machine) which he had developed on his own time. He also acted as co-founder of the San Diego Symposium for Biomedical Engineering, and edited the proceedings of the first symposium, held in 1961.
The first anaesthesia simulator was developed at the University of Southern California in the late 1960s. It featured spontaneous ventilation, a heart beat, temporal and carotid pulses, blood pressure, opened and closed mouth, blinks eyes, muscle fasciculation, and coughed. It responded to 4 intravenous drugs: thiopental, succinylcholine, epinephrine, and atropine. Additionally it responded to oxygen and nitrous oxide. The first anesthesia simulator was based upon “scripting”. A script prescribes the consequences of an action [2].
A team at the University of Florida resurrected the early technology in the late 1980s [3]. In concert with a team of computer scientists and engineers, the Human Patient Simulator [HPS] was created and introduced in the early 1990s. What distinguishes the Human Patient Simulator from the first anaesthesia simulator is that it is based upon modelling. The software that runs the HPS uses complex mathematical equations, which define the many factors comprising the cardiovascular and respiratory systems of humans. If a drug or event affects one or more factors, the new equation will describe the resulting changes. Thus, if an intervention is correct and timely we will see improvement in the simulated patient’s condition. If the intervention is incorrect the simulated patient’s condition will deteriorate and ultimately lead to cardiac arrest and death.
As far as Greek specialists are concerned it should be noted that J. Kontos started research in simulation since 1960 with the simulation e.g. of part of the respiratory system and magnetohydrodynamic systems [6], [7].

III. BIOMEDICAL SYSTEM MODEL DISCOVERY SUPPORT

There exists some research on computational methods for the application of inductive learning methods in discovery of new knowledge. However the knowledge induced by such methods usually has little relation to the formalisms and concepts used by scientists and engineers. Experts in some domains may reject output of a learning system, unless it is related to their prior knowledge. In contrast the use of models in science and engineering may provide an explanation that includes variables, objects, or mechanisms that are unobserved, but that help predict the behavior of observed variables. Explanations also make use of general concepts and ontologies or relations for explaining experimental findings using scientific models.
We will focus here on a particular class of system models consisting of processes that describe one or more causal relations between input variables and output variables. A process may also include conditions, stated as threshold tests on its input variables, that describe it when it is active. This knowledge is expressed in terms of differential equations when it involves change over time or algebraic equations when it involves instantaneous effects. A process model consists of a set of processes that link observable input variables with observable output variables, possibly through unobserved theoretical terms. The concept of process is fundamental to our original early proposal of the ARISTA method and its modeling application in [8].
Process models are often designed to characterize the behavior of dynamical systems that change over time, though they can also handle systems in equilibrium. The data produced by such systems differ from those that arise in most induction tasks in a variety of ways. First, these variables are primarily continuous, since they represent quantitative measurements of the system under study. Second, the observed values are not independently and identically distributed, since those observed at later time steps depend on those measured earlier. The dynamical systems explained by our models are viewed as deterministic. The observations themselves may well contain noise but we assume that the processes themselves are always active whenever their conditions are met and that their equations have the same form all the time. We use this assumption because scientists and engineers often treat the systems they study as deterministic.
A biomedical system model discovery support system like AROMA may revolutionize scientific discovery by providing computer tools for the automatic checking of the validity of a model as supported by experimental findings. With such tools the updating of models may also be facilitated whenever experimental findings are reported that disagree with a generally accepted model. Therefore model discovery support is a worthy final aim for Intelligent Text Mining.

IV. THE GENERAL ARCHITECTURE OF OUR AROMA SYSTEM

The general architecture of our system is shown in Figure 1. and consists of three subsystems namely the Knowledge Extraction Subsystem, the Causal Reasoning Subsystem and the Simulation Subsystem. These subsystems are briefly described below. The texts of the example application presented here are compiled from the MEDLINE abstracts of papers used by [1] as references that amount to 73 items. Most of these papers are used in [1] to support the discovery of a quantitative model of protein concentration oscillations related to cell apoptosis constructed as a set of differential equations. We are aiming at automating part of such a cognitive process by our system. The collection of the MEDLINE abstracts is processed by a preprocessor module so that they take the form required by our Prolog programs i.e. one sentence per line.

A. The Knowledge Extraction Subsystem

This subsystem integrates partial causal knowledge extracted from a number of different texts. This knowledge is expressed in natural language using causal verbs such as “regulate”, “enhance” and “inhibit”. These verbs usually take as arguments entities such as protein names and gene names that occur in the biomedical texts that we use for the present applications. In this way causal relation between the entities are expressed. The input files used for this subsystem contain abstracts downloaded from MEDLINE. A lexicon containing words such as causal verbs and stopwords are also input to this subsystem. An output file is produced by the system that contains parts of sentences collected from the original sentences of different abstracts. These output file is used for reasoning by the second subsystem.
The operation of the subsystem is based on the recognition of a causal verb or verb group. After this recognition complements of the verbs are chunked by processing the neighboring left and right context of the verb. This is accomplished by using a number of stopwords such as conjunctions and relative pronouns. The input texts are submitted first to a preprocessing module of the subsystem that converts automatically each sentence into a form consisting of Prolog facts that represent numerically information concerning the identification of the sentence that contains the word and its position in the sentence. This set of Prolog facts has nothing to do with logical representation of the “content” of the sentences as it seems to have been inaccurately reported in [10]. It should be emphasized that we do not deviate from our ARISTA method with this translation. We simply “annotate” each word with information concerning its position within the text. It should be noted that our annotation is not added in the original text but it is represented as the set of Prolog facts mentioned above.

B. The Causal Reasoning Subsystem

The output of the first subsystem is used as input to the second subsystem that combines causal knowledge in natural language form to produce by automatic deduction conclusions not mentioned explicitly in the input text. The operation of this subsystem is based on the ARISTA method [8]. The sentence fragments containing causal knowledge are parsed and the entity-process pairs are recognized. The user questions are analysed and reasoning goals are extracted from them. The answers to the user questions are generated automatically by a reasoning process together with explanations in natural language form. This is accomplished by the chaining “on the fly” of causal statements using prerequisite knowledge such as ontology to support the reasoning process. A second output of this subsystem consists of both qualitative and quantitative information that is input to the third subsystem and controls the adaptation of the model of the biomedical system.

C. The Simulation Subsystem

The third subsystem is used for modelling the dynamics of the biomedical system specified on the basis of the MEDLINE abstracts processed by the first subsystem. The characteristics of the model such as structure and parameter values will eventually be extracted from the input texts combined with prerequisite knowledge such as ontology and default process and entity knowledge. Considering the above example two coupled first order differential equations are used as the approximate mathematical model of the biomedical system in rough correspondence with the model proposed in [1]. A basic characteristic of the behaviour of such a system is the occurrence of oscillations for certain values of the parameters of the equations.

The equations in finite difference form that approximate the differential equations are:

Δx= a1*x + b1*y + c1*x*y (1)
 Δy= a2*y + b2*delay(d,x) (2)

Where Δx means the difference between the value of the variable x at the present time and the value of the variable x at the next time instant and delay(d,x) means the value of x before d units of time. Time is taken to advance in discrete steps. The variables x and y correspond to the concentrations of the proteins p53 and mdm2 respectively. The symbols a1, b1, c1, a2, b2 stand for the parameters of the equations. It is noted that multiplicative term c1*x*y renders equation (1) non-linear. This non-linearity causes the appearance of the oscillations to differ from simple sine waves. The solution of these equations is accomplished with a Prolog program that provides an interface for manipulating the parameters of the model. An important module of the simulation subsystem is one that generates text describing the behaviour of the variables of the model on true.

V. A FIRST EXAMPLE OF BIOMEDICAL TEXT MINING

An illustrative subset of sentences used in this first illustrative example is the following where the reference numbers of the papers with which the authors of [1] refer to are given in parentheses:

The p53 protein is activated by DNA damage. (23)
Expression of Mdm2 is regulated by p53. (32)
Mdm2 increase inhibits p53 activity. (17)

Using these sentences our system discovers automatically the qualitative causal process model with a negative feedback loop that can be summarized as:

DNA damage +causes p53 +causes mdm2 -causes p53

Where +causes means “causes increase” and -causes means “causes decrease or inhibition”

by answering the question: Is there a process loop of p53?

This question is internally represented as the Prolog goal: “cause(P1,p53,P2,p53,S)”, where P1 and P2 are two process names that the system extracts from the texts and characterize the behavior of p53. S stands for the overall effect of the feedback loop found i.e. whether it is a positive or a negative feedback loop. In this case S is found equal to “-” or “negative” since a positive causal connection is followed by a negative one.

The short answer automatically generated by our system is: Yes.

The loop is p53 activity –causes p53 production.

The long answer automatically generated by our system is:

Using sentence 17 with inference rule IR4

since the DEFAULT process of p53 is

using sentence 32

the EXPLANATION is: since is equivalent to

p53 production –causes activity of p53

because

p53 production +causes expression of Mdm2

and

increase of Mdm2 –causes activity of p53

It should be noted that the combination of sentences (17) and (32) in a causal chain that forms a closed negative feedback loop is based on two facts of prerequisite ontological knowledge.

This knowledge is inserted manually in our system as Prolog facts and can be stated as:

“the DEFAULT process of p53 is ‘production’” or

in Prolog: “default(p53,production).”.

“the process ‘increase’ is equivalent to the process ‘expression’” or

in Prolog “equivalent(increase,expression).”.

The above analysis of the text fragments of the first example is partially based on the following prerequisite knowledge which is also manually inserted as Prolog facts:

kind_of(“the”,“determiner”)
kind_of(“is”,“copula”)
kind_of(“of”,“preposition”)
kind_of(“p53”,“entity_noun”)
kind_of(“protein”,“entity_noun”)
kind_of(“DNA”,“entity_noun”)
kind_of(“Mdm2”,“entity_noun”)
kind_of(“activated”,“causal_connector”)
kind_of(“inhibits”,“causal_connector”)
kind_of(“regulated”,“causal_connector”)
kind_of(“damage”,“process”)
kind_of(“expression”,“process”)
kind_of(“increase”,“process”)
kind_of(“activity”,“process”)

The above prerequisite knowledge base fragment contains both general linguistic and domain dependent ontological knowledge about the words occurring in the corpus. In practice of course these two parts of knowledge are and handled differently by the inference rules of the reasoning module.

VI. A SECOND EXAMPLE OF BIOMEDICAL TEXT MINING

The second example text is also compiled from two MEDLINE abstracts of papers used by [1] as references. These two abstracts downloaded from MEDLINE again contain knowledge concerning the interaction of the proteins p53 and mdm2. These proteins are involved in the life cycle of the cell. The first abstract consists of six sentences from which two are selected by the first subsystem from which the following fragments are extracted automatically.

“The p53 protein regulates the mdm2 gene” “regulates both the activity of the p53 protein”

These fragments are then automatically transformed as Prolog facts in order to be processed by the second subsystem as shown below:

t(“325”, “The p53 protein regulates the mdm2 gene”).
t(“326”, “regulates both the activity of the p53 protein”).

The numbers 325 and 326 denote that these fragments are extracted from the sentences 5 and 6 of the text 32.
The second abstract consists of seven sentences from which two are selected by the first subsystem from which the following fragments are extracted automatically

“The mdm2 gene enhances the tumorigenic potential of cells”
“The mdm2 oncogene can inhibit p53_mediated transactivation”

and expressed in the form of Prolog facts as:

t(“923”, “The mdm2 gene enhances the tumorigenic potential of cells”).
t(“927”, “The mdm2 oncogene can inhibit p53_mediated transactivation”).

Using the sentences of the second example our system discovers the causal negative feedback loop:

p53 +causes mdm2 -causes p53

Where +causes means “causes increase” and -causes means “causes decrease or inhibition”

by answering the question:

Is there a process loop of p53?

This question is internally represented as the Prolog goal:

“cause(P1,p53,P2,p53,S)”

where P1 and P2 are two process names that the system extracts from the texts and characterize the behavior of p53. S stands for the overall effect of the feedback loop found i.e. whether it is a positive or a negative feedback loop. In this case S is found equal to “-” since a positive causal connection is followed by a negative one.

The short answer automatically generated by our system is: Yes.

The loop is p53 activity –causes p53 production.

The long answer automatically generated by our system is:

the QUESTION is:
Get process loop of p53
OR
cause(P1,p53,P2,p53,S)
USING INFERENCE RULE IR4a
since the DEFAULT entity of is
USING sentence 927 with inference rule IR4
USING INFERENCE RULE IR4b
USING sentence 325

the EXPLANATION is:
since is a kind of
p53 protein -causes p53

because
p53 protein +causes gene of mdm2
and
oncogene of mdm2 -causes p53_mediatedtransactivation of p53

It should be noted that the combination of sentences (92) and (32) in a causal chain that forms a closed negative feedback loop is based on two facts of prerequisite ontological knowledge.

This knowledge is inserted manually in our system as Prolog facts and can be stated as:

the DEFAULT entity of is or default(p53_mediated, p53).

is a kind of or kind_of(oncogene, gene).


VIII. CONCLUSIONS

We presented our AROMA system for Intelligent Text Mining. The AROMA system that we developed consists of three main subsystems. The first subsystem achieves the extraction of knowledge from sentences that is related to the structure and the parameters of the biomedical system simulated. The second subsystem is based on a reasoning process that generates new knowledge by combining “on the fly” knowledge extracted by the first subsystem as well as explanations of the reasoning. The third subsystem is a system simulator written in Prolog that generates the time behavior of the model’s variables. Two important features of the simulation subsystem are the use of structure and parameter knowledge automatically extracted from scientific texts and the automatic generation of texts describing the behavior of the system being simulated. Our final aim is to be able to model biomedical systems by integrating knowledge extracted from different texts and give the user a facility for questioning these models during a collaborative man-machine model discovery procedure. The model based question answering we are aiming at may support both biomedical researchers and medical practitioners.

REFERENCES

[1] Bar-Or, R. L. et al (2000). Generation of oscillations by the p53-Mdm2 feedback loop: A theoretical and experimental study. PNAS, vol. 97, No 21 pp. 11250-11255, October.
[2] Denson JS and Abrahamson S (1969). A computer-controlled patient simulator. JAMA 208:504-508.
[3] Good, M.L. and Gravenstein , J.S. (1996). Anesthesia simulators and training devices, in Prys Roberts C, Brown BR Jr (eds) International Practice of Anaesthesia, Vol 2. Oxford, Butterworth-Heinemann, pp. 2/167/1-11
[4] Grau, O. (2000). The History of Telepresence automata, illusion and the rejection of the body. , in: Ken Goldberg (Hg.): The Robot in the Garden: Telerobotics and Telepistemology on the Internet, Cambridge/Mass, S. 226-246.
[5] Kean, V., J. (1995). The Ancient Greek Computer from Rhodes known as the Antikythera Mechanism. Efstathiadis Group. Greece.
[6] Kontos, J (1965). Computation of Instability Growth Rates in Finite Conductiity Magnetohydrodynamics. Nuclear Fusion, Vol 5, No 2.
[7] Kontos, J. et al (1966), A Control Engineering Approach to Magnetohydrodynamic Stability. Proceedings of the I.E.E., Vol 113, No 3.
[8] Kontos, J. (1992) ARISTA: Knowledge Engineering with Scientific Texts. Information and Software Technology. vol. 34, No 9, pp.611-616.
[9] Kontos, J. and Malagardi, I. (1999). Information Extraction and Knowledge Acquisition from Texts using Bilingual Question-Answering. Journal of Intelligent and Robotic Systems, vol 26, No. 2, pp. 103-122, October.
[10] Kontos, J. and Malagardi, I. (1999). Information Extraction and Knowledge Acquisition from Texts using Bilingual Question-Answering. Journal of Intelligent and Robotic Systems vol. 26, No. 2, pp. 103-122, October.
[11] Kontos, J. and Malagardi, I. (2001) A Search Algorithm for Knowledge Acquisition from Texts. HERCMA 2001, 5th Hellenic European Research on Computer Mathematics & its Applications Conference. Athens.
[12] Kontos, J., Elmaoglou, A. and Malagardi, I. (2002a) ARISTA Causal Knowledge Discovery from Texts Discovery Science 2002 Luebeck, Germany. Proceedings of the 5th International Conference DS 2002 Springer Verlag. pp. 348-355.
[13] Kontos, J., Malagardi, I., Peros, J., Elmaoglou, A. (2002b). System Modeling by Computer using Biomedical Texts. Res Systemica Volume 2 Special Issue. October 2002. (http:/www.afscet.asso.fr/resSystemica/accueil.html).
[14] Kroetze, J.,A. te al. (2003) Differentiating Data- and Text-Mining Terminology. Proceedings of SAICSIT 2003. pp. 93-101.
[15] McLeod J. and Osborne J. (1966). "Physiological simulation in general and particular," pp. 127-138, Natural Automata and Useful Simulations, edited by H. H. Pattee, E. A. Edelsack, Louis Fein, A. B. Callahan , Spartan Books, Washington DC.
[16] Mendes, P. & Kell, D.B. (1996). Computer simulation of biochemical kinetics. In BioThermoKinetics of the living cell (eds. H.V. Westerhoff, J.L. Snoep, F.E. Sluse, J.E. Wijker and B.N. Kholodenko), pp. 254-257. BioThermoKinetics Press, Amsterdam.
[17] Price de Solla D. (1974). Gears from the Greeks. American Philosophical Society.
[18] Weeber, M. et al. (2000). Text-Based Discovery in Biomedicine: The architecture of the DAD-system. Proceedings of the 2000 AMIA Annual Fall Symposium. Pp. 903-907. Hanley and Belfus, Philadelphia, PA.
[19] Zeigler, B.P. (1976). Theory of Modelling and Simulation, John Wiley and Sons, New York.