We report two experiments that investigate the effects of sentence context on bilingual lexical access in Spanish and English. Highly proficient Spanish-English bilinguals read sentences in Spanish and English that included a marked word to be named. The word was either a cognate with similar orthography and/or phonology in the two languages, or a matched non-cognate control. Sentences appeared in one language alone (i.e., Spanish or English) and target words were not predictable on the basis of the preceding semantic context. In Experiment 1, we mixed the language of the sentence within a block such that sentences appeared in an alternating run in Spanish or in English. These conditions partly resemble normally occurring inter-sentential code-switching. In these mixed-language sequences, cognates were named faster than non-cognates in both languages. There were no effects of switching the language of the sentence. In Experiment 2, with Spanish-English bilinguals matched closely to those who participated in the first experiment, we blocked the language of the sentences to encourage language-specific processes. The results were virtually identical to those of the mixed-language experiment. In both cases, target cognates were named faster than non-cognates, and the magnitude of the effect did not change according to the broader context. Taken together, the results support the predictions of the Bilingual Interactive Activation + Model (Dijkstra and van Heuven, 2002) in demonstrating that bilingual lexical access is language non-selective even under conditions in which language-specific cues should enable selective processing. They also demonstrate that, in contrast to lexical switching from one language to the other, inter-sentential code-switching of the sort in which bilinguals frequently engage, imposes no significant costs to lexical processing.
bilingualism; language switching; switch costs; lexical access; sentence context; cognates
The phonemic restoration effect refers to the tendency for people to hallucinate a phoneme replaced by a non-speech sound (e.g., a tone) in a word. This illusion can be influenced by preceding sentential context providing information about the likelihood of the missing phoneme. The saliency of the illusion suggests that supportive context can affect relatively low (phonemic or lower) levels of speech processing. Indeed, a previous event-related brain potential (ERP) investigation of the phonemic restoration effect found that the processing of coughs replacing high versus low probability phonemes in sentential words differed from each other as early as the auditory N1 (120-180 ms post-stimulus); this result, however, was confounded by physical differences between the high and low probability speech stimuli, thus it could have been caused by factors such as habituation and not by supportive context. We conducted a similar ERP experiment avoiding this confound by using the same auditory stimuli preceded by text that made critical phonemes more or less probable. We too found the robust N400 effect of phoneme/word probability, but did not observe the early N1 effect. We did however observe a left posterior effect of phoneme/word probability around 192-224 ms -- clear evidence of a relatively early effect of supportive sentence context in speech comprehension distinct from the N400.
Phonemic restoration effect; speech comprehension; N400; ERP
When looking for the referents of novel nouns, adults and young children are sensitive to cross-situational statistics (Yu and Smith, 2007; Smith and Yu, 2008). In addition, the linguistic context that a word appears in has been shown to act as a powerful attention mechanism for guiding sentence processing and word learning (Landau and Gleitman, 1985; Altmann and Kamide, 1999; Kako and Trueswell, 2000). Koehne and Crocker (2010, 2011) investigate the interaction between cross-situational evidence and guidance from the sentential context in an adult language learning scenario. Their studies reveal that these learning mechanisms interact in a complex manner: they can be used in a complementary way when context helps reduce referential uncertainty; they influence word learning about equally strongly when cross-situational and contextual evidence are in conflict; and contextual cues block aspects of cross-situational learning when both mechanisms are independently applicable. To address this complex pattern of findings, we present a probabilistic computational model of word learning which extends a previous cross-situational model (Fazly et al., 2010) with an attention mechanism based on sentential cues. Our model uses a framework that seamlessly combines the two sources of evidence in order to study their emerging pattern of interaction during the process of word learning. Simulations of the experiments of (Koehne and Crocker, 2010, 2011) reveal an overall pattern of results that are in line with their findings. Importantly, we demonstrate that our model does not need to explicitly assign priority to either source of evidence in order to produce these results: learning patterns emerge as a result of a probabilistic interaction between the two clue types. Moreover, using a computational model allows us to examine the developmental trajectory of the differential roles of cross-situational and sentential cues in word learning.
probabilistic modeling; cross-situational word learning; syntactic bootstrapping; context-based attention mechanisms
Older adults (as a group) are less likely than younger adults to engage in an anticipatory mode of language comprehension, failing to successfully pre-activate information about upcoming likely (predictable) words during online processing. To assess (within one set of materials) age-related changes in the use of sentential context to affect processing of predictable words and in the consequences of violating predictions, event-related brain potentials were recorded while older adults read sentences that varied in sentence-level constraint and expectancy of sentence-final words. Strongly constraining sentences were completed by their most expected, predictable words and weakly constraining sentences were completed by their most expected, less predictable words. Both types of sentences also were completed by unexpected (but plausible) words. Older adults showed reduced and delayed effects of sentential context on processing predictable words. Whereas younger adults elicit an enhanced positive ERP (starting around 500 ms post-stimulus onset, largest over prefrontal electrode sites), specifically for unexpected words that violate strong expectancies for a different word, older adults as a group did not exhibit this neural consequence of disconfirmed predictions. Older adults were instead more likely to show a left-lateralized frontal negativity for predictable items. This ERP response has been attributed to processes needed to revisit contextual material in forming an interpretation of message-level meaning, which may be more likely when anticipatory modes of comprehension are not engaged. Taken together, the results suggest that normal aging can affect allocation of resources to different cognitive and neural pathways in achieving comprehension outcomes.
language; event-related potentials; sentential context; N400; frontal negativity
To extract physician-asserted drug side effects from electronic medical record clinical narratives.
Materials and methods
Pattern matching rules were manually developed through examining keywords and expression patterns of side effects to discover an individual side effect and causative drug relationship. A combination of machine learning (C4.5) using side effect keyword features and pattern matching rules was used to extract sentences that contain side effect and causative drug pairs, enabling the system to discover most side effect occurrences. Our system was implemented as a module within the clinical Text Analysis and Knowledge Extraction System.
The system was tested in the domain of psychiatry and psychology. The rule-based system extracting side effects and causative drugs produced an F score of 0.80 (0.55 excluding allergy section). The hybrid system identifying side effect sentences had an F score of 0.75 (0.56 excluding allergy section) but covered more side effect and causative drug pairs than individual side effect extraction.
The rule-based system was able to identify most side effects expressed by clear indication words. More sophisticated semantic processing is required to handle complex side effect descriptions in the narrative. We demonstrated that our system can be trained to identify sentences with complex side effect descriptions that can be submitted to a human expert for further abstraction.
Our system was able to extract most physician-asserted drug side effects. It can be used in either an automated mode for side effect extraction or semi-automated mode to identify side effect sentences that can significantly simplify abstraction by a human expert.
Natural language processing; machine learning; information extraction; electronic medical record; Information storage and retrieval (text and images); discovery; and text and data mining methods; Other methods of information extraction; Natural-language processing; bioinformatics; Ontologies; Knowledge representations, Controlled terminologies and vocabularies; Information Retrieval; HIT Data Standards; Human-computer interaction and human-centered computing; Providing just-in-time access to the biomedical literature and other health information; Applications that link biomedical knowledge from diverse primary sources (includes automated indexing); Linking the genotype and phenotype
We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.
With the increasing availability of full-text scientific papers online, new tools, such as Textpresso, will help to extract information and knowledge from research literature
Non-adjacent dependencies are thought to be more costly to process than sentences wherein dependents immediately follow or precede what they depend on. In English locality effects have been revealed, while in languages with rich case marking (German and Hindi) sentence final structures show anti-locality-effects. The motivation of the current study is to test whether locality effects can be directly applied to a typologically different language than those investigated so far. Hungarian is a “topic prominent” language; it permits a variation of possible word sequencing for semantic reasons, including SVO word order. Hungarian also has a rich morphological system (e.g., rich case system) and postpositions to indicate grammatical functions. In the present ERP study, Hungarian subject–verb dependencies were compared by manipulating the mismatch of number agreement between the sentence's initial noun phrase and the sentence's final intransitive verb as well as the complexity of the intervening sentence material, interrupting the dependencies. Possible lexical class and frequency or cloze-probability effects for the first two words of the intervening sentence material were revealed when used separate baseline for each word, while at the third word of the intervening material as well as at the main verb ERPs were not modulated by complexity but at the verb ERPs were enhanced by grammaticality. Ungrammatical sentences enlarged the amplitude of both LAN and P600 components at the main verb. These results are in line with studies suggesting that the retrieval of the first element of a dependency is not influenced by distance from the second element, as the first element is directly accessible when needed for integration (e.g., McElree, 2000).
ERP (Event-Related Potentials); Locality effect; Anti-locality effect; Anterior Negativity (LAN); P600; Nonadjacent dependency; Sentence comprehension
The ability to create new meanings from combinations of words is one important function of the language system. We investigated the neural correlates of combinatorial semantic processing using fMRI. During scanning, participants performed a rating task on auditory word or pseudoword strings that differed in the presence of combinatorial and word-level semantic information. Stimuli included normal sentences comprised of thematically related words that could be readily combined to produce a more complex meaning, semantically incongruent sentences in which content words were randomly replaced with other content words, pseudoword sentences, and versions of these three sentence types in which syntactic structure was removed by randomly re-ordering the words. Several regions showed greater BOLD signal for stimuli with words than for those with pseudowords, including the left angular gyrus, left superior temporal sulcus, and left inferior frontal gyrus, suggesting that these areas are involved in semantic access at the single word level. In the angular and inferior frontal gyri these differences emerged early in the course of the hemodynamic response. An effect of combinatorial semantic structure was observed in the left angular gyrus and left lateral temporal lobe, which showed greater activation for normal compared to semantically incongruent sentences. These effects appeared later in the time course of the hemodynamic response, beginning after the entire stimulus had been presented. The data indicate a complex spatiotemporal pattern of activity associated with computation of word and sentence-level semantic information, and suggest a particular role for the left angular gyrus in processing overall sentence meaning.
The present study examines the involvement of syntactic and semantic/conceptual processes in the comprehension of pronouns in Dutch using the technique of event-related brain potentials (ERPs) replicating and extending an earlier study in German. Dutch and German are closely related and share the same logic in referring to non-diminutive and diminutive NPs (i.e. adding an affix which changes the syntactic gender into neutral). Both languages separate male (hij/er (he)) and female pronouns (zij/sie (she)), as well as a pronoun that refers to an entity of neutral gender, (het/es (it)). However, the neutral pronoun het in Dutch is not only a pronoun, it also is the article of a neutral noun. To investigate the influence of this word class ambiguity on pronoun resolution, as well as to establish the generality of the finding of the German study we manipulated syntactic and biological gender congruency between a personal pronoun and its antecedent in Dutch.
In Dutch, sentences with the word-class (pronoun/article) ambiguous pronoun het elicited an early negative shift (150–280 ms) which continued in the time frame of the N400. For sentences with a syntactically and biologically incongruent pronoun a P600 (in absence of an N400) was obtained, which was independent of the morphological form of the referent.
The neurophysiological pattern found for Dutch stimuli was clearly different from the German study, indicating that the processing of pronouns in these two languages differs. This can be explained in terms of language specific characteristics concerning the word class ambiguous neutral pronoun het. Moreover, in contrast to the findings in the German study, there was no clear effect caused by the morphological form of the referent. Additionally, in Dutch, the pronoun resolution in sentences with a non-diminutive antecedent seems to reflect processes of revision (P600 in absence of an N400), whereas for German evidence was found for clear involvement of conceptual/semantic processes as well as structure building processes (N400/P600 complex).
This study examined neural activity associated with establishing causal relationships across sentences during online comprehension. ERPs were measured while participants read and judged the relatedness of three-sentence scenarios in which the final sentence was highly causally related, intermediately related and causally unrelated to its context. Lexico-semantic co-occurrence was matched across the three conditions using a Latent Semantic Analysis. Critical words in causally unrelated scenarios evoked a larger N400 than words in both highly causally related and intermediately related scenarios, regardless of whether they appeared before or at the sentence-final position. At midline sites, the N400 to intermediately related sentence-final words was attenuated to the same degree as to highly causally related words, but otherwise the N400 to intermediately related words fell in between that evoked by highly causally related and intermediately related words. No modulation of the Late Positivity/P600 component was observed across conditions. These results indicate that both simple and complex causal inferences can influence the earliest stages of semantically processing an incoming word. Further, they suggest that causal coherence, at the situation level, can influence incremental word-by-word discourse comprehension, even when semantic relationships between individual words are matched.
causal coherence; discourse; ERP; N400; P600; inference; language; situation model
To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA (“Semantic Extraction using a Neural Network Architecture”), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, cooccurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches.
Philologists reconstructing ancient texts from variously miscopied manuscripts anticipated information theorists by centuries in conceptualizing information in terms of probability. An example is the editorial principle difficilior lectio potior (DLP): in choosing between otherwise acceptable alternative wordings in different manuscripts, “the more difficult reading [is] preferable.” As philologists at least as early as Erasmus observed (and as information theory's version of the second law of thermodynamics would predict), scribal errors tend to replace less frequent and hence entropically more information-rich wordings with more frequent ones. Without measurements, it has been unclear how effectively DLP has been used in the reconstruction of texts, and how effectively it could be used. We analyze a case history of acknowledged editorial excellence that mimics an experiment: the reconstruction of Lucretius's De Rerum Natura, beginning with Lachmann's landmark 1850 edition based on the two oldest manuscripts then known. Treating words as characters in a code, and taking the occurrence frequencies of words from a current, more broadly based edition, we calculate the difference in entropy information between Lachmann's 756 pairs of grammatically acceptable alternatives. His choices average 0.26±0.20 bits higher in entropy information (95% confidence interval, P = 0.005), as against the single bit that determines the outcome of a coin toss, and the average 2.16±0.10 bits (95%) of (predominantly meaningless) entropy information if the rarer word had always been chosen. As a channel width, 0.26±0.20 bits/word corresponds to a 0.790.79+0.09−0.15 likelihood of the rarer word being the one accepted in the reference edition, which is consistent with the observed 547/756 = 0.72±0.03 (95%). Statistically informed application of DLP can recover substantial amounts of semantically meaningful entropy information from noise; hence the extension copiosior informatione lectio potior, “the reading richer in information [is] preferable.” New applications of information theory promise continued refinement in the reconstruction of culturally fundamental texts.
The cerebral hemispheres have been shown to be differentially sensitive to sentence-level information; in particular, it has been suggested that only the left hemisphere (LH) makes predictions about upcoming items, while the right (RH) processes words in a more integrative fashion. The current study used event-related potentials to jointly examine the effects of expectancy and sentential constraint on word processing. Expected and unexpected but plausible words matched for contextual fit were inserted into strongly and weakly constraining sentence frames and presented to the left and right visual fields (LVF and RVF). Consistent with the prediction/integration view, the P2 was sensitive to constraint: words in strongly constraining contexts elicited larger P2s than those in less predictive contexts, for RVF/lh presentation only. N400 responses for both VFs departed from the typical pattern of amplitudes graded by cloze probability. Expected endings in strongly and weakly constraining contexts were facilitated to a similar degree with RVF/lh presentation, and expected endings in weakly constraining contexts were not facilitated compared to unexpected endings in those contexts for LVF/rh presentation. These data suggest that responses seen for central presentation reflect contributions from both hemispheres. Finally, a late positivity, larger for unexpected endings in strongly constraining contexts, observed for these stimuli with central presentation was not seen here for either VF. Thus, some phenomena observed with central presentation may be an emergent property of mechanisms that require interhemispheric cooperation. These data highlight the importance of understanding hemispheric asymmetries and their implications for normal language processing.
The brain is able to acquire information about an unknown word’s meaning from a highly constraining sentence context with minimal exposure. In this study, we investigate the potential contributions of the cerebral hemispheres to this ability. Undergraduates first read weakly or strongly constraining sentences completed by known or unknown (novel) words. Subsequently, their knowledge of these words was assessed via a lexical decision task in which they served as visual primes for lateralized target words varying in their semantic relationship to the primes (unrelated, identical or synonymous). As expected, smaller N400 amplitudes were seen for target words preceded by identical (vs. unrelated) known word primes, regardless of visual field of presentation. When Unknown words served as primes, N400 reductions to synonymous target words were observed only if the prime had appeared under High sentential constraint; targets appearing in the LVF/RH elicited a small N400 effect and modulation of a subsequent late positivity whereas those in the RVF/LH elicited modulation on the late positivity only. Unknown words initially seen in Low constraint contexts showed priming effects only in a late positivity and only in the RVF/LH. Strength of contextual constraint clearly seems to impact the hemispheres’ rapid acquisition of novel word meanings. N400 modulation for novel words under strong contextual constraint in the LVH/RH suggests that fast-mapped lexical representations may initially activate meanings that are weakly, distantly, associatively or thematically-related. More extensive and bilateral semantic processing seems to occur at longer processing latencies (post N400).
ERPs; N400; word learning; fast-mapping; cerebral hemispheres; language; priming
Functional neuroimaging assessments of residual cognitive capacities, including those that support language, can improve diagnostic and prognostic accuracy in patients with disorders of consciousness. Due to the portability and relative inexpensiveness of electroencephalography, the N400 event-related potential component has been proposed as a clinically valid means to identify preserved linguistic function in non-communicative patients. Across three experiments, we show that changes in both stimuli and task demands significantly influence the probability of detecting statistically significant N400 effects — that is, the difference in N400 amplitudes caused by the experimental manipulation. In terms of task demands, passively heard linguistic stimuli were significantly less likely to elicit N400 effects than task-relevant stimuli. Due to the inability of the majority of patients with disorders of consciousness to follow task commands, the insensitivity of passive listening would impede the identification of residual language abilities even when such abilities exist. In terms of stimuli, passively heard normatively associated word pairs produced the highest detection rate of N400 effects (50% of the participants), compared with semantically-similar word pairs (0%) and high-cloze sentences (17%). This result is consistent with a prediction error account of N400 magnitude, with highly predictable targets leading to smaller N400 waves, and therefore larger N400 effects. Overall, our data indicate that non-repeating normatively associated word pairs provide the highest probability of detecting single-subject N400s during passive listening, and may thereby provide a clinically viable means of assessing residual linguistic function. We also show that more liberal analyses may further increase the detection-rate, but at the potential cost of increased false alarms.
•The N400 is a candidate marker of linguistic function after severe brain injury.•The probability of detecting N400s is dependent on task demands.•Passive listening is less sensitive than command-following.•The probability of detecting N400s is dependent on stimuli choices.•Word-pairs generated from association norms provide the highest sensitivity.
Vegetative state; Minimally conscious state; N400; Sensitivity; Language
Breast cancer is the leading cause of cancer mortality in women between the ages of 15 and 54. During mammography screening, radiologists use a strict lexicon (BI-RADS) to describe and report their findings. Mammography records are then stored in a well-defined database format (NMD). Lately, researchers have applied data mining and machine learning techniques to these databases. They successfully built breast cancer classifiers that can help in early detection of malignancy. However, the validity of these models depends on the quality of the underlying databases. Unfortunately, most databases suffer from inconsistencies, missing data, inter-observer variability and inappropriate term usage. In addition, many databases are not compliant with the NMD format and/or solely consist of text reports. BI-RADS feature extraction from free text and consistency checks between recorded predictive variables and text reports are crucial to addressing this problem.
We describe a general scheme for concept information retrieval from free text given a lexicon, and present a BI-RADS features extraction algorithm for clinical data mining. It consists of a syntax analyzer, a concept finder and a negation detector. The syntax analyzer preprocesses the input into individual sentences. The concept finder uses a semantic grammar based on the BI-RADS lexicon and the experts’ input. It parses sentences detecting BI-RADS concepts. Once a concept is located, a lexical scanner checks for negation. Our method can handle multiple latent concepts within the text, filtering out ultrasound concepts. On our dataset, our algorithm achieves 97.7% precision, 95.5% recall and an F1-score of 0.97. It outperforms manual feature extraction at the 5% statistical significance level.
BI-RADS; free text; lexicon; mammography; clinical data mining
Text definitions for entities within bio-ontologies are a cornerstone of the effort to gain a consensus in understanding and usage of those ontologies. Writing these definitions is, however, a considerable effort and there is often a lag between specification of the main part of an ontology (logical descriptions and definitions of entities) and the development of the text-based definitions. The goal of natural language generation (NLG) from ontologies is to take the logical description of entities and generate fluent natural language. The application described here uses NLG to automatically provide text-based definitions from an ontology that has logical descriptions of its entities, so avoiding the bottleneck of authoring these definitions by hand.
To produce the descriptions, the program collects all the axioms relating to a given entity, groups them according to common structure, realises each group through an English sentence, and assembles the resulting sentences into a paragraph, to form as ‘coherent’ a text as possible without human intervention. Sentence generation is accomplished using a generic grammar based on logical patterns in OWL, together with a lexicon for realising atomic entities. We have tested our output for the Experimental Factor Ontology (EFO) using a simple survey strategy to explore the fluency of the generated text and how well it conveys the underlying axiomatisation. Two rounds of survey and improvement show that overall the generated English definitions are found to convey the intended meaning of the axiomatisation in a satisfactory manner. The surveys also suggested that one form of generated English will not be universally liked; that intrusion of too much ‘formal ontology’ was not liked; and that too much explicit exposure of OWL semantics was also not liked.
Our prototype tools can generate reasonable paragraphs of English text that can act as definitions. The definitions were found acceptable by our survey and, as a result, the developers of EFO are sufficiently satisfied with the output that the generated definitions have been incorporated into EFO. Whilst not a substitute for hand-written textual definitions, our generated definitions are a useful starting point.
An on-line version of the NLG text definition tool can be found at http://swat.open.ac.uk/tools/. The questionaire and sample generated text definitions may be found at http://mcs.open.ac.uk/nlg/SWAT/bio-ontologies.html.
It is well known that both semantic and syntactic information play a role in pronoun resolution in sentences. However, it is unclear what the relative contribution of these sources of information is for the establishment of a coreferential relationship between the pronoun and the antecedent in combination with a local structural case constraint on the pronoun (i.e. case assignment of a pronoun under preposition governing). In a prepositional phrase in German and Dutch, it is the preposition that assigns case to the pronoun. Furthermore, in these languages different overtly case-marked pronouns are used to refer to male and female persons. Thus, one can manipulate biological/syntactic gender features separately from case marking features.
The major aim of this study was to determine what the influence of gender information in combination with a local structural case constraint is on the processing of a personal pronoun in a sentence.
Event-related brain potential (ERP) experiments were performed in German and in Dutch. In a word by word sentence reading study in German and Dutch, gender congruency between the antecedent and the pronoun was manipulated and/or case assignment by the preposition was violated while ERPs of young native speakers were recorded.
The German and the Dutch ERP data showed an enlarged negativity broadly distributed starting approximately 350 ms after onset of the pronoun followed by a late positivity for gender violations. For syntactic incongruencies without gender violations only a positivity was present. The Dutch data showed an earlier onset of the positivity in comparison to German.
Finding negativities and positivities for conditions with a gender violation indicates that pronoun resolution with gender incongruency between the pronoun and the antecedent suffers from semantic as well as syntactic integration problems. The presence of a positivity for the syntactically incongruent conditions without gender violations suggests that the processing of incorrect case marking without a gender violation gives rise to syntactic but not semantic integration problems. We suggest that the more prominent case violation in Dutch caused the earlier onset of the positivity in the Dutch study. In addition, the pattern of ERP effects shows that both case and gender information are used almost immediately implying that the local structural constraint affects the resolution process with more processing activity than for a pronoun of which only one source of information is violated or incongruent.
Children in grades one to four completed two sentence construction tasks: (a) Write one complete sentence about a topic prompt (sentence integrity, Study 1); and (b) Integrate two sentences into one complete sentence without changing meaning (sentence combining, Study 2). Most, but not all, children in first through fourth grade could write just one sentence. The sentence integrity task was not correlated with sentence combining until fourth grade, when in multiple regression, sentence integrity explained unique variance in sentence combining, along with spelling. Word-level skills (morphology in first and spelling in second through fourth grade) consistently explained unique variance in sentence combining. Thus, many beginning writers have syntactic knowledge of what constitutes a complete sentence, but not until fourth grade do both syntax and transcription contribute uniquely to flexible translation of ideas into the syntax of a written sentence. In Study 3, eleven syntactic categories were identified in single- and multi- sentence composing from second to fifth grade. Complex clauses (independent plus subordinate) occurred more often on single-sentence composing, but single independent clauses occurred more often on multi-sentence composing. For multi-sentence text, more single, independent clauses were produced by pen than keyboard in grades 3 to 7. The most frequent category of complex clauses in multi-sentence texts varied with genre (relative for essays and subordinate for narratives). Thus, in addition to syntax-level sentence construction and word-level transcription, amount of translation (number of sentences), mode of transcription, and genre for multiple sentence text also influence translation of ideas into written language of child writers. Results of these studies employing descriptive linguistic analyses are discussed in reference to cognitive theory of writing development.
Sentence construction; Single-sentence composing; Multi-sentence composing; Syntactic level of language; Written syntax
Information Extraction (IE) is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources.
We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments) in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC), consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%.
The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining). Initial experiments have also shown that the corpus may viably be used to train IE components, such as semantic role labellers. The corpus and annotation guidelines are freely available for academic purposes.
During auditory language comprehension, listeners need to rapidly extract meaning from the continuous speech-stream. It is a matter of debate when and how contextual information constrains the activation of lexical representations in meaningful contexts. Electrophysiological studies of spoken language comprehension have identified an event-related potential (ERP) that was sensitive to phonological properties of speech, which was termed the phonological mismatch negativity (PMN). With the PMN, early lexical processing could potentially be distinguished from processes of semantic integration in spoken language comprehension. However, the sensitivity of the PMN to phonological processing per se has been questioned, and it has additionally been suggested that the “PMN” is not separable from the N400, an ERP that is sensitive to semantic aspects of the input. Here, we investigated whether or not a separable PMN exists and if it reflects purely phonological aspects of the speech input. In the present experiment, ERPs were recorded from healthy young adults (N =24) while they listened to sentences and word lists, in which we manipulated semantic and phonological expectation and congruency of the final word. ERPs sensitive to phonological processing were elicited only when phonological expectancy was violated in lists of words, but not during normal sentential processing. This suggests a differential role of phonological processing in more or less meaningful contexts and indicates a very early influence of the overall context on lexical processing in sentences.
Language; Speech comprehension; Phonology; Semantics; Event-related potential (ERP)
Detection of sentences that describe protein-protein interactions (PPIs) in biomedical publications is a challenging and unresolved pattern recognition problem. Many state-of-the-art approaches for this task employ kernel classification methods, in particular support vector machines (SVMs). In this work we propose a novel data integration approach that utilises semantic kernels and a kernel classification method that is a probabilistic analogue to SVMs. Semantic kernels are created from statistical information gathered from large amounts of unlabelled text using lexical semantic models. Several semantic kernels are then fused into an overall composite classification space. In this initial study, we use simple features in order to examine whether the use of combinations of kernels constructed using word-based semantic models can improve PPI sentence detection.
We show that combinations of semantic kernels lead to statistically significant improvements in recognition rates and receiver operating characteristic (ROC) scores over the plain Gaussian kernel, when applied to a well-known labelled collection of abstracts. The proposed kernel composition method also allows us to automatically infer the most discriminative kernels.
The results from this paper indicate that using semantic information from unlabelled text, and combinations of such information, can be valuable for classification of short texts such as PPI sentences. This study, however, is only a first step in evaluation of semantic kernels and probabilistic multiple kernel learning in the context of PPI detection. The method described herein is modular, and can be applied with a variety of feature types, kernels, and semantic models, in order to facilitate full extraction of interacting proteins.
Natural language processing is an important tool in biomedicine, and fails without successful segmentation of words and sentences. Tokenization is a form of segmentation that identifies boundaries separating semantic units, for example words, dates, numbers and symbols, within a text. We sought to construct a highly generalizeable tokenization algorithm with no prior knowledge of characters or their function, based solely on the inherent statistical properties of token and sentence boundaries. Tokenizing clinician-entered free text, we achieved precision and recall of 92% and 93%, respectively compared to a whitespace token boundary detection algorithm. We classified over 80% of punctuation characters correctly, based on manual disambiguation with high inter-rater agreement (kappa = 0.916). Our algorithm effectively discovered properties of whitespace and punctuation in the corpus without prior knowledge of either. Given the dynamic nature of biomedical language, and the variety of distinct sublanguages used, the effectiveness and generalizability of our novel tokenization algorithm make it a valuable tool.
Receiving extraneous articles in response to a query submitted to MEDLINE/PubMed is common. When submitting a multi-word query (which is the majority of queries submitted), the presence of all query words within each article may be a necessary condition for retrieving relevant articles, but not sufficient. Ideally a relationship between the query words in the article is also required. We propose that if two words occur within an article, the probability that a relation between them is explained is higher when the words occur within adjacent sentences versus remote sentences. Therefore, sentence-level concurrence can be used as a surrogate for existence of the relationship between the words.
In order to avoid the irrelevant articles, one solution would be to increase the search specificity. Another solution is to estimate a relevance score to sort the retrieved articles. However among the >30 retrieval services available for MEDLINE, only a few estimate a relevance score, and none detects and incorporates the relation between the query words as part of the relevance score.
We have developed "Relemed", a search engine for MEDLINE. Relemed increases specificity and precision of retrieval by searching for query words within sentences rather than the whole article. It uses sentence-level concurrence as a statistical surrogate for the existence of relationship between the words. It also estimates a relevance score and sorts the results on this basis, thus shifting irrelevant articles lower down the list.
In two case studies, we demonstrate that the most relevant articles appear at the top of the Relemed results, while this is not necessarily the case with a PubMed search. We have also shown that a Relemed search includes not only all the articles retrieved by PubMed, but potentially additional relevant articles, due to the extended 'automatic term mapping' and text-word searching features implemented in Relemed.
By using sentence-level matching, Relemed can deliver higher specificity, thus eliminating more false-positive articles. By introducing an appropriate relevance metric, the most relevant articles on which the user wishes to focus are listed first. Relemed also shrinks the displayed text, and hence the time spent scanning the articles.
Studies demonstrating the involvement of motor brain structures in language processing typically focus on time windows beyond the latencies of lexical-semantic access. Consequently, such studies remain inconclusive regarding whether motor brain structures are recruited directly in language processing or through post-linguistic conceptual imagery. In the present study, we introduce a grip-force sensor that allows online measurements of language-induced motor activity during sentence listening. We use this tool to investigate whether language-induced motor activity remains constant or is modulated in negative, as opposed to affirmative, linguistic contexts.
Participants listened to spoken action target words in either affirmative or negative sentences while holding a sensor in a precision grip. The participants were asked to count the sentences containing the name of a country to ensure attention. The grip force signal was recorded continuously. The action words elicited an automatic and significant enhancement of the grip force starting at approximately 300 ms after target word onset in affirmative sentences; however, no comparable grip force modulation was observed when these action words occurred in negative contexts.
Our findings demonstrate that this simple experimental paradigm can be used to study the online crosstalk between language and the motor systems in an ecological and economical manner. Our data further confirm that the motor brain structures that can be called upon during action word processing are not mandatorily involved; the crosstalk is asymmetrically governed by the linguistic context and not vice versa.