Search tips
Search criteria

Results 1-25 (1496250)

Clipboard (0)

Related Articles

1.  When Language Switching has No Apparent Cost: Lexical Access in Sentence Context 
We report two experiments that investigate the effects of sentence context on bilingual lexical access in Spanish and English. Highly proficient Spanish-English bilinguals read sentences in Spanish and English that included a marked word to be named. The word was either a cognate with similar orthography and/or phonology in the two languages, or a matched non-cognate control. Sentences appeared in one language alone (i.e., Spanish or English) and target words were not predictable on the basis of the preceding semantic context. In Experiment 1, we mixed the language of the sentence within a block such that sentences appeared in an alternating run in Spanish or in English. These conditions partly resemble normally occurring inter-sentential code-switching. In these mixed-language sequences, cognates were named faster than non-cognates in both languages. There were no effects of switching the language of the sentence. In Experiment 2, with Spanish-English bilinguals matched closely to those who participated in the first experiment, we blocked the language of the sentences to encourage language-specific processes. The results were virtually identical to those of the mixed-language experiment. In both cases, target cognates were named faster than non-cognates, and the magnitude of the effect did not change according to the broader context. Taken together, the results support the predictions of the Bilingual Interactive Activation + Model (Dijkstra and van Heuven, 2002) in demonstrating that bilingual lexical access is language non-selective even under conditions in which language-specific cues should enable selective processing. They also demonstrate that, in contrast to lexical switching from one language to the other, inter-sentential code-switching of the sort in which bilinguals frequently engage, imposes no significant costs to lexical processing.
PMCID: PMC3668438  PMID: 23750141
bilingualism; language switching; switch costs; lexical access; sentence context; cognates
2.  Sentence-Based Attentional Mechanisms in Word Learning: Evidence from a Computational Model 
When looking for the referents of novel nouns, adults and young children are sensitive to cross-situational statistics (Yu and Smith, 2007; Smith and Yu, 2008). In addition, the linguistic context that a word appears in has been shown to act as a powerful attention mechanism for guiding sentence processing and word learning (Landau and Gleitman, 1985; Altmann and Kamide, 1999; Kako and Trueswell, 2000). Koehne and Crocker (2010, 2011) investigate the interaction between cross-situational evidence and guidance from the sentential context in an adult language learning scenario. Their studies reveal that these learning mechanisms interact in a complex manner: they can be used in a complementary way when context helps reduce referential uncertainty; they influence word learning about equally strongly when cross-situational and contextual evidence are in conflict; and contextual cues block aspects of cross-situational learning when both mechanisms are independently applicable. To address this complex pattern of findings, we present a probabilistic computational model of word learning which extends a previous cross-situational model (Fazly et al., 2010) with an attention mechanism based on sentential cues. Our model uses a framework that seamlessly combines the two sources of evidence in order to study their emerging pattern of interaction during the process of word learning. Simulations of the experiments of (Koehne and Crocker, 2010, 2011) reveal an overall pattern of results that are in line with their findings. Importantly, we demonstrate that our model does not need to explicitly assign priority to either source of evidence in order to produce these results: learning patterns emerge as a result of a probabilistic interaction between the two clue types. Moreover, using a computational model allows us to examine the developmental trajectory of the differential roles of cross-situational and sentential cues in word learning.
PMCID: PMC3387725  PMID: 22783211
probabilistic modeling; cross-situational word learning; syntactic bootstrapping; context-based attention mechanisms
3.  The phonemic restoration effect reveals pre-N400 effect of supportive sentence context in speech perception 
Brain research  2010;1361:54-66.
The phonemic restoration effect refers to the tendency for people to hallucinate a phoneme replaced by a non-speech sound (e.g., a tone) in a word. This illusion can be influenced by preceding sentential context providing information about the likelihood of the missing phoneme. The saliency of the illusion suggests that supportive context can affect relatively low (phonemic or lower) levels of speech processing. Indeed, a previous event-related brain potential (ERP) investigation of the phonemic restoration effect found that the processing of coughs replacing high versus low probability phonemes in sentential words differed from each other as early as the auditory N1 (120-180 ms post-stimulus); this result, however, was confounded by physical differences between the high and low probability speech stimuli, thus it could have been caused by factors such as habituation and not by supportive context. We conducted a similar ERP experiment avoiding this confound by using the same auditory stimuli preceded by text that made critical phonemes more or less probable. We too found the robust N400 effect of phoneme/word probability, but did not observe the early N1 effect. We did however observe a left posterior effect of phoneme/word probability around 192-224 ms -- clear evidence of a relatively early effect of supportive sentence context in speech comprehension distinct from the N400.
PMCID: PMC2963680  PMID: 20831863
Phonemic restoration effect; speech comprehension; N400; ERP
4.  To predict or not to predict: Age-related differences in the use of sentential context 
Psychology and aging  2012;27(4):975-988.
Older adults (as a group) are less likely than younger adults to engage in an anticipatory mode of language comprehension, failing to successfully pre-activate information about upcoming likely (predictable) words during online processing. To assess (within one set of materials) age-related changes in the use of sentential context to affect processing of predictable words and in the consequences of violating predictions, event-related brain potentials were recorded while older adults read sentences that varied in sentence-level constraint and expectancy of sentence-final words. Strongly constraining sentences were completed by their most expected, predictable words and weakly constraining sentences were completed by their most expected, less predictable words. Both types of sentences also were completed by unexpected (but plausible) words. Older adults showed reduced and delayed effects of sentential context on processing predictable words. Whereas younger adults elicit an enhanced positive ERP (starting around 500 ms post-stimulus onset, largest over prefrontal electrode sites), specifically for unexpected words that violate strong expectancies for a different word, older adults as a group did not exhibit this neural consequence of disconfirmed predictions. Older adults were instead more likely to show a left-lateralized frontal negativity for predictable items. This ERP response has been attributed to processes needed to revisit contextual material in forming an interpretation of message-level meaning, which may be more likely when anticipatory modes of comprehension are not engaged. Taken together, the results suggest that normal aging can affect allocation of resources to different cognitive and neural pathways in achieving comprehension outcomes.
PMCID: PMC3685629  PMID: 22775363
language; event-related potentials; sentential context; N400; frontal negativity
5.  Drug side effect extraction from clinical narratives of psychiatry and psychology patients 
To extract physician-asserted drug side effects from electronic medical record clinical narratives.
Materials and methods
Pattern matching rules were manually developed through examining keywords and expression patterns of side effects to discover an individual side effect and causative drug relationship. A combination of machine learning (C4.5) using side effect keyword features and pattern matching rules was used to extract sentences that contain side effect and causative drug pairs, enabling the system to discover most side effect occurrences. Our system was implemented as a module within the clinical Text Analysis and Knowledge Extraction System.
The system was tested in the domain of psychiatry and psychology. The rule-based system extracting side effects and causative drugs produced an F score of 0.80 (0.55 excluding allergy section). The hybrid system identifying side effect sentences had an F score of 0.75 (0.56 excluding allergy section) but covered more side effect and causative drug pairs than individual side effect extraction.
The rule-based system was able to identify most side effects expressed by clear indication words. More sophisticated semantic processing is required to handle complex side effect descriptions in the narrative. We demonstrated that our system can be trained to identify sentences with complex side effect descriptions that can be submitted to a human expert for further abstraction.
Our system was able to extract most physician-asserted drug side effects. It can be used in either an automated mode for side effect extraction or semi-automated mode to identify side effect sentences that can significantly simplify abstraction by a human expert.
PMCID: PMC3241172  PMID: 21946242
Natural language processing; machine learning; information extraction; electronic medical record; Information storage and retrieval (text and images); discovery; and text and data mining methods; Other methods of information extraction; Natural-language processing; bioinformatics; Ontologies; Knowledge representations, Controlled terminologies and vocabularies; Information Retrieval; HIT Data Standards; Human-computer interaction and human-centered computing; Providing just-in-time access to the biomedical literature and other health information; Applications that link biomedical knowledge from diverse primary sources (includes automated indexing); Linking the genotype and phenotype
6.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature 
PLoS Biology  2004;2(11):e309.
We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at or via WormBase at
With the increasing availability of full-text scientific papers online, new tools, such as Textpresso, will help to extract information and knowledge from research literature
PMCID: PMC517822  PMID: 15383839
7.  Time course of semantic processes during sentence comprehension: an fMRI study 
NeuroImage  2007;36(3):924-932.
The ability to create new meanings from combinations of words is one important function of the language system. We investigated the neural correlates of combinatorial semantic processing using fMRI. During scanning, participants performed a rating task on auditory word or pseudoword strings that differed in the presence of combinatorial and word-level semantic information. Stimuli included normal sentences comprised of thematically related words that could be readily combined to produce a more complex meaning, semantically incongruent sentences in which content words were randomly replaced with other content words, pseudoword sentences, and versions of these three sentence types in which syntactic structure was removed by randomly re-ordering the words. Several regions showed greater BOLD signal for stimuli with words than for those with pseudowords, including the left angular gyrus, left superior temporal sulcus, and left inferior frontal gyrus, suggesting that these areas are involved in semantic access at the single word level. In the angular and inferior frontal gyri these differences emerged early in the course of the hemodynamic response. An effect of combinatorial semantic structure was observed in the left angular gyrus and left lateral temporal lobe, which showed greater activation for normal compared to semantically incongruent sentences. These effects appeared later in the time course of the hemodynamic response, beginning after the entire stimulus had been presented. The data indicate a complex spatiotemporal pattern of activity associated with computation of word and sentence-level semantic information, and suggest a particular role for the left angular gyrus in processing overall sentence meaning.
PMCID: PMC1941617  PMID: 17500009
8.  Finding the Right Word: Hemispheric Asymmetries in the Use of Sentence Context Information 
Neuropsychologia  2007;45(13):3001-3014.
The cerebral hemispheres have been shown to be differentially sensitive to sentence-level information; in particular, it has been suggested that only the left hemisphere (LH) makes predictions about upcoming items, while the right (RH) processes words in a more integrative fashion. The current study used event-related potentials to jointly examine the effects of expectancy and sentential constraint on word processing. Expected and unexpected but plausible words matched for contextual fit were inserted into strongly and weakly constraining sentence frames and presented to the left and right visual fields (LVF and RVF). Consistent with the prediction/integration view, the P2 was sensitive to constraint: words in strongly constraining contexts elicited larger P2s than those in less predictive contexts, for RVF/lh presentation only. N400 responses for both VFs departed from the typical pattern of amplitudes graded by cloze probability. Expected endings in strongly and weakly constraining contexts were facilitated to a similar degree with RVF/lh presentation, and expected endings in weakly constraining contexts were not facilitated compared to unexpected endings in those contexts for LVF/rh presentation. These data suggest that responses seen for central presentation reflect contributions from both hemispheres. Finally, a late positivity, larger for unexpected endings in strongly constraining contexts, observed for these stimuli with central presentation was not seen here for either VF. Thus, some phenomena observed with central presentation may be an emergent property of mechanisms that require interhemispheric cooperation. These data highlight the importance of understanding hemispheric asymmetries and their implications for normal language processing.
PMCID: PMC2066191  PMID: 17659309
9.  Getting it right: Word learning across the hemispheres 
Neuropsychologia  2013;51(5):825-837.
The brain is able to acquire information about an unknown word’s meaning from a highly constraining sentence context with minimal exposure. In this study, we investigate the potential contributions of the cerebral hemispheres to this ability. Undergraduates first read weakly or strongly constraining sentences completed by known or unknown (novel) words. Subsequently, their knowledge of these words was assessed via a lexical decision task in which they served as visual primes for lateralized target words varying in their semantic relationship to the primes (unrelated, identical or synonymous). As expected, smaller N400 amplitudes were seen for target words preceded by identical (vs. unrelated) known word primes, regardless of visual field of presentation. When Unknown words served as primes, N400 reductions to synonymous target words were observed only if the prime had appeared under High sentential constraint; targets appearing in the LVF/RH elicited a small N400 effect and modulation of a subsequent late positivity whereas those in the RVF/LH elicited modulation on the late positivity only. Unknown words initially seen in Low constraint contexts showed priming effects only in a late positivity and only in the RVF/LH. Strength of contextual constraint clearly seems to impact the hemispheres’ rapid acquisition of novel word meanings. N400 modulation for novel words under strong contextual constraint in the LVH/RH suggests that fast-mapped lexical representations may initially activate meanings that are weakly, distantly, associatively or thematically-related. More extensive and bilateral semantic processing seems to occur at longer processing latencies (post N400).
PMCID: PMC3656665  PMID: 23416731
ERPs; N400; word learning; fast-mapping; cerebral hemispheres; language; priming
10.  The reliability of the N400 in single subjects: Implications for patients with disorders of consciousness 
NeuroImage : Clinical  2014;4:788-799.
Functional neuroimaging assessments of residual cognitive capacities, including those that support language, can improve diagnostic and prognostic accuracy in patients with disorders of consciousness. Due to the portability and relative inexpensiveness of electroencephalography, the N400 event-related potential component has been proposed as a clinically valid means to identify preserved linguistic function in non-communicative patients. Across three experiments, we show that changes in both stimuli and task demands significantly influence the probability of detecting statistically significant N400 effects — that is, the difference in N400 amplitudes caused by the experimental manipulation. In terms of task demands, passively heard linguistic stimuli were significantly less likely to elicit N400 effects than task-relevant stimuli. Due to the inability of the majority of patients with disorders of consciousness to follow task commands, the insensitivity of passive listening would impede the identification of residual language abilities even when such abilities exist. In terms of stimuli, passively heard normatively associated word pairs produced the highest detection rate of N400 effects (50% of the participants), compared with semantically-similar word pairs (0%) and high-cloze sentences (17%). This result is consistent with a prediction error account of N400 magnitude, with highly predictable targets leading to smaller N400 waves, and therefore larger N400 effects. Overall, our data indicate that non-repeating normatively associated word pairs provide the highest probability of detecting single-subject N400s during passive listening, and may thereby provide a clinically viable means of assessing residual linguistic function. We also show that more liberal analyses may further increase the detection-rate, but at the potential cost of increased false alarms.
•The N400 is a candidate marker of linguistic function after severe brain injury.•The probability of detecting N400s is dependent on task demands.•Passive listening is less sensitive than command-following.•The probability of detecting N400s is dependent on stimuli choices.•Word-pairs generated from association norms provide the highest sensitivity.
PMCID: PMC4055893  PMID: 24936429
Vegetative state; Minimally conscious state; N400; Sensitivity; Language
11.  Differences in the processing of anaphoric reference between closely related languages: neurophysiological evidence 
BMC Neuroscience  2008;9:55.
The present study examines the involvement of syntactic and semantic/conceptual processes in the comprehension of pronouns in Dutch using the technique of event-related brain potentials (ERPs) replicating and extending an earlier study in German. Dutch and German are closely related and share the same logic in referring to non-diminutive and diminutive NPs (i.e. adding an affix which changes the syntactic gender into neutral). Both languages separate male (hij/er (he)) and female pronouns (zij/sie (she)), as well as a pronoun that refers to an entity of neutral gender, (het/es (it)). However, the neutral pronoun het in Dutch is not only a pronoun, it also is the article of a neutral noun. To investigate the influence of this word class ambiguity on pronoun resolution, as well as to establish the generality of the finding of the German study we manipulated syntactic and biological gender congruency between a personal pronoun and its antecedent in Dutch.
In Dutch, sentences with the word-class (pronoun/article) ambiguous pronoun het elicited an early negative shift (150–280 ms) which continued in the time frame of the N400. For sentences with a syntactically and biologically incongruent pronoun a P600 (in absence of an N400) was obtained, which was independent of the morphological form of the referent.
The neurophysiological pattern found for Dutch stimuli was clearly different from the German study, indicating that the processing of pronouns in these two languages differs. This can be explained in terms of language specific characteristics concerning the word class ambiguous neutral pronoun het. Moreover, in contrast to the findings in the German study, there was no clear effect caused by the morphological form of the referent. Additionally, in Dutch, the pronoun resolution in sentences with a non-diminutive antecedent seems to reflect processes of revision (P600 in absence of an N400), whereas for German evidence was found for clear involvement of conceptual/semantic processes as well as structure building processes (N400/P600 complex).
PMCID: PMC2446385  PMID: 18588672
12.  Electrophysiological differentiation of phonological and semantic integration in word and sentence contexts 
Brain research  2006;1146:85-100.
During auditory language comprehension, listeners need to rapidly extract meaning from the continuous speech-stream. It is a matter of debate when and how contextual information constrains the activation of lexical representations in meaningful contexts. Electrophysiological studies of spoken language comprehension have identified an event-related potential (ERP) that was sensitive to phonological properties of speech, which was termed the phonological mismatch negativity (PMN). With the PMN, early lexical processing could potentially be distinguished from processes of semantic integration in spoken language comprehension. However, the sensitivity of the PMN to phonological processing per se has been questioned, and it has additionally been suggested that the “PMN” is not separable from the N400, an ERP that is sensitive to semantic aspects of the input. Here, we investigated whether or not a separable PMN exists and if it reflects purely phonological aspects of the speech input. In the present experiment, ERPs were recorded from healthy young adults (N =24) while they listened to sentences and word lists, in which we manipulated semantic and phonological expectation and congruency of the final word. ERPs sensitive to phonological processing were elicited only when phonological expectancy was violated in lists of words, but not during normal sentential processing. This suggests a differential role of phonological processing in more or less meaningful contexts and indicates a very early influence of the overall context on lexical processing in sentences.
PMCID: PMC1853329  PMID: 16952338
Language; Speech comprehension; Phonology; Semantics; Event-related potential (ERP)
13.  Child writers’ construction and reconstruction of single sentences and construction of multi-sentence texts: contributions of syntax and transcription to translation 
Reading and writing  2011;24(2):151-182.
Children in grades one to four completed two sentence construction tasks: (a) Write one complete sentence about a topic prompt (sentence integrity, Study 1); and (b) Integrate two sentences into one complete sentence without changing meaning (sentence combining, Study 2). Most, but not all, children in first through fourth grade could write just one sentence. The sentence integrity task was not correlated with sentence combining until fourth grade, when in multiple regression, sentence integrity explained unique variance in sentence combining, along with spelling. Word-level skills (morphology in first and spelling in second through fourth grade) consistently explained unique variance in sentence combining. Thus, many beginning writers have syntactic knowledge of what constitutes a complete sentence, but not until fourth grade do both syntax and transcription contribute uniquely to flexible translation of ideas into the syntax of a written sentence. In Study 3, eleven syntactic categories were identified in single- and multi- sentence composing from second to fifth grade. Complex clauses (independent plus subordinate) occurred more often on single-sentence composing, but single independent clauses occurred more often on multi-sentence composing. For multi-sentence text, more single, independent clauses were produced by pen than keyboard in grades 3 to 7. The most frequent category of complex clauses in multi-sentence texts varied with genre (relative for essays and subordinate for narratives). Thus, in addition to syntax-level sentence construction and word-level transcription, amount of translation (number of sentences), mode of transcription, and genre for multiple sentence text also influence translation of ideas into written language of child writers. Results of these studies employing descriptive linguistic analyses are discussed in reference to cognitive theory of writing development.
PMCID: PMC3048336  PMID: 21383865
Sentence construction; Single-sentence composing; Multi-sentence composing; Syntactic level of language; Written syntax
14.  Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts 
PLoS ONE  2009;4(7):e6393.
To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA (“Semantic Extraction using a Neural Network Architecture”), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, cooccurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches.
PMCID: PMC2712690  PMID: 19636432
15.  Neurophysiology of Hungarian subject–verb dependencies with varying intervening complexity 
Non-adjacent dependencies are thought to be more costly to process than sentences wherein dependents immediately follow or precede what they depend on. In English locality effects have been revealed, while in languages with rich case marking (German and Hindi) sentence final structures show anti-locality-effects. The motivation of the current study is to test whether locality effects can be directly applied to a typologically different language than those investigated so far. Hungarian is a “topic prominent” language; it permits a variation of possible word sequencing for semantic reasons, including SVO word order. Hungarian also has a rich morphological system (e.g., rich case system) and postpositions to indicate grammatical functions. In the present ERP study, Hungarian subject–verb dependencies were compared by manipulating the mismatch of number agreement between the sentence's initial noun phrase and the sentence's final intransitive verb as well as the complexity of the intervening sentence material, interrupting the dependencies. Possible lexical class and frequency or cloze-probability effects for the first two words of the intervening sentence material were revealed when used separate baseline for each word, while at the third word of the intervening material as well as at the main verb ERPs were not modulated by complexity but at the verb ERPs were enhanced by grammaticality. Ungrammatical sentences enlarged the amplitude of both LAN and P600 components at the main verb. These results are in line with studies suggesting that the retrieval of the first element of a dependency is not influenced by distance from the second element, as the first element is directly accessible when needed for integration (e.g., McElree, 2000).
PMCID: PMC3598571  PMID: 21740931
ERP (Event-Related Potentials); Locality effect; Anti-locality effect; Anterior Negativity (LAN); P600; Nonadjacent dependency; Sentence comprehension
16.  Protein interaction sentence detection using multiple semantic kernels 
Detection of sentences that describe protein-protein interactions (PPIs) in biomedical publications is a challenging and unresolved pattern recognition problem. Many state-of-the-art approaches for this task employ kernel classification methods, in particular support vector machines (SVMs). In this work we propose a novel data integration approach that utilises semantic kernels and a kernel classification method that is a probabilistic analogue to SVMs. Semantic kernels are created from statistical information gathered from large amounts of unlabelled text using lexical semantic models. Several semantic kernels are then fused into an overall composite classification space. In this initial study, we use simple features in order to examine whether the use of combinations of kernels constructed using word-based semantic models can improve PPI sentence detection.
We show that combinations of semantic kernels lead to statistically significant improvements in recognition rates and receiver operating characteristic (ROC) scores over the plain Gaussian kernel, when applied to a well-known labelled collection of abstracts. The proposed kernel composition method also allows us to automatically infer the most discriminative kernels.
The results from this paper indicate that using semantic information from unlabelled text, and combinations of such information, can be valuable for classification of short texts such as PPI sentences. This study, however, is only a first step in evaluation of semantic kernels and probabilistic multiple kernel learning in the context of PPI detection. The method described herein is modular, and can be applied with a variety of feature types, kernels, and semantic models, in order to facilitate full extraction of interacting proteins.
PMCID: PMC3116455  PMID: 21569604
17.  An Unsupervised Machine Learning Approach to Segmentation of Clinician-Entered Free Text 
Natural language processing is an important tool in biomedicine, and fails without successful segmentation of words and sentences. Tokenization is a form of segmentation that identifies boundaries separating semantic units, for example words, dates, numbers and symbols, within a text. We sought to construct a highly generalizeable tokenization algorithm with no prior knowledge of characters or their function, based solely on the inherent statistical properties of token and sentence boundaries. Tokenizing clinician-entered free text, we achieved precision and recall of 92% and 93%, respectively compared to a whitespace token boundary detection algorithm. We classified over 80% of punctuation characters correctly, based on manual disambiguation with high inter-rater agreement (kappa = 0.916). Our algorithm effectively discovered properties of whitespace and punctuation in the corpus without prior knowledge of either. Given the dynamic nature of biomedical language, and the variety of distinct sublanguages used, the effectiveness and generalizability of our novel tokenization algorithm make it a valuable tool.
PMCID: PMC2655800  PMID: 18693949
18.  Mathematical Philology: Entropy Information in Refining Classical Texts' Reconstruction, and Early Philologists' Anticipation of Information Theory 
PLoS ONE  2010;5(1):e8661.
Philologists reconstructing ancient texts from variously miscopied manuscripts anticipated information theorists by centuries in conceptualizing information in terms of probability. An example is the editorial principle difficilior lectio potior (DLP): in choosing between otherwise acceptable alternative wordings in different manuscripts, “the more difficult reading [is] preferable.” As philologists at least as early as Erasmus observed (and as information theory's version of the second law of thermodynamics would predict), scribal errors tend to replace less frequent and hence entropically more information-rich wordings with more frequent ones. Without measurements, it has been unclear how effectively DLP has been used in the reconstruction of texts, and how effectively it could be used. We analyze a case history of acknowledged editorial excellence that mimics an experiment: the reconstruction of Lucretius's De Rerum Natura, beginning with Lachmann's landmark 1850 edition based on the two oldest manuscripts then known. Treating words as characters in a code, and taking the occurrence frequencies of words from a current, more broadly based edition, we calculate the difference in entropy information between Lachmann's 756 pairs of grammatically acceptable alternatives. His choices average 0.26±0.20 bits higher in entropy information (95% confidence interval, P = 0.005), as against the single bit that determines the outcome of a coin toss, and the average 2.16±0.10 bits (95%) of (predominantly meaningless) entropy information if the rarer word had always been chosen. As a channel width, 0.26±0.20 bits/word corresponds to a 0.790.79+0.09−0.15 likelihood of the rarer word being the one accepted in the reference edition, which is consistent with the observed 547/756 = 0.72±0.03 (95%). Statistically informed application of DLP can recover substantial amounts of semantically meaningful entropy information from noise; hence the extension copiosior informatione lectio potior, “the reading richer in information [is] preferable.” New applications of information theory promise continued refinement in the reconstruction of culturally fundamental texts.
PMCID: PMC2800184  PMID: 20084117
19.  Information Extraction for Clinical Data Mining: A Mammography Case Study 
Breast cancer is the leading cause of cancer mortality in women between the ages of 15 and 54. During mammography screening, radiologists use a strict lexicon (BI-RADS) to describe and report their findings. Mammography records are then stored in a well-defined database format (NMD). Lately, researchers have applied data mining and machine learning techniques to these databases. They successfully built breast cancer classifiers that can help in early detection of malignancy. However, the validity of these models depends on the quality of the underlying databases. Unfortunately, most databases suffer from inconsistencies, missing data, inter-observer variability and inappropriate term usage. In addition, many databases are not compliant with the NMD format and/or solely consist of text reports. BI-RADS feature extraction from free text and consistency checks between recorded predictive variables and text reports are crucial to addressing this problem.
We describe a general scheme for concept information retrieval from free text given a lexicon, and present a BI-RADS features extraction algorithm for clinical data mining. It consists of a syntax analyzer, a concept finder and a negation detector. The syntax analyzer preprocesses the input into individual sentences. The concept finder uses a semantic grammar based on the BI-RADS lexicon and the experts’ input. It parses sentences detecting BI-RADS concepts. Once a concept is located, a lexical scanner checks for negation. Our method can handle multiple latent concepts within the text, filtering out ultrasound concepts. On our dataset, our algorithm achieves 97.7% precision, 95.5% recall and an F1-score of 0.97. It outperforms manual feature extraction at the 5% statistical significance level.
PMCID: PMC3676897  PMID: 23765123
BI-RADS; free text; lexicon; mammography; clinical data mining
20.  Establishing causal coherence across sentences: an ERP study 
Journal of cognitive neuroscience  2010;23(5):1230-1246.
This study examined neural activity associated with establishing causal relationships across sentences during online comprehension. ERPs were measured while participants read and judged the relatedness of three-sentence scenarios in which the final sentence was highly causally related, intermediately related and causally unrelated to its context. Lexico-semantic co-occurrence was matched across the three conditions using a Latent Semantic Analysis. Critical words in causally unrelated scenarios evoked a larger N400 than words in both highly causally related and intermediately related scenarios, regardless of whether they appeared before or at the sentence-final position. At midline sites, the N400 to intermediately related sentence-final words was attenuated to the same degree as to highly causally related words, but otherwise the N400 to intermediately related words fell in between that evoked by highly causally related and intermediately related words. No modulation of the Late Positivity/P600 component was observed across conditions. These results indicate that both simple and complex causal inferences can influence the earliest stages of semantically processing an incoming word. Further, they suggest that causal coherence, at the situation level, can influence incremental word-by-word discourse comprehension, even when semantic relationships between individual words are matched.
PMCID: PMC3141815  PMID: 20175676
causal coherence; discourse; ERP; N400; P600; inference; language; situation model
21.  A Multi-Classifier Based Guideline Sentence Classification System 
Healthcare Informatics Research  2011;17(4):224-231.
An efficient clinical process guideline (CPG) modeling service was designed that uses an enhanced intelligent search protocol. The need for a search system arises from the requirement for CPG models to be able to adapt to dynamic patient contexts, allowing them to be updated based on new evidence that arises from medical guidelines and papers.
A sentence category classifier combined with the AdaBoost.M1 algorithm was used to evaluate the contribution of the CPG to the quality of the search mechanism. Three annotators each tagged 340 sentences hand-chosen from the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure (JNC7) clinical guideline. The three annotators then carried out cross-validations of the tagged corpus. A transformation function is also used that extracts a predefined set of structural feature vectors determined by analyzing the sentential instance in terms of the underlying syntactic structures and phrase-level co-occurrences that lie beneath the surface of the lexical generation event.
The additional sub-filtering using a combination of multi-classifiers was found to be more effective than a single conventional Term Frequency-Inverse Document Frequency (TF-IDF)-based search system in pinpointing the page containing or adjacent to the guideline information.
We found that transformation has the advantage of exploiting the structural and underlying features which go unseen by the bag-of-words (BOW) model. We also realized that integrating a sentential classifier with a TF-IDF-based search engine enhances the search process by maximizing the probability of the automatically presented relevant information required in the context generated by the guideline authoring environment.
PMCID: PMC3259557  PMID: 22259724
Knowledge Bases; Data Mining; Natural Language Processing
22.  Inclusion of Ethical Issues in Dementia Guidelines: A Thematic Text Analysis 
PLoS Medicine  2013;10(8):e1001498.
Clinical practice guidelines (CPGs) aim to improve professionalism in health care. However, current CPG development manuals fail to address how to include ethical issues in a systematic and transparent manner. The objective of this study was to assess the representation of ethical issues in general CPGs on dementia care.
Methods and Findings
To identify national CPGs on dementia care, five databases of guidelines were searched and national psychiatric associations were contacted in August 2011 and in June 2013. A framework for the assessment of the identified CPGs' ethical content was developed on the basis of a prior systematic review of ethical issues in dementia care. Thematic text analysis and a 4-point rating score were employed to assess how ethical issues were addressed in the identified CPGs. Twelve national CPGs were included. Thirty-one ethical issues in dementia care were identified by the prior systematic review. The proportion of these 31 ethical issues that were explicitly addressed by each CPG ranged from 22% to 77%, with a median of 49.5%. National guidelines differed substantially with respect to (a) which ethical issues were represented, (b) whether ethical recommendations were included, (c) whether justifications or citations were provided to support recommendations, and (d) to what extent the ethical issues were explained.
Ethical issues were inconsistently addressed in national dementia guidelines, with some guidelines including most and some including few ethical issues. Guidelines should address ethical issues and how to deal with them to help the medical profession understand how to approach care of patients with dementia, and for patients, their relatives, and the general public, all of whom might seek information and advice in national guidelines. There is a need for further research to specify how detailed ethical issues and their respective recommendations can and should be addressed in dementia guidelines.
Please see later in the article for the Editors' Summary
Editors’ Summary
In the past, doctors tended to rely on their own experience to choose the best treatment for their patients. Faced with a patient with dementia (a brain disorder that affects short-term memory and the ability tocarry out normal daily activities), for example, a doctor would use his/her own experience to help decide whether the patient should remain at home or would be better cared for in a nursing home. Similarly, the doctor might have to decide whether antipsychotic drugs might be necessary to reduce behavioral or psychological symptoms such as restlessness or shouting. However, over the past two decades, numerous evidence-based clinical practice guidelines (CPGs) have been produced by governmental bodies and medical associations that aim to improve standards of clinical competence and professionalism in health care. During the development of each guideline, experts search the medical literature for the current evidence about the diagnosis and treatment of a disease, evaluate the quality of that evidence, and then make recommendations based on the best evidence available.
Why Was This Study Done?
Currently, CPG development manuals do not address how to include ethical issues in CPGs. A health-care professional is ethical if he/she behaves in accordance with the accepted principles of right and wrong that govern the medical profession. More specifically, medical professionalism is based on a set of binding ethical principles—respect for patient autonomy, beneficence, non-malfeasance (the “do no harm” principle), and justice. In particular, CPG development manuals do not address disease-specific ethical issues (DSEIs), clinical ethical situations that are relevant to the management of a specific disease. So, for example, a DSEI that arises in dementia care is the conflict between the ethical principles of non-malfeasance and patient autonomy (freedom-to-move-at-will). Thus, healthcare professionals may have to decide to physically restrain a patient with dementia to prevent the patient doing harm to him- or herself or to someone else. Given the lack of guidance on how to address ethical issues in CPG development manuals, in this thematic text analysis, the researchers assess the representation of ethical issues in CPGs on general dementia care. Thematic text analysis uses a framework for the assessment of qualitative data (information that is word-based rather than number-based) that involves pinpointing, examining, and recording patterns (themes) among the available data.
What Did the Researchers Do and Find?
The researchers identified 12 national CPGs on dementia care by searching guideline databases and by contacting national psychiatric associations. They developed a framework for the assessment of the ethical content in these CPGs based on a previous systematic review of ethical issues in dementia care. Of the 31 DSEIs included by the researchers in their analysis, the proportion that were explicitly addressed by each CPG ranged from 22% (Switzerland) to 77% (USA); on average the CPGs explicitly addressed half of the DSEIs. Four DSEIs—adequate consideration of advanced directives in decision making, usage of GPS and other monitoring techniques, covert medication, and dealing with suicidal thinking—were not addressed in at least 11 of the CPGs. The inclusion of recommendations on how to deal with DSEIs ranged from 10% of DSEIs covered in the Swiss CPG to 71% covered in the US CPG. Overall, national guidelines differed substantially with respect to which ethical issues were included, whether ethical recommendations were included, whether justifications or citations were provided to support recommendations, and to what extent the ethical issues were clearly explained.
What Do These Findings Mean?
These findings show that national CPGs on dementia care already address clinical ethical issues but that the extent to which the spectrum of DSEIs is considered varies widely within and between CPGs. They also indicate that recommendations on how to deal with DSEIs often lack the evidence that health-care professionals use to justify their clinical decisions. The researchers suggest that this situation can and should be improved, although more research is needed to determine how ethical issues and recommendations should be addressed in dementia guidelines. A more systematic and transparent inclusion of DSEIs in CPGs for dementia (and for other conditions) would further support the concept of medical professionalism as a core element of CPGs, note the researchers, but is also important for patients and their relatives who might turn to national CPGs for information and guidance at a stressful time of life.
Additional Information
Please access these Web sites via the online version of this summary at
Wikipedia contains a page on clinical practice guidelines (note: Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
The US National Guideline Clearinghouse provides information on national guidelines, including CPGs for dementia
The Guidelines International Network promotes the systematic development and application of clinical practice guidelines
The American Medical Association provides information about medical ethics; the British Medical Association provides information on all aspects of ethics and includes an essential tool kit that introduces common ethical problems and practical ways to deal with them
The UK National Health Service Choices website provides information about dementia, including a personal story about dealing with dementia
MedlinePlus provides links to additional resources about dementia and about Alzheimers disease, a specific type of dementia (in English and Spanish)
The UK Nuffield Council on Bioethics provides the report Dementia: ethical issues and additional information on the public consultation on ethical issues in dementia care
PMCID: PMC3742442  PMID: 23966839
23.  Working Memory Is Partially Preserved during Sleep 
PLoS ONE  2012;7(12):e50997.
Although several cognitive processes, including speech processing, have been studied during sleep, working memory (WM) has never been explored up to now. Our study assessed the capacity of WM by testing speech perception when the level of background noise and the sentential semantic length (SSL) (amount of semantic information required to perceive the incongruence of a sentence) were modulated. Speech perception was explored with the N400 component of the event-related potentials recorded to sentence final words (50% semantically congruent with the sentence, 50% semantically incongruent). During sleep stage 2 and paradoxical sleep: (1) without noise, a larger N400 was observed for (short and long SSL) sentences ending with a semantically incongruent word compared to a congruent word (i.e. an N400 effect); (2) with moderate noise, the N400 effect (observed at wake with short and long SSL sentences) was attenuated for long SSL sentences. Our results suggest that WM for linguistic information is partially preserved during sleep with a smaller capacity compared to wake.
PMCID: PMC3517624  PMID: 23236418
24.  Neural correlates of semantic and syntactic processes in the comprehension of case marked pronouns: Evidence from German and Dutch 
BMC Neuroscience  2006;7:23.
It is well known that both semantic and syntactic information play a role in pronoun resolution in sentences. However, it is unclear what the relative contribution of these sources of information is for the establishment of a coreferential relationship between the pronoun and the antecedent in combination with a local structural case constraint on the pronoun (i.e. case assignment of a pronoun under preposition governing). In a prepositional phrase in German and Dutch, it is the preposition that assigns case to the pronoun. Furthermore, in these languages different overtly case-marked pronouns are used to refer to male and female persons. Thus, one can manipulate biological/syntactic gender features separately from case marking features.
The major aim of this study was to determine what the influence of gender information in combination with a local structural case constraint is on the processing of a personal pronoun in a sentence.
Event-related brain potential (ERP) experiments were performed in German and in Dutch. In a word by word sentence reading study in German and Dutch, gender congruency between the antecedent and the pronoun was manipulated and/or case assignment by the preposition was violated while ERPs of young native speakers were recorded.
The German and the Dutch ERP data showed an enlarged negativity broadly distributed starting approximately 350 ms after onset of the pronoun followed by a late positivity for gender violations. For syntactic incongruencies without gender violations only a positivity was present. The Dutch data showed an earlier onset of the positivity in comparison to German.
Finding negativities and positivities for conditions with a gender violation indicates that pronoun resolution with gender incongruency between the pronoun and the antecedent suffers from semantic as well as syntactic integration problems. The presence of a positivity for the syntactically incongruent conditions without gender violations suggests that the processing of incorrect case marking without a gender violation gives rise to syntactic but not semantic integration problems. We suggest that the more prominent case violation in Dutch caused the earlier onset of the positivity in the Dutch study. In addition, the pattern of ERP effects shows that both case and gender information are used almost immediately implying that the local structural constraint affects the resolution process with more processing activity than for a pronoun of which only one source of information is violated or incongruent.
PMCID: PMC1456979  PMID: 16526952
25.  Real-Time Parallel Processing of Grammatical Structure in the Fronto-Striatal System: A Recurrent Network Simulation Study Using Reservoir Computing 
PLoS ONE  2013;8(2):e52946.
Sentence processing takes place in real-time. Previous words in the sentence can influence the processing of the current word in the timescale of hundreds of milliseconds. Recent neurophysiological studies in humans suggest that the fronto-striatal system (frontal cortex, and striatum – the major input locus of the basal ganglia) plays a crucial role in this process. The current research provides a possible explanation of how certain aspects of this real-time processing can occur, based on the dynamics of recurrent cortical networks, and plasticity in the cortico-striatal system. We simulate prefrontal area BA47 as a recurrent network that receives on-line input about word categories during sentence processing, with plastic connections between cortex and striatum. We exploit the homology between the cortico-striatal system and reservoir computing, where recurrent frontal cortical networks are the reservoir, and plastic cortico-striatal synapses are the readout. The system is trained on sentence-meaning pairs, where meaning is coded as activation in the striatum corresponding to the roles that different nouns and verbs play in the sentences. The model learns an extended set of grammatical constructions, and demonstrates the ability to generalize to novel constructions. It demonstrates how early in the sentence, a parallel set of predictions are made concerning the meaning, which are then confirmed or updated as the processing of the input sentence proceeds. It demonstrates how on-line responses to words are influenced by previous words in the sentence, and by previous sentences in the discourse, providing new insight into the neurophysiology of the P600 ERP scalp response to grammatical complexity. This demonstrates that a recurrent neural network can decode grammatical structure from sentences in real-time in order to generate a predictive representation of the meaning of the sentences. This can provide insight into the underlying mechanisms of human cortico-striatal function in sentence processing.
PMCID: PMC3562282  PMID: 23383296

Results 1-25 (1496250)