The initial PubFocus search is preformed by using several complementing search menus: basic and advanced menus with various search limits for each, and detailed, allowing experienced users to construct their own search queries (Fig. ; Fig. ). Syntax of the search queries is identical to that of PubMed. Previously generated PubMed queries can be imported into PubFocus by simply copying the entire query string into the search field under the detailed menu. Initial search results are arranged chronologically (20 records are displayed at a time) and can be viewed in three alternative modes: brief, summary and abstract. In each mode both the title and authors' names are hyperlinked to allow interactive search output navigation. Four hyperlinks are commonly provided for each author's name (AND, OR, NOT and ONLY), allowing quick focus on search subsets including or excluding any particular author, as well as simple link-out to all publications by any given author (Fig. ). Additionally, for each record an impact factor (IF) of the journal and a number of forward citations are obtained by interfacing parallel databases (Fig. ). Impact factor is obtained from a locally hosted and manually built journal ranking database, which includes current impact factors for 7,525 unique source titles (based on the 2005 edition of Journal Citation Report® by The Thomson Corporation). Volumetric data on forward citations is obtained through automated parsing of HTML outputs for the individual PubMed Central (PMC) records (commonly known as "cited in PMC" data) and matching Google™ Scholar records. This information can be used to judge the rank of the publication, as a higher impact factor and higher volume of forward referencing would indicate seniority of the article.
PubFocus provides a set of statistical tools for bibliometric analysis of relevant records. In the first step, data on relevant records is extracted from the PubMed server in packets of 50 XML-formatted records (up to a total of 2500 records per analysis) and written into a local temporary database (Fig. ). Remote extraction of the XML-formatted records was chosen over loading the entire PubMed database (approximately 31.6 – 46.3 GB in size [
18]) into the local relational database in order to maintain lightweightness and ease of transferability of the application between servers and to avoid necessity for frequent updates and maintenance. In addition, this method excludes a need for developing a separate search engine. Here, relevance of citations to the search query is determined by the PubMed search engine. In the second step, the local database is enriched with IFs and volumetric data on forward citations (Fig. ; Fig. ). The user can choose to collect data on forward citations from three alternative sources: PubMed Basic, PubMed Central or Google™ Scholar (Fig. ). Data acquisition from PubMed Basic is fast, yet it only allows establishing the presence or absence of forward references within the PubMed Central database (i.e. "yes or no" mode). Alternatively, both PubMed Central and Google™ Scholar provide numeric values of forward citations, but data harvesting from these sources can be lengthy and is recommended for smaller sets of records. In general, Google™ Scholar provides higher values of forward citations for the same publication than PubMed Central does. In the recent formal study forward citation data from Google™ Scholar showed a substantial degree of overlap with that from proprietary Web of Science
® [
19]. While forward citations appear in the Google™ Scholar database earlier in the lifecycle of the publication than in PubMed Central database, Google™ Scholar "returns a smaller number of citing references" than Web of Science
® but "provides a large set of unique citing material" [
19]. While "it is clear that Google™ Scholar provides unique citing material", "the exact composition of this citing material should also be more thoroughly examined so that scholars will have a clear idea what is and is not included in Google™ Scholar searches".
Statistical analysis is performed in the second step. Basic statistics employ a simple volumetric approach (similar to the analysis done by Web of Science®) to compute, sort, and provide semi-graphical output of the results (Fig. ). Basic statistics can be viewed by accessing the "Basic Statistics" tab. Analysis includes that of publication trends over the years, top publishing first authors (commonly scientists with most contribution to the paper), top publishing last authors (commonly principal investigators), top fields of research, top research topics, top publication sources based on volume or impact factor, publication types, and publication languages. In addition, search-narrowing tools allow selection and display of subsets of relevant records matching any of the above listed parameters. For example, one can select to display records published by top three principal investigators or records published in certain journals only (such as a small subset of "high-profile" articles published in journals like Nature or Science with high impact factors). Back and forth focusing on the subsets of the initial search is done using a temporary local database without additional time-consuming external data harvesting (Fig. → → → ; Fig. ).
We have also developed means for the integration and use of biomedical databases in the assisted citations' analytics (Fig. ). We have designed a standard MySQL format and integrated MySQL full-text search to allow automatic extraction of matching terms from titles and abstracts of relevant citations (Fig. ). These terms are sorted either based on their occurrence rate or by build-in ontology categories. Matching terms are presented in the form of semi-graphical output, allowing similar selection and search narrowing procedures as throughout the rest of the PubFocus portal (Fig. ). While multiple databases qualified for the integration, we chose NCI (National Cancer Institute) thesaurus and MGD (Mouse Genome Database) mammalian gene orthology (Fig. ). Both of these databases are large and can be useful for a wide scientific audience rather than for small interest groups only. NCI thesaurus represents a "major effort to integrate molecular and clinical cancer-related information within a unified biomedical informatics framework, with controlled terminology as its foundational layer". It includes some 49,000 biomedical concepts and 146,000 synonyms "separated into 20 logically distinct
kinds" including anatomical terms, diagnostic terms, diseases, drugs and chemicals etc. [
20,
21]. NCI thesaurus has not been previously integrated into any PubMed citations analytics applications. MGD mammalian gene orthology includes around 65,500 symbols and names and an additional 24,000 synonyms of mammalian genes. It is one of the most comprehensive and inclusive mammalian gene databases [
22].
Additional statistics employ custom algorithms to provide more accurate ranking of the search results. Particularly, PubFocus uses several determinants of publications' and authors' impact to automatically attempt to identify most prominent publications and authors whose papers can be considered significant within the given field of search.
1) Combined Impact Factor (CombIF) is calculated as CombIF = IF*citations-over-age index to account for:
a) Impact factor (IF) of the publication source.
b) Age of the publication and presence of forward references in either one of three databases mentioned above (citations-over-age index). Calculation of citations-over-age index is performed based either on Table (if PubMed Basic mode is used) or Table (if either PubMed Central or Google™ Scholar mode is used). In general, citations-over-age index boosts the IF value of new and cited articles proportionally to the number of forward citations and reduces the IF value of old articles that have not been cited ("dead-end" articles).
| Table 1Algorithm of citations-over-age index determination in PubMed Basic mode |
| Table 2Algorithm of citations-over-age index determination in either PubMed Central or Google™ Scholar mode |
2) Cumulative impact factor (CIF) is calculated as CIF = ∑(IF) and represents the cumulative value of a journal's IF where any given author has been published.
3) Author's Rank (AR) was introduced to make important adjustments to AIF. It is calculated as AR = ∑(CombIF/contribution index) to account for:
a) Combined Impact Factor (CombIF, see above).
b) The role of author in the publication (
contribution index). Contribution index of the first and last authors (the most contributing authors) on the paper is set at
1, keeping CombIF value unchanged. Contribution level of the middle authors demonstrates high degree of variability [
23]. Often, however, contribution of the second author is greater than contribution of the remaining middle authors [
24]. Therefore, the
contribution index of second authors is set at
1/2 and for remaining middle authors it is set at
1/3, diluting the initial CombIF value. While an alternative strategy could have been implemented where total
contribution indices of all authors of a paper should add to "1", this approach was dismissed because papers with large authors' lists will produce contributions diluted to an insignificant extent.
Ranked results can be viewed in the form of sorted citation records by accessing the "Sorted Results" tab (Fig. ; Fig. ). Processed publications can be sorted by publication date, first author, last author, impact factor, forward citations or CombIF (Fig. ). Sorting can be done either in ascending or descending modes. Sorting by CombIF in descending mode is the most informative, as it outputs publications with the highest established impact first. In addition, by accessing the "Basic Statistics" tab the user can view distribution of either impact factors, volume of forward citations or CombIFs among all publications within the database.
By accessing the "1st Authors" and "PI Authors" tabs the user can access ranking tables of both first and last authors within the given field of search (Fig. ). Ranking is provided based on both CIF and AR (see above). Two sets of tables are generated. The first set accounts for the publications where respective authors are either first or last in the list of authors only. The second set accounts for all publications including those where respective authors are not the primary contributors (i.e. middle authors). Ranking tables by AR on all publications (last table on the page) is the most informative as it ranks authors based on their cumulative established impact within the field of search.
Upon completion of data acquisition for any given search query, the user can quickly browse between "Sorted Results", "Basic Statistics", "1st Authors", "PI Authors" "NCI Thesaurus" and "Gene Orthology" tabs without reinitiating the often lengthy data acquisition process (Fig. ). Until a new search is performed, statistical analysis is done by accessing data stored in the temporary local database.