We have described a new metric characterizing the richness of metadata in a given database, record or other collection. High MCI scores identify the most commonly-filled fields in existing records and could be used to automatically select the most useful fields for display in tables or web interfaces (
i.e., the richest or most commonly-complete subsets of the data), or to empirically validate the content of a ‘minimum information’ specification [
2]. The fields most frequently filled in a given collection are good candidates to be formalized by a community as a ‘core’ requirement. If there is a mismatch – for example, if fields marked as ‘core’ in a standard are difficult to collect, or those with 100% compliance are not included – it suggests that standard might need to be revised; for example, with respect to the GSC definition of new habitat-specific metadata fields (‘environmental packages’) [
5].
MCI scores, as defined here, only take into account simple presence or absence of values. It is clearly important to make sure these values are valid (for example not uninformative ‘placeholders’ entered into required fields by reluctant data submitters or otherwise inappropriate information). Likewise, sheer quantity of metadata is not always necessarily optimal and care needs to be taken in both generating and interpreting MCI scores in a manner that is appropriate to the interpretation of the data at hand. MCI scores are best used when the exact variables in the total list of expected fields are well defined and transparent to the user (i.e. ideally selected from a minimum standard).
MCI scores will ideally be used to make targeted improvements to databases over time. They could also be used over time to track the evolution of databases and their contents, for example, to signal significant updates in content even when the total number of entries remains the same, to report progress to funders, or to reward the work of curators who contribute the relevant information. Methods that aid in defining the pivotal contributions of curators and rewarding their efforts to the wider community are needed.
MCI scores could be further refined in several ways; for example, to include only fields matching certain criteria (
e.g., string, number, regular expression-compliant, or curated
versus calculated values), or those using terms from recognized ontologies. This would be particularly useful for judging compliance with a given standard like MIGS – since free text is not allowed, formal validation could be done using, for example, GCDML [
14] (for genomics) or the ISA-Tab (multi-omic) format [
15]. MCI scores could also be broken down to cover ‘required’ and ‘optional’ fields separately.
Further refinement of MCI scores would require more thorough validation of metadata, making maximum use of mappings between minimal information requirements, recommended terminologies and any formats used. New efforts emerging from the community are laying the basis for such a multi-dimensional validation process: Data standardization efforts such as the ISA Commons [
16] offer common metadata tracking frameworks that can better underpin and facilitate the development of improved validation methods.
Where databases such as PRIDE [
17] allow free use of controlled vocabularies to extend records (
i.e.,
user-defined fields), the list of identifiable fields may appear disproportionately large (each term used becomes a field, making for a
very sparse matrix). MCI requires adaptation for use in such data structures, but even in basic form can be useful in defining whether one or more core (minimum) sets of metadata can be identified (subsets of the data with MCI scores well above average).
When calculating MCI scores, it is important to consider that databases may also contain markedly different subsets (for example, delineated by technique or taxon); appropriate partitioning of records before calculation would address this.
In summary, the MCI scores individual records according to the completeness of their metadata and of their component fields, providing valuable insights into the provenance, value and cost of those records. As such, it serves as an objective and quantifiable metric for metadata capture and highlights the scholarly work required to develop curated collections [
18]. We look forward to the time when other databases utilize MCI scores, as it will also serve to provide a qualitative assessment between these resources.