April 2, 2018
The new NIH Strategic Plan for Data Science aligns strongly with many areas of emerging interest within the scientific community and within the membership of the American Society for Cell Biology (ASCB). The current challenges facing Data Science revolve around several points, including:
- Data infrastructure and the need to modernize the data ecosystem to handle large amounts of genetic data, microscopy images, and a variety of other scientific data generating large files
- Data management, analytics, and tools to create and maintain publicly available and easily accessible databases for the purposes of sharing published data
- Workforce development to train scientists on data science analysis platforms and techniques
The NIH Strategic Plan for Data Science seeks to address these concerns to make data science more accessible to collaborators, other researchers, and the general public. This Plan offers many valuable suggestions to guide scientific research forward. The ASCB would like to provide feedback and recommendations on some specific points of the plan particularly relevant for the broad cell biology community, as addressed below.
- To address requirements for data storage, NIH recommends (Goal 1) Support a Highly Efficient and Effective Biomedical Research Data Infrastructure. In Objective 1-1, NIH recommends leveraging current capabilities within the private sector “either through strategic partnerships or procurement, to create a workable Platform as a Service environment.”While we agree that private sector capabilities in cloud computing and data infrastructure are strong, ASCB is reminded of the vast capabilities already available to NIH. Since 1988, the National Center for Biotechnology Information (NCBI) has developed and hosted a variety of biomedical and genetic information and processing tools, including BLAST (protein/gene search and alignment), Genome (searchable genome sequences), PubMed (online repository of citations for scientific publications), and PubMed Central (linked to PubMed, containing full-text primary scientific literature). Since 2005, under the NIH Public Access Policy, all scientific peer-reviewed publications generated using NIH funds have been deposited in PubMed Central, a policy that has since extended across several funding sources. Given these requirements, currently over 2100 scientific journals, as well as 4.7 million individual articles, are accessible in PubMed Central, resulting in its serving as a de facto repository for almost all scientific publications since 2005. Given the strength of the already-existing NIH biotechnology and computing technology within NCBI, the ASCB recommends NIH continue to leverage interactions with the private sector, while hosting and maintaining the technology within NCBI on government servers.
- Beyond data storage, NIH also recommends in Goal 2 to Promote Modernization of the Data-Resources Ecosystem. Current requirements for data storage vary across ICs, as do funding mechanisms to develop, host, and maintain these resources. Objective 2-2 recommends Support the Storage and Sharing of Individual Datasets.Objective 2-2 recommends expanding NIH Data Commons to allow submission, open sharing, and indexing of individual, FAIR datasets. Although PubMed Central serves as a repository of scientific publications, it often remains very challenging to access raw datasets (i.e.: genomic data, raw microscopy images, etc) via PubMed’s links to scientific publications and “Supplementary Datasets” on the scientific journal’s website. Given the trend towards online-only journals, the ASCB recommends that NIH develop this Objective to provide an easy and rapid ability to submit raw data and supplementary data in a universal and easily accessible/searchable format. To fully implement these search capabilities and expand the repository capabilities of NIH Data Commons, once developed, the ASCB recommends encouraging authors to make all supporting Supplementary data available in publications receiving NIH funding support.Further, within Objective 2-2, NIH recommends promoting the use of NIH Common Data Elements Repository. When developing this Repository, the ASCB recommends developing an intuitive interface to catalog and store raw data utilized in publications, providing a centralized repository for raw/native image and data formats, rather than presenting only a subset of information in figures within primary literature. Given the potential burden (time, cost, and storage), the ASCB recommends continuing to examine this proposal, and first identifying an intuitive, rapid, and simple solution to upload and catalog these data files prior to implementing this Goal.
- To efficiently modernize and maintain large Data Science files, NIH seeks in Goal 3 to Support the Development and Dissemination of Advanced Data Management, Analytics, and Visualization tools. Specifically, within Objective 3-1, NIH will Support Useful, Generalizable, and Accessible Tools and Workflows.
In Objective 3-1, NIH states, “Historically, because data resources have generally been funded through NIH research grants, applicants have emphasized development of new tools in order to meet innovation expectations associated with conducting research. This strategy can shift the focus of data resources away from their core function of providing reliable and efficient access to high-quality data. In addition, coupling review and funding of data resources to tool development can inhibit the type of open competition among developers that allows support of the most innovative and useful tools.” These requirements will become critical as data science continues to expand. The ASCB agrees with this strategy, and recommends identification of specific funding mechanisms for these databases, as well as continued annual appropriations for the maintenance and growth of these networks. Current support of University-level data systems often requires PI-level support from their R-level grants, and the ASCB recommends that appropriated funds not be supplemented from individual researchers’ grant support on either the University or National level.
- In recognizing the importance of Data Science to the scientific community, Goal 4 proposes to Enhance Workforce Development for Biomedical Data Science. In this goal, NIH and the scientific community “recognize that data scientists perform far more than a support function, as data science has evolved to be an investigative domain in its own right.” To address the importance of data scientists, Objective 4-1 acknowledges NIH will begin to “develop training programs for its staff to improve their knowledge and skills in areas related to data science.”While we appreciate the importance of data scientists within the intramural NIH workforce, the ASCB recommends expanding this training environment for data scientists to provide support for both intramural and extramural scientists. NIH has piloted programs along these lines in the past, including the BEST (Broadening Experiences in Scientific Training) program, initially piloted at 17 universities to explore how to broaden the improving biomedical career environment. Although tailored to provide opportunities to explore careers outside academia, these programs often provided the first exposure of bioinformatics to graduate students in the biomedical sciences. The ASCB recommends expanding on the BEST initiative to promote specific training in advanced bioinformatics and biostatistics at research universities, either through specialized “certificate” programs or electives for graduate students. Through these programs, graduate students will learn the skills required to successfully perform bioinformatics and data science analysis,providing both NIH and the extramural workforce with scientists containing the skills and resources necessary to combine data scientist and “wet lab” researcher into one individual.
- Goal 2: Promote Modernization of the Data-Resources Ecosystem, and its Objective 2-1 Modernize the Data Repository Ecosystem, may have the largest impact on the ASCB Membership.
Many ASCB members work with one or more model organisms. Over the last 20 years funding agencies have invested a significant part of their budgets in large scale research endeavors, from high throughput RNAi or mutagenesis screens, to genomic sequencing, RNASeq, protein structure predictions or biochemical characterization of protein:protein interactions. The sheer wealth of data and publications created by these approaches makes it extremely challenging and time consuming for an individual researcher to aggregate and connect all this information by solely relying on the original literature. One key way NIH and the community addressed these challenges is by the creating dedicated model organism databases funded, in part by the NIH, ranging from Flybase to Xenbase to Wormbase to Zfin. These sites are essential to the work of ASCB members and the broader cell biology community, providing centralized and easy access to diverse data such as:
- Time/place of gene expression based on microarray, RNASeq, & Fluorescence in situ hybridization data
- How many splice variants and potential orthologues exist
- What the loss of function phenotype is in various stages of development
- Known and predicted genetic and protein interaction partners, from large scale interactome screens
- Sources of resources such as RNAi constructs, mutant lines, cDNA clones, tagged versions of protein
- A list of all publications and meeting abstracts referencing a particular gene
- Information about predicted conserved protein domains, and sequence homology.
Like the Model Organism Stock Centers, these centralized databases serve many thousands of NIH-funded investigators across the nation, and thus are an extremely cost-effective approach. To garner this information compiled from hundreds of different primary sources without such a database would take days instead of mere minutes, wasting valuable time and NIH funded salaries. They also provide an essential source of information for researchers from smaller academic institutions in the US or across the globe that may not have access to all the necessary research publications containing the primary data.
Objective 2.1 makes a distinction between “Databases” and “Knowledgebases”. The existing model organism databases currently are a complex amalgamation of Data, Knowledge and Tools, reflecting their origins in the 1980s. The Draft Plan (p. 4) notes that “Historically, NIH has often supported data resources using funding approaches designed for research projects, which has led to a misalignment of objectives and review expectations…Funding for tool development and data resources has become entangled, making it difficult to assess the utility of each independently and to optimize value and efficiency.” It also notes (Objective 3.1) that “Historically, because data resources have generally been funded through NIH research grants, applicants have emphasized development of new tools in order to meet innovation expectations associated with conducting research. This strategy can shift the focus of data resources away from their core function of providing reliable and efficient access to high-quality data. In addition, coupling review and funding of data resources to tool development can inhibit the type of open competition among developers that allows support of the most innovative and useful tools.”
A thoughtful revision that helps solve these challenges is thus very timely, but needs to respect certain key objectives, to best serve and avoid unintended harm to the communities served.
- Transitions need to be implemented over time, ensuring that key resources will remain available through the transition. The goals of any changes should be transparent and communicated clearly to the community—for example, the recent announcement by NIH/NGRI suggesting 15-20% cuts in the funding of FlyBase, was not done with a sufficiently clear explanation to the community of the goals of this effort.
- As the model organism databases are updated/restructured, it is important to ask the community what it wants and needs and use that to drive re-allocation of resources to most useful data/knowledge/tools, and to continually track usage of different parts of the data/knowledge bases to determine which are most useful to the community.
- Separating funding for core data, curated knowledge and tool development will allow reviewers to assess each component and each team on their own strengths, and thus would be a step forward. However, ultimately the user should see seamless interface in which all of these components are integrated, helping achieve one of the core goals of the Strategic Plan (p. 4): “The current data- resource ecosystem tends to be ‘siloed’ and is not optimally integrated or interconnected.”
- Currently, the different model organism data/knowledge bases are “siloed”, without a common structure or easy ways to interconnect knowledge across models. A revision should address this missed opportunity to adopt best practices across platforms, which currently hampers outside tool developers as they need to adapt their tools to the quite different architectures of different data/knowledge bases.
- Finally, one goal of the Draft Plan is to “give citizen scientists access to appropriate data, tools, and educational resources.” It will be important to develop funding models and standards that ensure that no data or knowledge are sequestered behind paywalls, and to seek to ensure that newly developed tools are available to all.
As NIH continues to develop its Strategic Plan for Data Science, the ASCB stands ready to provide feedback and share our views with you in more detail. Thank you for your consideration of this emerging and important topic.