Part 3 Specific recommendations
Note: Reproducible and open scientific practices and tools are constantly evolving. The principles outlined in this manifesto are designed to be to be as timeless and generic as possible. However, specific recommendations on or incorporated individual tools, services, and workflows are prone to be modified and updated over time.
In this section we discuss the rationale on our specific recommendations and on some comparisons between options.
3.1 Recommended tools and services to use
Open science encompasses a vast number of diverse tools and services that is continuously increasing. This encouraging growth indicates that open science is actively evolving and that there is a rich network of people and organizations devoted to improving current scientific practices. A downside to this abundance is that it can act as a barrier for researchers who desire to work more in the open. The range of tool choices and the lack of guidance on what to use particularly risks to overwhelm and discourage researchers seeking to open up their workflow for the first time.
To provide a solution to this problem, the ROS framework provides heavily opinionated recommendations on open tools, workflows, and services. Below is a brief summary of the specific recommendations we make, followed by more detailed explanations and comparisons between tools, services, and workflows.
3.1.1 Summary list
- File management and version control: Git, combined with GitHub or GitLab
- Statistical and/or programming language: R or Python
- For writing documents: Pandoc Markdown (e.g. R Markdown)
- Analytic platform: RStudio (for R) or JupyterLab (for Python)
- Writing platform: RStudio
- Dissemination for getting a DOI and for discoverability:
- All activities: For R projects, preferably everything is done in RStudio. See the workflow section below for more detail. For Python projects the environment is a bit more complicated and we are still thinking through how it would look.
3.1.2 Detailed explanations
18.104.22.168 Project management phase
Project management is an often overlooked and underappreciated aspect of many research projects. Part of project management is managing and organizing files and folders, keeping track of changes to the files and documents, and doing common tasks used in any typical research project (e.g. spellchecking of documents, autocompletion of code, and easily searching through all files in a project).
For the main tasks, we recommend using RStudio for projects using R and JupyterLab for projects using Python. Both are open source, value and contribute to open source and openness in general, are well maintained and documented, and widely used. RStudio in particular has a very nice integration with multiple ROS aspects, such as for R Markdown, project management, and Git integration. While there are several other similar tools to use when working with Python (e.g. PyCharm or Spyder), RStudio is really the only candidate tool when working with R as no other tool compares for a variety of reasons.
TODO: Add explanation on why to choose JupyterLab over others.
For tracking changes to files, instead of doing this manually by saving
different version of documents (e.g.
version control software to automate and simplify this step of the workflow.
Git is the most established version control software in the open source,
software development, and package development world. It is widely used in
the ROS community. It allows you to annotate the changes you make and later
revert back to any point in the file’s history. RStudio has an excellent
interface to Git. Using version control also focuses the project into a single
folder structure and emphasizes the production of a final “product” from the
coding and writing. It also facilitates dissemination and publication (discussed
later). Use tools with Git integration such as those found in RStudio for R
projects. With Python projects, while Git integration is more varied, we recommend
the Git extension for JupyterLab and the Git plugin for PyCharm.
TODO: Include a table of comparisons here and how each matches the principles.
22.214.171.124 Data analysis phase
Scientific analyses are frequently conducted in spreadsheet software such as Microsoft Excel, Google Sheets, or LibreOffice Calc. These tend to be the default choice because people are used to them, but none of these offer the transparency or reproducibility necessary for scientific analyses. This is partly because they are based on proprietary source formats which cannot be read without access to specific software. This means that if the vendor stops supporting that software, the analyses results become inaccessible. This is discussed more in the writing section.
Specifically for data analysis, using code to write down every step of the analysis increases transparency and reproducibility of the results. There is no longer any need to email the authors of a paper to ask questions about the analysis process, as the code should detail how the results were obtained from the raw data. The analysis script files used in a scientific output should be contained in the same repository as the scientific output (e.g. manuscript, poster, slides). This is important both in a collaborative setting and for individuals revisiting analyses conducted at an earlier stage. Instead of remembering the order of buttons clicked in a graphical software, having the analysis written down in a code recipe ensures that no manual errors are introduced and the same results are easily obtained by running the code again. This also saves significant amounts of time since it allows the automation of analyses so that when new data come in, the same code can be run on that data with just one command instead of having to redo every step of the analysis (as is often the case for spreadsheet software).
Many state of the art data analysis packages are quickly made available in open source coding languages, often by a scientist, researcher, or other community member who initially developed or used the analysis procedure. In contrast, proprietary software (both coding languages and graphical programs) often lag behind in the implementation of new algorithms and prevent contributions from community members, and when new functionality becomes available it is often only accessible after paying for the latest version or an additional toolbox. Proprietary software also has the disadvantage of being closed source, which then excludes it from being part of a ROS workflow.
There are many programming and statistical computing languages available, both open source and proprietary. However, of them all we recommend using R and Python. Both languages are open source, have active communities, are working at being more welcoming and inclusive, have very well developed packages and extensions for all types of analyses projects, are well maintained and documented, are (mostly) readable, are widely used in the scientific community, and are the two most widely used languages in the world for data science. The R community in particular is very active in addressing and working to fixing equity and fairness issues such as including and welcoming under-represented groups. Some examples from both communities include the R-Ladies, Py-Ladies, and NumFOCUS initiatives.
TODO: Include a table of comparisons here and how each matches the principles.
126.96.36.199 Writing phase
Scientific writing is often done in word processors such as Google Docs or Microsoft Word. These tend to be the default choice because people are used to them or because they learned them first (since they are default programs on computers). However, these types of programs don’t offer the transparency in the writing process. This is because they are based on proprietary source formats, cannot be easily read, and require their respective programs to write in them. If a vendor decides to stop supporting that format, or if a researcher’s institution can’t afford a license, the text of that document will be inaccessible.
More commonly, if one finds a document written using an older version of the
.docx), there is no guarantee it can be opened in
the new version of the software. Opening the same document in different
versions of the same software or on different computers could render different
results (such as when opening a Windows PowerPoint presentation on a Mac).
Documents can only be opened by people who can afford to purchase the products
sold by the vendor. Storing either data or manuscripts in such formats means
that they can be lost forever or could be inaccessible to certain groups of
people. In contrast, writing in an open, text-based source format means that
the document can be opened by anyone with access to a computer or mobile
Open, text-based formats are commonly referred to as plain text documents.
Although plain text itself cannot be formatted into headings, bold font, etc;
the addition of text-based markup, such as
[bold] surrounding a word,
enables text editors to display plain text as formatted documents. There are
several plain-text “markup language”, such as LaTeX or HTML, but many of
these have verbose markup that make them inefficient to type and difficult to
learn. Markdown is a markup language that was designed from markup
conventions used over email so it is simple to learn and easy to type. A flavor
of Markdown called Pandoc Markdown) is specialized on scholarly communication
and support features required in scientific writing such as automatic figure
referencing, in text citations, and bibliography insertion (including plain
text formats such as BibTeX). Pandoc Markdown documents can also be
converted to a large range of output formats, including Word
beautifully typeset LaTeX PDFs, or web friendly HTML files. R Markdown is
an extension of Pandoc Markdown that allows R and Python code to be executed
within and inserted into a document, increasing document-level reproducibility.
Since Markdown is just plain text, changes can be easily tracked using Git and collaboration can happen on GitHub or GitLab. There are also promising online text editors emerging which support Markdown with track changes to ease the transition for people used to conventional word processors, e.g. Authorea and Stencila. Taken together the Markdown format is an open plain text format that is accessible and usable on all operating systems, has an active community of users, is well maintained and documented (e.g. the Pandoc Markdown manual or the R Markdown Book), can be converted in a wide range of document types (see the Pandoc Markdown about page for examples), is designed for simplicity and readability, and has flavors dedicated to scholarly communication.
188.8.131.52 Dissemination phase
As mentioned in the introductory paragraph, publishing findings under an open access license increases the exposure the research gets. Importantly, data and the analytic code used should also be made openly and publicly available, to fulfill the three components of open science (open data, open source, and open access). When a manuscript, slide, or poster has been finalized and is ready to be publicly published or presented, a few steps should be taken.
There are several aspects to the dissemination phase of a project that includes discoverability and “archival” status, meaning there is some historical “timestamp” to your scientific output.
The key to creating a “timestamp” is through obtaining a Digital Object Identifier, or DOI. When a DOI is created for a research “object” or output like a manuscript or poster, it is archived and findable by the DOI string. This maintains a historical record and allows the output to be easily citable. All research output, as such data, code, and manuscripts, should get a DOI.
Given that the project will be under a Git repository, obtaining a DOI for the code and project files is easy by using Zenodo, especially with GitHub’s Zenodo integration. For posters and slides, the recommended service to use is figshare for obtaining a DOI. For manuscripts, preprint archives like bioRxiv, [PeerJ Preprints], or OSF Preprints are recommended to obtaining a DOI. The advantage of using preprint archives is that they allow the manuscript to be easily discoverable, as almost all of these services are indexed in Google Scholar.
NOTE: Currently only OSF has an API, of the preprint archives.
TODO: Expand on the reasoning for using Zenodo, figshare, and the others
TODO: Decide if bioRxiv is best solution given lack of API, unlike OSF and other preprint archives
TODO: Check into best solutions for posters/slides
3.2 Recommended workflow and processes
3.2.1 Summary list
TODO: Add list here
3.2.2 Detailed explanations
TODO: Add explanations here
184.108.40.206 For R projects
TODO: Complete this section. It is currently INCOMPLETE!