Final Project
Schedule
Due date | |
---|---|
Proposal | Apr 25, 4:30 PM |
Draft Submission | May 28, 4:30 PM |
Peer Review | Jun 1, 1:30 PM |
Final Submission | Jun 7, 11:59 PM |
Instructions
The final project should produce a computational narrative in R markdown which answers a relevant research problem using methods introduced in this course. A [computational narrative] is a document which combines prose, code, and visualizations to explain and communicate scientific results. Unlike a traditional article, the code and data producing the results transparently presented in the document itself. This course asks requires a computational essay instead of an article in order to focus the students’ attention on the methodology, computation, and reproducibility issues which this course should promote.
The submission will consist of an R markdown document compiled into either a PDF or HTML file (if the document contains interactive material) and a repository containing the data and code necessary to reproduce the analysis. The details of this process are described in the final project details.
The computational narrative should be no longer than a research note in a political science journal, 5,000 words of text (excluding code).
The computational narrative should be self contained and written for a general social science audience.
The computational narrative should effectively and informatively communicate its research design, methods, and contribution using text, code, and visualizations. The computational narrative should include no unnecessary material.
The computational narrative must address an research problem or question using the appropriate methodology and design incorporating methods introduced and discussed in this course.
As with any project worked on during graduate school, the author should use it to further research that could lead to a publication. However, given the time constraints of the quarter, it is recognized that it may not be possible to develop a novel contribution to a body of knowledge. Thus, projects will be primarily assessed on the appropriate application of methodology while taking the question as given.
Replications of existing papers are acceptable provided that the author conducts their own analysis.
It is acceptable to use this work in other courses.
It is possible to collaborate on this assignment, but it requires pre-approval of the instructor.
The computational narrative should not include a lengthy discussion of previous literature. But if it makes an original contribution it should clearly identify a gap in this literature and state the original contribution.
Citations should use the citation capabilities in R markdown; see the R markdown documentation.
Projects will not be judged on their ability to achieve statistical significance. They will be judged on the appropriateness of the research design and methods applied to the chosen research question. The results that arise from appropriate use of design and methods will not be held against the researcher.
The use of visualization to convey results is preferable to tables or the printed output of code.
The use of asterisks or symbols to represent statistical significance is discouraged. Tables should include standard errors rather than \(z\)-scores, \(t\)-scores, or \(p\)-values. Tables of regression results should be nicely formatted and selective.
Authors reporting that results are “statistically significant” should use the 0.05 level or lower. Results at the 0.10 should not be referred to as “statistically significant”.
However, the strength of evidence necessary to justify findings cannot be captured by any single criterion, such as the conventional .05 level of statistical significance. “A range of criteria beyond statistical significance, including substantive significance, theoretical aptness, the importance of the problem under study, and the feasibility of obtaining additional evidence. (APSR)” See the ASA Statement on p-Values for their appropriate use.
Equations and formulas are important for the presentation of statistical arguments. Authors should make the mathematical presentation as clear as possible. Clear and consistent notation and formatting of equations should be used. All symbols used in equations need to be clearly defined. To ensure readability of the paper, authors should choose a notation that makes the argument as easy to follow as possible. Equations are part of the text and thus they should contain appropriate punctuation. Equations should be numbered consecutively, with sub-numbering used as appropriate, e.g. equations 1a and 1b.
Mathematical notation should use the appropriate R markdown math capabilities. See this for an introduction.
To the extent possible, variable names should be readable and clearly denote what the variable is.
Provide clear a descriptions of the data and the context necessary to understand the data. A non-expert reader should be able to understand the meaning and scale of the relevant variables and important features of the analysis.
Advice
As in any graduate courses, strive to produce analysis that could lead to a publishable article.1
The main focus of the computational narrative should be on data, methods, and results. Justify your modeling choices with reference to theory. Present findings in terms any intelligent person could understand, regardless of their statistical knowledge. This should not limit the sophistication of your methods. It does require you to explain results from complicated models in approachable terms.
Do not spend too much time on literature reviews or theory, but do not neglect hypothesis building. Hypotheses can be clearly explicated without recourse to numbered lists. The the time the reader reaches the results, they should know what to expect, what would be surprising, and why.
- Research that ask substantively important, interesting, novel, or controversial questions are better—potentially much better—than research that do not, all else equal.
- Papers that explain their empirical findings in ways non-specialists can understand are better than papers that do not, all else equal.
Model specifications informed by test statistics, substantive knowledge and theory are better than model specifications based solely on test statistics, all else equal.
Reproducibility
- Code must run without error and successfully replicate.
- The repository must contain a REAMDE.md file which describes the contents of the repositories, software and data dependencies and installation instructions, and instructions on how to reproduce the analysis.
- If the analysis requires data too large to include in the repository, instructions on how to obtain the necessary must be included. If possible, a script to automate the process is preferred.
- If the analysis uses data that cannot be distributed for privacy or intellectual property reasons, it must be documented. The student must discuss with the instructor prior to submission.
Organization
- The repository must be organized as an R project.
- Code should follow the tidyverse style guide. You can use the lintr and styler package to check and format your code.
- To the extent possible, good scientific practices should be followed, as outlined in [Good enough practices in scientific computing] (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510).
- Do not use of absolute paths in the code. See this post for a discussion of this. Use the here package to
- Do not use
install.packages
in the computational narrative. Instructions on installing dependencies should go in the README or a separate script. - Include a code chunk at the top of the R markdown document which loads the packages which will be used.
- Include a code chunk at the end of the R markdown document which includes information from
sessionInfo()
to document the original computational environment. - Do not include unnecessary output (e.g. extraneous print statement) or messages. Each piece of code and output in the computational narrative should serve some purpose toward communicate the research.
GitHub and Submission Workflow
To get started on the project.
- Fork https://github.com/UW-POLS503/pols-503-2018-projects to your own account, e.g. https://github.com/jrnold/2018-503-208-projects.
- Edit
README.md
as appropriate removing instructions and adding your name and title of your project.
Work on the project within your fork.
Assignments will be submitted by opening an appropriately named issue and assigning it to the instructors.
Computational Narratives
A computational narrative is a document that combines text, code, and code output to communicate a data analysis or scientific research. R markdown is one program that can be used to produce computational narratives. Even if students are familiar with statistics and programming, they may be less familiar with computational narratives than with the format of journal articles and books.
The following references provide some discussion of computational narratives. Many of these references Jupyter, a similar, Python-based tool to produce notebooks.
- The state of Jupyter: How Project Jupyter got here and where we are headed
- Fernando Perez and Brian E. Granger (2015) Project Jupyter: Computational Narratives as the Engine of Collaborative Data Science
- Programming, meh … Let’s Teach How to Write Computational Essays Instead
- What is a Computational Essay?
- “Literate computing” and computational reproducibility: IPython in the age of data-driven journalism
Since students may not have seen computational narratives, the following examples are provided. Most of them are produced by R markdown. Some are produced by the Python program Jupyter. These are not meant to be definitive guides for how to structure your computational narrative. Many of these were produced for different purposes than the project for this course. However, they do provide examples of effectively combining text, code, and output to communicate data analysis.
- Stan Case Studies are generally well formatted examples of this style.
- Getting Started with GDELT
- Brian Keegan, The Need for Openness in Data Journalism (python/Jupyter)
- David Robinson, Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half
- The Case Studies in The Tidy Text Package Text package, https://www.tidytextmining.com/twitter.html.
- Words Growing or Shrinking in Hacker News Titles: A Tidy Analysis
- Gender and Verbs across 10,000 stories: a tidy analysis
- Brookman, David, and Kalla, Joshua, and Aronow, Peter. 2014. Irregularities in LaCour (2014) This is an unusual document that provides evidence of research fraud in the LaCour and Green paper. However, note how it uses knitr and R to do so. The paper is making and argument, with points supported by figures and results, with the code that produced them also visible.
- Mapping the GDELT data and some Russian protests too
This advice is derived with minimal editing from Chris Adolph Writing Empirical Papers 6 Rules & 12 Recommendations.↩