CS66 Final Project Proposal

Due: Friday, April 19 at 11:59pm


Overview

The goal of the final project is to gain experience with the entire scientific process: motivation, hypothesis development, data collection/cleaning, algorithm development/application, evaluation of results, interpretation, and conclusion. Below I’ll describe the main components your project should include and provide a few dataset resources. You are welcome to choose a non-data-driven project (options below), but make sure your proposal is very thorough about what you will do in this case.

This document describes the project proposal and getting started on the project - there will be separate instructions for the oral presentation and exactly what to submit at the end of the semester. The project is worth 15% of your overall grade.

Timeline and Logistics

For the oral presentations, each student will have roughly 4-5 minutes to speak (+ time for questions). This will be scaled for group size.

I would encourage you to work in pairs, but if you would prefer to work individually or in a group of 3, that is fine. Project expectations will scale linearly with the number of people in your group. You are welcome to work across lab sections (but we will still have required lab the last three weeks of classes and it would be ideal if both people could come to the same lab).

I am not doing formal random partners for the project, but if you would like to be matched with someone, send me an email and let me know if there were partners throughout the semester that you worked particularly well with.


Proposal

The goal of the proposal is to help you start working on your final project and assembling the different resources you want to use (literature, software, data, etc). Broadly, your project should include:

  1. A dataset and a goal (i.e. phoneme identification from audio signals)
  2. An algorithm or set of algorithms you will develop and/or apply to this dataset
  3. A scientific question you are trying to answer (i.e. “Will SVMs or neural networks perform better on my dataset?” or “How will pre-processing a dataset or subsampling features affect the results?”)
  4. A way to evaluate and interpret the results
  5. References

For the proposal, briefly outline each of these sections (details below) in a 1-page PDF document. Submit your proposal to me by email and cc all group members. I will provide feedback (and make your group a git repo) in the order I receive proposals. Include a title and all partner names. If you are working in a group of 3, also include a workload breakdown (i.e. what each person will do).

1. Dataset and Goal

What dataset will you use for this project? Regardless of what you’re interested in, there is probably a dataset that is related. Please be as specific as possible - don’t assume your ideal data is available until you’ve actually downloaded and viewed it. Include a reference/link to your data in your proposal. Here are a few databases to get started. Note that I work mostly with biological data and do not have direct experience with most of these datasets. After you download the data, make sure that you can actually view it, that you understand the features and label (if applicable). What is n? What is p? Make sure the dataset includes enough examples that you could reasonably learn from it (I would say n<1000 would be concerning).

2. Software/Methods

What software/methods will you use for the project? You can write your own method(s) or use existing software, but there should be some programming component. You could compare two existing methods, or compare your own method to an existing method. Even if you’re planning to use existing software, there will likely be a significant programming component since you’ll have to get the data in the right format, learn how to run the program, and evaluate/visualize the results.

If you are using any existing software, include a reference.

3. Motivation and Scientific Question

Why is this an interesting or relevant topic? You could talk about your personal motivation for this topic, or why it would be interesting to investigate in general. What scientific question are you trying to answer? Briefly describe a hypothesis about the results you expect. (i.e. I think SVMs will be more accurate than Logistic Regression on this dataset because X,Y,Z.)

4. Results, Evaluation, and Interpretation

What type of results do you expect from your project? How will you evaluate and/or visualize the results?

5. References

Include a list of at least two references (Google scholar is a great place to start). If you would like paper recommendations based on your topic, let me know. For the reference format, include:

Example:

Langmead, Trapnell, Pop, et al. “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.” Genome Biololgy (2009).

If your references include online references (software, datasets, etc) or books, feel free to modify this format (i.e. title of site and link is usually enough if there is not an obvious author).


Alternative Project Styles

  1. Implementation based. If you would like to explore a large-scale software project, that is a great option. Ideas: you could implement a research paper that does not have code available online. Or you could implement an algorithm we have talked about in class but used existing software for (i.e. SVM, neural network, or related method). If you pursue this option, this software should be something you write from scratch yourself as part of this class, not something you have written before.

  2. Theory based. I am also open to a more theoretical/mathematical project that does not involve data or substantial coding. If you have a concrete idea/plan for such a project, let me know (I have already talked to a few people about this). This option is fairly open-ended, but you should be creating something new or finding a new way to interpret complex/vague theory from the literature.

  3. Simulated data. A variation of the default project format would be to simulate your own data. In some areas of biology and physics, this is a common way to test new algorithms. If you’re interested in this approach, include some details in the proposal about how you will add this step.


Getting started after the proposal

I would recommend submitting your proposal as soon as possible so you can get feedback and get started. I’ll provide feedback on your proposal and create a github repository for your project. The repository should include all the code you write for the project. Only include data if the datasets is very small (i.e. a fraction of the data or test example). The repository should also include your “lab notebook” (see details below) and eventually your slides for the presentation. Summary of what should be on git:

Meeting with me

There are two project check-ins, one during lab on April 17 (to discuss project ideas and work on the proposal) and one during lab on May 1 (you should have started running an algorithm on your dataset).

Lab Notebook

Instead of a formal writeup, you should keep a lab notebook throughout the project (beginning after your git repo is created, but you’re welcome to start now if you like). I actually keep an online lab notebook for all my research projects. It helps me remember what I did last and also tracks the time I’m spending on each project. I also use this same document to keep a list of TODOs.

You and your partners should share a lab notebook so you can keep each other updated about your progress. I’m not going to grade this part formally or read every entry - just check that it’s there and looks reasonable. It is more for you to keep track of the project.

Here is an entry from my own work:

Sara: 03-07-18 (2hrs)