The goal of the final project is to gain experience with the entire scientific process: motivation, hypothesis development, data collection/cleaning, algorithm development/application, evaluation of results, interpretation, and conclusion. Below I’ll describe the main components your project should include and provide a few dataset resources. You are welcome to choose a non-data-driven project (options below), but make sure to discuss this with me in advance.
You are encouraged to work in pairs for the final project - group work is an important component of Computer Science (and other fields!) and we haven’t emphasized it throughout the semester. This in an opportunity to practice collaboration and creating something bigger than each person could individually. If you would like a random partner, please email me by the end of Thanksgiving break. You are welcome to work across lab sections (but we will still have required lab the last few weeks of classes and it would be ideal if both people could come to the same lab). If you prefer to work individually or in a group of 3, please also email me by the end of Thanksgiving break.
The goal of the proposal is to help you start working on your final project and assembling the different resources you want to use (literature, software, data, etc). Broadly, your project should include:
For the proposal, briefly outline each of these sections (details below) in a short email to me. Submit your proposal to me by email and cc your partner. I will confirm the proposal (and make your group a git repo) in the order I receive proposals.
What dataset will you use for this project? Regardless of what you’re interested in, there is probably a dataset that is related. Please be as specific as possible - don’t assume your ideal data is available until you’ve actually downloaded and viewed it. Include a reference/link to your data in your proposal. Here are a few databases to get started. Note that I work mostly with biological data and do not have direct experience with most of these datasets. After you download the data, make sure that you can actually view it, that you understand the features and label (if applicable). What is n
? What is p
? Make sure the dataset includes enough examples that you could reasonably learn from it (I would say n<1000
would be concerning).
Kaggle: Wide variety of datasets (may need to create an account).
UCI Machine Learning Repository: Also contains a wide variety of datasets (options on the left allow you to search by task, attribute type, etc which can be very useful).
ImageNET: Large database of images (larger than CIFAR-10, which is also an option).
Climate data: If you are interested, I would encourage you to choose a climate-oriented project. There are a number of government and climate research sites with data. Here are a few:
Wikipedia Data List: Up-to-date list of datasets organized by category (may or may not be freely available):
1000 genomes (human DNA data): If you’re interested in exploring a biological project, let me know. This specific dataset contains DNA data from humans around the world, but there are many other datasets from other species.
The 50 Best Free Datasets for Machine Learning: There are a number of these type of lists floating around - this one looks decent but make sure that you can actually download the data.
What software/methods will you use for the project? You can write your own method(s) or use existing software, including the code we have written in labs during the semester. You could compare two existing methods, or compare your own method to an existing method. Even if you’re planning to use existing software, there will likely be a programming component since you’ll have to get the data in the right format, learn how to run the program, and evaluate/visualize the results.
If you are using any existing software, include a reference.
Why is this an interesting or relevant topic? You could talk about your personal motivation for this topic, or why it would be interesting to investigate in general. What scientific question are you trying to answer? Briefly describe a hypothesis about the results you expect. (i.e. I think SVMs will be more accurate than Logistic Regression on this dataset because X,Y,Z.)
What type of results do you expect from your project? How will you evaluate and/or visualize the results?
Include references if necessary.
Implementation based. If you would like to explore a large-scale software project, that is a great option. Ideas: you could implement a research paper that does not have code available online. Or you could implement an algorithm we have talked about in class but used existing software for (i.e. SVM, neural network, or related method). If you pursue this option, this software should be something you write from scratch yourself as part of this class, not something you have written before.
Theory based. I am also open to a more theoretical/mathematical project that does not involve data or substantial coding. If you have a concrete idea/plan for such a project, let me know (I have already talked to a few people about this). This option is fairly open-ended, but you should be creating something new or finding a new way to interpret complex/vague theory from the literature.
Simulated data. A variation of the default project format would be to simulate your own data. In some areas of biology and physics, this is a common way to test new algorithms. If you’re interested in this approach, include some details in the proposal about how you will add this step.
I would recommend submitting your proposal as soon as possible so you can get started. I’ll confirm your proposal and create a github repository for your project. The repository should include all the code you write for the project. Only include data if the datasets is very small (i.e. a fraction of the data or test example). The repository should also include your “lab notebook” (see details below). Summary of what should be on git:
There are two project check-ins, one during lab on December 3 (working on the proposal) and one during lab on December 10 (finishing up).
Instead of a formal writeup, you should keep a lab notebook throughout the project (beginning after your git repo is created, but you’re welcome to start now if you like). I actually keep an online lab notebook for all my research projects. It helps me remember what I did last and also tracks the time I’m spending on each project. I also use this same document to keep a list of TODOs.
You and your partners should share a lab notebook so you can keep each other updated about your progress. I’m not going to grade this part formally or read every entry - just check that it’s there and looks reasonable. It is more for you to keep track of the project.
Here is an entry from my own work:
Sara: 03-07-18 (2hrs)
Note: there will be an alternative presentation time besides the last day of class - I will create a poll for this time.
The main deliverable for the final project is the presentation. I chose a presentation over a paper since I think presentation is an important skill that needs more emphasis across the curriculum. This is an opportunity to practice presenting and receiving feedback. I also wanted everyone in the class to be able to see the other projects, which doesn’t always happen with final papers.
In addition to the presentation, evening you should also submit (on git):
I’ll go through each of these pieces below. When thinking about what to include in your git repo, keep a reproducibility perspective in mind. From your lab notebook, references, code, and slides, I should be able to reproduce your project and results exactly.
Each person will have 4-5 minutes total to present. The scale with group size is not exactly linear since there is some startup cost to doing a presentation. Roughly we will do:
We will have a bit of time after each presentation for questions and transition to the next group. I will have a timer that will go off when your group has 1 min left. The best way to make sure you are hitting the right time is to practice.
In terms of presentation content, you should (very briefly) include all the main components you mentioned in your proposal, as well as future work:
Introduce your topic and goal in a creative or visual way. Whenever you give a presentation, there will be those in the audience less interested in the topic than you are, who might question the “point” of your topic or thesis. Give them a reason to pay attention. Often this involves placing your topic in a larger context, using an image the audience can relate to, telling a personal story, or posing a question you’ll answer later in the talk.
Briefly explain your dataset and/or chosen methods. Try to pick one detail or aspect that you found interesting or challenging. If you are using methods we’ve talked about in class, you could expand on how you prepared the data. If you are implementing or using a new method, tie it to our class material and then explain how it is different or novel. Overall, try to briefly give the project a narrative; explain your thought-process throughout the project.
Display your results in a visual way. Negative results are results too, and can definitely be included. How did you evaluate and interpret your results? If they did not match your expectations, what might be going on?
In a few words, what were your main takeaways from the project? What would you do if you had 6 months to work on this project instead of a few weeks? What aspects would you change or extend further?
Video on is required while you are presenting (email me if this will not work for you).
Avoid text-heavy slides, try to use images and diagrams to convey information.
Include citations for any figures/info you use that you did not create (on the slide where you use it).
You do not need to include a full list of references in the slides.
For groups with more than one person, feel free to divide the presentation however you like, as long as each person gets equal time.
As an audience member, be respectful to the other presenters. Be on time and give them your full attention (this counts toward your participation grade).
Time permitting, each person should ask at least one question to another group. Sometimes keeping a question in mind is a good way to stay engaged.
In addition to keeping track of what you have done so far, also include a list of references at the end. This should include anything that you made use of - papers, datasets, external software. Think about the standard of reproducibility when creating your lab notebook.
Except for external software, include all code that was necessary to obtain your final results (including citations for code you did not write). Keep your code organized and commented. You can include some small example datasets, but avoid putting large data files on git since this can cause problems. Err on the side of including more results though (output files, figures, etc).