AIcrowd | Classification - Building collapse detection

Round 1: Completed #classification

VITA Lab, EPFL

5300

1044

🕵️ Overview

The goal of this challenge is to create your own classification model and determine whether a building collapses based on some earthquake date. Your task is to investigate and propose your own model to outperform your peers. You must understand any preprocessing and any architecture you use as you will need to give it a description during code submission as well as explain it during the poster presentation.

The challenge ends on the 26th of May at 23:59. This is also the week before the final exam, so do not postpone submitting your predictions.

💾 Dataset

You can find the dataset on Moodle CIVIL-226 as well as a description of the data.

There are two kinds of CSV files, the one containing the features and the one containing the scaling factor and the result.

You will need to combine these two files using the scaling procedure described in the Data description. As you will notice, some data may be missing or have been incorrectly reported, it is up to you to decide how to deal with them, and we expect your decisions to be explained in your code and poster.

📝 Code

Your code should be a notebook named train.ipynb. It should contain everything from the loading of the dataset to the predictions you make. The notebook should be well documented and organized. You can inspire yourself from the notebook given as exercises during the semester. Be aware that part of the grading will rely on the clarity of your notebook. You should motivate the decisions you made directly in the notebook.

Optional. You may also upload different files and notebooks, in this case, you MUST submit a README file explaining clearly what is each file/notebook for and how to reproduce the results you have obtained. When submitting multiple files, upload them together as a zip.

Edit: Please also include the name of your team on AICrowd (either in the README or in the main train.ipynb notebook).

Your final code and poster should be submitted on Moodle by 30/05 23:59.

🚀 Submission

Please submit only as a TEAM -> Simply click on create a team on the top right of the challenge page.

For the challenge, you must submit your predicted test set target columns here on AICrowd. You are allowed up to 5 daily submissions so manage your time and progress carefully (edit: changed to 10 daily submissions per team).

To submit, please upload a .csv file, single column with header 'label'. Namely, you are uploading only the missing column of test_set.csv with your predictions.

Make sure you do not upload any other inputs from the test_set in your CSV file, it MUST be 1 column. The evaluation compares the entries of your lines 1 by 1, with the true 'target' values and gives you your score on the leaderboard.

WARNING: The predicted values must be int of 0 or 1, 0 corresponds to "NO" and 1 corresponds to "YES".

For sending us your project, on Moodle, you will be able to upload your code and poster in a separate link. Please always have all sciper and names of your teammates in the README file (or on the notebook directly) and on the poster.

🖊 Evaluation Criteria

For the Challenge:

The top 5 teams will get bonus points for their grade, proportional to their ranking in the leaderboard. The primary score of the challenge is Accuracy.
If you are without a submission however, you will lose points accordingly for failing to submit an acceptable classification model, which is what we ask of you in this project.

For the Code:

You will submit on Moodle a zip with just your notebook and your poster. Do not include the data, as you won't be able to upload your submission to Moodle if you do so.
Please make it tidy and add documentation when needed. Readability counts towards the grading. Your code should be able to reproduce (or come very close to) your best AICrowd submission.
Please make sure your script loads the submitted data with a relative path (e.g., load_the_csv('data/train_set.csv')), and not with an absolute path (e.g., load_the_csv('MyDrive/Users/alice/data_folder/train_set.csv') )

For the Poster:

You will present your models, creative ideas and results in the form of a poster that you must submit with your final code by 30/05 23:59. Please note that this replaces the form of a report, which you may have often done in previous courses and would not teach you much.
The poster must explain shortly what your code does, and what are the main ideas and implementations you have done to solve the task. Please add the name and SCIPER numbers of your teammates in the README, as well as your AICrowd team name and the ID of your best submission.

Prizes:

Best results, the team winning the leaderboard, will get a prize and will be presenting their approach to the class on 02/06.
Best poster, the team with the clearest and nicest poster will get a prize and will be presenting their approach to the class on 02/06.
Optional: Most original approach.

🔗 Resources

Poster requirements:

Think of it as mid-way between a report (structure) and a collage of slides, where you can have both bullet points and few full sentences of explanation.

Key Components:

Title: Your project title, teammates
Predicting: Briefly explain the motivation for your topic, what you built, and the results. It’s easier to think of this as a quick summary of the inputs and outputs. (3 sentences max)
Data: Exactly where did your data come from and what does it contain? (ie. What are in the rows and columns? Are examples labeled with ground truth?, etc...) (1-2 sentences max)
Features: How many features have you selected and which features are the raw input data vs. features you have derived? Why are they appropriate for this task? (2-3 sentences max)
Models: Exactly which model(s) are you using or are worth showing? Write out the basic math formulas if applicable and clearly note any modifications or additions. If you have more than one model, make subsections for each. (3-4 sentences max)
Results: Make a compact table of results. Each row should be a different model. The columns should be the training accuracy and the test accuracy. List how many samples are in each of the training and testing data sets. Obviously, these sets should be different. (1-2 sentences max + 1 table max)
Discussion: This is where you share your thoughts about your project. (Hopefully you have a few interesting interpretations!) Briefly summarize what happened. Briefly explain whether or not you expected your results. If your results were good, explain why. If they were not good, explain why. (5 sentences max)
Future: If you had more time to work on this or add a creative idea, what would you do first? (2-3 sentences max)
References: Papers you read to create your model or succeed in the project

Source: http://cs229.stanford.edu/projects.html

Examples of posters from ML conferences: https://web.archive.org/web/20201128110223/https://postersession.ai///#

Extra Guidelines:

Methodology:

Your choice of method needs to be well motivated and you need to show evidence that your work has an eﬀect.

The simplest way to do so is to start with a simple model as a baseline, evaluate it, ﬁnd a way to improve it, evaluate again and repeat. Explain the process that leads to your various improvements, evaluate the results carefully and present evidence using plots and tables. When comparing two models, make sure you tuned the hyper-parameters for both models beforehand. Comparing untuned / ill-deﬁned models is not meaningful.

Source: https://github.com/epfml/ML_course/blob/master/projects/Project_Guidelines.pdf

Code:

README(if applicable): The README should contain the full instructions on how to run your code, how to reproduce your obtained results, and give an overview of the architecture of your code (what are the diﬀerent ﬁles and what they contain). You will also need to specify which libraries should be installed.

Modularisation: Avoid copy-pasting of code as much as possible. Define re-usable functions instead.

Documentation: Clear variable and function names are even better than comments. Indent your code properly. Use Python Docstring convention to explain what a function does. Make multiple short functions with explicit names rather than a 200-lines run function. The more readable your code is, the more likely you are to be understood and given points.

Useful resources:

Libraries:

General purpose:
- NumPy: https://numpy.org/
- Pandas: https://pandas.pydata.org/
ML:
- PyTorch: https://pytorch.org/
- PyTorch Lightning: https://www.pytorchlightning.ai/
- scikit-learn: https://scikit-learn.org/stable/
Visualization:
- matplotlib: https://matplotlib.org/
- seaborn: https://seaborn.pydata.org/

Code and collaboration:

Code editors / IDEs:
- JupyterLab (for notebooks)
- Visual Studio Code: https://code.visualstudio.com/
  - Use the Python extension (https://code.visualstudio.com/docs/python/data-science-tutorial)
  - Supports notebooks too
Collaboration:
- Deepnote (real-time collaboration, like Google Docs but for notebooks): https://deepnote.com/
Free GPUs:
- Google Colab: colab.research.google.com/

Experiment logging

If you want to log and visualize experiments, we recommend you to use TensorBoard, which keeps track of the loss and accuracy.

For more information on how to use TensorBoard with PyTorch, check out the documentation.

Google Colab for GPUs

If you are in need of GPUs, you can run your notebook in Colab.

To use a GPU on Colab, make sure to switch to a GPU runtime (Runtime -> Change runtime type -> GPU)

To use GPUs with PyTorch, you will first need to move your model and data to the GPU. See this tutorial for more information: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#training-on-gpu