Organisation
Contents
1.6. Organisation¶
1.6.1. The rest of the course¶
The rest of this course is organised into nine chapters, with appendices at the end. Each chapter will focus on either the system or the process.
We have seven chapters on ML systems, which are: Chapter 2 on linear regression, Chapter 3 on logistic regression and and linear discriminant analysis, Chapter 6 on feature selection and regularisation, Chapter 7 on trees and ensembles, Chapter 8 on generalised linear models and support vector machines, Chapter 9 on principal component analysis and \(K\)-means/hierarchical clustering, and Chapter 10 on neural networks and deep learning, including convolutional and recurrent neural networks (CNNs and RNNs). These chapters will cover the system side of ML. Real-world applications will be used to illustrate the concepts and techniques.
We have two chapters on ML processes, which are: Chapter 4 on hypothesis testing and software development, and Chapter 5 on cross validation and bootstrap. These chapters will cover the process side of ML, including leave-one-out/k-fold cross validation,bootstrap, types of errors, significance of results, and the software development life cycle on GitHub.
1.6.2. Real-world datasets used¶
In this course, we will use real-world datasets to introduce machine learning from the perspective of AI transparency. We will use the following datasets from the textbook. You can click on the name of the dataset to see the actual data.
Name |
Data provided |
Machine learning problem |
---|---|---|
Sales, TV, radio, newspaper |
Predict sales based on TV, radio, and newspaper advertising |
|
Gas mileage, horsepower, and other information for cars. |
Predict gas mileage for a car. |
|
Hourly usage of a bike sharing program in Washington, DC. |
Predict the number of bikes rented per hour. |
|
Housing values and other information about Boston census tracts. |
Predict the median value of a house. |
|
Survival times for patients diagnosed with brain cancer. |
Predict the survival time for a patient. |
|
Information about individuals offered caravan insurance. |
Predict whether an individual will buy caravan insurance. |
|
Information about car seat sales in 400 stores. |
Predict the sales of a car seat. |
|
Demographic characteristics, tuition, and more for USA colleges. |
Predict the number of applications received by a college. |
|
Information about credit card debt for 10,000 customers. |
Predict the amount of credit card debt for a customer. |
|
Customer default records for a credit card company. |
Predict whether a customer will default on a credit card payment. |
|
Returns of 2,000 hedge fund managers over 50 months. |
Predict the returns of a hedge fund manager. |
|
Information about heart disease for 303 patients. |
Predict whether a patient has heart disease. |
|
Records and salaries for baseball players. |
Predict the salary of a baseball player. |
|
Measurements of 150 iris flowers. |
Predict the species of an iris flower. |
|
Gene expression measurements for four cancer types. |
Predict the cancer type for a patient. |
|
Gene expression measurements for 64 cancer cell lines. |
Find clusters or groups among the cell lines for personalised treatment. |
|
Sales information for Citrus Hill and Minute Maid orange juice. |
Predict the sales of orange juice. |
|
Past values of financial assets, for use in portfolio allocation. |
Predict the value of a financial asset. |
|
Time to publication for 244 clinical trials. |
Predict the time to publication for a clinical trial. |
|
Daily percentage returns for S&P 500 over a 5-year period. |
Predict whether the stock index with increase or decrease. |
|
Crime statistics per 100,000 residents in 50 states of USA. |
Predict the crime rate in a state. |
|
Income survey data for men in central Atlantic region of USA. |
Predict the income of men |
|
1,089 weekly stock market returns for 21 years. |
Predict the stock market return in a week |
The above datasets show the diverse range of problems that machine learning can solve, which shows only the tip of the iceberg actually. Applications of machine learning are everywhere, from healthcare to finance, from manufacturing to agriculture, from transportation to education, and so on. The datasets used in this course are from the textbook, which is a good starting point for learning about machine learning. However, you can also find many other datasets online, such as Kaggle, UCI Machine Learning Repository, OpenML, Google Dataset Search, and so on.
1.6.3. Machine learning models¶
This course focuses on machine learning models (or methods) that are most widely used in practice, while NOT aiming to be exhaustive in covering all the models. The following table shows the machine learning models that we will cover in this course.
Method |
Description |
Example |
---|---|---|
Linear regression |
A linear model for regression. |
Predicting the price of a house. |
Logistic regression |
A linear model for classification. |
Predicting whether a customer will default on a credit card payment. |
Support vector machine |
A kernel-based model for classification. |
Predicting whether a customer will default on a credit card payment. |
Decision tree |
A nonlinear model for classification and regression. |
Predicting whether a customer will default on a credit card payment. |
Random forest |
An ensemble of decision trees for classification and regression. |
Predicting whether a customer will default on a credit card payment. |
Neural network |
A nonlinear model for classification and regression. |
Predicting whether a customer will default on a credit card payment. |
\(K\)-means |
A clustering model. |
Finding groups of similar customers. |
Principal component analysis |
A dimensionality reduction model. |
Finding the most important features of a dataset. |
No single model will perform well in all possible scenarios. Therefore, it is important to understand the assumptions and trade-offs of each model so that you can choose the right model for a given problem.
1.6.4. Exercises¶
1. Choose three or more datasets of your interest from Table 1.2. Click on the name of each chosen dataset to explore and get a sense of the data. You may not be able to get a beautiful view or a view at all for those larger ones. Write down the possible machine learning problems using terminology in Table 1.1 that can be solved using each of your chosen dataset.
Compare your answer with the solution below
Dataset |
Machine learning problems |
---|---|
Advertising |
Regression |
Auto |
Regression |
Bikeshare |
Regression |
Boston |
Regression |
BrainCancer |
Regression |
Caravan |
Classification |
Carseats |
Regression |
College |
Regression |
Credit |
Regression |
Default |
Classification |
Fund |
Regression |
Hitters |
Regression |
Iris |
Classfication |
Khan |
Classification |
NCI60 |
Clustering |
OJ |
Regression |
Portfolio |
Regression |
Publication |
Regression |
Smarket |
Classification |
USArrests |
Clustering |
Wage |
Regression |
Weekly |
Classification |