Math6380J: Mathematical Introduction to Data Analysis

PKU

Math 6380J: A Mathematical Introduction to Data Analysis
Spring 2017

Course Information

Synopsis (摘要)

This course is open to graduates and senior undergraduates in applied mathematics and statistics who are interested in learning from data. Students with other backgrounds such as engineering and biology are also welcome, provided you have certain maturity of mathematics. It starts from two curses of dimensionality: Stein's Phenonema and random matrix theory in PCA, then covers some fundamental topics on high dimensional statistics, manifold learning, diffusion geometry, random walks on graphs, concentration of measure, random matrix theory, geometric and topological methods, etc.
Prerequisite: linear algebra, basic probability and multivariate statistics, basic stochastic process (Markov chains), convex optimization; familiarity with Matlab, R, and/or Python, etc.

Reference (参考教材)

[pdf download]

Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. By Efron and Hastie. A new monograph on computational statistics and 'learning'.

The Elements of Statistical Learning. 2nd Ed. By Hastie, Tibshirani, and Friedman. A classic textbook on statistical learning for graduate students with interests on statistical thinking of machine learning.

An Introduction to Statistical Learning, with applications in R. By James, Witten, Hastie, and Tibshirani. A simplified version of the textbook above for undergraduates, with extensive lab sessions on R programming.

Instructors:

Yuan YAO

Time and Place:

Monday 6:30pm-9:20pm, Rm 5510 (Lift 25-26)
This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza. If you have any problems or feedback for the developers, email team@piazza.com.
Find our class page at: https://piazza.com/ust.hk/spring2017/math6380/home

Homework and Projects:

Monthly mini-projects and a final major project. No final exam.

[Project Reports]

Schedule (时间表)

Date	Topic	Instructor	Scriber
02/06/2017, Mon	Lecture 01: Introduction, Geometry of PCA (Chap 1 Sec 1), MLE (Chap 2 Sec 1) [Reference]: To view .jpynb files below, you may try [ Jupyter NBViewer] PCA in iPython Notebook [ pca.ipynb ] [ pca.py ] PCA with Logistic regression for digit classification: [ pca_logistic.ipynb ] [ pca_logistic.py ]	Y.Y.
02/13/2017, Mon	Lecture 02: Stein's Estimate of Mean and Parallel Analysis for PCA (Chap 2 Sec 2, Efron-Hastie Chap 7.) [Reference]: the following codes are made in R language, please let me know had you found good sources in other languages Stein's Estimate vs. MLE [ james_stein.R ] Horn's Parallel Analysis for PCA [ paran.R ] with S&P500 data in class: [ snp500.Rda ]	Y.Y.
02/20/2017, Mon	Lecture 03: MLE, Linear, JS, LASSO, Hard Thresholding, Nonconvex Regularization, LBI(ISS): Risk and Consistency [Lecture Note] [Reference]: R package "Libra": Linearized Bregman Algorithm [ Cran R ] Tutorial on "Libra": [ arXiv ] [ English ] [Chinese] Some R codes: [ lbi.R ] [ diabetes.R ]	Y.Y.	Jiacheng XIA
02/27/2017, Mon	Lecture 04: Mini-Project 1 and some pick-up on Random Matrix Theory for PCA [ Lecture04.pptx ] [Reference]: Matlab simulation on Marcenko-Pastur distribution [ mp.m ] [Johnstone06] Johnstone, I (2006) High Dimensional Statistical Inference and Random Matrices. arXiv:0611589. [Nadakuditi10] Nadakuditi, R. R. and F. Benaych-Georges (2010) The breakdown point of signal subspace estimation. IEEE Sensor Array and Multichannel Signal Processing Workshop (October 2010), pg. 177-180. Description of mini-project 1 [ project1.pdf ] Piazza Q&A room [ Kaggle inclass contest on Cleave Drug Sensitivity Prediction ]	Y.Y.
03/06/2017, Mon	Lecture 05: SDP relaxations, RPCA, and SPCA (Chap 4: 1-4) [Seminar]: Speaker: Bowei YAN, U Texas-Austin Title: Semidefinite relaxations for clustering [ slides ] Abstract: In recent years, a number of works have studied methods for community detection in stochastic block models (SBM) via semidefinite relaxations. Among various proposed Semidefinite Programming approaches, there are usually conditions required on the sparsity of the graph, the separation of the clusters, and the number and the size of the clusters. In this talk I will introduce an SDP that uses the projection matrix instead of the indicator clustering matrix. We prove that this formulation recovers the ground truth structure with weaker conditions on each of the aforementioned aspects. The proposed relaxation can also be used for kernel clustering and is shown to be robust with respect to arbitrary outliers compared to existing spectral based methods. This is a joint work with Purnamrita Sarkar. [Reference]: You need Matlab CVX optimization toolbox to run the following demo codes. Robust PCA demo: [ testRPCA.m ] Sparse PCA demo: [ testSPCA.m ] Yi MA's website on large scale Robust PCA algorithms [ Yi MA's UIUC web ] Bowei's Matlab codes for SDP-based clustering via ADMM [ sbm_cov_code.zip ]	Y.Y.
03/13/2017, Mon	Lecture 06: Supervised PCA, Dual PCA-MDS, and Reproducing Kernel [ lecture06.pdf ] [Reference]: Python MDS in scikit-learn: [ plot_mds.html ] Localized Sliced Inverse Regression in Matlab: [ lsir at Duke ]	Y.Y.	Yuqi ZHAO
03/20/2017, Mon	Lecture 07: RKHS, SVM, and MDS with incomplete information ( Last part of lecture06.pdf and Chap 4.5-4.6) [Reference]: Python SVM in scikit-learn: [ svm.html ] Sensor Network Localization in Matlab: [ SNLSDP ]	Y.Y.
03/27/2017, Mon	Lecture 08: Tree methods: CART, Bagging, Random Forests, and Boosting [ slides ] [ ISLR: Chap 8 ] [Reference]: Python Notebook: [ Chapter08.ipynb ] R code for lab: [ Chapter08Lab.txt ] Mini-project 2 [ project2.pdf ] Kaggle in-class contest: [ Combinatorial Drug 20 Efficacy ] Kaggle in-class contest: [ OneDrug Sensitivity ]	Y.Y.
04/03/2017, Mon	Lecture 09: Manifold Learning: ISOMAP, LLE and extended LLEs [ lecture09.1.pdf ] [ lecture09.2.pdf ] [Reference]: Python scikit-learn Manifold Learning: [ scikit-learn.manifold ] Todd Wittman's Matlab manifold learning comparison: [ mani.m ] Hau-Tieng Wu's Matlab codes on Vector Diffusion Map: [ VDM ]	Y.Y.
04/10/2017, Mon	Lecture 10: Topological Data Analysis [ lecture10.pdf ] [Reference]: Project reports are at [GitHub Math6380 web] In particular, all reports with source codes are placed at folder [ Project2 reports ] for peer review. Submit your top 5 favorite reports (id and authors) on or before April 25, 2017, to datascience.hw email address. No Self-vote! That won't be counted. Crowdsourced World College Ranking at allourideas: [ allourideas.org/worldcollege ]	Y.Y.
04/17/2017, Mon	Spring break	Y.Y.
04/24/2017, Mon	Lecture 11: Applied Hodge Theory: Social Choice and Game Theory etc. [ lecture11.pdf ]	Y.Y.
04/28/2017, Fri, 3-6pm, Room 2405 (lift 17-18)	Lecture 12: An Odyssey on Representation Learning: A Brief Introduction to Neural Network [ lecture12.pdf ] and [ Final Project Description ] [Reference]: Notice: Room changed to 2405 (lift 17-18), 3-6pm!! Tutorial on MLP in R by JIANG, Yue [ slides ] Tutorial on Tensorflow by MIAO, Lizhang [ Tensorflow_tu.ipynb ] viewed by [ Jupyter NBViewer ] Tutorial on Reinforcement Learning by Akhan Ismailov [ slides ] After the deadline May 21, 2017, project reports will be collected at [GitHub Math6380 web] Doodle Peer Review will be announced at [to-be-announced]	Y.Y.

Datasets (to-be-updated)

[Animal Sleep Data] Animal species sleeping hours vs. other features

[Anzhen Heart Data] Heart Operation Effect Prediction, provided by Dr. Jinwen Wang, Anzhen Hospital

[Beer Data] 877 beers dataset , provided by Mr. Richard Sun, Shanghai

[Crime Data] Crime rates in 59 US cities during 1970-1992

[Real-Time-Bidding Algorithm Competition Data] Contest Website

[红楼梦人物事件矩阵] a 376-by-475 matrix (374-by-475 updated by WAN, Mengting) for character-event appearance in A Dream of Red Mansion (Xueqin Cao) [ dataset ]

[西游记] characters-scene occurance matrices for 100 chapters [ dataset ] [data in RData] [data in matlab (302-by-408 matrix)]

chap001-005	chap006-009	chap010-013	chap014-017	chap018-021	chap022-025
chap026-029	chap030-033	chap034-037	chap038-041	chap042-045	chap046-049
chap050-053	chap054-057	chap058-061	chap062-065	chap066-069	chap070-073
chap074-077	chap078-081	chap082-085	chap086-088	chap089-091	chap092-094
chap095-097	chap098-100	All in TXT	readData.m

[Keywords Pricing] Keywords and profit index in paid search advertising, by Hansheng Wang (Guanghua, PKU). [sample file] [readme.txt] [data in csv]

[Radon Data] Radon measurements of 12,687 houses in US

[Wells Data] Switch unsafe wells for arsenic pollution in Bangladesh

to-be-done...

by YAO, Yuan.