PKU

Math 6380J: A Mathematical Introduction to Data Analysis
Spring 2017


Course Information

Synopsis (摘要)

This course is open to graduates and senior undergraduates in applied mathematics and statistics who are interested in learning from data. Students with other backgrounds such as engineering and biology are also welcome, provided you have certain maturity of mathematics. It starts from two curses of dimensionality: Stein's Phenonema and random matrix theory in PCA, then covers some fundamental topics on high dimensional statistics, manifold learning, diffusion geometry, random walks on graphs, concentration of measure, random matrix theory, geometric and topological methods, etc.
Prerequisite: linear algebra, basic probability and multivariate statistics, basic stochastic process (Markov chains), convex optimization; familiarity with Matlab, R, and/or Python, etc.

Reference (参考教材)

[pdf download]

Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. By Efron and Hastie. A new monograph on computational statistics and 'learning'.

The Elements of Statistical Learning. 2nd Ed. By Hastie, Tibshirani, and Friedman. A classic textbook on statistical learning for graduate students with interests on statistical thinking of machine learning.

An Introduction to Statistical Learning, with applications in R. By James, Witten, Hastie, and Tibshirani. A simplified version of the textbook above for undergraduates, with extensive lab sessions on R programming.

Instructors:

Yuan YAO

Time and Place:

Monday 6:30pm-9:20pm, Rm 5510 (Lift 25-26)
This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza. If you have any problems or feedback for the developers, email team@piazza.com.
Find our class page at: https://piazza.com/ust.hk/spring2017/math6380/home

Homework and Projects:

Monthly mini-projects and a final major project. No final exam.

[Project Reports]

Schedule (时间表)

Date Topic Instructor Scriber
02/06/2017, Mon Lecture 01: Introduction, Geometry of PCA (Chap 1 Sec 1), MLE (Chap 2 Sec 1)
Y.Y.
02/13/2017, Mon Lecture 02: Stein's Estimate of Mean and Parallel Analysis for PCA (Chap 2 Sec 2, Efron-Hastie Chap 7.)
    [Reference]: the following codes are made in R language, please let me know had you found good sources in other languages
  • Stein's Estimate vs. MLE [ james_stein.R ]
  • Horn's Parallel Analysis for PCA [ paran.R ] with S&P500 data in class: [ snp500.Rda ]
Y.Y.
02/20/2017, Mon Lecture 03: MLE, Linear, JS, LASSO, Hard Thresholding, Nonconvex Regularization, LBI(ISS): Risk and Consistency [Lecture Note]
Y.Y. Jiacheng XIA
02/27/2017, Mon Lecture 04: Mini-Project 1 and some pick-up on Random Matrix Theory for PCA [ Lecture04.pptx ]
Y.Y.
03/06/2017, Mon Lecture 05: SDP relaxations, RPCA, and SPCA (Chap 4: 1-4)
    [Seminar]:
  • Speaker: Bowei YAN, U Texas-Austin
  • Title: Semidefinite relaxations for clustering [ slides ]
  • Abstract: In recent years, a number of works have studied methods for community detection in stochastic block models (SBM) via semidefinite relaxations. Among various proposed Semidefinite Programming approaches, there are usually conditions required on the sparsity of the graph, the separation of the clusters, and the number and the size of the clusters. In this talk I will introduce an SDP that uses the projection matrix instead of the indicator clustering matrix. We prove that this formulation recovers the ground truth structure with weaker conditions on each of the aforementioned aspects. The proposed relaxation can also be used for kernel clustering and is shown to be robust with respect to arbitrary outliers compared to existing spectral based methods. This is a joint work with Purnamrita Sarkar.
Y.Y.
03/13/2017, Mon Lecture 06: Supervised PCA, Dual PCA-MDS, and Reproducing Kernel [ lecture06.pdf ]
Y.Y. Yuqi ZHAO
03/20/2017, Mon Lecture 07: RKHS, SVM, and MDS with incomplete information ( Last part of lecture06.pdf and Chap 4.5-4.6)
Y.Y.
03/27/2017, Mon Lecture 08: Tree methods: CART, Bagging, Random Forests, and Boosting [ slides ] [ ISLR: Chap 8 ]
Y.Y.
04/03/2017, Mon Lecture 09: Manifold Learning: ISOMAP, LLE and extended LLEs [ lecture09.1.pdf ] [ lecture09.2.pdf ]
Y.Y.
04/10/2017, Mon Lecture 10: Topological Data Analysis [ lecture10.pdf ]
    [Reference]:
  • Project reports are at [GitHub Math6380 web]
  • In particular, all reports with source codes are placed at folder [ Project2 reports ] for peer review.
  • Submit your top 5 favorite reports (id and authors) on or before April 25, 2017, to datascience.hw email address. No Self-vote! That won't be counted.
  • Crowdsourced World College Ranking at allourideas: [ allourideas.org/worldcollege ]
Y.Y.
04/17/2017, Mon Spring break
Y.Y.
04/24/2017, Mon Lecture 11: Applied Hodge Theory: Social Choice and Game Theory etc. [ lecture11.pdf ]
Y.Y.
04/28/2017, Fri, 3-6pm, Room 2405 (lift 17-18) Lecture 12: An Odyssey on Representation Learning: A Brief Introduction to Neural Network [ lecture12.pdf ] and [ Final Project Description ]
Y.Y.

Datasets (to-be-updated)


by YAO, Yuan.