Data Science with Pandas

Data science is learning from data in order to gain useful predictions and insights and consists of the steps below¹:

Ask an interesting question:
1. What is the scientific goal?
2. What would you do if you had all the data?
3. What do you want to predict or estimate?
GET the data:
1. How were the data sampled?
2. Which data are relevant?
3. Are there privacy or copyright issues?
EXPLORE the data:
1. Plot the data.
2. Are there anomalies?
3. Are there patterns?
MODEL the data:
1. Build a model.
2. Fit the model.
3. Validate the model.
Communicate and visualize the results:
1. What did we learn?
2. Do the results make sense?
3. Can we tell a story?

Data science can roughly be split into data engineering and data analysis. Data engineering consists of gathering and preparing data for analysis by scraping cleaning, correcting, integrating, re-ordering, scaling, converting, etc. In other words, data engineers transform data into formats that data scientists can analyze. For a good introduction to data analysis, sign up for the free Udacity course.

Python packages

The PyData Python Open Data Science Stack:

numpy as np
- axis=0 means columns and axis=1 means rows
scipy
sklearn
- preprocessing
- linear_model
- cross_validation
- confusion_matrix
- svm
- multiclass
pandas as pd
(The framework for data engineering, although others exist, like Bubbles.)
bobobo for ETL

Preprocessing

binarization
mean removal
scaling
normalization
label encoding

Machine learning

Wikipedia Portal

Applications of AI

Computer Vision (CV)
Natural Language Processing (NLP)
Speech Recognition
Expert Systems (rule based)
Games
Robotics (all of the above)

Branches of AI

Machine learning and pattern recognition
Logic-based AI
Seach
Knowledge reresentation
Planning
Heuristics
Genetic Programming

Types of models

analytical
learned
- supervised: uses labeled training data
- unsupervised: without labeled training data

Techniques

classification: arrange data into a a fixed numer of distinct categories
- if the number of samples if insufficient, the algorithm will overfit the training data
Classifiers:
- logistic regression: not actually a classifier, but often used as such
- Bayes theorem: describes the probability of an event occurring based on different conditions related to this event (naïve Bayes assumes these conditions are independent of each other)
- Support Vector Machine (SVM): defines a separating hyperplane between classes (the best hyperplane maximizes the distance to each point)
regression: explain the relationship between independent / input / predictor variables and dependent / output variables

Metrics

Confusion matrix: shows the performance of a classifier in terms of true/false positives/negatives
F1 score: harmonic average of…
- precision: #true positives / #total positives
- recall: #true positives / #total truths

Concepts

Cognitive modeling: simulating the human thinking process
Deep learning: feature extraction and transformation using using a cascade of multiple layers (hence deep) of nonlinear processing units (e.g. neural nets, belief networks), each using the output from the previous layer as input.
Rational agent: does the 'right' thing in a given context, using sensors, actuators and an inference engine
General Problem Solver (GPS)
Cross validation: divide your data set into training and test subsets

Footnotes:

https://cs109.github.io