Data Science with Pandas
Data science is learning from data in order to gain useful predictions and insights and consists of the steps below1:
- Ask an interesting question:
- What is the scientific goal?
- What would you do if you had all the data?
- What do you want to predict or estimate?
- GET the data:
- How were the data sampled?
- Which data are relevant?
- Are there privacy or copyright issues?
- EXPLORE the data:
- Plot the data.
- Are there anomalies?
- Are there patterns?
- MODEL the data:
- Build a model.
- Fit the model.
- Validate the model.
- Communicate and visualize the results:
- What did we learn?
- Do the results make sense?
- Can we tell a story?
Data science can roughly be split into data engineering and data analysis. Data engineering consists of gathering and preparing data for analysis by scraping cleaning, correcting, integrating, re-ordering, scaling, converting, etc. In other words, data engineers transform data into formats that data scientists can analyze. For a good introduction to data analysis, sign up for the free Udacity course.
Python packages
Preprocessing
- binarization
- mean removal
- scaling
- normalization
- label encoding
Machine learning
Applications of AI
- Computer Vision (CV)
- Natural Language Processing (NLP)
- Speech Recognition
- Expert Systems (rule based)
- Games
- Robotics (all of the above)
Branches of AI
- Machine learning and pattern recognition
- Logic-based AI
- Seach
- Knowledge reresentation
- Planning
- Heuristics
- Genetic Programming
Types of models
- analytical
- learned
- supervised: uses labeled training data
- unsupervised: without labeled training data
Techniques
classification: arrange data into a a fixed numer of distinct categories
- if the number of samples if insufficient, the algorithm will overfit the training data
Classifiers:
- logistic regression: not actually a classifier, but often used as such
- Bayes theorem: describes the probability of an event occurring based on different conditions related to this event (naïve Bayes assumes these conditions are independent of each other)
- Support Vector Machine (SVM): defines a separating hyperplane between classes (the best hyperplane maximizes the distance to each point)
- regression: explain the relationship between independent / input / predictor variables and dependent / output variables
Metrics
- Confusion matrix: shows the performance of a classifier in terms of true/false positives/negatives
- F1 score: harmonic average of…
- precision: #true positives / #total positives
- recall: #true positives / #total truths
Concepts
- Cognitive modeling: simulating the human thinking process
- Deep learning: feature extraction and transformation using using a cascade of multiple layers (hence deep) of nonlinear processing units (e.g. neural nets, belief networks), each using the output from the previous layer as input.
- Rational agent: does the 'right' thing in a given context, using sensors, actuators and an inference engine
- General Problem Solver (GPS)
- Cross validation: divide your data set into training and test subsets