Data Science Bookmarks
Bookmark of the links and references I have referred for Data Science development and learning.
Basic Statistics - My Video references:
- Correlation Coefficient
- The Correlation Coefficient - Common Misconception
- Spearmans Correlation Coefficient
- RMSE Stackoverflow
- Hypothetic Statistics
- Useful pylibraries for statistics: numpy, scipy.stats, mode, rankdata
Statistic Jargons:
- Outlier - an outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.
Pandas
- Very importantuseful to play around with datasets/csv
- Data frames and Series
- Finding unique values
- dataframe apply is a very useful function to perform row/column operations in a dataframe. Can check its other uses too.
- Pandas Dataframe apply usage
- Data Cleaning is an important application of pandas
- Matplotlib is intended to be a plot library and pandas to be a a data analysis library. However, in some cases their functionality overlap.
- Best Pandas functions
- Understanding Described datarame
Numpy
- random.rand(n) returns random float values. random.randint(n) returns random integer values
- Range_Arange_Diff
Data Visualization
- Even similar summary statistics can have different Visualizations - Francis Anscombe conducted an experiment.
- Dataviz
- Data Storytelling
Pandas Long and Wide data format
- Wide format is mostly used in pandas when operatins are done on values in a column
- Wide format is the data format for statistical modelling
- .merge, .pivot, flatten, aggregate functions
- [Understanding Described datarame]
Plotting
- pyplot
- [grouped-unstack-histogram] (http://themrmax.github.io/2015/11/13/grouped-histograms-for-categorical-data-in-pandas.html)
Machine learning
Linear Regression
Classification
Feature Selection
Gradient Descent
Clustering
Natural Language Processing
Written on March 28, 2017