EDA stands for Exploratory Data Analysis in data science. It refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypothesis and check assumptions with the help of summary statistics and graphical representations. The key aspects of EDA include:
- Univariate analysis - analyzing the distribution, central tendency, dispersion of individual variables. This includes visualizing each variable using histograms, boxplots, scatter plots etc.
- Bivariate analysis - analyzing the relationship between two variables using scatter plots, correlation coefficients etc. This can identify positive/negative correlations.
- Multivariate analysis - analyzing the relationships between multiple variables through visualizations like scatter plot matrices, pair plots, parallel coordinate plots etc.
- Data transformations - log transforms, binning, Winsorization etc. to handle skewness, outliers.
- Dimensionality reduction - using techniques like PCA to simplify large datasets.
- Cluster analysis - clustering observations into groups based on similarity.
The goals of EDA are to detect anomalies, test assumptions, generate hypotheses and ultimately gain insights for further focused analysis. It enables deeper understanding of patterns in data before applying predictive models. EDA is an iterative cycle and establishes the basis for more advanced techniques.
General EDA
Some Libraries to use:
- pandas - for data manipulation and analysis.
- numpy - for mathematical and statistical operations.
- matplotlib - for basic visualizations like histograms, scatter plots.
- seaborn - for advanced statistical data visualizations.
- scipy - for statistical tests and methods.