Exploratory Data Analysis (EDA) is where meaningful analysis truly begins. Before writing a
single line of machine learning code, data scientists must first ask: What does the data say? EDA answers that question. By examining structure, distribution, relationships, and irregularities in a dataset, analysts can ensure models are built on solid ground. In a recent study on accounting fraud, EDA helped expose key red flags—such as financial strain, poor internal controls, and organizational complexity—long before any model was trained.
EDA isn’t a one-size-fits-all process. Four core types of EDA are typically used, and each plays a specific role:
1.Univariate Non-Graphical EDA
Univariate Non-Graphical EDA focuses on one variable at a time using summary statistics.
Measures like mean, median, mode, variance, skewness, and kurtosis offer insight into the shape
and spread of numerical variables. For categorical features, frequency tables and proportions
show how often certain values appear. In the fraud dataset, over 85% of entries were fraud
cases—a major imbalance that statistical summaries made immediately apparent.
2.Univariate Graphical EDA
Visualizing individual variables can reveal patterns missed in tables. Histograms, box plots, line
charts, and stem-and-leaf plots showed how features like employee count and fraud amount were skewed and filled with outliers. Boxplots, for instance, exposed massive outliers—some fraud cases involved losses exceeding $100 billion. Visual analysis made the data’s skewness, variability, and range easy to interpret at a glance.
3.Multivariate Non-Graphical EDA
To explore relationships between multiple variables without visuals, techniques like cross-tabulations and correlation matrices are used. In the study, cross-tab results showed that
companies with fewer employees and smaller fraud amounts were more commonly involved in
fraud. Correlation matrices highlighted a strong link between the size of a fraud and the
associated drop in share price, providing early clues about cause and effect.
4.Multivariate Graphical EDA
When analyzing multiple variables together, graphical techniques bring complex relationships to
life. Scatter plots, bubble charts, grouped bar plots, and heatmaps showed how features like share price decrement, employee count, and fraud amount interacted. A bubble chart, for example, revealed that larger companies with significant losses also faced steep drops in share price. These visualizations made it easier to spot patterns and communicate insights.
The study didn’t stop at EDA. Findings from the exploratory phase guided the selection of
features for predictive modeling. Random Forest and CART (Classification and Regression Trees) were tested, with CART producing the strongest results. Variables like audit firm type, prior offenses, and financial strain were identified as top predictors of fraud. Visualizing the decision tree offered a clear path from raw data to fraud classification.
EDA plays a far more critical role than many beginners realize. Without exploring the data first,
model performance suffers—and so does interpretability. Mastering the four types of EDA
ensures that analysis is thorough, reliable, and insightful. Anyone learning data science will gain
a powerful edge by treating EDA not as a quick step, but as a core part of the journey.
For full details, refer to the original article available at: https://doi.org/10.21203/rs.3.rs-
5635767/v1
Lokanan, M. E. (2024). Harnessing exploratory data analysis (EDA) for robust financial fraud
detection and model enhancement. Research Square. https://doi.org/10.21203/rs.3.rs-5635767/v1