Abstract:
Using quantitative models to detect financial fraud in listed companies is a classic research topic. Early traditional empirical models mainly rely on logistic regression, with data indicators gradually expanding from financial indicators to those related to corporate governance, external auditing, and financing needs, emphasizing intrinsic causal inference and identifying fraudulent characteristics. This approach performs well in identifying financial fraud cases in early samples. However, as fraudulent tactics have become increasingly sophisticated in recent years, the effectiveness of simple logistic regression models has struggled to meet practical needs and shallow and deep learning models are gradually developing. Shallow machine learning models mainly use decision trees, support vector machines, and ensemble learning algorithms. The indicators used by these models include not only financial indicators, corporate governance, and internal control but also raw financial data and textual information. Specifically, ensemble learning models based on decision trees have relatively good interpretability, with recall rates between 70% and 80% in laboratory settings. Deep machine learning models mainly use convolutional neural networks, recurrent neural networks, and long short-term memory networks. These models excel at handling unstructured data. The input data includes not only structured and textual data but also unstructured information such as audio and images. Deep learning models usually demonstrate stronger detection capabilities than shallow machine learning models, with recall rates generally exceeding 80%. However, deep learning models have limited interpretability and their success largely depends on large training datasets, whereas the scarcity of sample data in China can easily lead to overfitting issues. Overall, quantitative models, especially deep learning models, show considerable effectiveness in detecting financial fraud and can already be used for preliminary screening. Nevertheless, there are also problems such as the limited availability of industry- or fraud-type-specific models, insufficient utilization of valuable non-financial and unstructured data, and weak model interpretability. It is suggested to make efforts from multiple aspects such as sample accumulation, data governance, human-machine interaction, and the integration of large language models to achieve substantive improvements in model performance.