A dataset of venture capitalist types in China (1978–2021): A machine-human hybrid approach

Chen, Jin; Cao, Ruining; Song, Yifei; Hu, Anan; Ding, Ying

doi:10.1038/s41597-024-04108-z

Download PDF

Data Descriptor
Open access
Published: 20 November 2024

A dataset of venture capitalist types in China (1978–2021): A machine-human hybrid approach

Scientific Data volume 11, Article number: 1255 (2024) Cite this article

3995 Accesses
Metrics details

Subjects

Abstract

Despite escalating interest in distinguishing among various types of venture capitalists (VCs) and their roles in shaping entrepreneurship and innovation, such research remains sparse in the world’s second-largest VC market, i.e., China. To address this important gap, we have devised a machine-human hybrid approach to perform the classification task for VC types. Specifically, we have compiled a list of 49,187 VCs that made investments in China before 2021 from CVSource database, collected VC ownership information from other public sources, developed machine-learning algorithms to predict VC types, and used human coders when machine-learning failed to produce a prediction. Utilizing this hybrid approach, we have classified VCs into one of the following types: GVC (public agency-affiliated, state-owned enterprise-affiliated), CVC (corporate VC), IVC (independent VC), BVC (bank-affiliated VC), FVC (financial/non-bank-affiliated VC), UVC (university-affiliated VC), and PenVC (pension-fund-affiliated VC). We not only provide the most up-to-date database for VC types in the Chinese setting but also demonstrate how to leverage machine-learning algorithms to devise a transparent coding approach for VC-type classifications.

A dataset on affiliation of venture capitalists in China between 2000 and 2016

Article Open access 04 August 2021

Is China decoupling from the global value chain? A quantitative analysis framework based on the global production network

Article Open access 11 June 2025

vcfdist: accurately benchmarking phased small variant calls in human genomes

Article Open access 09 December 2023

Background & Summary

Scholars have increasingly emphasized the importance of differentiating among various types of venture capitalists (VCs) such as independent VCs, corporate VCs, and governmental VCs, each with unique interests in the selection, investment, and fostering of ventures. For example, unlike independent VCs (IVCs) that focus on financial returns on investments, corporate VCs (CVCs) are mainly set up to capture technology windows or harness industry growth, thus prioritizing the strategic alignment between their parent firms and their investee ventures¹. Governmental VCs (GVCs), demonstrating remarkable growth over the past two decades, are deemed an important policy tool for bridging the equity gap and supporting high-risk start-ups that would otherwise struggle to attract private capital². Additionally, VCs might also have affiliations with other entities such as banks, financial institutions, universities, or pension funds, each with divergent objectives. Due to this heterogeneity in ownership, different VCs offer distinct value-added services, resources, and network access to their investee ventures and influence the ventures towards different exit channels³. It is thus of great importance to develop and continually update datasets that can facilitate research in this area, enhancing our understanding of the nuanced roles of various VCs in nurturing entrepreneurship and innovation.

Despite the global proliferation of research on VC types⁴, there is an extreme dearth of such research in the context of China, the world’s second-largest VC market, partly because of the scarcity of high-quality datasets differentiating among types of VCs. For example, as Dushnitsky and Yu (2022, p.1, p.4) argued, “most of what we know about CVC investors … is based on data from the so-called developed world, and especially the United States and Western Europe,” while “the Chinese setting can offer insights into different evolutionary paths” of CVC activities¹. While the leading databases in China (e.g., CVSource, PEdata, and ITJuzi) have provided tags to annotate a few types of VCs (e.g., GVC, CVC), the tags are either incomplete or not up to the standard of research rigor. In an effort to mitigate this data limitation, Dushnitsky and Yu¹ have compiled a large dataset of CVCs in China by integrating multiple data sources; however, this proprietary dataset is not publicly accessible¹. Similarly, prior studies on GVCs in China have manually collected and consolidated data regarding VCs’ government ownership, but these datasets are typically reserved for private usage^5,6. The first and sole study to have publicly disclosed data on VC types in China’s context is Chen et al.⁷. They manually developed a dataset categorizing 6,553 VCs according to their affiliations and made it freely available. However, this dataset only encompasses VCs active in China’s market between 2000 and 2016, rendering it inadequate for analyzing the dramatic changes in VC activities in recent years. For instance, the influx of new VCs into China’s market continued after 2016, resulting in a population of over 49,000 VCs as of 2021. Consequently, there remains a conspicuous absence of up-to-date, high-quality datasets delineating the types of VCs that have invested in China’s market.

More importantly, previous research on China’s datasets has leaned heavily on manual coding, a laborious and time-consuming method^1,5,6. They use public sources (e.g., WIND, China’s State Administration for Industry and Commerce, and VCs’ websites) to discern the ownership of each VC. Such a manual approach suffers from low levels of transparency and is not able to conduct data/results triangulation. As Dushnitsky and Yu (2022, p.18) commented, “We point to an important issue associated with data availability in the Chinese setting. We find that CVC patterns based on data from one database are often not replicated when testing a similar specification using a different database.” In other words, it is difficult to triangulate VC-type data from manually coded databases because of the inconsistency of their classification standards and low transparency.

Machine-learning-based technologies allow the development of a more reliable, consistent, and transparent dataset. Recent advancements in machine-learning algorithms have showcased their superior capabilities in performing classification tasks⁸. Developing a customized algorithm for VC-type classification and making it open to the public would largely help researchers increase the efficiency of data processing and the reliability of the findings. This is particularly pertinent given the exponential growth of VC numbers from thousands to tens of thousands in recent years, rendering manual coding nearly infeasible for completing the necessary tasks.

Using a hybrid approach that combines machine-learning-based and manual coding techniques, we have developed a comprehensive dataset, termed ChinaVCtype, that classifies VC types based on their affiliations as of 2021. Specifically, we assembled a list of VCs that had invested in China before the end of 2021, sourced from the CVSource database, a leading database on VC investment in China⁹. We then obtained their ownership details, such as shareholders, shareholder equity, and shareholder business scope, from public sources like Qichacha, a leading database on business administration information in China. In alignment with previous studies⁷, we designed a multi-step process to categorize each VC into one of the following types: GVC which includes VCs affiliated with public agencies or state-owned enterprises, CVC, independent VC (IVC), bank-affiliated VC (BVC), financial/non-bank-affiliated VC (FVC), university-affiliated VC (UVC), and pension-fund-affiliated VC (PenVC). The majority of this classification task is undertaken by machine-learning algorithm. If the algorithm can generate predictions, human coders are invited to double-check its coding quality. Whenever the algorithm is indecisive in predictions, we resort to human coders. A series of validation tests have been conducted by comparing the results of this newly developed dataset with prior studies, affirming its high quality.

This study makes three important contributions. First, the dataset we have developed can foster a wide range of topics concerning VC types and entrepreneurship in the context of China. This includes concentrating on a specific VC type (e.g., GVC, CVC), comparing different VC types (e.g., GVC vs. CVC vs. IVC), and assessing the generalizability of VC analyses from developed countries to the Chinese setting. Second and more importantly, going beyond prior studies that struggle to triangulate VC-type data from multiple Chinese VC databases due to low transparency and low consistency of those databases’ classification criteria, we have devised a machine-human hybrid approach that makes the VC-type classification process open, transparent, and consistent. By doing so, our study offers a unified VC-type data platform for future research to replicate, compare, and extend prior findings. Third, leveraging the machine-human hybrid approach, we not only provide a methodological foundation for future scholars to apply the data from various platforms on a large scale but also underscore the potential of machine-learning algorithms to assist scholars in performing extensive classification tasks.

Methods

Following prior literature, we develop our ChinaVCtype dataset by classifying VCs to seven types, including GVC (public agency-affiliated or SOE-affiliated VC), CVC, IVC, BVC, FVC, UVC, and PenVC^4,7. We adopted prior studies’ coding schemes whenever possible and made adaptation when necessary. We have developed machine learning algorithms to make predictions and invited human coders to conduct the classification tasks when machine learning algorithms failed to generate a prediction for a VC’s type. The whole coding process is to be elaborated in the section “Data coding scheme.”

Data collection

To generate a list of VCs, we started with the population of VCs that have invested in China’s market as of 2021. In this study, we have adopted a broad definition of VC, referring to entities that make risky equity investments in entrepreneurial ventures. Consequently, the scope of VC in our study encompasses not only the narrowly defined VC but also private equity (PE), and even corporations that directly invest venture capital in ventures. This is because “a clear distinction between venture capital and private equity (PE) is lacking in China”⁹, and we have observed that many corporations may not yet have established their CVC entities to serve as an extension for their corporate venturing activities. Following this broad definition, in the CVSource database, there were 49,187 VCs that have made investments in China as of 2021, and the first deal appeared to be in 1978. Thus, we develop VC types for these 49,187 VCs in China’s market between 1978 and 2021.

VCs’ basic information (e.g., VC name, age, location) was collected from CVSource. Information on VCs’ first-level shareholders (e.g., shareholder name, equity ratio) was collated from Qichacha’s professional version (pro.qcc.com). For those VCs that were not captured by Qichacha, we checked other public sources (e.g., Tianyancha, Bloomberg, Crunchbase, VC official websites, press releases, media reports) to gather information. If any inconsistency across sources was detected, we followed Bertoni and Martí¹⁰ to triangulate between multiple sources.

Data coding scheme

Figure 1 elucidates the coding process utilized to categorize various types of VCs. This process is bifurcated into two sections. Part I pertains to VCs possessing shareholder information such as first-level shareholder names and their respective equity in VCs. For these VCs, we formulated a method called “shareholder-based judgement.” This method initially employs machine learning or human coders to classify each shareholder’s type. Subsequently, the shareholder’s type and equity data are used to categorize the VC’s type. Part II addresses VCs that lack shareholder information. For these VCs, we employed a method referred to as “direct judgement.” In this approach, machine learning or human coders are used to directly determine the specific category to which a VC belongs.

Part I. Shareholder-based judgement

Based on the VC list from CVSource, we consolidated their full names and searched in Qichacha for their shareholder names, which resulted in 119,760 unique shareholders for 36,860 VCs. We collected the basic information of shareholders (e.g., business scope, equity ratio) and developed a four-step (Steps 1.1–1.4) approach to determine VC types, as described below.

Step 1.1 Consolidate shareholder information from Qichacha

We gathered basic information of VCs’ shareholders from Qichacha, including shareholder names, registered capital, ranges of the number of employees, business categories, locations, business scope, introduction, and industries. This data included numerical, ordinal, discrete, and textual data, requiring different pre-processing techniques. We need to pre-process these various types of data to be recognized by machines.

First, regarding numerical data (e.g., registered capital), we transformed them into ordinal data by using the tenth quantile as the range of classification.

Second, as for ordinal and discrete data (e.g., ranges of the number of employees, business categories, locations, industries, registered capital transformed), we used one-hot encoding to transform them to be in the dummy format¹¹.

Third, considering the nature of Chinese context, we conducted multiple steps to pre-process textual data: We began with removing special symbols and unnecessary words (such as punctuation and messy code) and keeping Chinese characters, English characters, and numbers only. Next, we adopted the enterprise information dictionary and finance dictionary from Jieba (https://github.com/fxsjy/jieba), which is the most advanced word segmentation tool for Chinese, to segment words in each sentence, and applied stopwords lists from Harbin Institute of Technology to remove unimportant words or connectives¹². After that, each shareholder’s name, business scope, and introduction were processed by TF-IDF (Term Frequency-Inverse Document Frequency), a well-known weighting technique in information retrieval and text mining that can effectively measure the similarity between sentences¹³. Following the application of TF-IDF, each sentence was transformed into a vector. However, the high dimension of each vector poses a challenge for subsequent machine learning. To address this issue, we applied the low variance filtering method, which helps delete low-variance variables that have little influence on machine learning’s prediction target. This method allows us to minimize the number of dimensions of the vectors and focus on key differences between each sentence. After data pre-processing, we obtained the representation of textual data.

Step 1.2 Use supervised machine learning to predict shareholder types

According to previous studies, there are eight mutually exclusive types of shareholders: “public agency, corporate, bank, financial, university, professional, person, and pension.”⁷ Machine learning is applied to differentiating among the first six types because they have plenty of textual information for algorithms to learn, while for person and pension we mainly applied human coders because it is faster to filter individual names by human coders and because China has very few pension funds.

In addition, for a shareholder categorized as corporate, bank, financial, or professional types, it may have another level of label denoting whether it is state-owned enterprise (SOE)⁷. We also used machine learning to learn this feature and make prediction on the SOE label.

Step 1.2.1 Sample split

The type of machine learning employed in this study is supervised machine learning, in which the algorithm first learns from a pre-labelled training dataset, predicts the labels of a test dataset, and is only applied to new data prediction if the prediction performance on the test dataset is deemed acceptable. In this sense, the training dataset is for a machine learning model to train its parameters, and the test dataset is to test the generalization ability of the model.

To train the machine learning model, a sample with correctly labelled shareholder type (e.g., public agency, corporate, bank, financial, university, professional, SOE) and feature information (e.g., shareholder names, registered capital, ranges of the number of employees, locations, business scope, introduction, and industries) is needed. For this purpose, we resorted to Chen et al.’s VCAC dataset⁷. They provided us the manually coded types of VCs’ shareholders between 2000 and 2016, but they did not have shareholders’ feature information. To obtain the features for prediction, we searched in Qichacha using these shareholders’ names and obtained 6,404 unique shareholders with complete information. We divided these 6,404 shareholders into a training dataset and a test dataset as 80% (5,123 shareholders) and 20% (1,281 shareholders).

Step 1.2.2 Model training

In total, we adopted six classical prediction models including Decision Tree, Random Forest, SVM, KNN, MLPClassifier, and Xgboost for prediction, using Python software. The metrics of prediction performance include precision, recall, and F1-score. Among the metrics, precision represents the percentage of how many positive cases predicted-to-be are with truly positive labels in the test dataset¹⁴. In contrast, recall reflects how many truly positive cases have been recognized by the model as positive in the test dataset¹⁴. F1-score synthesizes these two metrics and reflects the comprehensive performance of model prediction¹⁵. Thus, we used F1-score to illustrate the comparison of prediction performance. As shown in Table 1, Xgboost demonstrated superior predictive performance, with the highest accuracy consistently across all types. Thus, Xgboost model was used to predict shareholder types.

Table 1 F1-scores of six models predicting shareholder type based on the results of test dataset.

Subjects

Abstract

Similar content being viewed by others

A dataset on affiliation of venture capitalists in China between 2000 and 2016

Is China decoupling from the global value chain? A quantitative analysis framework based on the global production network

vcfdist: accurately benchmarking phased small variant calls in human genomes

Background & Summary

Methods

Data collection

Data coding scheme

Part I. Shareholder-based judgement

Step 1.1 Consolidate shareholder information from Qichacha

Step 1.2 Use supervised machine learning to predict shareholder types

Step 1.2.1 Sample split

Step 1.2.2 Model training

Step 1.2.3 Keyword-based prediction

Step 1.2.4 Screening cross-labels

Step 1.3 Invite human coders to code shareholder type

Step 1.3.1 Shareholder type coding

Step 1.4 Use shareholders’ type and equity ratio to code VC type

Part II. Direct judgement

Step 2.1 Consolidate VC information from CVSource

Step 2.2 Use supervised machine learning to predict VC types

Step 2.2.1 Sample split

Step 2.2.2 Model training

Step 2.2.3 Keyword-based prediction

Step 2.2.4 Screening cross-labels

Step 2.3 Invite human coders to directly judge on VC type

Data Records

Technical Validation

Upon dataset composition: Matching VC names between multiple sources

Upon direct judgement on VC types by human coders: Triangulating across coders

Upon the results of VC type classification: Comparing with prior studies

Usage Notes

Implications

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links