Register-based and routinely collected data are important sources of disease surveillance and epidemiological studies [
21,
22]. The data sources have been widely utilized in the Nordic countries, Scotland, United Kingdom, United States, Canada and Australia [
5,
21]. For rare conditions such as birth defects, spinal surgery and arthroplasty, the data provide an effective means for monitoring the rates [
23,
24]. By compiling more similar data sources from different countries or regions, the study population for the rare diseases increased significantly and the study results became more reliable [
24,
25]. On the other hand, data linkage provides a way to extend the study field to broader areas and improve the quality of the linked data [
2,
10,
26]. By linking different data sources according to study objectives, study variables can be increased and the completeness of the variable values can be improved significantly [
10,
26].
A common limitation of registered or routinely collected data is loss of registration or under-reporting. Favourable health was generally more frequent among the registered than the non-registered, and non-registration may lead to bias in analyses of health inequalities [
10,
27]. In NSW, Aboriginal mothers were less likely to register their births [
10]. The magnitude of under-estimation can be estimated by the capture-recapture method [
23]. The master dataset provides a platform for creating an SV which can improve the under-estimation [
10][
25]. The SV is created from linked data and adds value to the data.
Building a master dataset is essential for linked data analysis. The master dataset is useful for data quality improvement. The more data to be linked when building the master dataset, including internal and external datasets, the greater the chance of improving data quality (including consistency and completeness), and the larger the amount of information made available for research. For a mother’s perinatal psychiatric study, the master dataset will be more useful if it includes, in addition to mothers’ and babies’ information, fathers’ data and other data collections such as the national health insurance data (Medicare), RBDM and the hospital- based Emergency Department Data Collection (EDDC). An optimal linked master dataset should cover all life events and health conditions of a study population in the long term.
The techniques for building a master dataset were derived from the current study data which were relatively simple. For very complicated data, such as the Western Australian Data Linkage System (WADLS), which was instigated in 1995 to link up to 40

years of data from over 30 collections for an historical population of 3.7 million, more linking methods were applied [
5]. For example, firstly the study records and variables for specific topics were selected from different data sources and then the data were linked [
26].
It should be borne in mind when describing rates or risks before birth that gestational age in MDC refers to the time interval from the first day of women’s last menstrual period (LMP) to her baby’s date of birth rather than the interval between conception and date of birth. The conception date is about 14

days after the first day of women’s LMP. Furthermore, gestational age data is provided in completed weeks rather than days. As a result, the shortest time interval before birth should be expressed in weeks.
For more complicated data linkage, building a variable dictionary, including the SV, is very helpful when checking and analysing data.
To provide a comprehensive and representative picture of maternal mental illness before and after birth, some other limitations of linked data should also be considered when interpreting and disseminating the results [
28,
29]. Patients’ access to health care impacts hospital admission rates. For psychiatric disorders, severity needs to be considered because only mild and severe patients admitted to hospital.