Search tips
Search criteria 


Logo of sensorsMDPI Open Access JournalsMDPI Open Access JournalsThis articleThis JournalInstructions for authorssubscribe
Sensors (Basel). 2017 February; 17(2): 342.
Published online 2017 February 10. doi:  10.3390/s17020342
PMCID: PMC5335977

A Framework for Bus Trajectory Extraction and Missing Data Recovery for Data Sampled from the Internet

Simon X. Yang, Academic Editor


This paper presents a novel framework for trajectories’ extraction and missing data recovery for bus traveling data sampled from the Internet. The trajectory extraction procedure is composed of three main parts: trajectory clustering, trajectory cleaning and trajectory connecting. In the clustering procedure, we focus on feature construction and parameter selection for the fuzzy C-means clustering method. Following the clustering procedure, the trajectory cleaning algorithm is implemented based on a new introduced fuzzy connecting matrix, which evaluates the possibility of data belonging to the same trajectory and helps detect the anomalies in a ranked context-related order. Finally, the trajectory connecting algorithm is proposed to solve the issue that occurs in some cases when a route trajectory is incorrectly partitioned into several clusters. In the missing data recovery procedure, we developed the contextual linear interpolation for the cases of missing data occurring inside the trajectory and the median value interpolation for the cases of missing data outside the trajectory. Extensive experiments are conducted to demonstrate that the proposed framework offers a powerful ability to extract and recovery bus trajectories sampled from the Internet.

Keywords: trajectory processing, anomaly detection, missing data recovery, clustering

1. Introduction

Urban transportation is being increasingly studied due to growing congestion and its significant impact on people’s daily activities. Currently, the sampling and mining of large-scale traveling data from the real world in order to explore the dynamic properties of traffic systems is becoming a popular research topic. For instance, Zheng et al. [1,2,3,4] sampled trajectories generated by more than 33,000 taxis over a six-month period in Beijing, which contributed to the theory and application of sparse GPS data sensing. By analyzing traffic conditions and driver behaviors, a recommendation system was designed for both taxi drivers and passengers to improve drivers’ pick-up efficiency and reduce passengers’ average wait time. Pan et al. [5,6,7,8,9] investigated human mobility patterns and city dynamics in an urban taxi transportation system by mining the GPS of taxis in Hangzhou, China. Moreover, Liu et al. [10] inferred road maps from the GPS of Shanghai taxis with low resolution and sampling frequency.

In fact, taxis are not the only type of floating probe that can be used to sense urban traffic; rather, the GPS data of urban buses can also be used to detect urban traffic statuses and passengers’ mobility in a city. It is important to note that buses provide a good coverage of cities and run in a fixed road net with better spatial-temporal regularity than do taxis. Bejan et al. [11,12] proposed a statistical model based on the analysis of the sparse GPS data of over 100 buses in the city of Cambridge, and they further evaluated the velocity fields and congestions of the urban traffic. Amaral et al. [13] introduced a tool, called BusesInRioto collect and manage bus GPS data in the City of Rio de Janeiro. However, studies on urban bus trajectory processing are reported much less than those on taxi trajectories, especially for large-scale data that take into consideration hundreds or even thousands of buses. The theories and methods developed in the area of data mining for taxi trajectories may not be suitable for urban bus GPS data because buses exhibit more regular trajectories in the road network and are dispatched in a repeated pattern with a repeated cycle. However, buses are often associated with uncertain boarding and alighting times. Due to the severe traffic congestion and the relatively large spatial-temporal deviation of passenger mobility behavior in many cities in China, most current research on the dynamic models of urban bus trajectories is not able to predict bus arrival time with high precision, nor can they optimize a dispatching schedule in a good manner.

In recent years, the Chinese government has upgraded the infrastructure of the urban bus systems in many cities. Buses equipped with GPS devices report real-time location information back to a service center. Several cities, such as Suzhou, Wuxi, Qingdao, Hangzhou, Wenzhou, and so on, have built corresponding websites to better serve passengers, i.e., these websites provide real-time bus arrival information so that passengers can make more flexible travel plans. For researchers, these websites also provide a very good opportunity to collect large-scale bus trajectory data and then to further investigate urban traffic dynamics by data mining methods. One challenge is that there may be a number of erroneous and missing data in the collected data as a result of an unstable GPS signal influenced by high buildings in the city. Additionally, the message packet may be lost as a result of poor mobile communication, which increases the difficulty of trajectory extraction.

Our project concerning urban bus traveling data acquisition from the Internet was started in 2011, and it primarily focused on bus data published from the Suzhou traffic management department [14]. Previous work was published in [15] for the early dataset. However, the method presented in [15] ignored the arrival data of the bus identification, which resulted in incorrect extraction when the trajectories had a short headway or were bunched. Furthermore, the work in [15] did not address the missing data recovery issue. In this paper, a new framework for bus trajectory extraction is developed in a generalized form, which is more efficient in erroneous data detection and more robust in extracting a whole route trajectory than the previous method presented in [15]. The work of this paper consists of two parts: bus trajectory extraction with noise removal and missing data recovery for the extracted trajectories.

In our work, noise removal is based on the anomaly detection method. Outliers are patterns in data that do not conform to a well-defined notion of normal behavior. Anomaly detection refers to the outlier detection problem of identifying the data that are inconsistent with the rest of the set. Anomaly detection is an important problem that has been researched in many diverse research areas and application domains [16]. Numerous anomaly detection (AD) approaches have been presented in relevant literature, including a stochastic model-based approach, such as Gaussian mixture [17], artificial neural networks (ANN) [18], distance-based outlier detection [19], decision function-based approaches, such as one-class support vector machine (SVM) classification [20], the tensor-based method [21], and so on. The types of anomalies can be classified into two categories: point anomalies and contextual anomalies. Contextual anomaly detection seeks to find relationships within datasets where variations in external behavioral attributes well describe the anomalous results in the data [22]. Compared to point anomalies, contextual anomalies detection has been further explored for time series data. The bus trajectories are sequential data that reveal spatial-temporal features. In this sense, we prefer to adopt the contextual anomalies detection for bus trajectories noise removal. In contrast to the traditional AD method for bus trajectories denoising and extracting, this work focuses on erroneous data elimination in addition to the extraction of the trajectories in the form of one trajectory for one route. In this paper, we first apply the fuzzy C-means clustering (FCM) method to split the trajectories into small clusters. Then, we propose a pentagonal fuzzy membership function (MF) based on historical traveling data to measure the probability of two arrival data in one trajectory. The point-to-point MF is extended to the fuzzy connecting MF matrix for measuring the connecting probability of a whole trajectory cluster (also called a trajectory fragment in this paper), and then, AD detection is given based on the MF matrix with prioritized noise removal. When the trajectories are cleaned, a trajectory connecting algorithm is given based on our proposed fuzzy connecting matrix, which helps to connect the trajectory fragments into a whole one-route trajectory.

After the one-route trajectories are extracted, another important work of this paper is to recover the missing data of the trajectories. The missing data recovery is realized by path interpolation. To date, linear interpolation is the easiest, most conservative and most common interpolation method, which assumes that movement follows a straight-line path between two known points [23]. However, straight lines are not consistent with many types of path dynamics, especially for bus trajectories that include consecutive congested and non-congested roads. Other interpolation methods, such as kinematic interpolation [23], cubic Bezier curves [24] and Catmull-Rom curves [25], are also applied in missing data recovery for moving objects. Kinematic interpolation [23] requires speed information and dynamic traveling models. However, the GPS data, such as speed, latitude and longitude, are not provided on the official website in our data acquisition system. Moreover, there are hundreds of lines and thousands of road sections in a city, and the traveling dynamics change at different times of day and for different road sections. Furthermore, it is also cumbersome to determine how to model the traveling dynamics. In this point, kinematic interpolation is not appropriate for our system. In a real system, we have applied the cubic Bezier curves and Catmull-Rom curves in order to test the recovering performance for the sampled data. Unfortunately, the mean absolute error (MAE) of those cubic interpolation methods exhibits a worse performance than the straight linear interpolation in most cases for the sampled station-level sparse arrival data. In this work, we extend the linear interpolation method by considering the historical data and contextual arrival information, which achieves a better performance than that of traditional methods.

The main contributions of this article are summarized as follows:

  1. A novel framework of urban bus trajectory extraction is developed, which provides a generalized preprocessing of the raw data sampled from the real-time bus arrival information querying website offered by the Suzhou Transportation Bureau. The trajectory extraction framework consists of three sub-procedures: data clustering to aggregate trajectory fragments, data cleaning to eliminate noisy data in the trajectories and data connecting to merge the fragments into complete one-route trajectories. The proposed framework enables researchers from different regions to sample the urban bus traveling data from the Internet and to extract the trajectories with noise removal, thus providing a new method for urban computing via the traffic data of buses.
  2. The fuzzy membership function matrix is first introduced to deal with the problems of contextual anomalies detection, which is applied for noisy data removal in the trajectories. Furthermore, the MF matrix is extended to solve the problem of trajectory fragments connecting with erroneous data tolerance. The proposed MF matrix can be evaluated by the given empirical parameters and can be refined by the sampled data, which is easy to implement and is robust in the trajectory cleaning capability.
  3. An efficient scheme that takes into consideration historical data, as well as contextual arrival information is presented for the missing data recovery of the traveling data. Numerous experiments demonstrate that the proposed method outperforms other traditional interpolating methods, such as straight linear interpolation and Catmull-Rom curves.

The remainder of this paper is organized as follows. Section 2 describes how to obtain bus arrival data from the Internet. Section 3 provides some definitions to assist with the problem statement. Section 4 contains major contributions, giving the solutions to infer bus trajectory with outlier elimination and missing data recovery. The experimental results and discussion are presented in Section 5. Finally, conclusions and future work are summarized in Section 6.

2. The Bus Arrival Data Acquisition Project

We collected bus arrival data in Suzhou from its official website [14]. Figure 1 is a snapshot of the web page (translated from Chinese to English by Chrome Browser), and brief explanations are provided via the red squares and circles. In the system, each bus line is sampled approximately every 30 s and is stored in the Sqlite3 database.

Figure 1
Snapshot of the web page for the bus arrival information on the official website in Suzhou.

For the original arrival information shown in Figure 1, we designed four types of tables, i.e., “LineInfo”, “StationInfo”, “ArrivalData” and “BusInfo”, to maintain the arrival data in the database as shown in Figure 2. In the table “LineInfo”, when a line is collected, it first queries the table by LineUrl. If the querying result returns null, then the LineName and LineUrl are inserted, and LineCode is automatically increased by one for the primary key property. Bus lines with the same line name, but opposite traveling directions are considered to be different lines. Because LineUrl is the web page URL of the collected line, it is unique and can be used as the one-to-one mapping of LineCode. The remaining three types of tables, “StationInfo”, “ArrivalData” and “BusInfo”, were created and stored in a database for each sampling day. In the database, the field StationIndex is a sequenced station index starting at one at the first station along the line route. During the operation days, the public traffic company may change some line route, either temporarily or permanently. Therefore, the stations of a focused line with the same station index may represent different stations on different sampling days, and the table “StationInfo” can help keep track of these changes. The mappings between database fields and web page information are illustrated in Figure 1, where BusIndex starts at one for every bus line and is produced by our program according to the appearance order of BusCode with respect to the sampling line. ArrivalTime is an eight byte float number that is recorded in minutes from the reference at midnight “00:00:00”, e.g., ArrivalTime=9×60+44+12/60=584.2 at arrival time “9:44:12”. The table “ArrivalData” is the main table for arrival data records, which is utilized for investigating bus traffic in our project.

Figure 2
Database design for the urban bus arrival information.

The flowchart of our project is illustrated in Figure 3. Almost all of the buses in Suzhou are equipped with GPS devices, and they report their positions to the remote server when they are in service. The server processes the GPS data and finally offers the real-time bus arrival information in the form of a web page, as shown in Figure 1. Users can query the bus arrival data by line name (line number), station name or station code in the website [14]. In the data sampling client, we designed four types of database tables, as shown in Figure 2. Table “LineInfo” was built before the sampling task, which is generated according to the bus lines and the corresponding URLs offered by the website. The task scheduling center schedules which line is to be sampled for the present task. The flowchart is depicted as follows:

  1. Query the LineUrl by LineCode in the “LineInfo” table by the task scheduling center.
  2. Query the bus arrival information by http request with the URL bundled to LineCode.
  3. Parse the HTML from the remote server and extract the arrival data StationName, StationIndex (processed by program), StationCode, BusCode and ArrivalTime (parsed from text to float number by program) from the web page, as shown in Figure 1.
  4. Write the arrival data to the corresponding three tables.
  5. Sleep until the next sampling time occurs, and then, update the LineCode to the next line by the task scheduling center.
  6. Go back to Step 1 and perform the sampling until the task finishing time hits.

Figure 3
Flowchart of the bus arrival data acquisition project.

In the database, “ArrivalData” is the main table, which is utilized for bus trajectories extraction and traveling time inspection. Assume the LineCode for the line shown in Figure 1 is 158 (generated in the database). The web page shown in Figure 1 consists of three arrival events; thus, three records are added to “ArrivalData”, as illustrated in Table 1. In the table, all data are numerical, which is more convenient than text for data mining and scientific calculation.

Table 1
Illustration of arrival data stored in the “ArrivalData” table.

3. Problem Statements

Definition 1.

The bus arrival information matrix extracted from the table “ArrivalData” for line l in one day is denoted by G(l)={Gj(l)=[Tj,Ij,Bj]|j=1,,|G(l)|}, where l is the line code, |G(l)| is the size of G(l) and Gj(l) is the j-th record of the arrival information that consists of ArrivalTime Tj, StationIndex Ij and BusIndexBj. For example, the bus arrival information matrix in Table 1 is:

Definition 2.

The bus arrival information matrix generated by bus m for line l is denoted by G(m,l)={Gj(m,l)=[Tj,Ij]|[Tj,Ij,Bj]G(l),Bj=m}. G(m,l) is a two-dimensional matrix extracted from G(l). If there are M buses in G(l), then G(l) can be reconstructed by m=1MG(m,l).

Definition 3.

The station-level arrival information data generated by bus m for line l at station k are denoted by G(k,m,l)={Gj(k,m,l)=Tj|[Tj,Ij]G(m,l),Ij=k}. G(k,m,l) is a one-dimensional vector extracted from G(m,l). If there are N stations in the line, then G(m,l) can be reconstructed by k=1NG(k,m,l).

Definition 4.

The consumed time of a bus from station i to station j in the driving direction for line l is denoted by tij(l). tij(l) includes the traveling time on the corresponding road, the waiting time caused by traffic signals at a crossroad and the dwell time at stations i to j1 caused by passengers’ boarding and alighting behaviors. tij(l) is an important feature for urban buses’ traveling behaviors, because it provides a scratch view of the crowdedness for the corresponding road. tij(l) can be utilized to investigate the urban traffic, the spatial-temporal property of the urban bus, the arrival time prediction, etc. Unfortunately, tij(l) cannot be directly derived from either G(l) or G(m,l) because the records in G(l) lack the route information. Thus, before calculating tij(l), it is necessary to extract the one-route trajectories in G(l).

Definition 5.

The one-route trajectories set generated by bus m for line l is denoted by S(m,l), where one-route trajectory represents the trajectory generated by a bus from the first station to the terminal station in one round. The k-th one-route trajectory in S(m,l) is denoted by S(k,m,l)S(m,l), and the one-route trajectories set generated from all M buses in line l is denoted by S(l)=m=1MS(m,l). Once S(l) is derived from G(l), it is easy to obtain tij(l) by Tj1Tj2 with Ij1=j and Ij2=i in S(k,m,l)S(l), where Ij1 and Ij2 are the station indices.

The process of generating S(l) from G(l) is referred to as trajectories extraction throughout this paper. As pointed out in [1], spatial trajectories are never perfectly accurate due to sensor noise and other factors. In reality, there will be an amount of missing data and abnormal data in G(l) caused by erroneous GPS signal, communication failure and data sampling fault, which causes trajectories extraction to be a difficult task. Figure 4 shows the bus arrival events collected in G(l) for l=130 (Line 10 southward) from the Suzhou transportation website [14] with a sampling period of 30 s. Case A in the figure shows an abnormal trajectory, which was caused by the bus traveling in the opposite direction. A reasonable explanation for Case A is that the bus driver forgot to switch the GPS reporting device to the right direction when the bus changed route direction. The wrong direction trajectories need to be distinguished and eliminated from G(l) in the trajectory extraction process. Case B shows a severe data missing problem for the designated route. The incomplete data will impede the inference of S(l). Case C shows the GPS over-reporting at the terminal station. Case C is caused by the fact that the GPS device continues to report more than one GPS position to the service after the bus reaches its destination. The situation at the first station or some other stations that have a long dwell time may also result in more than one position being reported to the service, which will cause the trajectory extraction to be more complicated. When over-reporting and wrong direction reporting occur in adjacent time slots, it is difficult to extract the trajectories by the direct search method.

Figure 4
Bus arrival events collected in G(l) for l=130 in Suzhou. Sampling date: 2 August 2012.

In this paper, we focus on two problems associated with arrival data processing, where the data were collected from the Internet. The first concerns trajectories extracted from G(l) with outlier elimination, and the second one concerns missing data recovery for the extracted trajectories. The framework of the data processing is shown in Figure 5. Firstly, the input arrival data G(l) are split by different buses m to produce G(m,l). Secondly, a fuzzy C-means clustering (FCM) method is applied to cluster the data in G(m,l), and the trajectory fragment set is generated. Then, the outliers in the trajectory fragments are eliminated by the proposed cleaning algorithm. Next, a trajectory fragment connecting method is applied to connect the cleaned trajectory fragments into one-route trajectories set S(m,l). Finally, the missing data in S(m,l) are recovered, and the complete one-route trajectories set S(l) is merged by the interpolated S(m,l).

Figure 5
Framework of the data processing for trajectories extraction and missing data recovery.

4. Trajectories Extraction and Missing Data Recovery

4.1. Fuzzy C-Means Clustering

Clustering and anomaly detection are extensively studied for trajectory mining applications [1,26]. Furthermore, the two techniques are often used in conjunction [27]. In our project, we first cluster the arrival data into a trajectory fragment if those data should be grouped into the same route for a bus. Then, the anomaly detection method is applied to check whether there exist outliers in the trajectory fragment and removes the erroneous data according to the contextual relationship of the data. We adopt fuzzy C-means clustering (FCM) as the trajectory clustering method. FCM is a soft clustering method and was proposed by Bezdek [28]. In the FCM method, the data X={X1,X2,Xn} are partitioned into c fuzzy sets by minimizing the objective function as follows:


where a>1 is the fuzzification coefficient, V is the prototype matrix with Vj denoting the j-th cluster prototype, U is the partition matrix with uij denoting the membership value for Xi classified to cluster j and ||XiVj|| is a distance function. Usually, the Euclidean distance is considered for ||XiVj||. The membership uij is constrained by:


The minimization of Ja(U,V) with Constraint (2) gives [28]:



In practice, the partition matrix U and prototype matrix V are updated in an iterative way by Formulas (3) and (4) until the change of objective function ΔJa(U,V) between two consecutive iteration is less than a predefined tolerance or the maximum iterating times reached.

In this paper, we utilize the FCM method to cluster the arrival data with the objective that the data in the same cluster should belong to the same route. Of course, the best result is one cluster for one-route trajectory. However, due to the disturbance of outliers and data missing problems, the actual clusters are not always equal to the route trajectories.

4.2. Feature Construction for FCM

Feature construction is an important work in FCM. The arrival data G(l) contains three-dimensional information generated by line l. Usually, one line consists of several buses that can be distinguished by Bj in G(l). In the sense of trajectory clustering, we wish to partition the arrival data generated by one route in G(l) into one cluster. Hence, the arrival data generated by different buses should be naturally partitioned into different trajectory clusters. However, if we take Gj(l)G(l) as the input feature of X for FCM, the distance ||Gi(l)Gj(l)|| for ij cannot effectively measure the difference between different trajectories. On the other hand, the two-dimensional information G(m,l) extracted from G(l) is better for trajectory clustering, since all arrival data in G(m,l) are generated by the same bus m, and arrival data generated by different buses are naturally separated by G(m,l) with different values of m. In practice, there are more abnormal arrival data at the first and final stations than other stations; we eliminate the FF-data (arrival data of the first and final stations) before trajectories extraction for the sake of reducing the negative effect introduced by abnormal data.

Figure 6 shows the arrival events grouped by G(m,l) after eliminating the FF-data for the same dataset shown in Figure 4. As illustrated in Figure 6, Trajectories A and B can be efficiently distinguished by G(m,l); however, quite differently, they are difficult to partition from G(l) without the bus-code information when the trajectories are in a short headway or bunched, as shown in Figure 4. If we adopt the method presented in [15], Trajectories A and B will not be extracted correctly. Hence, grouping arrival data by G(m,l) instead of G(l) is a sensible choice for the purpose of trajectory extraction, especially for the trajectories in a short headway. Though G(m,l) can separate trajectories from different buses, the arrival data in G(m,l) still consist of several traveling routes and need to be further partitioned from each other.

Figure 6
Bus arrival events grouped by G(m,l) for the same data G(l) as shown in Figure 4.

The FCM clustering of G(m,l) is aimed to separate the arrival data by one cluster for each one-route trajectory. Figure 7a shows the arrival events by the bus with m=1 for the same line and sampling date as in Figure 4. From the figure, one can see that there are total of four traveling routes for bus m=1, which means that the work of trajectory extraction is to extract the four trajectories and add them into S(m,l). The first problem for FCM is how to measure the distance of trajectory data in order to get a large distance between different trajectories and a small distance in the same trajectory. Though traditional Euclidean distance is widely used in clustering, it is not a good metric for our FCM clustering. We take three points in Figure 7a as an example, as shown in the figure, Points A and B are in the same trajectory; however, the Euclidean distance between A and C (points from different trajectories) is smaller than the distance between A and B, which will result in the wrong clustering for FCM.

Figure 7
The constructed input feature of G(m,l) without abnormal data. (a) Arrival events; (b) constructed X.

For trajectory clustering by FCM, we construct an enhanced distance metric by considering the traveling time along the route direction. Let the j-th record in G(m,l) be Gj(m,l)=[Tj,Ij]; then, the input feature Xj for Gj(m,l) is:


where Tj is the arrival time for record j, Ij is the arrival station index in line l and R(Ij) is a linear function of the traveling time from the first station to station Ij. For most stations in the studied city, the uncongested traveling time between two consecutive stations is about 2 min, and R(Ij) is selected as half of the traveling time; hence, a robust and convenient choice for the initial value of R(Ij) is:


Once the trajectories of a studied line are extracted, R(Ij) can be refined according to the historical traveling time derived from the trajectories. Figure 7b shows the constructed input feature X of G(m,l). As shown in the figure, the constructed X is a one-dimensional datum, which can split the traveling data of G(m,l) into four clusters well, with each cluster mapped to a one-route trajectory. The illustrated case in Figure 7 is suitable for the well-conditioned traveling data without noise. However, there may exist some abnormal data as depicted in Figure 4; the input feature constructed by Equations (5) and (6) thus may reduce the separable performance when the arrival data in G(m,l) contain more trajectories in the opposite direction than those in the correct direction. In order to enhance the separability for the case of the wrong direction trajectories, the R(Ij) in Equation (6) is modified by:


and Xj in Equation (5) is modified as:


The X constructed by Equation (5) is called the forward clustering feature, and the one constructed by Equation (8) is called the backward clustering feature in this paper. Accordingly, we call the fuzzy C-means clustering with the forward clustering feature as forward FCM (F-FCM) and the one with backward clustering feature as backward FCM (B-FCM).

4.3. Parameters Selection for FCM

The most important parameters for our trajectory extraction in the procedure of FCM are the number of fuzzy sets c and initial prototypes. In practice, the counts of trajectories always change by different buses in different sampling days and need to be determined automatically. We count the trajectories based on station scanning, which is depicted as follows:


where c0 is the count of trajectories for G(m,l), N is the number of stations in line l, |·| is the size of the set and G(k,m,l) is defined in Definition 3. Since there are more abnormal arrival events at the first and final stations, the scanning process is started from the second station and terminated before the final station to reduce the effect of outliers. Actually, the scanned c0 may be less than the actual count (denoted by c0*) in some cases caused by data missing or be greater than c0* in other cases caused by over-reporting. We prefer to set c>c0* to avoid partitioning two different trajectories into one cluster. In our extraction procedure, c is set by:


where αR+ is the gain of c over c0 and is chosen by α[1.3,2.0] in most cases. In Equation (10), [αc0] is an integer operation rounding αc0 to the nearest integer less than or equal to αc0.

The iterative algorithm for FCM clustering is a local optimization procedure; hence, different initial prototypes may result in different partitions. One property of the constructed feature X is that both X and trajectories are ordered in arrival time. Inspired by the time-ordered property, we proposed an initial prototype selection method as follows:


where Vi(0) is the i-th prototype in the initial prototype vector V(0), Xk is the k-th element in X and c is the number of clusters. If all clusters are of the same size, the prototypes selection approach proposed in Equation (11) will choose the center of the cluster as the initial prototype, which will reduce the clustering iterations. In this sense, the proposed initial prototypes selection method is better than random selection with the advantage that it costs less iterations in most cases when the sizes of the sampled trajectories are nearly equal.

4.4. The FCM Clustering Algorithm

The F-FCM clustering algorithm is shown in Algorithm 1, which is an iterative algorithm. By default, we set α=1.8, Kmax=100, δ=0.01 in the algorithm. The B-FCM clustering algorithm is similar to F-FCM, except for the feature generating in Step (a3).

Let Gc(m,l)=j=1cGc(j,m,l) be the output of the aggregated cluster set by FCM, where Gc(j,m,l) is the j-th cluster in Gc(m,l). The final FCM clustering algorithm is an optimal partition selection between F-FCM and B-FCM according to the objective function, which is given by Algorithm 2. In Algorithm 2, Gc(m,l) is generated according to the partition matrix U.

Algorithm 1 F-FCM clustering algorithm.
Input: G(m,l);
Output: UFFCM (the partition matrix), VFFCM (the prototypes set), JaFFCM (the objective function);
  • (a1) get c0 by Equation (9);
  • (a2) set α and get c by Equation (10);
  • (a3) generate the clustering feature X by Equation (5);
  • (a4) initiate the prototypes set V(0) by Equation (11), get U(0) by Equation (4), get Ja(0)Ja(U(0),V(0)) with Equation (1);
  • (a5) set the maximum FCM iteration Kmax, error tolerance δ and execute the following iterative loop:
  •     for i← 1 to Kmax do
  •         get V(i) by U(i1) with Equation (3);
  •         get U(i) by V(i) with Equation (4);
  •         get Ja(i) by Equation (1);
  •         if |Ja(i)Ja(i1)|<δ then
  •            break;
  •         end if
  •     end for
  • (a6) set UFFCM=U(i), VFFCM=V(i), JaFFCM=Ja(i) for the last iteration i in Step (a5).
Algorithm 2 FCM clustering algorithm.
Input: G(m,l), N (the number of data in G(m,l), i.e., N=|G(m,l)|);
Output: U (the partition matrix), Gc(m,l) (the aggregated clusters);
  • (a1) get UFFCM, VFFCM, JaFFCM by F-FCM clustering;
  • (a2) get UBFCM, VBFCM, JaBFCM by B-FCM clustering;
  • (a3) clustering method determination:
  •     if JaFFCMJaBFCM then
  •         set U=UFFCM;
  •     else
  •         set U=UBFCM;
  •     end if
  • (a4) partition loop:
  •     for i← 1 to N do
  •         if uij=maxk=1,2,,cuik then
  •            partition Gi(m,l) (the i-th data in G(m,l)) into Gc(j,m,l);
  •         end if
  •     end for
  • (a5) generate Gc(m,l) by Gc(m,l)=j=1cGc(j,m,l).

4.5. Trajectory Fragment Cleaning

The aggregated Gc(m,l) from G(m,l) by FCM is called the trajectory fragments set in this paper. When the trajectory fragments set Gc(m,l) is generated by Algorithm 2, the following procedure is about outlier elimination, which is also called trajectory fragment cleaning in this paper.

We utilize fuzzy membership function (MF) to describe the relationship of arrival data in a trajectory, i.e., the introduced MF in our AD procedure describes the connecting possibility between two arrival events Gi(m,l) and Gk(m,l) to be classified into the same trajectory. In fuzzy set theory, most common membership functions are triangular, trapezoidal, Gaussian and bell-shaped [29,30]. In a more generalized viewpoint, the triangular and trapezoidal functions are in the forms of polygons, which are described by three parameters for the triangular and four parameters for the trapezoidal. In this paper, we extend the triangular to pentagonal MF, which is depicted as follows:


where u(t) is the fuzzy MF of two arrival events to be connected into one trajectory and c1,c2,,c5 are the five parameters describing the shape of pentagonal MF as shown in Figure 8. For connecting possibility evaluation, the t in Equation (12) is the directional station-unit traveling time between Gi(m,l) and Gk(m,l), which is calculated as follows:


where Gi(m,l)=[Ti,Ii] and Gk(m,l)=[Tk,Ik].

Figure 8
The pentagonal fuzzy membership function.

If the trajectories for the studied lines have already been extracted, we can utilize the historical traveling time t to determinate the parameters of the pentagonal MF by:


where ηmin<1 is the lower bound coefficient and ηmax>1 is the upper bound coefficient, which control the extended range of t. Compared with the triangular MF, the pentagonal MF has the benefits of more flexible slope control. For data-based knowledge, u(t) should be in high connecting possibility (u(t)0.5) for all t in the historical data, and u(t)=1 (the highest connecting possibility) when t is the median of the historical data. Moreover, the historical data cannot reveal all cases in the future, so the data range should be extended from historical data to make a better coverage for the future. Furthermore, the extensions are not symmetric for the lower bound and upper bound. The shortest traveling time is restricted by the highest vehicle velocity and distance between two stations, which can be expected once the shortest distance between two adjacent stations is given. Hence, there is a slight difference between min(t) derived by historical data and the actual one in the future. However, the longest traveling time will be affected by traffic jam, traffic accident and other factors, which should be evaluated in a more conservative way. Actually, the t in Equation (13) is dependent on bus lines, station indices, date types and time of day. For the trajectory extraction work, it is hard to consider all of the factors when extracting hundreds of lines with millions of arrival events generated in one day. We prefer to evaluate u(t) for all arrival events with a set of empirical parameters, which is given as follows:


where the n-th row of CM refers to the parametrical set of u(t) with n=|IiIk| in Equation (13) and the j-th column of CM is the parameter cj in Equation (12). When |IiIk| exceeds the maximum row index of CM, the parameters for evaluating u(t) are selected by the last row of CM. By using CM given in Equation (15), the u(t) between any two arrival events in a trajectory can be evaluated.

Let UcRN×N be the fuzzy connecting MF matrix evaluated for a trajectory fragment Gc(j,m,l) with N arrival events, where Uikc is the u(t) between the i-th and k-th arrival data when ik or Uikc=1 when i=k.

Let G^c(j,m,l)Gc(j,m,l) be the cleaned trajectory fragment from Gc(j,m,l) and U^c the cleaned fuzzy connecting MF matrix generated from G^c(j,m,l). We take the trajectory fragment cleaning as an optimization problem as follows:


where |G^c(j,m,l)| is the size of G^c(j,m,l), avg(U^c) is the average of the elements in U^c and umin is the connecting possibility threshold, which is given by empirical knowledge. We set umin=0.3 for all arrival data in the project. For a cleaned trajectory fragment G^c(j,m,l), every arrival datum in G^c(j,m,l) should be connected, so all of the elements in U^c should be greater than the predefined threshold umin. On the other hand, any data in G^c(j,m,l) should not be misclassified as noisy data, which is guaranteed by the maximization of |G^c(j,m,l)|.

Let N¯ic be the anomalous indexing of the i-th data in Gc(j,m,l), which is the count of uik satisfying uikumin for k=1,2,N, and N¯c the anomalous indexing vector generated by N¯c=[N¯1c,N¯2c,N¯Nc]T.

Let G^c(m,l)=j=1cG^c(j,m,l) be the set of the cleaned trajectory fragments.

Let R^c(j,m,l) be the noisy dataset generated from Gc(j,m,l) by the trajectory fragment cleaning algorithm, and R^c(m,l)=j=1cR^c(j,m,l). By the definition, one gets Gc(j,m,l)=G^c(j,m,l)R^c(j,m,l), and Gc(m,l)=G^c(m,l)R^c(m,l).

The solution of Problem (16) is started by checking the noisy data with maximum anomalous indexing and deleting it from Gc(j,m,l) recursively. The detailed solution of Problem (16) is given by Algorithm 3.

Algorithm 3 Trajectory fragment cleaning algorithm.
Input: Gc(j,m,l), CM, umin;
Output: G^c(j,m,l), U^c, R^c(j,m,l);
  • (a1) set a temporary noisy set R^c=;
  • (a2) get Uc of Gc(j,m,l) with CM;
  • (a3) get N¯c from Uc with umin;
  •     if N¯c=0 then
  •         go to (a5);
  •     else
  •         go to (a4);
  •     end if
  • (a4) find the index k in N¯c with the maximum anomalous indexing;
  •     if k is not unique then
  •         choose k* be the one in k with the minimum row summation;
  •     else
  •         set k*=k;
  •     end if
  •     update R^c by adding the k*-th data in Gc(j,m,l) into R^c;
  •     delete the k*-th data in Gc(j,m,l), delete the k*-th row and k*-th column in Uc;
  •     go to (a3);
  • (a5) output the cleaned trajectories:
  •     if |Gc(j,m,l)|>1 then
  •         set G^c(j,m,l)=Gc(j,m,l), and U^c=Uc;
  •     else
  •         set G^c(j,m,l)=, and U^c=;
  •     end if
  •     set R^c(j,m,l)=R^c.

4.6. Trajectory Fragments Connecting

In some cases, the one-route trajectory may be broken into two or more trajectory fragments. Hence, when generating S(m,l) from G^c(m,l), some trajectories need to be connected to form a one-route trajectory. The process of generating S(m,l) from G^c(m,l) is named trajectory fragments connecting (TFC) in this paper. We introduce the fuzzy connecting matrix UF for the processing of TFC. We denote uijF as the i-th row and j-th column element in UF, which is defined as follows:


where G(ij)=G^c(i,m,l)G^c(j,m,l), U^(ij) is the fuzzy connecting MF matrix generated from G^(ij), G^(ij) is the cleaned trajectory from G(ij) and avg(U^(ij)) is the average value of the elements in U^(ij). U^(ij) and G^(ij) are generated by Algorithm 3 for the input G(ij). Rule 1 is described as follows:


where |R^(ij)| is the size of noisy dataset R^(ij) generated from G(ij) by Algorithm 3, nτ is the predefined parameter to control the maximum number of noisy data when considering G^c(i,m,l) and G^c(j,m,l) might be connected and umin is as defined in Problem (16). We set nτ=3 in most cases for the TFC processing. In Constraint (18), G^c(i,m,l) is ahead of G^c(j,m,l) in G^(ij), which means that all of the arrival events belonging to G^c(i,m,l) in G^(ij) happen before the other events in G^(ij). From Equation (17) and Constraint (18), one can see that UF is not a symmetric matrix; moreover, if uijF>0, we will definitely have ujiF=0.

In the processing of TFC, it is better to select the pair of fragments in G^c(m,l) with the maximum connecting possibility for connection, which is easily conducted by finding umaxF (the maximum element in UF). Once G^c(i,m,l) and G^c(j,m,l) can be connected, we set G^c(i,m,l)=G^(ij) and delete G^c(j,m,l) in G^c(m,l). In this way, UF is partly updated in each connection round. The detail of TFC is given by Algorithm 4.

Algorithm 4 Trajectory fragments connecting.
Input: G^c(m,l), nτ, umin;
Output: S(m,l);
  • (a1) get UF of G^c(m,l) by Equation (17);
  • (a2) if UF=0, go to (a4);
  • (a3) find uijF=umaxF;
  •     update G^c(m,l) by setting G^c(i,m,l)=G^(ij);
  •     delete G^c(j,m,l);
  •     update UF by:
  •         recalculating the i-th row of UF;
  •         removing both the j-th row and j-th column of UF;
  •     goto (a2);
  • (a4) set S(m,l)=G^c(m,l).

4.7. Missing Data Recovery

Data missing is very common in many real systems and can have a significant effect on the analysis of data. In our data acquisition project, almost all of the extracted trajectories are incomplete due to the arrival data missing randomly occurring along the stations. For the urban pubic transportation system, data missing may be caused by GPS positioning failure of the floating bus or networking problems when transmitting data. For data sampling, data missing will also happen in an unstable network. It is important that the missing data should be recovered firstly before mining the spatio-temporal properties of the trajectories.

There are tens thousands of trajectories recorded in one day for the city of Suzhou with more than 500 bus lines. Time consumed algorithms for missing data recovery are not recommended for such a system. In this work, we propose two methods of data recovery corresponding to different types of missing data, both of which are easily implemented. The first one is contextual linear interpolation for missing data recovery (CLIMDR) for the cases of missing data inside the trajectory, as shown with the magenta dots in Figure 9. The other one is median value interpolation for missing data recovery (MIMDR) for the cases of missing data outside the trajectory, as shown with the red dots in Figure 9.

Figure 9
The illustration of different types of missing data.

The CLIMDR method is context related and statistical. Taking the points A1-A2-A3 shown in Figure 9 as an example, the missing data of A2 is recovered by:


where TAi is the arrival time of Ai for i=1,2,3, tAi,Aj is the traveling time from Ai to Aj and DA1,A2,A3 is the historical non-missing dataset selected from S(l) by the station indices corresponding to A1, A2 and A3. For data operation convenience, S(l) (the one-route trajectories set of line l) can be transformed into a two-dimensional data matrix M(l) with Mi,j(l)M(l) the arrival time at station j of the i-th trajectory. DA1,A2,A3 is constructed from M(l) by two steps:

  1. set DA1,A2,A3=M(l)(:,[I(A1),I(A2),I(A3)]), where the first symbol “:” means selecting all rows of M(l), and the second parameter set “[I(A1),I(A2),I(A3)]” means selecting the corresponding columns of M(l) with I(Ai) the station index of Ai;
  2. delete the rows of DA1,A2,A3 when any elements in the row equal zero (missing data).

The CLIMDR method is implemented based on the assumption that tA1,A2 and tA1,A3 are approximately linear with:


In Equation (20), the parameters k1 and k0 are obtained with least square estimation from DA1,A2,A3 as follows:


where T1,2, T1,3 and T¯1,3 are constructed from DA1,A2,A3 as follows:


The CLIMDR method for the case of A1-A2-A3 shown in Figure 9 is called CLIMDR-1, which means recovering one depth missing datum, i.e., the number of successive missing data is one. Similarly, we call CLIMDR-n for the n-depth missing data recovery by the CLIMDR method. CLIMDR-n is implemented by an iterative method. Taking the trajectory points B1-B6 shown in Figure 9 as an example, there are four missing data in B1-B6; hence, the missing depth is four, and the corresponding recovery method is CLIMDR-4. In the interpolating procedure, we first recover B2 by TB1 and DB1,B2,B6; then, B3 can be recovered by TB2 and DB2,B3,B6. The interpolation procedure continues until B5 is recovered.

The MIMDR method is applied for the cases of missing data outside the extracted trajectory. In those cases, most of interpolation methods, including linear interpolation, kinematic interpolation [23], Bezier curves [24], Catmull-Rom curves [25] and our proposed CLIMDR, are not available, since the head part or tail part of the data is also lost. In the MIMDR method, we firstly partition the data into the weekday set and weekend set. Then, we split the time of day into time slots with equal duration of 20 min and calculate the median value of traveling time in every time slot from the historical non-missing dataset. Finally, the missing data are recovered by the median value according to the time slot and weekday/weekend attributes. In practice, we only recover the starting and terminal stations of the trajectory by MIMDR to avoid accumulative errors. When the missing head or tail of the trajectory is recovered, the CLIMDR method can be applied again to recover the middle part of the trajectory.

5. Results and Discussion

5.1. Comparing Trajectory Clustering between F-FCM and B-FCM

The F-FCM method is suitable for the trajectories that have the right traveling direction, whereas the B-FCM method is suitable for those that have the wrong traveling direction. In practice, it is difficult to determine whether the bus arrival dataset G(m,l) contains wrong direction trajectories. An alternate solution is to partition G(m,l) by both F-FCM and B-FCM and then select the better partition according to the objective function in Equation (1). Figure 10 shows the comparison of the separability for the constructed forward feature and backward feature of G(m,l), which contains wrong traveling direction trajectories. As shown in Figure 10b,c, when G(m,l) contains wrong direction trajectories, the separable performance of X constructed by the forward feature is worse than that by the backward feature.

Figure 10
Comparison of the separable performance of X constructed by forward and backward clustering of G(m,l) with wrong traveling direction trajectories. (a) Arrival events; (b) constructed X for forward clustering; (c) constructed X for backward clustering. ...

The dataset G(m,l), shown in Figure 10, is used as an example. Both F-FCM and B-FCM with a=2 and c=3 are applied, and the aggregated clusters are shown in Figure 11. For F-FCM, the constructed X does not have much of a partitioning margin between the different clusters, thus resulting in wrong partitions for the data W1, W2 and W3, which are pointed out in Figure 11a. In contrast, the B-FCM separates the three clusters correctly. The objective functions calculated by Equation (1) for the above approaches are JaFFCM=3.2×104 and JaBFCM=6.2×103. Since JaBFCM<JaFFCM, the clustering results by B-FCM should be chosen rather than F-FCM. We consider the clustering approach selection criterion ruled by Ja to be more natural and feasible than the straight way of directly checking whether there are wrong direction trajectories in the dataset.

Figure 11
The comparison of the aggregated clusters by both forward FCM (F-FCM) and backward FCM (B-FCM). (a) Clustering by F-FCM; (b) clustering by B-FCM.

5.2. Discussing the Setting of Cluster Number for FCM

The parameter c, which is also called the cluster number, is an important setting for the FCM method. Figure 12 shows the comparison of FCM clustering with different settings of c. In the figure, the scanned c0 by Equation (9) is the same as the true count of the trajectories, i.e., c0=c0*=9. By setting c=c0, the F-FCM clustering in Figure 12a reveals that A1–A2 and A3–A4 are incorrectly partitioned. To fix this wrong clustering, A2 should be split out from the A1–A2 cluster, and A3 should be split out from the A3–A4 cluster. Then, A3 should be connected to A2. In comparison, when c is set as c=11 with α=1.3 in Equation (10), Figure 12b reveals that the wrong partitions for A1–A2 and A3–A4 in Figure 12a are rectified. The drawback of enlarging the gain parameter α is that one trajectory may be split into several different clusters, e.g., A5–A6 and A7–A8 in Figure 12b. When c is small, every cluster must be checked, and the over-aggregated cluster must be split into two or more trajectories. Then, some of the trajectories must be connected into one. In contrast, when c is large, it is only necessary to perform the connecting work for some small clusters. We prefer to set c as greater than the actual count of trajectories in order to avoid the problem of over-clustering.

Figure 12
The comparison of FCM with different settings of c. (a) a=2, c=9; (b) a=2, c=11.

The proposed FCM method solves the problem of deciding which data should be classified into a cluster as a trajectory; however, it cannot distinguish which data in the cluster are abnormal and should be eliminated. Thus, anomaly detection and erroneous data elimination are necessary for the aggregated clusters.

5.3. Testing of the Trajectory Fragment Cleaning Algorithm

To illustrate the cleaning algorithm, we give a real trajectory that was sampled on 29 September 2012 for the line l=130, as shown in Figure 13. The sampling date is the last working day before a week-long holiday in China; thus, it is also a traffic jam day. The bus company scheduled more buses for some busy lines on that day in order to satisfy the travel demand. However, the position reporting devices may not work well for some temporarily-scheduled buses, which results in a greater number of outliers on that day. Table 2 gives the detailed arrival data, which are shown in Figure 13. In the figure, the arrival data labeled as P2P5, P8 and P14 are easily distinguishable as outliers by our empirical knowledge. Furthermore, P16 is reasoned to be a candidate outlier when compared to P15. By applying the proposed AD method, all of the erroneous data are detected and eliminated from the trajectory.

Figure 13
The aggregated trajectory fragment with noisy data.
Table 2
The arrival data shown in Figure 13.

To better illustrate the process of Algorithm 3, the data shown in Figure 13 are used as an example. Figure 14 shows the colorized Uc for the data given in Table 2, with the parameters given in Equation (15). As shown in the figure, the abnormal data P2P5, P8 and P14 show a lower connecting MF than others.

Figure 14
The evaluated fuzzy connecting membership function (MF) matrix for the trajectory fragment shown in Figure 13.

In the first round, k* = 14 with N¯14c=13; thus, P14 is chosen as the noisy data and is deleted in this round. In Algorithm 3, one noisy datum is selected for deletion each round until N¯c=0. In Figure 14, both P15 and P16 satisfy the connecting possibility threshold to other non-noisy data; however, P15 and P16 are in the same station index, which means that at least one of them should be classified as noisy data. We prefer to delete the data with the lower average connecting possibility, which results in maximizing avg(U^c). In the eighth round, both P15 and P16 have the same anomalous indexing, and after comparing the row summation of Uc, P16 is discriminated as the noisy data. Finally, in the ninth round, N¯c=0 is obtained, which means the noise detection process is complete. The outputted U^c is illustrated in Figure 15. As shown in the figure, all of the data are with u(t)>umin to each other, and the corresponding G^c(j,m,l) is a cleaned trajectory fragment. Additionally, if any one of the deleted data is added to G^c(j,m,l), the corresponding N¯c0. Thus, the size of G^c(j,m,l) is maximized by Algorithm 3. Figure 16 shows the final results according to the data given in Figure 13, where the red labels indicate the cleaned data G^c(j,m,l) and the blue ones indicate noisy data R^c(j,m,l).

Figure 15
The fuzzy connecting MF matrix for the trajectory fragment after outlier elimination.
Figure 16
The cleaned trajectory fragment.

5.4. Testing of the Trajectory Fragments Connecting Algorithm

When the traveling data are aggregated by FCM and cleaned by the proposed cleaning algorithm, the following procedure is to connect the over split clusters into a one-route trajectory, which is implemented by the trajectory fragments connecting (TFC) algorithm. The dataset shown in Figure 6 is utilized for testing. Figure 17 shows the cleaned and connected trajectories after TFC, where the data labeled with “*” are the noisy data, and the connected dot data are one-route trajectories (also denoted by S(m,l)). In the figure, the data in the first and final stations (FF-data) are omitted. Furthermore, the trajectories labeled by A1, A2 and A3 are filtered as noisy data because their respective sizes violate the connecting rule described by Constraint (18). When all of the one-route trajectories are extracted by Algorithm 4, it is easy to add the available FF-data to S(m,l) by u(t) between the FF-data and the data in S(m,l). Figure 18 shows the final extracted trajectories from the data shown in Figure 4. As illustrated in the figure, all of the one-route trajectories are extracted and cleaned.

Figure 17
The cleaned trajectories after trajectory fragments connecting (TFC).
Figure 18
The final extracted trajectories after adding the arrival data of the first and final stations (FF).

5.5. Performance of the Missing Data Recovery Methods

A dataset with a size of 8288 one-route trajectories for the line l=130 was used to verify the performance of our proposed missing data recovery methods. The dataset was sampled from 1 August–10 December in 2012 and was extracted by the addressed extracting algorithm. There are 118,917 (29.89%) data lost in the trajectories. For a real system, it is difficult to conduct the studies such as dynamic headway mining [31,32], service reliability evaluation [33], scheduling analysis [34,35,36], and so on, before missing data are interpolated, especially for the trajectories in the high missing rate. Thus, missing data recovery is important for the trajectory mining of urban buses. The proposed CLIMDR method is easily implemented and outperforms when the dataset is large enough (thousands of trajectories or more).

Figure 19 shows the mean absolute error (MAE) of the recovered data by CLIMDR-1 from Station 3–Station 46 for the tested dataset. For comparison, the traditional straight linear interpolation for missing data recovery with one depth missing data (SLIMDR-1) and Catmull–Rom cubic interpolation [25] are also conducted. As compared in the figure, the performance of our method is better than those of the other two methods for almost allstations, especially for the stations with congested roads and a large MAE.

Figure 19
Comparisons of the MAE performances of the recovered data by contextual linear interpolation for missing data recovery (CLIMDR)-1.

Figure 20 shows the MAE performance of the recovered data by MIMDR for the same dataset. In Figure 20, there are a total of 42 stations (from Station 3–Station 44) for the test. For the MIMDR method, there are 10 stations with MAE0.2 and 21 stations with MAE0.4. In contrast, the CLIMDR-1 method outperforms MIMDR by 17 stations with MAE0.2 and 36 stations with MAE0.4. For the real system, we only recover the first and final stations for the trajectories by the MIMDR method, and then, we apply the CLIMDR method to interpolate the middle parts for the trajectories.

Figure 20
The MAE performance of the recovered data by median value interpolation for missing data recovery (MIMDR).

Missing consecutive data also commonly occur in the real system. The CLIMDR-n method is suitable for the cases where n consecutive data are missing in the trajectories, and it outperforms comparable methods in most cases. Figure 21 shows the comparisons of the MAE performances for the cases of six consecutive missing data, i.e., six-depth missing data. We selected eight cases, for which the starting station is from 21–28, to test the performance of CLIMDR-6. In the figure, the MAE performances of the three methods are grouped, where the left bar represents the proposed CLIMDR method, the middle bar represents the SLIMDR method and the right bar represents the Catmull–Rom method. For six-depth missing data recovery, there are six groups of comparisons in total for each case. As shown in the figure, group i (i=1,2,,6) also indicated the depth index, representing the performance of the corresponding i-th missing data in the six-depth missing data. As illustrated in the figure, the CLIMDR method outperformed the other two methods in almost all of the cases.

Figure 21
Comparisons of the MAE performances for six-depth missing data recovery.

Figure 22 shows the final recovered trajectories for the same dataset as illustrated in Figure 17, where the red dots represent the recovered data. When the cleaned bus trajectories are extracted and recovered for the missing data, the data are stored in the form of one trajectory for one record, and all of the data from different sampling days are merged into one database. In the final database, the trajectories are sorted by dispatching time and can be retrieved in the form of a two-dimensional matrix, where one row represents one trajectory. One merit of the trajectory matrix is that the traveling time set between any two stations can be easily derived by subtracting the corresponding two columns from the matrix. Furthermore, the headway dynamics can also be calculated by subtracting any two adjacent row data in the matrix.

Figure 22
The final recovered trajectories.

6. Conclusions

In this paper, we have presented a novel framework for urban bus traveling data acquisition and data processing for data sampled from the Internet. The work consisted of two parts: trajectory extraction with noisy data removal and missing data recovery. We introduced the pentagonal fuzzy membership function for the purpose of describing the connecting possibility of arrival events. The fuzzy membership function was extended to be a matrix form and was applied to detect the outliers in a cluster. Furthermore, the connecting rule for clusters was designed with the help of fuzzy MF. The trajectory extraction work was implemented by three sub-procedures: data clustering by FCM, trajectory fragments cleaning and trajectory fragments connecting.

In real systems, the extracted trajectories have a high rate of missing data for data sampled from the Internet. We classified the missing data into two types: missing inside the trajectory and missing outside the trajectory. We introduced the CLIMDR method for data interpolation for data missing inside the trajectory and the MIMDR method for recovering the head or tail of the trajectory. We showed that the proposed CLIMDR method outperformed the straight linear interpolation method and Catmull–Rom interpolation method in most cases. The CLIMDR method is data driven and parameter free, which is easily implemented and robust.

In the future, we plan to investigate headway dynamics, scheduling efficiency and traveling time prediction of urban buses with the data sampled from the Internet and preprocessed by the methods given in this paper.


The work has been partially supported by grants from the Zhejiang Provincial Natural Science Foundation (LQ13G010007, LY17F020012, LY17F030016) and the National Natural Science Foundation of China (No. 61303113, 61572439, 61273212, 61374152).

Supplementary Materials

The following are available online at

Appendix A

The raw bus arrival data formed like Table 1 for l=130 are available as a supplementary file, which sampled during 1 August to 9 December 2012. The extracted and interpolated trajectories formed as S(l) for l=130 are also available. All of the data are stored in the form of one file for one sampling day.

Author Contributions

Author Contributions

C.F. Tong designed the system. X.H. Yang and C.F. Tong conceived of the algorithms. H.L. Chen analyzed the data. C.F. Tong and Q. Xuan wrote the paper.

Conflicts of Interest

Conflicts of Interest

The authors declare no conflict of interest.


1. Zheng Y. Trajectory Data Mining: An Overview. ACM Trans. Intell. Syst. Technol. 2015;6:1–45. doi: 10.1145/2743025. [Cross Ref]
2. Zhang F., Yuan N.J., Wilkie D., Zheng Y., Xie X. Sensing the Pulse of Urban Refueling Behavior: A Perspective from Taxi Mobility. ACM Trans. Intell. Syst. Technol. 2015;6:1–24. doi: 10.1145/2644828. [Cross Ref]
3. Ma S., Zheng Y., Wolfson O. Real-Time City-Scale Taxi Ridesharing. IEEE Trans. Knowl. Data Eng. 2015;27:1782–1795. doi: 10.1109/TKDE.2014.2334313. [Cross Ref]
4. Yuan N.J., Zheng Y., Xie X., Wang Y., Zheng K., Xiong H. Discovering Urban Functional Zones Using Latent Activity Trajectories. IEEE Trans. Knowl. Data Eng. 2015;27:712–725. doi: 10.1109/TKDE.2014.2345405. [Cross Ref]
5. Zhang D., Sun L., Li B., Chen C., Pan G., Li S., Wu Z. Understanding Taxi Service Strategies From Taxi GPS Traces. IEEE Trans. Intell. Transp. Syst. 2015;16:123–135. doi: 10.1109/TITS.2014.2328231. [Cross Ref]
6. Castro P.S., Zhang D., Chen C., Li S., Pan G. From Taxi GPS Traces to Social and Community Dynamics: A Survey. ACM Comput. Surv. 2013;46:1–34. doi: 10.1145/2543581.2543584. [Cross Ref]
7. Pan G., Qi G.D., Zhang W.S., Li S.J., Wu Z.H., Yang L.T. Trace Analysis and Mining for Smart Cities: Issues, Methods, and Applications. IEEE Commun. Mag. 2013;51:120–126. doi: 10.1109/MCOM.2013.6525604. [Cross Ref]
8. Pan G., Qi G., Wu Z., Zhang D., Li S. Land-Use Classification Using Taxi GPS Traces. IEEE Trans. Intell. Transp. Syst. 2013;14:113–123. doi: 10.1109/TITS.2012.2209201. [Cross Ref]
9. Li X., Pan G., Wu Z., Qi G., Li S., Zhang D., Zhang W., Wang Z. Prediction of urban human mobility using large-scale taxi traces and its applications. Front. Comput. Sci. China. 2012;6:111–121.
10. Liu X., Biagioni J., Eriksson J., Wang Y., Forman G., Zhu Y. Mining Large-scale, Sparse GPS Traces for Map Inference: Comparison of Approaches; Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’12); Beijing, China. 12–16 August 2012; New York, NY, USA: ACM; 2012. pp. 669–677.
11. Bejan A., Gibbens R., Evans D., Beresford A., Bacon J., Friday A. Statistical modelling and analysis of sparse bus probe data in urban areas; Proceedings of the 13th International IEEE Conference on Intelligent Transportation Systems; Funchal, Portugal. 19–22 September 2010; pp. 1256–1263.
12. Bejan A., Gibbens R. Evaluation of Velocity Fields via Sparse Bus Probe Data in Urban Areas; Proceedings of the 14th Internation IEEE Conference on Intelligent Transportation Systems; Washington, DC, USA. 5–7 October 2011; pp. 746–753.
13. Do Amaral B.G., Nasser R., Casanova M.A., Lopes H. BusesinRio: Buses as mobile traffic sensors: Managing the bus GPS data in the city of Rio de Janeiro; Proceedings of the 17th IEEE International Conference on Mobile Data Management; Porto, Portugal. 13–16 June 2016; pp. 369–372.
14. Suzhou Transportation Bureau The Official Website of Bus Arrival Information Quering System for Suzhou. [EB/OL] [(accessed on 18 January 2017)];2017 Available online:
15. Dai D., Mu D. An Algorithm for Bus Trajectory Extraction Based on Incomplete Data Source. Chin. J. Electron. 2012;21:599–603.
16. Chandola V., Banerjee A., Kumar V. Anomaly detection: A survey. ACM Comput. Surv. 2009;41:1–58. doi: 10.1145/1541880.1541882. [Cross Ref]
17. Yamanishi K., Takeuchi J.I., Williams G., Milne P. On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms. Data Min. Knowl. Discov. 2004;8:275–300. doi: 10.1023/B:DAMI.0000023676.72185.7c. [Cross Ref]
18. Markou M., Singh S. A Neural Network-Based Novelty Detector for Image Sequence Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2006;28:1664–1677. doi: 10.1109/TPAMI.2006.196. [PubMed] [Cross Ref]
19. Ghoting A., Parthasarathy S., Otey M.E. Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov. 2008;16:349–364. doi: 10.1007/s10618-008-0093-2. [Cross Ref]
20. Piciarelli C., Micheloni C., Foresti G.L. Trajectory-Based Anomalous Event Detection. IEEE Trans. Circ. Syst. Video Technol. 2008;18:1544–1554. doi: 10.1109/TCSVT.2008.2005599. [Cross Ref]
21. Fanaee-T H., Gama J. Tensor-based anomaly detection: An interdisciplinary survey. Knowl. Based Syst. 2016;98:130–147. doi: 10.1016/j.knosys.2016.01.027. [Cross Ref]
22. Hayes M.A., Capretz M.A. Contextual anomaly detection framework for big sensor data. J. Big Data. 2015;2:1–22. doi: 10.1186/s40537-014-0011-y. [Cross Ref]
23. Long J.A. Kinematic interpolation of movement data. Int. J. Geogr. Inf. Sci. 2016;30:854–868. doi: 10.1080/13658816.2015.1081909. [Cross Ref]
24. Winkel R. On a generalization of Bernstein polynomials and Bezier curves based on umbral calculus (II): De Casteljau algorithm. Comput. Aided Geom. Des. 2015;39:1–16. doi: 10.1016/j.cagd.2015.04.002. [Cross Ref]
25. Yuksel C., Schaefer S., Keyser J. Parameterization and applications of Catmull-Rom curves. Comput. Aided Des. 2011;43:747–755. doi: 10.1016/j.cad.2010.08.008. [Cross Ref]
26. Lei P.R. A framework for anomaly detection in maritime trajectory behavior. Knowl. Inf. Syst. 2016;47:189–214. doi: 10.1007/s10115-015-0845-4. [Cross Ref]
27. Ando S., Thanomphongphan T., Seki Y., Suzuki E. Ensemble anomaly detection from multi-resolution trajectory features. Data Min. Knowl. Discov. 2015;29:39–83. doi: 10.1007/s10618-013-0334-x. [Cross Ref]
28. Bezdek J. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press; New York, NY, USA: 1981.
29. Ang K.K., Quek C. Supervised pseudo self-evolving cerebellar algorithm for generating fuzzy membership functions. Expert Syst. Appl. 2012;39:2279–2287. doi: 10.1016/j.eswa.2011.08.001. [Cross Ref]
30. Zenebe A., Norcio A.F. Representation, similarity measures and aggregation methods using fuzzy sets for content-based recommender systems. Fuzzy Sets Syst. 2009;160:76–94. doi: 10.1016/j.fss.2008.03.017. [Cross Ref]
31. Bartholdi J.J., III, Eisenstein D.D. A self-coordinating bus route to resist bus bunching. Transp. Res. Part B Methodol. 2012;46:481–491. doi: 10.1016/j.trb.2011.11.001. [Cross Ref]
32. Li Y., Xu W., He S. Expected value model for optimizing the multiple bus headways. Appl. Math. Comput. 2013;219:5849–5861. doi: 10.1016/j.amc.2012.11.098. [Cross Ref]
33. Chen X., Yu L., Zhang Y., Guo J. Analyzing urban bus service reliability at the stop, route, and network levels. Transp. Res. Part A Policy Pract. 2009;43:722–734. doi: 10.1016/j.tra.2009.07.006. [Cross Ref]
34. Lin J., Wang P., Barnum D.T. A quality control framework for bus schedule reliability. Transp. Res. Part E Logist. Transp. Rev. 2008;44:1086–1098. doi: 10.1016/j.tre.2007.10.002. [Cross Ref]
35. He S.X. An anti-bunching strategy to improve bus schedule and headway reliability by making use of the available accurate information. Comput. Ind. Eng. 2015;85:17–32. doi: 10.1016/j.cie.2015.03.004. [Cross Ref]
36. Mendes-Moreira J., Moreira-Matias L., Gama J., Freire de Sousa J. Validating the coverage of bus schedules: A Machine Learning approach. Inf. Sci. 2015;293:299–313. doi: 10.1016/j.ins.2014.09.005. [Cross Ref]

Articles from Sensors (Basel, Switzerland) are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)