A neural network model which combines unsupervised and supervised learning

IEEE TRANSACTIONS O N NEURAL NETWORKS, VOL. 4, NO 2, MARCH 1993 357 N ZxZOxlOxl â efa-3003 I m: sec V. C. Chen and Y. H. Pao, âLearning control using neural networks,â in Proc. IEEE Int. Conf Robotics and Automation, vol. 3 , 1989, pp. 1448-1153. K. Hornik, M. Stinchcombe, and H. White, âMultilayer feedforward networks are universal approximators,â Neural Networks, vol. 2, pp. 359-366, 1989. D. E. Rumelhart, G. E. Hinton, and R. J . Williams, âLearning internal representations by error propagation,â in Parallel Distributed Process- ing, vol 1 : Foundations. J . Feldman, âDTNSRDC revised standard submarine equations of mo- tion,â David W. Taylor Naval Ship Research and Development Center, Tech. Rep. SPD-0393-09, 1979. Cambridge, MA: MIT Press, 1986. A Neural Network Model which Combines Unsupervised and Supervised Learning Keun-Rong Hsieh and Wen-Tsuen Chen 1 I I N. 2x2DxlOxl Abstract-In this letter, we propose a neural network which com- bines unsupervised and supervised learning for pattern recognition. The network is a hierarchical self-organization map, which is trained by unsupervised learning at first. When the network fails to recognize similar patterns, supervised learning is applied to teach the network to give different scaling factors for different features so as to discriminate similar patterns. Simulation results show that the model obtains good generalization capability as well as sharp discrimination between similar patterns. i with v.r I - -loo 20 40 60 80 100 I20 140 160 IS0 200 m e s a (b) Fig. 3. (a) The velocity reference signal and its effect on the error 6. (b) The control signals generated by the neural network. region. The corresponding response of the system is shown in Fig. 3(a). The system rise time is improved to 32 s and the overshoot is reduced to 42%. Also the settling time of the system is about 150 s which is much less than that without velocity reference. Fig. 3(b) shows the control signals generated by the neural network for the three cases referred in Fig. 2(a). It could be seen that the network is able to perform better with the velocity reference. IV. CONCLUSION A neural network controller designed solely based on the desired response of the dynamics to realize a maximum acceptable steady state error may not be the best one in terms of the dynamic responses. We presented a velocity reference feedback scheme for the neural network controllers to improve their dynamic responses. We have shown that the scheme gives specific improvements in the transient and steady state responses of the system. REFERENCES [ l ] K. S. Narendra and K. Parthasarathy, âIdentification and control of dy- namical systems using neural networks,â IEEE Trans. Neural Networks, vol. 1, pp. 4-27, Mar. 1990. 1. INTRODUCTION Artificial neural network models have been widely used in pattern recognition because of its parallel nature and simple processing elements. Some models are used as feature extractors, some are used as statistical classifiers, and some implement both to constitute a com- plete recognition system. In either case, each model is characterized by its connection topology and learning rule. Learning is used to adapt the network to capture the features in its training data and respond as what we want. There are two classes of learning, namely supervised and unsupervised. In supervised learning, training patterns as well as their associated desired output patterns are fed into the network during training phase. In unsupervised learning, only training patterns are given, and the network usually categorizes them according to a prespecified measure of similarity between patterns. A multilayer perceptron (MLP) with the back-propagation learning rule (BP) is a representative-model in supervised learning [l]. I t is a universal function-approximator theoretically, and can be trained as a feature extractor, as a classifier, or as both. But when it is applied to pattern recognition, i t exhibits poor generalization capa- bility for patterns not trained [2]. Kohonenâs unsupervised learning rule implements k-nearest neighbor classifier, which usually works in feature space [3]. Its performance depends heavily on the results of feature extraction. Neocognitron [4] is a complete classifier which can be trained by supervised or unsupervised rule. In this network, Manuscript received January 8, 1992; revised May 28, 1992. This work was supported by the National Science Council of the Republic of China under Grant NSC-81-0408-E-007-03. The authors are with the Department of Computer Science, National Tsing Hua University, HsinChu, Taiwan, 30043, Republic of China. IEEE Log Number 9203723. 1045-9227193fi03.00 0 1993 IEEE 358 IEEE TKANSACIIONS ON Nt-UKAL NFl'WOKKS. VOL. 4. NO. 2. MARCH l Y Y 3 the activation of each neuron represents the existcnce of a specific feature at a specific position in the input bitmap. I t is capable of recognizing alphanumeric characters with some dcformations. But. it demands very careful selection of training patterns, especially the feature patterns for the second laycr. Furthermore, its performance is very sensitive to the parameters used during training. In this letter, we propose ;I neural network model which is trained by unsupervised and super d learning rules for pattern recognition. The network has a multilayer architecture with self- organizing capability at cach layer. I t is first trained by unsupervised learning as an initial nctwork. At bcginning, the network is able to recognize oriented lines. After self-organization by training patterns, the network is able to discriminate some classes of patterns which have relatively great diffcrcnces in shapes. But. i t cannot discriminate similar patterns. In this case, thc network is furthcr trained by a supervised rule. The supervised algorithm finds out the minor differences between similar patterns and amplifies this difference so as to get a corrcct response. We believe this treatment to be a better learning scheme. A. Network Topology The network is composed of four layers including the input layer. Layers arc labeled Layer 0, Layer 1, Layer 2, and Layer 3. with Layer 0 being the input layer and Layer 3 being the output layer. Each layer consists of nodes organized 21s a two-dimcnsional plane. The size of cach layer is 1 6 x 16, 52x52, OOxhO, and 1Ox I O rcspectively as shown in Fig. 1 . There are two types of connections in thc nctwork, namely forward connections between successive layers and lateral connections in the same layer. Nodes in each layer are partitioned into clusters. and the clustcr size of Layer 0 to Layer 3 is 4x4. 4x4 . 1Ox IO, and IOx IO, rcspectively. All nodes in the same cluster havc connections from cxactly thc same set of nodes in their preceding layer. That is. they have the same receptive fields and thcrc are full connections between a cluster and its receptive field. The receptive ficld of Layer 1. Layer 2, and Layer 3 is 4x4, 32x32, and 60x60. respectively. The lateral connections within a layer are included so as to retain self-organization capability. The connection weights of the lateral connections havc ;I "Mexican-hat" function. The function of cach layer is similar to that of neocognitron. In neocognitron, each node extracts some feature a t some specific location in the input patterns. The hidden laycrs of o u r modcl. Layer 1 and Layer 2, also extract features from input patterns. Each cluster checks some area of the input image. and thc output of cach cluster represents thc type of feature a t that area. The advantage of ou r model is that it is not necessary to elaboratcly choose the training patterns for each layer. The hidden layers successively extract salient features automatically, and when this procedure fails, supcrvised learning takes over to correct thc ci-rors. B. Unsupervised Tmiriirig of' the Neirvork Previous studies [SI. [(,I, show that in the early stage of visual processing, linc segments are primarily recognized a t different ori- entations. Many pattern recognition systems havc such effect in the first stage. The first layer of our network also achieves such effect by an unsupervised learning rule similar to Kohonen's self-organization feature map. Kohonen's unsupervised learning rule is 21s follows: Step I : Sclcct it training pattern -1- = ( . I ' ~ . .ri. . . . . I . \ ), and fccd it into the input layer. Layer Ll L : l p I 16x16 52152 L a ) c r Z Layer 3 HI\M 1 O X l O Fig. I . The architecture o l the rictwork Siep 2: Compute the distances r l , between the input and each output nodc , j using \ 1 1 , = --jJ.r.,(t) - (('! l ( f i ) 2 (1) 1 - I where . r l [ f ) is the ,jth input at time f . cc ' , , ( f ) is the wcight from input neuron ,j to output neuron i at time f . Step 3: Select an output node i' with minimum distancc among all output nodes, and update the wcight vectors 11- of i' and its neighbors by the following rule I f ' ! , ( f + 1 ) = f f ' / l ( t l + l \ ( t ) ( . l ~ l ( f ) - If ' / , ( t i ) for iE .Y , . , j = 1. 2. ' . . . -1- ( 2 ) where .I-, is the set consists of i' and its neighbors, and i i ( t ) is the learning rate that is usually less than 1. The procedure continues until thc weights converge. In .Step 3, the updating rule makcs the wcight vectors i* and its neighbors more similar to I ( t ) . Generally, i' and its neighbors will be responsive for input vectors which are close to S( t ) in the future. If the input vectors and wcight vectors are normalized to be unit length, and each output node i has an inner product of 11; and -1- as its output function, then finding ii node i' with minimum distancc is to find the node i' with maximum output value. In our network model, layers arc not fully connected. Thus, Step 3 is modified to choose a node with minimum distance for cach cluster. Since our model has multilayer of self-organizing network, it is not practical to normalize vectors at each layer. For simplicity, we can assume the output function o , of each node he reciprocally proportional to the distance of 4 and 11; such as ~ I / 2 (3) 1 = A . , . ( I + ? ( .I., ( t ) F I / ' # , ( t ) ) l / = I where X.1 is a constant. Therefore. a node i has larger output valuc if its weight vector 11; is more similar to -1.. All nodes in our network operate according to the above output function and learn by our modified learning rule. We first choose line segments in four directions as our initial training patterns, they are horizontal, vertical, 45 degrecs, and 135 degrees lines. These lines are of width 2 such that they can tolerate more distortions in Layer 1. The purpose of the training is to make different nodes in Layer 1 responsive to line segments of different orientations. Each layer of thc network self-organizes according to these training patterns. As a result. it group of nodes will be responsive to different orientations of lines at different locations. After this training, we can consider the network to have the capability to extract line information in the input plane. Next to the above training, we use handwritten numeric patterns as training patterns and Ict the network learn by the learning rule. Again, this is an unsupervised learning; the network self-organize\ according to these numeric patterns. Following this self-organization, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 4, NO. 2. MARCH 1993 359 the nodes of the output layer are then divided into different areas with each area responsive to one or multiple classes of patterns. Classes with similar shapes are forced into the same area, which means the network cannot discriminate these classes at this moment. Thus, we need further learning as described in the following section. C. Supervised Training of the Network Unsupervised learning as described in the above section is not good enough to discriminate all classes of patterns. Each node i responds according to the distance between S and IT;, and the measure of distance in Kohonenâs network is Euclidean. Thus if the input is not sparse enough, the same set of nodes will respond for many similar patterns in different classes as long as their Euclidean distances are close enough. Besides, in multilayer networks such as neocognitron or perceptron, the output layer reccives input only from its preceding layer. Detailed information in the input image can be lost because of the feature extraction process in the previous layers. Therefore, they are unable to discriminate similar patterns whose feai.ures are mostly the same. There are two main ideas in our supervised learning. First, different features must have different scaling factors in discriminating similar patterns. For instance, in the letters â0â and âQ,â the extra stroke near the bottom of âQâ is the dominating factor in discriminating these two characters. Second, the output layer, whose response represents the classification result, should receive input not only from its preceding layer but also from the lower layer near the input layer. Since some parts of the feature vectors should contribute more than others in discriminating similar patterns, we incorporate a scaling factor .j in determining the degree of contribution for different parts of the feature vectors. We propose the following equation as the output function of each node. ) 0 , = k , . (1 + C.3,.,(.r,,(t) - 1 1 â , / ( t ) ? .Y . (4) j = 1 In general, the more important a feature w I l is, the greater its , j ,J is. Our supervised learning rule is mainly to examine the small difference of similar patterns in Layer 1 and amplify the difference. That is, increase the scaling factor of these different features. There are some different cases, for clarity, we use the followin$, case to demonstrate our learning procedure. Step 1: Suppose class A and B have the same set R of active nodes in output layer. Select a pattern Pq and PE from class A and B respectively. Step 2: Examine the response at Layer 1 for pattern P.4 and PB. If the response Pq and PH are the same, the network is unable to discriminate these two patterns. Try another pair of patterns. Step 3: Suppose the node C.4 and Cg in Layer 1 responds only to pattern Pt and PH, respectively. The goal is to divide R into two subsets R.4 and RB such that R.1 is ON only when C.q is ON, and RB is ON only when CB is ON. We can increase the sizes of the weight vectors of nodes in R as follows. 1. for iER..i. 0. else l l â , , r.4 = and (5) Furthermore, update their scaling factors corresponding to C A and CH as follows. where k2 is the learning rate Fig. 2. Some of the training patterns. Fig. 3. (a) Feature map before supervised learning. (b) Feature map after supervised learning. The supervised learning is applied again and again until all misclassified classes are examined. If two patterns have very similar responses at Layer 1 , it will get exactly the same response at Layer 3. After the introduction of the scaling factor, the responses at Layer 1 is not changed, but they are no more similar from the view point of Layer 3 because each entry has different contribution to the measure of similarity. Therefore, they will result in quite different responses in Layer 3, and we are able to discriminate different classes of patterns after the supervised learning. 111. SIMULATION RESULTS We have simulated the network model on a PC486 for handwritten numeric pattern recognition. The network is first trained by line segments in four directions of width 2 as described in the previous section. After the first stage training, the different layers of the network self-organize to extract different line segments at different locations. Then, handwritten numeric characters, which are shifted to center and scaled to maximum size, are fed into the network, and the network self-organizes by unsupervised learning again. There are 10 training patterns for each number, which are chosen so as to contain as many variations as possible. Some of the training patterns are shown in Fig. 2. Both phases of self-organization take about 1000 iterations to converge. After the training, the output layer self-organizes into areas with each area responsive to some class of patterns similar to Kohonenâs phonetic map [3]. The map is shown in Fig. 3(a). As can be seen from the figure, some classes are mapped into the same area, which means the network cannot discriminate these classes. In fact, these classes have similar shapes. The network is then trained by our supervised training rule. The supervised training keeps examining misclassified pair of patterns until all such patterns are examined. After the supervised training, the feature map is changed to Fig. 3(b). In this feature map, all classes have their own responsive areas and it classifies all the training patterns correctly. We have also tested some other untrained patterns, the network successfully recognizes all of them. Some of the test patterns are shown in Fig. 4. A three-layer perceptron is constructed to compare its generaliza- tion capability with that of our model. The number of nodes in each layer is 12x12, 10x10, and 10x1, respectively. The perceptron is 360 Fig. 4. Some correctly recognized patterns first trained by back-propagation with the same set of training pat- terns. For the training patterns, both models achieve 95% recognition rates. Then, we test the recognition rates for 200 untrained patterns of both models. These untrained patterns are written by different people and have different deformations. The recognition rates of perceptron and our model are 82% and 93%, respectively. The performance of our model does not degrade so much as perceptron does, which implies that our model generalizes better. IV. DISCUSSIONS Hybrid learning rules have been studied previously [7] , [ 8 ] . They share a common idea, that is, unsupervised learning is used to train the hidden layer and supervised learning is used to train the output layer. In our work, unsupervised learning is first applied on the whole network such that the network acquires the ability of rough clustering, then supervised learning is applied on some layers for fine classification. Our approach is more like the idea of multiresolution [9]. Features in lower resolution are used for rough clustering, and features in higher resolution are used for fine classification. Our learning rule is especially applicable to character recognition. Like other models, i t is very difficult to have meaningful estimation in its performance for such problem. The performance of hybrid learning scheme such as counter propagation network is similar to that of multilayer perceptron except for faster learning time. Our simulation shows that our learning model has better generalization ability. For Neocognitron, its good performance comes from careful selection of training patterns and âblurâ operation at successive layers, such idea can also be applied to our model. Under such circumstance, the performance of our model can also be improved. The adaptive feature of neural network makes it attractive for usage in pattern recognition. But errors always happen in discriminating untrained patterns. The problem is how to deal with these errors. An effective relearning rule has to memorize the misclassified patterns in one iteration, or as few as possible. Meanwhile, the learning cannot disturb its recognition for other patterns even if we have to use some auxiliary resources such as spared nodes and connections. These characteristics suggest another learning rule for on-line relearning. From the practical consideration, a pattern recognition system with fair recognition rate and no repeated errors is better than one with high recognition rate and repeated errors. Thus a fast relearning rule is inevitable for a practical recognition system. We consider that a proper hybrid learning scheme is desirable for such environment. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their valuable comments on the manuscript. REFERENCES D. E. Rumelhart, G. E. Hinton, and R. J. Williams, âLearning in- ternal representations by error propagation,â in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. I : Foundations. Cambridge, MA: MIT Press, 1986. G. G. Lendaris and I. A. Harb, âImproved generalization in A â s via use of conceptual graphs: a character recognition task as an example case,â in Proc. IJCNN-90, San Diego, CA, 1990, pp. 1-551-1-556. T. Kohonen, âThe neural phonetic typewriter,â IEEE Computer, vol. 21, no. 3, pp. 11-22, 1988. K. Fukushima and N. Wake, âHandwritten alphanumeric character recognition by Neocognitron,â IEEE Truns. Neural Networks, vol. 2, pp. 355-365, 1991. E. W. Kent, The Brains of Men and Machines. Peterborough, NH: McGraw-Hill, 1981. R. Linsker, âSelf-organization in a perceptual network,âlEEE Computer, vol. 21, no. 3, pp. 105-117, 1988. R. Hecht-Nielsen, Neurocomputing. Reading, MA: Addison- Wesley, 1000. J. Moody and C. J. Darken, âFast learning in networks of locally-tuned processing units,â Neural Computation, vol. 1, pp. 281-294, 1989. A. Rosenfeld, Multiresolution Image Processing and Analysis. New York: Spring-Verlag, 1984. Generalization In Probabilistic RAM Nets T. G. Clarkson, Y. Guan, J. G. Taylor, and D. Gorse Abstract- The probabilistic RAM (PRAM) is a hardware-realisable neural device which is stochastic in operation and highly nonlinear. Even small nets of PRAMâS offer high levels of functionality. The means by which a PRAM network generalizes when trained in noise is shown and the results of this behavior are described. I. INTRODUCTION Two important properties exist for artificial neurons used in classi- fication systems: nonlinearity and generalization. Current models of the neuron normally possess only one of these properties, not both. The addition of further neurons is therefore required in order for these properties to be found in a network of artificial neurons, for example, by the use of hidden layers. The PRAM neuron described below claims to possess both these properties which leads to a reduction in the number of neurons required to perform a given task. Being RAM- based, nonlinearity is an intrinsic feature of the PRAM. We show below how the PRAM also exhibits generalization when trained in noise. 11. THE PRAM The model of the PRAM neuron has been previously described [l], [2]. This model has been realized in VLSI hardware and arrays of over Manuscript received May 26, 1992. T. G. Clarkson and Y. Guan are with the Department of Electrical Engineering, Kingâs College, London. .I. G. Taylor is with the Department of Mathematics, Kingâs College, London. D. Gorse is with the Department of Computer Science, University College, London. IEEE Log Number 9206546. 1045-9227/93$03.00 0 1993 IEEE

A neural network model which combines unsupervised and supervised learning

Description

Comments