Given local outlier factors, we can detect the outliers that are always away from most of the samples. In order to outline the algorithm, some concepts must go first: Reachability Distance
RDk(x,x′)=max(∥x−x(k)∥,∥x−x′∥) where x(k) stands for the k th point nearest to x in training set {xi}ni=1 . Note that k is manually selected. Local Reachability Density LRDk(x)=(1k∑i=1kRDk(x(i),x))−1 Local Outlier Factor LOFk(x)=1k∑ki=1LRDk(x(i))LRDk(x) Evidently, as the LOF of x ascends, the probability that x is an outlier also goes up. Theoretically, it is an easy algorithm with intuitive principle. However, when n is a very large number, it also requires tremendous computation amount.Here is a simple example
n=100; x=[(rand(n/2,2)-0.5)*20; randn(n/2,2)]; x(n,1)=14; k=3; x2=sum(x.^2,2); [s, t]=sort(sqrt(repmat(x2,1,n) repmat(x2',n,1)-2*x*x'), 2); for i=1:k 1 for j=1:k RD(:,j)=max(s(t(t(:,i),j 1),k), s(t(:,i),j 1)); end LRD(:,i)=1./mean(RD,2); end LOF=mean(LRD(:,2:k 1),2)./LRD(:,1); figure(1); clf; hold on plot(x(:,1),x(:,2),'rx'); for i=1:n plot(x(i,1),x(i,2),'bo', 'MarkerSize', LOF(i)*10); endIn unsupervised learning problems, there is usually little information about the outliers. However, when some known normal sample set {x′i′}n′i′=1 is given, we may be confident to figure out the outliers in the test set {xi}ni=1 to some degree. Kullback-Leibler (KL) divergence, also known as Relative Entropy, is a powerful tool to estimate the probability density ratio of normal samples to test samples- w(x)=p′(x)p(x) where p′(x) is the probability density of normal samples and p(x) is that of test ones, and avoid direct calculation of the ratio. The ratio of normal sample approaches 1 but of an outlier is away from 1. To begin with, let transform the model of density ratio to a parameterized linear model: wα(x)=∑j=1bαjψj(x)=αTψ(x) where α=(α1,⋯,αb)T is the parameter vector and ψ(x)=(ψ1,⋯,ψb)T is a non-negative basis function vector. Then wα(x)p(x) can be seen as an estimation of p′(x) . Define the similarity between wα(x)p(x) and p′(x) as KL distance, i.e. KL(p′∥wα(x)p(x))=∫p′(x)logp′(x)wα(x)p(x) In general case, KL distance is non-negative and equals to zero only if wαp=p′ . When KL distance is considerably small, wαp can be regarded near to p′ . In order to guarantee that wαp is well-defined, we apply the following constraint ∫wα(x)p(x)dx=1,∀x,wα(x)p(x)≥0 Then by approximation, we can transform the estimation above to the following optimal problem: maxα1n′∑i′=1n′logwα(x′i′)s.t.1n∑i=1nwα(xi)=1,α1,…,αn′≥0 We briefly summarize the estimation process:
Initialize α .Repeatedly carry out the following process until α comes a suitable precision: α←α+ϵAT(1./Aα) α←α+(1−bTα)b(bTb) α←max(0,α) α←α/(bTα)where A is the matrix whose (i′,j)th element is ψj(x′i′) . b is the vector whose jth element is 1n∑ni=1ψj(xi) .
Here is an example (Gaussian Kernal Model):
function [ a ] = KLIEP( k, r ) a0=rand(size(k,2),1); b=mean(r)'; c=sum(b.^2); for o=1:1000 a=a0+0.01*k'*(1./k*a0); a=a+b*(1-sum(b.*a))/c; a=max(0,a); a=a/sum(b.*a); if norm(a-a0)<0.001, break, end a0=a; end end n=100; x=randn(n,1); y=randn(n,1); y(n)=5; hhs=2*[1,5,10].^2; m=5; x2=x.^2; xx=repmat(x2,1,n)+repmat(x2',n,1)-2*(x*x'); y2=y.^2; yx=repmat(y2,1,n)+repmat(x2',n,1)-2*y*x'; u=floor(m*(0:n-1)/n)+1; u=u(randperm(n)); for hk=1:length(hhs) hh=hhs(hk);k=exp(-xx/hh); r=exp(-yx/hh); for i=1:m g(hk,i)=mean(k(u==i,:)*KLIEP(k(u~=i,:),r)); end end [gh,ggh]=max(mean(g,2)); HH=hhs(ggh); k=exp(-xx/HH); r=exp(-yx/HH); s=r*KLIEP(k,r); figure(1); clf; hold on; plot(y,s,'rx');Furthermore, outlier detection can be done using support vector machine techniques. Due to the time limit, we just outline the main structure of that algorithm. A typical SVM outlier detector gives a hyper-ball that contains nearly all the sample points. Then a point which is outlying the hyper-ball can be seen as an outlier. Concretely speaking, we get the center c and radius R by solving the following optimal problem:
minc,r,ξ(R2+C∑i=1nξi)s.t.∥xi−c∥2≤R2+ξi,ξi≥0,∀i=1,2,…,n It can be solved by using Lagrange multiplers: L(c,r,ξ,α,β)=R2+C∑i=1nξi−∑i=1nαi(R2+ξi−∥xi−c∥2)−∑i=1nβiξi Then its dual problem can be formulated as: maxα,βinfc,R,ξL(c,r,ξ,α,β),s.t.α≥0,β≥0 KKT condition: ∂L∂c=0∂L∂R=0∂L∂ξi=0⇒⇒⇒c=∑ni=1αixi∑ni=1αi∑i=1nαi=1αi+βi=C,∀i=1,2,…,n Therefore, the dual problem can be solved by α^=argmaxα=⎛⎝∑i=1nαixTixi−∑i,j=1nαiαjxTixj⎞⎠s.t.0≤αi≤C,∀i=1,2,…,n It is in the form of typical quadratic programming problem. After solving it, we are able to further solve c and R: R2^=∥∥∥∥xi−∑j=1nα^jxj∥∥∥∥2,c^=∑i=1nα^ixi where xi is the support vector satisfying ∥xi−c∥2=R2 and 0<αi<C . Hence, when a sample point x satisfies ∥x−c^∥2>R2^ it can be viewed as an outlier.