Lightweight People Counting and Localizing for Easily Deployable Indoors WSNs
science projects buddy|
Active In SP
Joined: Dec 2010
21-12-2010, 01:38 AM
A lightweight method for counting and localizing people using camera sensor networks is described in this article. a motion histogram is made use of by the algorithm for detecting the people based on motion and size criteria.the distribution of people in a roomis estimated by the motin histogram, given a threshod vaueof the pixels in a frame-differenced motion´ image. Ths the motion histogram is an averaged shifted histogram. good detection rates at low computational complexity is provided by the algorithmfor the users. The details of our design is described in the artcle and the results of the simulation of the The
resulting histogram and counting algorithm is also done . The cidcuit desribed is implemented in a iMote2 sensor nodes for te purpose of field testing. An off-the-shelf camera is used along with a a custom
sensor board. Thescenario involves ultralow-power address-event motion imagers and the motion histogram is designed to easily adapt to it.
Get the full report here;
Use Search at http://topicideas.net/search.php wisely To Get Information About Project Topic and Seminar ideas with report/source code along pdf and ppt presenaion
summer project pal|
Active In SP
Joined: Jan 2011
03-02-2011, 09:09 PM
Lightweight People Counting and Localizing for Easily Deployable
ABDUL AZEEM Z
Applied Electronics and Instrumentation
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
COLLEGE OF ENGINEERING
This Seminar describe a lightweight method for counting and localizing people using camera
sensor networks. The algorithm makes use of a motion histogram to detect people based on
motion and size criteria. The motion histogram is an averaged shifted histogram that estimates
the distribution of people in a room given the above-threshold pixels in a frame-differenced
―motion‖ image. The algorithm provides good detection rates at low computational complexity.
In this paper, we describe the details of our design and experimentally determine suitable
parameters for the proposed histogram. The resulting histogram and counting algorithm are
implemented and tested on a network of iMote2 sensor nodes. The implementation on sensor
nodes uses a custom sensor board with a commercial off-the-shelf camera, but the motion
histogram is designed to easily adapt to ultralow-power address-event motion imagers.
Lightweight People Counting and Localizing for Easily Deployable Indoors WSNs.pdf (Size: 991.86 KB / Downloads: 46)
TABLE OF CONTENTS
1 INTRODUCTION 1
2 BACKGROUND AND RELATED WORK 2
3 NEW APPROACH 6
4 MOTION HISTOGRAM 7
4.1 Overview…………………………………………………………………….. 7
4.2 Histogram Structure ........................................................................................ 10
4.3 Filling the histogram ………………………………………………………... 11
4.4 Tuning the histogram ……………………………………………………….. 14
5 CAMERA - NODE LEVEL COUNTING 17
6 EXPERIMENTAL RESULTS 19
6.1 Histogram parameters ………………………………………………………. 19
6.2 Histogram resolution ………………………………………………………... 21
6.3 Network implementation……………………………………………………. 22
7 CONCLUSION 25
8 REFERENCE 26
A SYSTEM that counts and localizes people is a common requirement in a broad
spectrum of applications, such assisted living, home care, security, workplace safety, and
entertainment. or such a system to work for prolonged periods of time in an indoors environment
where new people may enter and leave, and where objects may be introduced, replaced or
moved, it cannot rely on wearable sensors or object tags. Also, for scalability purposes, it should
be low cost and easy to install, requiring little or no on-site calibration. What is more, for
multiple reasons, many applications require the system to observe a certain level of privacy,
regarding the people in the scene.
In response to these demands, our research pursues the development of lightweight
motion-discriminative sensors that bridge the gap between specialized scalar sensors such as
passive infrared (PIR) and generic array-based ones such as cameras. Our approach to this
challenge is to explore biologically inspired address-event (AE) architectures that operate
asynchronously at the pixel level to provide feature information instead of just images. AE
sensors also have the advantage of being typically ultralow power. This paper describes a
lightweight algorithm for people-counting and localizing given the output of a motion-
discriminating address-event imager.
Our previous work provided an initial evaluation of new address-event imager
architectures and a model for emulating such architectures on wireless sensor nodes to test AE
algorithms. In this paper, we present and evaluate a design for localizing and counting people in
indoor spaces with a set of wide angle camera sensor nodes mounted to the ceiling, facing down.
The main contribution of the work described here is the design and evaluation of the lightweight
histogram-based method for localizing people using motion and size information. By employing
an address-event motion imager, the computational requirements of the histogram can be further
reduced to operate on even smaller processors. Human locations collected from a wireless sensor
network (WSN) deployed inside a house are processed in the context of a building floor plan to
recognize the activities of the house inhabitants concludes the paper. 2
BACKGROUND AND RELATED WORK
There are many systems in the literature that aim to detect, count, localize, and/or track
humans. Lately, most of them utilize cameras as the main sensing modality, but there are others
that rely on more unusual modalities such as pressure sensors and PIR arrays .In previous works
pressure-sensitive floor tiles are installed and consecutive footsteps are monitored to discern the
tracks of multiple people. However, this approach requires a laborious installation process that
makes it unfit for most existing environments. Meanwhile, the PIR arrays are used to count the
people on a set of stairs. This is similar where a camera is employed but only for the information
contained in its center scan lines. These types of approaches are used to count people at the
entrance and exit points of closed spaces, which requires very few sensors. On the other hand,
counting errors that occur at detection time end up propagating indefinitely.
As for more traditional camera-based human tectors, there are those who try to segment a
human from an image by comparing it to an empty background frame, and those who directly
employ some type of pattern matching. The patterns to be matched can be the eigenvectors given
by the principal component analysis of a library of human images, or tuples of features such as
SIFT and gradient orientation histograms .Pattern-matching approaches depend on extensive
training, and the feature extraction and matching processes can be computationally intensive.
The more typical approach is to employ background differencing followed by a series of
morphological operations in order to obtain a workable silhouette of a person to be segmented (or
―blobbed‖). See Fig. 2(a). This silhouette can be used to confirm the blobbed object is indeed a
human , or to determine the shape of the bounding box from where other features will be
extracted for that purpose . Since the low-level morpho- logical operations do not guarantee that
each person translate to exactly one blob, a further pass has to be performed where blobs that
are close enough are merged together. The end result is that it is common to merge blobs that do
not belong together, as well as to separate blobs that compose the same object. These algorithms
usually perform additional steps after blobbing is done, in an attempt to correct such anomalies.
Fig. 2. Advantages of employing address-event (AE) imagers: since the com- putation starts at the pixel
level, initial feature-detection steps can be skipped. Furthermore, the algorithms that operate on AE data
are typically simpler.
Some researchers utilize stereo cameras to assist in the image segmentation processes in
previous paper. In that paper, the authors describe their tracking system for assisted living. Their
background model takes into consideration only the pixel intensity oscillations, and would fail in a
less controlled environment. More importantly, their system does not handle rooms larger than
the single stereo-pair’s field-of-view. Traditionally, human-tracking is achieved by, first,
detecting the people that are visible in each frame and, afterwards, tracking them across multiple
frames utilizing either extracted features (such as size, color histograms, edge orientation, etc.) ,
or motion correspondence using a Kalman filter, a particle filter, or other methods.
For the first part, that is, the problem of detecting of people in a video frame, the typical
approach is to employ background differencing followed by a series of morphological operations
in order to obtain a workable silhouette of a person to be ―blobbed‖. Since the low-level
morphological operations don’t guarantee that each person translate to exactly one blob, a further
pass has to be performed where blobs that are close enough are merged together. The end result
is that it is common to merge blobs that do not belong together, as well as to separate blobs that
compose the same object. This has the additional effect of adding uncertainty to the locations
that are extracted from the blob. Some have attempted to obtain more precise locations from
each blob by employing the distance transform rather than center-of-mass or foot
estimation, but that approach fails for fragmented blobs. 4
Finally, there is the problem of maintaining and updating the background model. This is a
necessary process due to the presence of a series of change factors in a stream of frames, among
1) natural oscillations in pixel intensity;
2) gradual changes in lighting, such as those imposed by the movement of the sun;
3) presence of repetitive background motion, such as waving foliage;
4) changes in position of static objects, such as furniture.
The simplest adaptive background-modeling technique is to continuously average all
frames. This has the undesirable effect of generating ―ghosts‖ in the areas where there is most
activity, which ends up making the subtraction in those areas less reliable. Better adaptive
background-modeling approaches are typically computationally expensive, sometimes modeling
each single pixel as a mixture of Gaussians or a Kalman filter . Many of these approaches
require the field of view (FOV) to be empty at initialization—something that may not be
possible in the practical settings we are interested in. Even then, in the presence of scenario
be fo re , most approaches either fail or recover slowly. In assisted-living and office situations,
though, these background changes occur very often. Take as an example the presence of office
chairs, which are moved every time someone sits or stands.
In light of the aforementioned problems with background differencing, the work
presented in this paper bypasses many of these issues by making use of frame differencing
instead [Fig. 2(b)]. Frame differencing consists of subtracting the previous frame from the
current one, to detect pixels that changed in intensity. The resulting frame is subsequently
thresholded, resulting in a boolean image. This way, the complex background modeling steps
become unnecessary, freeing system resources. Frame differencing, however, can generate
images that are harder to segment. This is probably one of the main reasons why the computer-
vision community has largely preferred background subtraction approaches. Our solution to this
is to constrain certain aspects of our deployment in order to allow us to make simplifying
assumptions. Namely, we place our cam- eras on the ceiling, facing straight down, and assume
the ceiling height is known. Another problem associated with frame-differencing is that people 5
can only be detected while moving. In this paper, we make use of additional features to track
More importantly, our approach is built from ground up with address-event image
sensors in mind. These sensors are biomimetic cameras that move the feature detection step
into the imager’s pixels, taking inspiration from the cornea. Much like neurons in the cornea,
each pixel asynchronously emits a pulse (or spike) when an event is captured. By ―event‖ it is
meant anything that can be measured. In the specific case of the imagers for which we
designed our motion histogram, an event is signaled whenever a pixel detects a certain intensity
variation. The address-event representation (AER) protocol is then used for multiplexing all
these spikes into an output data bus. For each incoming spike from pixel , the AER encoder
outputs the address of onto the bus. The event magnitude information is not directly encoded.
This differs from typical cameras, where the magnitude is the main unit of information, and
where large arrays are transmitted regardless of whether the scene is of interest. In AE cameras,
magnitude is naturally encoded into the event frequency. That is, in a motion-sensitive AE
imager, a pixel that detects the most motion fires most often. Another distinction between these
two types of cameras is that address-event cameras do not discretize time into frames, which
leads to high-precision time measurements which can only be obtained by ultrahigh-speed
cameras. Surprisingly, the power consumption of AE imagers is typically on the order of
milliwatts or hundreds of microwatts.
Humans can recognize and count other humans based on shape, size, and movement. The
background differencing approach attempts to extract and operate on mainly the first two types
of information. We choose to focus on the latter two, while at the same time simplifying them by
introducing a set of constraints on the deployment and the environment. First, we assume that
people inside the room are typically in motion. Even though this does not always hold, it is
certainly true for each person at some instant in time. Second, in order to cover a large area
(requiring fewer sensors) and to minimize occlusions, we choose to place the cameras on the
ceiling, facing straight down. In this configuration, and given the ceiling height, it is fair to
assume that human size lies within a certain predefined range. Using these two assumptions, our
goal is to classify as a human each image entity that meets our movement and size criteria and
extract their discrete physical location from our measurements.
To this purpose, we construct a motion histogram from frame- differenced images and
utilize that information to pinpoint each person’s location. The histogram is designed to consider
a typical human size in pixels, given the known characteristics of our camera and the ceiling
height, and use it to compute the discrete human locations (histogram peaks) which best explains
the moving pixels in the frame-differenced image. These locations can then be processed with
higher level algorithms to track each person and recognize their behavior .However, the unique
labeling of each human and the association problems that arise are not the focus of this paper, but
rather our lightweight sensing algorithm for human detection and localization.
For simplicity, consider the 1-D case of detecting a person in the cross section of a
background-differenced thresholded image. As shown in Fig. the foreground pixels for each
person are, ideally, all connected. However, this is usually not the case especially for frame-
differenced images, where above-threshold ―motion‖ pixels are often sparsely distributed. The
approach described in this paper gets around these issues by assuming that the size of a human is
approximately known. This allows a motion histogram to be created, which is in effect an
estimate of the unknown bivariate probability density function of the locations of moving
objects. The locations which have the locally highest probability are then selected as human
Given that the ceiling height is known (i.e., the camera’s z coordinate), the dimensions of
the bounding box that encloses the image of a human can be approximated by some known
average value. Let w be the average width of a human in the cross section of an image frame.
Then a histogram may be produced by binning foreground pixels within bins of size . This is
pictured in Fig. 3(b), where w=3. A mode-finding algorithm can then be utilized to discriminate
each moving person in the scene. Although the histogram in the figure detects the first person’s
position very precisely, the position of the second person is ambiguous. This effect is a
consequence of the particular choice of bin origin. Of course, in this case the second person’s
location can be better estimated by cleverly weighting each bin’s coordinates according to the
bin’s values, but that would bring back the connectedness issues seen in typical image blobbing.
A better approach is to simultaneously use multiple different bin origins. Thus, a single high-
resolution histogram can be composed from multiple shifted histograms, as shown in Fig. 3©
and (d). This type of histogram is called averaged shifted histogram (ASH).
The modes of the ASH in Fig. 3(d) are much better estimates of each person’s location.
Although the shifted histograms in Fig. 3© are shifted by =1 pixel, larger s are often used. 8
In the case when =1 , the ASH becomes a convolution. Surely, the smaller the the
higher the achieved resolution, since the worst-case peak location error for a histogram is given
by . However, when employing frame-differenced images, higher resolutions often produce
histogram peaks that do not represent the modes of the underlying distribution. The ressoning
behind this requires one to consider each thresholded pixel as a Bernoulli variable.
For this ―expected‖ histogram, the that gives the smallest person-localization error will
indeed be =1. However, most instances of this histogram will display very jagged lines,
which can easily produce false-positive peak detections. This situation can be dealt with by low-
passing the histogram (using a Gaussian kernel, for example, or ASH weights) to smoothen these
false peaks. However, that introduces the problem of choosing the best cutoff frequency so as
not to drop valid peaks. This is where ASHs with >1 become attractive: intuitively, the large
bin shifts produce an effect similar to a low pass at no extra cost, with the advantage that the
ASH’s parameters have a physical interpretation.
Another similarity between ASHs and convolutions is that ASHs may also employ
different weights for pixels as a function of their distance to the bin center, in an attempt to
further increase the histogram’s accuracy. This is analogous to convolving the thresholded
image with a given kernel—again, with the main difference being that the parameter is always
1 for convolutions.
To summarize, the motion histogram described in this paper is a bivariate ASH with
uniform weights (which make the ASH comparable to a convolution with a square mask). The
histogram is calculated over the absolute value of the difference between the current image
frame and the previous/Moreover , when using wide-angle lenses our ASH is built from
nonuniform bins, which are modeled after the different shapes people take as they move away from
the optical axis. This effect is greatly accentuated when using wide-angle lenses, which is the
case in our deployment.
Fig. 3. (a) Cross section of thresholded background-differenced image with two people present. (b)
Histogram with three-pixel-wide bins. People are de- tected at the histogram peaks, but the right-most
person’s position is ambiguous. © Superimposition of three histograms with same bin width but different
bin origins. (d) Combining the three histograms from ©, an averaged shifted his- togram is formed.
Notice how the histogram peak is now a better estimate of the person’s position. Since the origin shift
was in 1–pixel increments, this last histogram looks like the result of a convolution, but other shifts may be
used . 10
4.2 Histogram Structure
The primary goal of the motion histogram is to determine the probable location of
each person given the coordinates of the moving pixels in each frame. The value of
each histogram bin corresponds to the number of foreground pixels in a unique area of
the image. In Figure 1, bin b is associated to the set of pixels in the blue square on the
top-left side of the image. It is said that b contains those pixels. Therefore, the relation
can be defined, mapping each bin in the histogram H to the set of pixels in
the image I that it contains. Thus Conversely,
gives for each pixel the set of bins that contain it.
For each bin b, we define the g(b) according to the size of a human and their
possible physical locations. In the figure, adjacent bins overlap with each other, working
as a discretized sliding window across the image in both the vertical and horizontal
directions. The bin size is calculated from the expected image size of a human, so that,
in optimal conditions, a person in the field of view of the overhead camera is entirely
covered by a single bin, and partially covered by neighboring bins. In typical operation,
though, people may span multiple bins (when they extend their arms, for example), but
the algorithm described here still holds. If the bin areas on the left side of Figure 1 are
square with width w, and if the smallest distance between bin centers is , then g can be
where bx and by are the coordinates of bin b in the histogram and xx and xy are the
coordinates of pixel x in the image. Similarly, the h for the histogram described in the
Fig. 1. (a) Histogram structure: histograms are composed of multiple bins defined from overlapping areas
in the image (left).
The bin size is calculated from human dimensions, and each bin can be uniquely identified by its
top-left corner position. Using these positions, a more traditional representation of the histogram
may be composed (right).
4.3 Filling the Histogram
The histogram is filled using motion information from the difference of two consecutive
frames. The algorithm for filling the histogram at each frame resets all bins to value 0, and then
increments all bins that contain each above-threshold pixel. That is, given an above-threshold
pixel , we decrement all bins in the set . The end result is that each histogram bin is assigned a
value corresponding to the total number of foreground pixels it encompasses
where the vertical bars denote set cardinality, and T is the motion threshold. The location of a
person on the image plane can then be computed by running a peak-finding algorithm on the
pixel the set of bins that contain it.
Fig 5. . Detecting positions from motion images: each pixel in the image is mapped to one or more
histogram bins. Bin values are incremented for each foreground pixel the bin contains. Histogram peaks
detect people’s positions. Note that, for simplicity, this diagram shows each bin connected to a different
four-pixel area. In reality, however, bins encompass many more pixels.
The histogram filling so far produces new bin values for each new video frame without
taking into consideration the histogram for the previous frame, producing noise-prone centroids.
For increased robustness, a modified algorithm (Figure 4) takes care of this by incorporating 13
the composition variable, _. Each new histogram is superimposed on the previous histograms,
with transparency 0 < <= 1. Hence, the instance where the past histogram values are not
considered is a special case of this, with = 1. Figure given below shows a histogram produced
by the superimposed-histogram algorithm.
Example histogram frame. (a) Moving pixels. (b) Resulting histogram.
People are detected at histogram peaks.
For additional robustness against noise, the histogram filling algorithm may be modified to
incorporate information from previous frames. This is accomplished through the use of the 14
variable as shown in the pseudocode. This way, each new histogram is superimposed on the
previous histograms, and acts as the blending factor. Hence, the instance where the
past histogram values are discarded is a special case of this, with . Fig. above shows a
histogram produced by this algorithm with .An undesirable effect of employing a
large is that a person’s peak will tend to lag behind the person’s actual location, but we found that
values smaller than 0.5 typically produce acceptable results in this regard.
Another measure against histogram noise is to incorporate a threshold after the peak-finding
algorithm, such that only the peaks that are tall enough are utilized in people-counting and
localizing. This value can be estimated by calculating the motion histogram for an empty scene,
and choosing the mean bin height plus two standard deviations. Although this threshold is an
empirical value, we find that it does not have to be a large number ( % the number of pixels in
the bin) and this value should rarely require further adjustments.
4.4 Tuning the Histogram
Wide-Angle Lens Considerations: In the case where each bin maps to equal, but shifted
areas in the image, the histogram can be seen as the result of the cross correlation of the image
with a human model. In the simplest case, this model is a square, as in our discussion so far.
Another possibility is to utilize a more complex function as a kernel, such as a multivariate
Gaussian distribution. For the other types of models considered later in this paper, the histogram-
producing operation will no longer be a cross correlation, since the kernel shape will vary with
The type of model utilized has an immense effect on the performance of the histogram.
This is an extension of the effects that are seen in a cross correlation: the breadth and height of
the correlation peaks are the best when the kernel perfectly matches the image. In the case of the
motion histogram, if the model is too small, multiple histogram peaks may appear for each
person. If, on the other hand, it is too large, then the chance that two people incorrectly produce
only one peak increases. Similar considerations must be made when picking the window-shift
step size, : if the bins are too close, multiple bins may enclose the same person; if too far, the 15
person may be missed entirely. These parameters are initially picked to match the average human
imensions in the described setup, then fine-tuned empirically .
There are two additional effects that have not yet been accounted for, but which must
be considered when building the histogram: perspective and lens distortion. Their effect is
especially accentuated for wide-angle lenses and situations where the object distance is fairly
small compared to its length (in the direction of the optical axis). Since a person is relatively
large compared to the typical ceiling height , this must be taken into account for our setup. The
top camera in the Fig. produces very distinct images for each of the people depending on their
distance from the center axis of the camera: people near the center of the image appear as seen
from the top, while those at the edges are seen diagonally from the side. Hence, the square
histogram bins yield good results for subjects near the center of the image, where there is an
approximate top-view, but not so much as people wander toward the image edges.
Fig. . Effect of perspective and lens distortion on histogram.
(a) Ground-truth positions.
(b) Image from top-view camera.
© 3D bin model in the two locations that best match the ground-truth position.
(d) Bins project and implimentationed using camera and deployment parameters. These are the h mappings of the two peak
bins of the motion histogram.
Accounting for this, a human model is derived from a 3-D object project and implimentationed into the
image plane using the camera’s intrinsic calibration parameters. We take a rectangular cuboid as
the 3-D model, with width and height taken from average human measurements. The model’s
image is calculated by applying geometric optics equations in conjunction with the Brown
Conrady distortion equations to the coordinates of each of the cuboid’s corners. The project and implimentationions
of the models onto the image plane are saved as a boolean bitmaps, which together make up the
bin-to-image-area mapping. The mapping is computed in a similar fashion. The resulting bins
provide a more accurate model as can be seen in Fig. The motion histogram can, then, be
constructed as follows: for each bin in the histogram, the 3-D model is shifted by an amount
and the value of g at is mapped to the area within the project and implimentationed image of the cuboid:
. Using this model, the histogram parameters and can be measured in real-world
units (such as cm) rather than pixels, making them more intuitive.
CAMERA-NODE LEVEL COUNTING
The motion histogram is designed to detect and locate moving objects. Thus, so long as
none of the objects in the camera’s FOV stop moving, the number of peaks in the histogram
should correspond to the number of people in the room. However, when people stop, something
must be done to keep their location consistent. For this reason, we further process with a standard
tracking algorithm the locations detected with the motion histogram.
The tracking problem consists of labeling each detected object with the same unique ID
in all frames where the object is present. However, for the purposes of this discussion, we relax
the required length of the unique ID, since our tracker is employed mainly as a ―stopping‖
detector. For this reason, if an ambiguous situation takes place (two people crossing paths, for
example), the tracking algorithm is allowed to assign new IDs to each person involved in the
For these reasons, and given the power and computational limitations of sensor node
platforms, the tracker we employ does not filter the sensor data with a Kalman or particle filter.
Instead, the approach presented in the remainder of this section is based on bipartite graph
matching. Compared to the lightweight tracker used in our previous work, our current tracker
replaces the use of a histogram feature (peak height) with an image feature (color histogram)
Our tracker works as follows: at each time instant, the algorithm takes as input the set
of all peaks from the motion histogram and the set of all people
detected at the previous time step. Note that we denote variables from the previous time instant
with a prime. The variable is the number of peaks at the current frame, while
is the number of detected people at the previous frame. A complete bipartite graph
can be generated, where the weight of each edge is given by a function
. Then, the purpose of the tracker is to select a maximum weighted
matching of , that is, to find the combination of peak-to-person assignments that globally 18
maximizes a given similarity function . The matching can be computed using the Hungarian
method , with complexity upper bounded by (where n is the number of vertices and
the number of edges). The method described in earlier paper lowers this complexity to
Each peak is represented by a vector , and each detected person by
. These are the person’s ID, their x and y location (in motion-histogram
coordinates) and color histogram of the area in the image where the corresponding bin lies. This
area is given by the bin-to-imageplane mapping g. The color histogram utilized assigns each
pixel in that area into one of 32 bins. These features (location and color histogram) were chosen
given their positive contribution to detection rates in similar trackers.
The weight function w is, then, defined as
where is the Euclidian distance normalized by the maximum possible distance (image
diagonal), is the Bhattacharya coefficient, and is an empirical constant in the interval [0,1]. At
this point, an additional constraint is used: people are not allowed move a distance greater than
from one frame to the next for them to be properly matched. Given the w and of
the histogram in our deployment, this allows people to move at a speed of at most 6.7 m/s.The
output of the tracker is the set P of detected people , for all .
That is, each newly matched person is represented by their new position and color histogram
along with their previous ID.
The accuracy of a histogram is typically measured with a criterion such as the mean
integrated squared error or the expected value of the L1 norm. By minimizing this criterion with
respect to the histogram parameters, one is able to find the optimal parameters for a given
distribution. However, it is not clear what the distribution of above-threshold pixels for a moving
human is Without this type of information, we optimize the histogram parameters empirically.
6.1 Histogram Parameters
In order to find the best and sizes, the histogram was calculated for a series of videos
taken from a ceiling-mounted USB camera. Each video shows a constant number of people, all
of whom are always in motion. The number of people in each video ranges from 0 to 5. For each
frame where the number of histogram peaks does not match the number of people in the room,
an error counter is incremented. The best histogram structure is chosen as the one that provides
the least amount of errors (lowest valued counter). Fig.a shows the effect of varying for a fixed.
Meanwhile, Fig. b is the result of varying while holding at 50 cm. Both plots were generated
using the 3-D bin model, with the cuboid’s height set to 170 cm. Since there were no false
positives when nobody was in the camera’s FOV, the plot for zero people has been omitted from
the figure. 20
Fig.a . Effect of varying the bin width w in the 3-D bin model. The value of was kept at 15 cm. The y-axis
shows the number of detection errors, normalized for easier comparison between experiments with
different numbers of people
The plots in Figs. a and b show the detection capability of the raw histogram output,
before using any tracking or employing any other features. In these conditions, the histogram
peaks correctly detected the number of people over 60% of the time for and
up to four people.
The optimal bin shift is found in the interval . The room where
these experiments were performed has dimensions 9 m 5 m, with a ceiling height of 3 m. The
entire floor was covered by a single camera node with 162 wide-angle lens mounted on the
ceiling. If the usable field-of-view is defined as the one where a person is seen in their entirety,
then the dimensions get reduced to around 3.2 m 2.4 m. The people in the room were asked to
stay within those bounds, but, in the experiments with five people, they often moved outside due
to space constraints. This is reflected in the plots, where the five-person detection results show
much higher error rates than the others. Again, it should be emphasized that this is the raw
output of the motion histogram. Normally, this is coupled with additional data modalities to 21
further improve the results. This is done in the tracker explained in Section V by incorporating
the color histogram information.
Fig. b. Effect of varying the model shift _ in the 3-D bin model. The width was kept at 50 cm. The y-axis
shows the number of detection errors, normalized for easier comparison between experiments with
different numbers of people.
The shape of the plot in Fig. b reinforces the decision to use a histogram instead of a
convolution. It is clear that small bin shifts are desirable given that they produce histogram
modes with smaller localization error. However, those histograms are also very prone to false
positives. This effect is shown in the figure, where error rates are high for small bins, then
sharply decrease for midrange bins and increase again for larger ones.
6.2 Histogram Resolution
We tested the histogram’s positional accuracy by having two people walk toward one-
another and meet at the center of the image. This was captured by a single camera, in five
different runs. The histogram was able to differentiate distances of up to 15 cm 100% of the 22
time. This resolution greatly suits the assisted- living scenario, where the main interest is in the
logical spacial location (such as ―on the sofa,‖ or ―by the stove‖), instead of more precise
coordinates. The same test was performed for locations increasingly farther from the center. The
result was the same for distances up to 1 m from the image center (66% of the usable area). At
that distance, although the histogram at times produced a single peak for both people, the
tracking/ counting algorithm was able to disambiguate them. At the farthest position where one
is fully covered by the camera (1.5 m), the algorithm missed around 42% of all etections. We
believe there is room for improvement in those conditions, by utilizing a better tracker. The
histogram achieves its best precision at the center two-thirds of the image, since when people
walk closely and side-by-side near the image edges, occlusion often occur. Near the center, the
maximal accuracy (15 cm, given that cm) was achieved on five runs of parallel-walking tests
(where two people walked side-by-side in different locations). Additionally, on the experiments
where the two people crossed paths from different angles, the tracking algorithm was able to
keep the correct count and locations regardless of distance from the image center. This is
probably thanks to the small duration of the occlusions in this type of experiment. All
experiments in this section were performed offline on a desktop computer, using a Python
implementation of the algorithms.
6.3 Network Implementation
We implemented the motion histogram and counting algorithm in a sensor network
composed of multiple Intel iMote2 sensor nodes. Each node is suited with a custom-built
cameraboard that contains an OmniVision OV7649 imager. The nodes acquire images at 320
240 resolution, downsample them to 80 60, then run the algorithm described in Fig.
Our custom camera board with wide-
angle lenses mounted onto the iMote2
sensor node. 23
Fig.c . Snapshot of the real-time visualization GUI: a composite of pretransferred images from all six
nodes is used as the background where each detected person’s locations is over layed (blue circles). In the
snapshot above there are three people in the scene.
The network was time-synchronized at boot-time using a broadcast from the base node.
The mappings and are precomputed and kept in the node’s memory for fast operation. For each
frame, the number of detected peaks along with a timestamp are recorded into a small buffer. For
visualization purposes, this buffer can be wirelessly transmitted to the base when full or after a
timeout—whichever happens first. This allows people’s positions to be verified in real time with
a GUI on the client desktop (Fig. c). This entire process repeats at a frame rate of Hz when the
processor is configured to run at 208 MHz. Local counts for each node are transmitted whenever
they change. The base station aggregates the counts and reports the total count to the gateway
The nodes are placed on the test bed structure on the ceiling of our lab, where they are a
single hop away from their base, in a star topology. Given the ceiling height at the lab (240 cm)
and the presence of cubicle walls, six nodes are required to cover the entire area. In this
configuration, each node has a usable FOV of approximately 3 m 2 m. The node positions are
chosen to minimize field-of-view overlaps, and the images they acquire are cropped until the
overlap is virtually zero. This way, we avoid most of the correspondence issues to focus on
histogram and counting performance. Moreover, given the non overlapping FOVs, the system
described here can be effortlessly mapped onto a tree topology. In this configuration, each node
sums the count of its children, adds its local count and reports the value to the parent. This can
repeat periodically or only when the new local count differs from the previous papers.
For the experiment, peoplewalked around the testbed through every node’s FOV. Five
runs were performed. The people were allowed to stop at will, as well as to sit down and stand
up. Experimental runs were recorded with one to five people in the covered area. The results
range from 92.9% correct detections when a single person was in the room, to 64.3% for five
people, as shown in Table I.
We developed an algorithm that uses a motion histogram to detect, count, and localize
people. The algorithm is lightweight and operates in real time on wireless sensor nodes. Through
the use of low-power AER motion cameras, the overall computational complexity can be further
reduced, which in turn allows simpler processors to be employed, lowering the system’s power
consumption. As there is no calibration step that must be done on-site at the time of deployment,
the system can quickly scale by just adding more nodes. The only calibration requirement is that
the intrinsic parameters of the camera must be approximately known in the case where wide-
angle lenses are used. When all lenses used are the same model, the intrinsic calibration only
needs to be performed for one camera, and the parameters can be reused for all nodes.
We implemented the motion histogram and the tracking algorithm described in Section
Von sensor nodes, which transmitted their detected person-count to a base for aggregation. The
network requires non overlapping cameras, but a possible direction for future work is to exploit
FOV overlaps for increased robustness to occlusions. A more extensive evaluation of the
network and the sources of errors in the algorithm is currently taking place.
 T. Teixeira, E. Culurciello, E. Park, D. Lymberopoulos, and A. Savvides,―Address-event
imagers for sensor networks: Evaluation andmodeling,‖ in Proc. Inf. Process. Sens. Netw. IPSN,
Apr. 2006, pp.458–466.
 T. Teixeira and A. Savvides, ―Lightweight people counting and localizingin indoor spaces
using camera sensor nodes,‖ in Proc. ACM/IEEEInt. Conf. Distributed Smart Cameras, Sep.
2007, pp. 36–43.
 T. Murakita, T. Ikeda, and H. Ishiguro, ―Human tracking using floor sensors based on the
Markov chain Monte Carlo method,‖ in Proc. 17th
Int. Conf. Pattern Recognition, Aug. 2004, pp.
 K. Hashimoto, K. Morinaka, N. Yoshiike, C. Kawaguchi, and S. Matsueda, ―People count
system using multi-sensing application,‖ in Proc. IEEE Int. Conf. Solid-State Sens. Actuators,
Jun. 1997, pp. 1291–1294.
 N. Dalai and B. Triggs, ―Histograms of oriented gradients for human detection,‖ in Proc.
IEEE Conf. Comput. Vision and Pattern Recognition, 2005, pp. 886–893.
summer project pal|
Active In SP
Joined: Jan 2011
03-02-2011, 09:33 PM
Lightweight People Counting and Localizing forEasily Deployable Indoors WSNs
Roll No: 1
Describes a method for counting and localizing people in an area
Using Camera Sensor Networks
Is a light weight method
Using Motion histogram
Low cost system
Easy to install
Requires little or no on-site calibration
Previously used systems
PIR(Passive Infrared Sensors)- measures infrared(IR) light radiating from objects in its field of view
Pressure sensitive floor tiles
Camera based systems-use pattern matching
Requires a laborious installation process
May require precise onsite calibration
Feature extraction and matching processes can be computationally
Background differencing problems
Back ground differencing &Frame differencing
Problem with frame differencing
Example -histogram frame.
Tuning the Histogram: Wide-Angle Lens Considerations
CAMERA-NODE LEVEL COUNTING
Standard Tracking Algorithm
|Tagged Pages: people counting system directory, people counting directory, people counting and human detection in a challenging situation,|