Abstract
In recent times, we have witnessed remarkable progress in the realm of 3D video technology, allowing us to capture the three-dimensional surface data of moving individuals without resorting to specific markers or costumes. This article introduces innovative 3D algorithms for human sensing based on 3D video. These algorithms facilitate the extraction of essential information, including body movement and viewing directions, devoid of any disruptions caused by the sensing system itself.
Introduction
The fast-paced evolution of 3D video technology has brought about a revolution in capturing objects in motion, including human subjects, eliminating the need for attaching specialized markers or donning specific outfits [3] [1] [2] [4]. This presents a remarkable advantage over other motion capture technologies, making 3D video highly suitable for the three-dimensional digital preservation of human motion, including intangible cultural assets. However, it is important to note that 3D video, in and of itself, merely provides unstructured three-dimensional surface data, akin to pixel streams in conventional 2D video. The objective of this article is to demonstrate the utilization of raw 3D video in sensing human activity.
3D Video
The term "3D video" or "free viewpoint video" encompasses two distinct approaches documented in the existing literature. The first approach, known as "model-based" methods, entails reconstructing the three-dimensional shape of the object and rendering it, much like computer-generated imagery (CG) [2] [4]. The second approach, referred to as "image-based" methods, involves interpolating a two-dimensional image at a virtual camera position directly from two-dimensional multi-viewpoint images. When it comes to 3D human sensing, model-based approaches prove more appropriate since image-based methods do not provide three-dimensional information. The estimation of the three-dimensional shape within model-based approaches remains a classic yet unresolved problem in computer vision.
Figure 1:
Workflow of 3D video capture. Multiple viewpoint silhouettes derived from the
object's images offer an initial estimation of the object's shape, termed as
the "visual hull." This estimation is subsequently refined by
optimizing photo-consistency among input images. The texture-mapped
three-dimensional surface enables the generation of virtual images from
arbitrary viewpoints.
Estimating the
original three-dimensional shape from its two-dimensional projections presents
an inherently challenging problem. However, recent years have witnessed the
proposal of numerous practical algorithms that address this issue by
integrating traditional stereo matching and shape-from-silhouette techniques,
resulting in the generation of a comprehensive three-dimensional shape termed
the "photo hull". Figure 1 provides an illustration of our 3D video
capture scheme. The top and second rows showcase an example of multi-viewpoint
input images and their corresponding object regions, respectively. The visual
hull of the object is then computed using multi-viewpoint silhouettes, as
depicted in the third row. Through photo-consistency optimization, we refine
the visual hull to obtain the optimal three-dimensional surface of the object,
as shown in the fourth row. Finally, textures are mapped onto the
three-dimensional surface, as evidenced in the bottom row, which presents a
sample rendering of the final three-dimensional surface estimated from the
multi-viewpoint images.
Estimating Kinematic Structure from 3D Video
This section introduces an algorithm designed to estimate the kinematic structure of an articulated object captured as 3D video. The input comprises a time-series of three-dimensional surfaces, which serves as the basis for building the kinematic structure. Let Mt represent the input three-dimensional surface at time t (Figure 2(a)). Initially, we construct the Reeb graph [6] of Mt, as depicted in Figure 2(b). The Reeb graph is computed based on the integral of geodesic distances on Mt, resulting in a graph structure that resembles the kinematic structure. Figure 2(a) showcases the surface segmentation based on the integral of geodesic distances.
However, it is
worth noting that the definition of the Reeb graph does not guarantee that all
graph edges remain within Mt; some edges may extend beyond it. To tackle this
issue, we modify the Reeb graph to ensure it remains enclosed within Mt. This
modified graph is known as the "pseudo Endoskeleton Reeb Graph"
(pERG), as illustrated in Figure 2(c). Initially, pERGs are constructed for
each frame, from which we select "seed" pERGs that exhibit no
degeneration in their body parts. Identifying seed pERGs relies on a simple
assumption, requiring them to possess five branches, given our focus on human
behavior.
Subsequently, we
perform pERG-to-pERG fitting, starting from seed frames and progressing to
their neighboring frames. The seed frame is deformed to match its neighbors,
and this process continues until the fitting error exceeds a specific
threshold. This process yields topologically isomorphic intervals for each seed
frame, as depicted at the top of Figure 3. Within each interval, node
clustering is employed to identify the articulated structure (Figure 4).
Finally, articulated structures estimated across all intervals are integrated
into a unified kinematic structure. Figure 2(d) and 5 present the final unified
kinematic structure estimated purely from the input 3D surface sequence.
Visibility of the Model Surface
Firstly, we introduce a visibility definition for the model M(p) using collision detection between body parts. Regions where collisions occur generally evade detection by any cameras, as exemplified in Figure 8. The color depicts the distance between a point and its nearest surface point of other body parts. Leveraging this distance and visibility, we define the reliability of M(p) as follows:
Visibility of the Observed Surface
Next, we introduce the concept of visibility for the observed surface, Mt. Since Mt is estimated from multi-viewpoint images, the vertices on Mt can be categorized based on the number of cameras that can observe them. If a vertex v can be observed by one or fewer cameras, it is deemed non-photo-consistent, and its position is interpolated based on neighboring vertices. Conversely, if two or more cameras can observe vertex v, it is considered photo-consistent, and its three-dimensional position is explicitly estimated using stereo matching. Consequently, the number of observable cameras for a vertex reflects the reliability of its three-dimensional position.
Conclusion
This article has introduced human activity sensing algorithms based on 3D video. The algorithms encompass three fundamental aspects: (1) global kinematic structure, (2) intricate motion estimation, and (3) detailed estimation of face and eye directions. Importantly, these algorithms facilitate non-contact sensing, eliminating the need for special markers or costumes on the subject. This represents a significant advantage of our 3D video-based sensing approach.
Hashtags/Keywords/Labels:
#3DHumanSensing
#3DVideo #KinematicStructure #MotionCapture #ComputerVision
References/Resources:
1. T. Kanade, P. Rander, and P. J. Narayanan. "Virtualized reality: Constructing virtual worlds from real scenes." IEEE Multimedia, 1997.
2. T. Matsuyama,
X. Wu, T. Takai, and S. Nobuhara. "Real-time 3D shape reconstruction,
dynamic 3D mesh deformation and high fidelity visualization for 3D video."
CVIU, 2004.
3. S. Moezzi,
L.-C. Tai, and P. Gerard. "Virtual view generation for 3D digital
video." IEEE Multimedia, 1997.
4. J. Starck and
A. Hilton. "Surface capture for performance-based animation." IEEE
Computer Graphics and Applications, 2007.
For more such Seminar articles click index
– Computer Science Seminar Articles list-2023.
[All images are taken
from Google Search or respective reference sites.]
…till
next post, bye-bye and take care.
No comments:
Post a Comment