Pages

Saturday, July 1, 2023

3D Human Sensing

Abstract

In recent times, we have witnessed remarkable progress in the realm of 3D video technology, allowing us to capture the three-dimensional surface data of moving individuals without resorting to specific markers or costumes. This article introduces innovative 3D algorithms for human sensing based on 3D video. These algorithms facilitate the extraction of essential information, including body movement and viewing directions, devoid of any disruptions caused by the sensing system itself. 


Introduction

The fast-paced evolution of 3D video technology has brought about a revolution in capturing objects in motion, including human subjects, eliminating the need for attaching specialized markers or donning specific outfits [3] [1] [2] [4]. This presents a remarkable advantage over other motion capture technologies, making 3D video highly suitable for the three-dimensional digital preservation of human motion, including intangible cultural assets. However, it is important to note that 3D video, in and of itself, merely provides unstructured three-dimensional surface data, akin to pixel streams in conventional 2D video. The objective of this article is to demonstrate the utilization of raw 3D video in sensing human activity.

 

3D Video

The term "3D video" or "free viewpoint video" encompasses two distinct approaches documented in the existing literature. The first approach, known as "model-based" methods, entails reconstructing the three-dimensional shape of the object and rendering it, much like computer-generated imagery (CG) [2] [4]. The second approach, referred to as "image-based" methods, involves interpolating a two-dimensional image at a virtual camera position directly from two-dimensional multi-viewpoint images. When it comes to 3D human sensing, model-based approaches prove more appropriate since image-based methods do not provide three-dimensional information. The estimation of the three-dimensional shape within model-based approaches remains a classic yet unresolved problem in computer vision.

 

Figure 1: Workflow of 3D video capture. Multiple viewpoint silhouettes derived from the object's images offer an initial estimation of the object's shape, termed as the "visual hull." This estimation is subsequently refined by optimizing photo-consistency among input images. The texture-mapped three-dimensional surface enables the generation of virtual images from arbitrary viewpoints.

 

Estimating the original three-dimensional shape from its two-dimensional projections presents an inherently challenging problem. However, recent years have witnessed the proposal of numerous practical algorithms that address this issue by integrating traditional stereo matching and shape-from-silhouette techniques, resulting in the generation of a comprehensive three-dimensional shape termed the "photo hull". Figure 1 provides an illustration of our 3D video capture scheme. The top and second rows showcase an example of multi-viewpoint input images and their corresponding object regions, respectively. The visual hull of the object is then computed using multi-viewpoint silhouettes, as depicted in the third row. Through photo-consistency optimization, we refine the visual hull to obtain the optimal three-dimensional surface of the object, as shown in the fourth row. Finally, textures are mapped onto the three-dimensional surface, as evidenced in the bottom row, which presents a sample rendering of the final three-dimensional surface estimated from the multi-viewpoint images.


Estimating Kinematic Structure from 3D Video

This section introduces an algorithm designed to estimate the kinematic structure of an articulated object captured as 3D video. The input comprises a time-series of three-dimensional surfaces, which serves as the basis for building the kinematic structure. Let Mt represent the input three-dimensional surface at time t (Figure 2(a)). Initially, we construct the Reeb graph [6] of Mt, as depicted in Figure 2(b). The Reeb graph is computed based on the integral of geodesic distances on Mt, resulting in a graph structure that resembles the kinematic structure. Figure 2(a) showcases the surface segmentation based on the integral of geodesic distances.

However, it is worth noting that the definition of the Reeb graph does not guarantee that all graph edges remain within Mt; some edges may extend beyond it. To tackle this issue, we modify the Reeb graph to ensure it remains enclosed within Mt. This modified graph is known as the "pseudo Endoskeleton Reeb Graph" (pERG), as illustrated in Figure 2(c). Initially, pERGs are constructed for each frame, from which we select "seed" pERGs that exhibit no degeneration in their body parts. Identifying seed pERGs relies on a simple assumption, requiring them to possess five branches, given our focus on human behavior.

 

Subsequently, we perform pERG-to-pERG fitting, starting from seed frames and progressing to their neighboring frames. The seed frame is deformed to match its neighbors, and this process continues until the fitting error exceeds a specific threshold. This process yields topologically isomorphic intervals for each seed frame, as depicted at the top of Figure 3. Within each interval, node clustering is employed to identify the articulated structure (Figure 4). Finally, articulated structures estimated across all intervals are integrated into a unified kinematic structure. Figure 2(d) and 5 present the final unified kinematic structure estimated purely from the input 3D surface sequence.


Visibility of the Model Surface

Firstly, we introduce a visibility definition for the model M(p) using collision detection between body parts. Regions where collisions occur generally evade detection by any cameras, as exemplified in Figure 8. The color depicts the distance between a point and its nearest surface point of other body parts. Leveraging this distance and visibility, we define the reliability of M(p) as follows:


Visibility of the Observed Surface

Next, we introduce the concept of visibility for the observed surface, Mt. Since Mt is estimated from multi-viewpoint images, the vertices on Mt can be categorized based on the number of cameras that can observe them. If a vertex v can be observed by one or fewer cameras, it is deemed non-photo-consistent, and its position is interpolated based on neighboring vertices. Conversely, if two or more cameras can observe vertex v, it is considered photo-consistent, and its three-dimensional position is explicitly estimated using stereo matching. Consequently, the number of observable cameras for a vertex reflects the reliability of its three-dimensional position.

 

Conclusion

This article has introduced human activity sensing algorithms based on 3D video. The algorithms encompass three fundamental aspects: (1) global kinematic structure, (2) intricate motion estimation, and (3) detailed estimation of face and eye directions. Importantly, these algorithms facilitate non-contact sensing, eliminating the need for special markers or costumes on the subject. This represents a significant advantage of our 3D video-based sensing approach.

 

Hashtags/Keywords/Labels:

#3DHumanSensing #3DVideo #KinematicStructure #MotionCapture #ComputerVision

 

References/Resources:

1. T. Kanade, P. Rander, and P. J. Narayanan. "Virtualized reality: Constructing virtual worlds from real scenes." IEEE Multimedia, 1997.

2. T. Matsuyama, X. Wu, T. Takai, and S. Nobuhara. "Real-time 3D shape reconstruction, dynamic 3D mesh deformation and high fidelity visualization for 3D video." CVIU, 2004.

3. S. Moezzi, L.-C. Tai, and P. Gerard. "Virtual view generation for 3D digital video." IEEE Multimedia, 1997.

4. J. Starck and A. Hilton. "Surface capture for performance-based animation." IEEE Computer Graphics and Applications, 2007.

 

For more such Seminar articles click index – Computer Science Seminar Articles list-2023.

[All images are taken from Google Search or respective reference sites.]

…till next post, bye-bye and take care.

No comments: