LAPC, Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing 100029, China

Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China

Beijing Information Science and Technology University, Beijing 100101, China

Abstract

Background

How it is possible to “faithfully” represent a three-dimensional stereoscopic scene using Cartesian coordinates on a plane, and how three-dimensional perceptions differ between an actual scene and an image of the same scene are questions that have not yet been explored in depth. They seem like commonplace phenomena, but in fact, they are important and difficult issues for visual information processing, neural computation, physics, psychology, cognitive psychology, and neuroscience.

Results

The results of this study show that the use of plenoptic (or all-optical) functions and their dual plane parameterizations can not only explain the nature of information processing from the retina to the primary visual cortex and, in particular, the characteristics of the visual pathway’s optical system and its affine transformation, but they can also clarify the reason why the vanishing point and line exist in a visual image. In addition, they can better explain the reasons why a three-dimensional Cartesian coordinate system can be introduced into the two-dimensional plane to express a real three-dimensional scene.

Conclusions

1. We introduce two different mathematical expressions of the plenoptic functions, _{
w
} and _{
v
} that can describe the objective world. We also analyze the differences between these two functions when describing visual depth perception, that is, the difference between how these two functions obtain the depth information of an external scene.

2. The main results include a basic method for introducing a three-dimensional Cartesian coordinate system into a two-dimensional plane to express the depth of a scene, its constraints, and algorithmic implementation. In particular, we include a method to separate the plenoptic function and proceed with the corresponding transformation in the retina and visual cortex.

3. We propose that size constancy, the vanishing point, and vanishing line form the basis of visual perception of the outside world, and that the introduction of a three-dimensional Cartesian coordinate system into a two dimensional plane reveals a corresponding mapping between a retinal image and the vanishing point and line.

Background

How a three-dimensional scene can be “faithfully” expressed in a (two-dimensional) plane (e.g., TV), that is to say, how it can be “faithfully” represented using a planar Cartesian coordinate system, and what the differences are between the stereoscopic perception of an actual scene and its two-dimensional image are important issues in visual information processing research, neural computation, psychophysics, and neuroscience.

At the cellular level, previous studies have shown that in the V1 cortex, only complex cells are able to respond to absolute parallax

A three-dimensional scene “faithfully” represented in a plane seems to be commonplace phenomenon, yet the mechanism for this has never been explored. It is, however, a basic theoretical problem and is worthy of study in depth, not only because it concerns the geometric and physical properties of planes and space and is closely related to the three-dimensional perception of human vision, but also because it is closely related to the problem of stereoscopic perception in computer vision, robotics navigation, and visual cognitive psychology.

In fact, there are many similar phenomena, such as optical illusions generated using optics, geometry, physiology, psychology, and other means. Optical illusions are largely due to the uncertainty caused by the bimodal graphics in a two-dimensional plane and uncertainty during visual information processing in the brain. The illusions, such as bimodal images for instance (vase and face, girl and grandmother, Escher’s “waterfall” picture, and so on) and Additional file

**Straight iron rod passes through two mutually perpendicular nuts in a way impossible in a real scene (****).**

Click here for file

**Visual depth perception in an image of a truss structure.**

Click here for file

**Visual depth perception in a landscape image.**

Click here for file

**Three-dimensional scene with stereoscopic visual perception indicating a range of depth at the Metropolitan Museum of Art, New York.**

Click here for file

**Vivid effect of three-dimensional perception in a picture painted on the pavement [**
**] (****).**

Click here for file

Marr pointed out that the essence of visual information processing is to discover what and where objects are in space

As is known, any point in space can be represented by a Cartesian coordinate system (_{
x
}, _{
y
}, _{
z
} and color-related wavelength λ. In this way, one can define a function _{
w
}, _{w} = _{w}(_{
x
}, _{
y
}, _{
z
}; _{
w
}, and leaves only seven variables that form the plenoptic function proposed by Adelson and Bergen in the study of human primary visual information processing

The intensity of each ray can be described as a function of the spatial viewing angle; that is, the wavelength, time, and light intensity of the observation position (the expression is _{v} = _{v}(_{
ox
}, _{
oy
}, _{
oz
}) in spherical coordinates and _{v} = _{v}(_{
ox
}, _{
oy
}, _{
oz
};

We should note that the plenoptic function not only reveals how humans “see” the external world, but also intuitively and concisely describes the information processing that occurs between the retina and the primary visual cortex. Marr pointed out that the true nature of information processing in “seeing” is to discover where and what is in space. “Where” in space can be located by a Cartesian rectangular coordinate system (i.e., x, y, and z). “What” is in this position may be perceived through the emitted or reflected structure of the light ray from the “object” to the viewer’s eyes, These correspond to the intensity _{
x
}, _{
y
}, _{
z
} and wavelength λ of light at that location that carry information about the contour, shape, and color of the object. Thus, it can be seen that the plenoptic function is a good description of the external world. When Adelson and Bergen proposed the plenoptic function, their intentions were to solve the problem related to the corresponding points in computer vision. It was not expected that the study would promote the birth and development of the new discipline of computational photography _{
W
} = _{
x
}, _{
y
}, _{
z
}, _{v} = _{
x
}, _{
y
}, _{
z
}, _{
ox
}, _{
oy
}, _{
oz
}, representing the light intensity information of the object itself. The intensity of light is related to the number of excited photo-sensitive cells in the retina and their activity levels. As long as the angles of the incident light _{v} =

An interesting and important question concerns the difference between the functions _{w} and _{v}. It is generally considered that _{w} differs from _{v} in the number of dimensions; i.e., the coordinates are reduced from (

and the focal length

Schematic diagram of depth of field and depth of focus

**Schematic diagram of depth of field and depth of focus [**
**].**

Where _{o} object is distance and _{i} is image distance, as shown in Figure

In formula (3),

As one gazes into the distance, the depth of field may extend to infinity. One familiar phenomenon occurs when we look at a distant railway or highway and the tracks or road edges gradually converge to a single point in the distance (called the vanishing point), as shown in Figure ^{
n
} and an affine coordinate system ^{
n
} as

where the infinity point (_{1}, _{2}, ⋯, _{
n
}, 0)^{T} is just the limit of (_{1}/_{2}/_{
n
}/^{T} under

Optics model of the affine transformation of parallel lines implemented by vision

**Optics model of the affine transformation of parallel lines implemented by vision [****].** The optical axis of the vision points to a distant focus, the fixation point. Straight parallel lines converge at the focus. The focus and its vanishing line are projected on the retina or imaging plane through the vanishing line and point in the retina. The visual system then perceives a distant intersection in the scene of the external world. Figure

When human eyes look into the distance, the fixation point can change in position, and this forms a horizontal vanishing line, as shown in Figure

Results

Mapping between the scene and visual image

The above brief description of previous research aims to introduce the problem of how a three-dimensional Cartesian coordinate system converted into a two-dimensional plane is able to express a real three-dimensional scene. This also explains why visual images in the retina can provide three-dimensional scene information to an observer. However, how the Cartesian coordinate system in a two-dimensional plane can “faithfully” represent a three-dimensional scene is not known, even though the problem seems trivial. The difference between the stereoscopic perception of actual scenes and a scene in a two-dimensional plane is an important issue in visual information processing, neural computation, psychophysics, and neuroscience, and is also a main research topic in image processing, three-dimensional display methods, and computer vision.

Figure

Cartesian coordinate system on the plane, ^{∘})

**Cartesian coordinate system on the plane, ****= (****- 90**^{∘}**).**

When the angle is within the range 90° < _{
p
}, which is equivalent to the value along the z axis in real three-dimensional space. For example, if

The actual loss of depth information along the z-axis, or the information loss of visual depth perception, is _{loss} = _{
p
} =

As already pointed out, there is a conjugate relation (or causality) between the object point and its image point. When an observer sees a three-dimensional scene I_{wr} = _{w}(_{
x
}
_{
y
}
_{
z
}; _{wr} according to

That is, the actual scene _{w}(_{
x
}
_{
y
}
_{
z
};

That is, through the plenoptic function _{w}(_{
x
}, _{
y
}, _{
z
}; _{wr} forms a visual image _{v}(_{
x
}, _{
y
}, _{
z
}; _{wr} and

Of course, this is largely a proof of principle, but this discussion demonstrates that it can be used for studies in visual information processing.

It has been confirmed in many eye tracker tests, including psychophysical experiments that the visual system can adjust with eye movements to find a suitable viewing angle and orientation so that the loss of information is minimal

Loss of information due to the introduction of a three-dimensional Cartesian coordinate system in the plane

Figure

When the angle

**When the angle ****between the x-axis and z-axis is not the same as in the Cartesian coordinate system, the spatial relationships among these axes and visual perception are also different. (A)** The included angle **(a)** to 90° **(e)**. In case **(e)**, there is no stereoscopic perception. **(B)** The included angle **a** to 120° **e**. In case **a**, there is no stereoscopic perception. **(C)** The included angle **e** to 120° **a**, obtained by rotating **(A)** 90° in the vertical direction and turned 30° in the horizontal direction. In case **(e)**, there is no three-dimensional perception.

For Figures

Role of the vanishing point in stereoscopic visual perception

Figure

Vanishing points: (a) one vanishing point, (b) two vanishing points, and (c) three vanishing points

**Vanishing points: (a) one vanishing point, (b) two vanishing points, and (c) three vanishing points.** Each blue cube front marks the Cartesian dimensional rectangular coordinate system of x, y, and z axes and the visual perception of the mutually perpendicular structure between them. In various modern city buildings and green landscapes, photographs and actual scenes from different perspectives can have three vanishing point types.

The existence of the vanishing point is the fundamental reason why a Cartesian three-dimensional rectangular coordinate system can be drawn in a two-dimensional plane. As mentioned above, it can be easily seen that the formation of vanishing points underlies the optical system of human vision (in principle, see Figure

Dual-plane parameterization of the plenoptic function for neural computation of early vision

We know that each pixel of a two-dimensional digital image is a record of the intensity of all light that reaches this point, but does not distinguish between the directions of the light rays. It is just a projection of the light field of the three-dimensional structure, with lost information about phase and direction. Unlike this, the light field refers to the collection of light from any point in space in an arbitrary direction. It comprises all light from different angles that makes a contribution to each pixel. If it takes into account how the angle of light changes with time (

Studies by Zeki, Livingstone et al. have indicated that in the human visual system color information is transmitted in a separate channel in the cerebral cortex _{v} = _{v}(_{
x
}, _{
y
}, _{
z
}) can describe and reconstruct plenoptic images, or visual information of the objective world with different combinations of variables.

When the viewer’s eyes are looking at a point in any scene, emitted or reflected light rays from this point will enter the eye. The intensity information of the incident light ray carried in Vx, Vy, Vz is received by the eye. Since the optical and the coordinate Z axes are the same, the light intensity of the stimulus is converted into the strength of photosensitive cell activity. Therefore, only angles

Nested representation of the dual-plane

**Nested representation of the dual-plane ****(****,** **) and ****(****,** **) parameterization for the retina and the primary visual cortex.**

Three-dimensional visual perceptions of images in a two-dimensional plane

We know that if an image of a scene on a plane does not contain depth information, the human visual system has no way of perceiving the scene three-dimensionally. When observing the external world, human vision has characteristics of perceptual constancy (e.g., size, color, and shape constancy). This constancy is the basis of an affine transformation, which depends on vanishing points and vanishing lines in visual perception and is determined by the characteristics of the optical system of the visual pathway. As Rock pointed out, the height of an object in the base plane is an important depth cue. It can be calculated according to

where

For the sake of simplicity, we analyze only the example (taken from the literature

The image plane was tilted and the camera height was 0.87 m

**The image plane was tilted and the camera height was 0.87 m.** The location of the picture is the front of the No. 8 student dormitory building at Beijing Jiaotong University [37]. The inset in Figure **a**, **b**, **c**, **d** are almost proportionately reduced, which reflects linear property of size constancy, and from this linear property we can calculate depth distances of these trees in the Figure

The main purpose of the calculation example is to show that we can use the vanishing point, size constancy and affine transformation model in Figure

The example focuses on the absolute depth perception of white markers, edges on the ground and nine trees (see Figure

Computation results of depth distance of trees in Figure

**Computation results of depth distance of trees in Figure**
**according to the size constant of visual perception.**

Specific calculations are carried out employing two methods. The first method employs psychological methods based on formulae (8) and (9), and the second method employs an affine transformation based on an optical model of vision (Figure

The results of both calculation methods are consistent with actual measurement results, showing that the calculation methods are reasonable and reflect the consistency between visual psychology and the optical system of visual pathways in the depth perception of an actual scene. More importantly, the results show that a two-dimensional image can contain rich three-dimensional information that is perceived by the visual system itself.

We know that when looking at an image or a scene from different angles, the perceived depth of field changes. To show depth information provided by constancy and the affine transformation in a two-dimensional image plane (see the model in Figure

where ^{∘}) is the included angle between the z-axis and

The proposed method is completely different from three-dimensional image reconstruction that uses binocular disparity and corresponding points in the field of visual computational theory, or three-dimensional reconstruction using corresponding points in two images taken by two cameras in the field of computer vision. The processing method of visual perception has advantages

In Appendix 1, according to Figure

Discussion

This article explores how the human vision system extracts depth information from an image of a scene in a Cartesian rectangular coordinate system on a two-dimensional plane. We introduced the concepts of a plenoptic function in the optical system of the visual pathway. In the section of methods “Computational approach in visual cortex V1”, we proposed an algorithm of coincidence test, in which an image primitive _{
U,V
}(_{
θ,φ
}(_{
Θ × Φ
} in cortical columns.

Note that, all of neurons in the columns simultaneously carry out compliance testing operations in parallel manner, neuron of [_{
θ,φ
}(_{
Θ × Φ
}, which most consistent with the image primitives _{
U,V
}(_{
U,V
}(

Based on the biological function and structure of the visual pathway and the primary visual cortex, we proposed the dual-parameterized method, which can be expressed as _{v}(_{
u,v
}(_{
U × V
} ⊗ [_{
θ,φ
}(_{
Θ × Φ
}, or to formula 12, as described as follows.

In this paper, we have raised an issue “in the two-dimensional plane, why can three-dimensional structure of a picture be expressed by adopting Cartesian coordinate system?”, its importance is to study the information processing from 2D retinal image to three-dimensional visual perception. Based on neural computation of visual cortex V1, and taking into account the affine transformation processing of visual image information and size constancy of visual perception, and also considered the findings of psychophysics. However, formula (8) and Figure

We know the reconstruction of visual image is just a hard inverse problem as a major topic of research in computer vision, its concern is how to use binocular disparity information (i.e., corresponding point in dual camera image) to find a stable and efficient reconstruction algorithms; it is also an issues concerned by current 3D display technical, its focal point is that this kind of research will able to provide an effective method for better 3D display technology; of course, it is also hard problem to trouble the research of biological vision, vision research mainly is to start from unified basic viewpoint of the biological function and structure of the vision and then explore how to achieve the following information processing by human visual system, namely : from retinal images of three-dimensional scenes to → 2D visual image, and to → 3D visual perception. In the first section “Mapping between the scene and visual image ” of this paper, this issue has been discussed in more detail, in which the formulas (6) and (7) had shown that there is no specific reconstruction algorithm from 2D retinal images to three-dimensional scene. At present, to an image, the processing time of the brain has been determined by using an approach of rapid serial visual presentation of image series and cognitive psychological method, it is just 13 ms

According to Figures

1. The picture, in which there is no vanishing point;

2. Alternating process of Cartesian coordinate system and affine coordinate system; 3. The Moon Illusion (see Appendix 1 for details

We have reason to believe that rough outline of theory about three-dimensional visual perception of visual pathway is generally clear.

Conclusion

We know that there are many monocular depth cues (e.g., perspective scaling, linear perspective, texture gradient, atmospheric perspective, occlusion, light and shade, color, and image hierarchy structure) that can also form depth perception. However, in this paper, we study how to express stereoscopic visual perception in a two-dimensional plane and only use the parameterized method of a dual plane of the plenoptic function to process the visual information of an image.

According to the principle of graceful degradation proposed by Marr

We have studied this issue, and to answer Marr’s question, this paper presents a preliminary explanation. The main results are as follows:

1. Two different plenoptic functions to describe the objective world were introduced. The difference between these two functions _{w} and _{v} regarding the external scene obtained by visual perception were analyzed, and their specific applications in visual perception were discussed.

2. The main results were how the processing of visual depth information perceived in stereoscopic scenes can be displayed in a two-dimensional plane. Constraints for the coordinates and an algorithm implementation were also provided, in particular, a method used to separate the plenoptic function and a transformation from the retina to the visual cortex. A dual-plane parameterized method and its features in neural computing from the visual pathway to visual cortex V1 were discussed. Numerical experiments showed that the advantages of this method are efficiency, simplicity, and robustness.

3. Size constancy, a vanishing point, and vanishing line form the psychophysiological basis for visual perception of the external world, as well as the introduction of the three-dimensional Cartesian rectangular coordinate system into a two-dimensional plane. This study revealed the corresponding relationship between perceptual constancy, the optical system of vision, and the mapping of the vanishing point and line in the visual image on the retina.

The main results of this paper are a preliminary explanation as to why and how the Cartesian rectangular coordinate system can be introduced into a two- dimensional plane, and how a three-dimensional scene can be perceived in a two-dimensional plane. The results of this study are of significance in visual depth perception and possibly in applications of computational vision.

Methods

Computational approach in visual cortex V1

The adopted dual plane parameterized representation makes the mathematical form of the visual pathway and primary visual cortex neural computation more concise and intuitive. More specifically, the retina may be represented by the plane ^{6}, hence, M × N ≈ 10^{6}. As pointed out in ^{6}

The retina –p(u, v) plane is divided into M × N = 10^{6} patches according to the receptive field sizes of ganglion cells

**The retina –p(u, v) plane is divided into M × N = 10**^{6 }**patches according to the receptive field sizes of ganglion cells.** A simpler approach is based on the complexity of the image, that is, according to the distribution of basic characteristics (line, corners, and curves for example) in an image. An A4-sized image can be divided into 128 or 64 patches in the first stage, and at the second stage, for pitch of the image can be divided by 8, 16, and so on. It is worth noting that when the total number of first-stage divisions is large, the total number of second-stage divisions should be small

The entire image in the retina can be represented using the following matrix:

Ganglion cells transmit a neural firing spike train to the LGN. Then, similarly, magnocellulars and parvocells in the LGN transmit information about the image patches into 4C_{α} (magnocellular layer) and 4C_{β} (parvocellular layer) of the fourth layer in the V1 cortex. Naturally, these coded neural firing spike trains need to be decoded and information about their image primitives need to be restored. A neural decoding circuit with 40 Hz synchronous oscillation accomplishes this task

In cortex V1, the shapes of a receptive field of the simple and complex cells are bar-shaped patterns of orientation and bandwidth selectivity. The sizes of the receptive field of the simple and complex cells are about 20–50 μm. Their orientation and maximum resolutions are about 10° and 0.25°, respectively. Hence, their line resolution is between 5.0–100 μm

Accordingly, the V1 cortex is represented by the plane

V1 cortical columns as the basic components of the information processing unit

**V1 cortical columns as the basic components of the information processing unit. (a)**. Neuron receptive fields with eight kinds of typical shapes. **(b)**. the functional column from 0° to 180° divided into 18 different orientations at 10° intervals

Similarly, the functional column shown in Figure

where (_{
u,v
}(_{
U × V
} and [_{
θ,φ
}(_{
Θ × Φ
}, given by

The neurobiological significance of the Kronecker product ⊗ between the two matrixes [_{
u,v
}(_{
U × V
} and [_{
θ,φ
}(_{
Θ × Φ
} lies in the assumption that these functional columns have the same information processing function and each functional column consists of many receptive fields with different directions and frequencies

According to Figure

Appendix

Appendix 1 of the section 5 of text

From Figures

Image without a vanishing point

In a typical case, there are one, two or three vanishing points in a scene graph, as shown in Figure

Graphs with no vanishing points, depth cues or stereoscopic information

**Graphs with no vanishing points, depth cues or stereoscopic information.**

Alternating use of a Cartesian coordinate system and affine coordinate system

According to Equation (13),

The mapping from a Cartesian coordinate system to an affine coordinate system is a gradual process in which

Relationship among the Cartesian coordinate system, projective coordinate system and affine coordinate system

**Relationship among the Cartesian coordinate system, projective coordinate system and affine coordinate system.**

The inversing of a Necker cube, which is a known problem of stereoscopic perception, can be explained by the alternating of a Cartesian coordinate system and affine coordinate system. The Necker cube has a constant perspective angle; i.e., each of the four sides of a Necker cube (see Figure

Reversion phenomenon and three-dimensional visual perception of a Necker cube

**Reversion phenomenon and three-dimensional visual perception of a Necker cube.**

There seems to be no vanishing point in Figure

Moon Illusion

The Moon and Sun appear larger on the horizon than at zenith, which is a phenomenon known as the Moon illusion. There are many research findings and interpretations for this problem. However, we believe that the Moon and Sun on the horizon are simply on the lower part of the vanishing line in Figure

Schematic diagram of the Moon illusion

**Schematic diagram of the Moon illusion.**

Because of the combined effects of visual perception’s constancy and vision’s optical property of far objects being smaller and near objects being larger, the Moon (or Sun) in the sky is perceived to be further from the observer, and area of the Moon is thus perceived to be smaller. Existing experimental and calculation results are that the Moon on the horizon is visually perceived to be 1.5 to 1.7 times as large as that in the sky

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

ZS proposed and conceived the study and wrote a first draft. QJ, ZQ, and LC took part in designing the study and contributed to the comparative analysis. LX and SS took part in the numerical calculations, verification and analysis of the data and drew all the illustrations. All authors discussed and modified the revised manuscript and all authors have accepted the final version.

Acknowledgements

This research was supported by the Natural Science Foundation of China (No.: 61271425). The authors would like to thank Dr. Wu Aimin for citing his research work from Ref