If you're interested in this but have no background, the best place to start is "Fully Convolutional Networks for Semantic Segmentation" – https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn...
This is a very active field of research. Another thread worth pulling on is Mask R-CNN: https://arxiv.org/abs/1703.06870
It's not quite as simple as "this one has highest mAP, let's use it"; the tradeoffs are complex. In particular, as you can see in the image here, one thing DeepLab doesn't do is segment instances – so you get a mask of "people", not a mask per person. Mask R-CNN does a better job on that by design, because it predicts both bounding boxes and a mask per bounding box.
Link to Arxiv (DeepLabv1): https://arxiv.org/abs/1606.00915
Link to Arxiv (DeepLabv3): https://arxiv.org/abs/1706.05587
Link to GitHub: https://github.com/tensorflow/models/tree/master/research/de...
The README on there has a very neat TLDR of the model:
"DeepLabv1 : We use atrous convolution ['s a shorthand for convolution with upsampled filter'] to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks.
DeepLabv2 : We use atrous spatial pyramid pooling (ASPP) ['a computationally efficient scheme of resampling a given feature layer at multiple rates prior to convolution'] to robustly segment objects at multiple scales with filters at multiple sampling rates and effective fields-of-views.
DeepLabv3 : We augment the ASPP module with image-level feature [5, 6] to capture longer range information. We also include batch normalization  parameters to facilitate the training. In particular, we applying atrous convolution to extract output features at different output strides during training and evaluation, which efficiently enables training BN at output stride = 16 and attains a high performance at output stride = 8 during evaluation.
DeepLabv3+ : We extend DeepLabv3 to include a simple yet effective decoder module to refine the segmentation results especially along object boundaries. Furthermore, in this encoder-decoder structure one can arbitrarily control the resolution of extracted encoder features by atrous convolution to trade-off precision and runtime."
Congratulations, Deeplab 3+ finally discovered that the U-net architecture, first proposed 3 years ago, is more efficient than the flat architecture they used before.
Deeplab 3+ is still a wildly inefficient network structure, but it undeniably works, if you can afford the computational resources. Just keep in mind you can achieve similar results (within 1% mIOU) with much leaner structures.
Is this fast enough to be used as a background removal in live streams?