Architectures & Encoders
SemanticSeg4EO supports 20+ segmentation architectures from three categories: built-in, SMP-based, and modern Transformer models. This page documents each architecture, its requirements, and recommended use cases.
Architecture Overview
Architecture |
Category |
Requires |
Notes |
|---|---|---|---|
|
Built-in |
(none) |
Simple U-Net with configurable dropout. Works without any optional dependencies. Good for quick experiments. |
|
SMP |
|
Classic U-Net with skip connections. Reliable baseline. |
|
SMP |
|
Nested U-Net with dense skip connections. Often outperforms plain U-Net. |
|
SMP |
|
Multi-scale Attention Network. Good for multi-scale features. |
|
SMP |
|
Lightweight encoder-decoder. Faster than U-Net. |
|
SMP |
|
Feature Pyramid Network. Effective for multi-scale objects. |
|
SMP |
|
Pyramid Scene Parsing Network. Good for scene-level context. |
|
SMP |
|
Path Aggregation Network. Strong multi-scale aggregation. |
|
SMP |
|
DeepLab v3 with ASPP. Strong contextual reasoning. |
|
SMP |
|
DeepLab v3+ with improved decoder. Often best accuracy for dense prediction. |
|
Modern |
|
Smallest SegFormer. Fast, good accuracy-speed trade-off. |
|
Modern |
|
SegFormer B1. |
|
Modern |
|
SegFormer B2. Recommended Transformer baseline. |
|
Modern |
|
SegFormer B3. |
|
Modern |
|
SegFormer B4. |
|
Modern |
|
Largest SegFormer. Highest accuracy, most memory. |
|
Modern |
|
U-Net decoder with Transformer encoder. Good hybrid approach. |
|
Modern |
|
HRNet-W18. High-resolution feature maps, 18-width. |
|
Modern |
|
HRNet-W32. Balanced accuracy/speed. |
|
Modern |
|
HRNet-W48. Highest accuracy in the HRNet family. |
|
Modern |
|
Swin Transformer U-Net. State-of-the-art for medical/EO segmentation. |
Note
smp = segmentation-models-pytorch, transformers = HuggingFace Transformers,
timm = PyTorch Image Models. Install these in your external environment
(see Environment Setup).
Built-in Architecture
unet-dropout
A simple U-Net implementation built directly into the plugin, requiring no optional dependencies. It includes:
Configurable dropout at each decoder level (set via Dropout Rate)
4 encoder levels with max-pooling
4 decoder levels with skip connections and bilinear upsampling
Works with any number of input channels
When to use: quick experiments, CPU-only setups, or when you cannot install
segmentation-models-pytorch.
SMP Architectures
All SMP architectures use the segmentation-models-pytorch library and support:
Dozens of encoder backbones (see Encoders below)
ImageNet pretrained weights
Configurable input channels (encoder first layer is adapted automatically)
Install SMP in your environment:
pip install segmentation-models-pytorch
Modern / Transformer Architectures
These architectures do not use a separate encoder — they are self-contained. The Encoder dropdown is automatically hidden when one of these is selected.
SegFormer (b0–b5)
SegFormer is a transformer-based segmentation model from NVIDIA that uses:
A hierarchical transformer encoder (Mix Vision Transformer)
A lightweight MLP decoder
Very competitive performance on standard benchmarks
All six variants (b0–b5) are available, corresponding to increasing model sizes.
Requires: pip install transformers
HRNet (w18, w32, w48)
HRNet (High-Resolution Network) maintains high-resolution feature representations throughout the network, making it excellent for segmentation tasks requiring fine spatial detail.
w18/w32/w48 refer to the width (number of channels) at the highest resolution stream
Higher width = more parameters and better accuracy
Requires: pip install timm
SwinUNet
SwinUNet is a pure Transformer architecture using the Swin Transformer as both encoder and decoder backbone, with skip connections following a U-Net structure.
Requires: pip install timm
UNetFormer
UNetFormer combines a Transformer encoder with the U-Net decoder structure. Uses a standard SMP encoder (see Encoders) as the backbone.
Requires: pip install segmentation-models-pytorch
Encoders
For SMP-based architectures and UNetFormer, you select an encoder backbone from the Encoder dropdown. The encoder is the feature extraction network; the decoder learns to combine those features into a segmentation mask.
Encoder families available:
ResNet family (most common baseline):
resnet18, resnet34, resnet50, resnet101, resnet152
ResNeXt family:
resnext50_32x4d, resnext101_32x4d, resnext101_32x8d
SE-Net family (squeeze-and-excitation):
se_resnet50, se_resnet101, se_resnet152
se_resnext50_32x4d, se_resnext101_32x4d
senet154
EfficientNet family:
efficientnet-b0 → efficientnet-b7
timm-efficientnet-b0 → timm-efficientnet-l2
ResNeSt family (via timm):
timm-resnest14d, timm-resnest26d, timm-resnest50d
timm-resnest101e, timm-resnest200e, timm-resnest269e
DenseNet family:
densenet121, densenet169, densenet201, densenet161
Inception family:
inceptionresnetv2, inceptionv4
MobileNet:
mobilenet_v2
DPN family:
dpn68, dpn68b, dpn92, dpn98, dpn107, dpn131
VGG family:
vgg11_bn, vgg13_bn, vgg16_bn, vgg19_bn
Mix Vision Transformer (SegFormer backbone via SMP):
mit_b0, mit_b1, mit_b2, mit_b3, mit_b4, mit_b5
MobileOne:
mobileone_s0, mobileone_s1, mobileone_s2, mobileone_s3, mobileone_s4
ConvNeXt (U-Net only, requires timm):
convnext_tiny, convnext_small, convnext_base
convnext_large, convnext_xlarge
Encoder Selection Guide
Use case |
Recommended encoder |
|---|---|
Quick experiments / CPU |
|
Balanced (recommended default) |
|
Best accuracy (GPU required) |
|
Very high accuracy, needs timm |
|
Which Architecture Should I Choose?
As a starting point:
First experiment →
unet-dropout(no dependencies, fast)Standard baseline →
unetordeeplabv3+withresnet34Best accuracy →
segformer-b2orhrnet-w32(requires transformers/timm)Limited GPU memory →
fpnorlinknetwithefficientnet-b0Fine spatial detail →
hrnet-w32orhrnet-w48