I would like to understand how the MMIO works on ARM architecture.
I realized that ARM provides 1:1 mapping from physical address to specific peripheral. For example, to manage the GPIOX on arm, for example in Raspberry Pi, the processor accesses the specific physical addresses (seems that preconfigured by the manufacturer?) without configuring some registers beforehand.
I thought that there are some specific BAR register that arbitrates the read/write request to specific physical addresses to peripherals. However, when I check the spec for BCM2835 (Raspberry pi 3), the physical addresses are translated by another MMU called VC/ARM MMU. Is it a common design to have another MMU that translates the physical addresses to bus addresses in ARM architecture?

Also, I was wondering how the SMMU (IOMMU in x86) is utilized in this concept. I found one article mentioning that the VC/ARM MMU is an example of SMMU but I think that is not true? When I check some monitor code & kernel driver code implementation (not raspberry pi), it seems that the SMMU is also mapped to specific physical addresses and the monitor/kernel uses those addresses to initialize and communicate with SMMU. If the arm architecture utilize VC/ARM MMU as an SMMU, how that physical address mapping for SMMU itself can be accessed to initialize the SMMU..?
Lastly, I thought that all peripherals are managed by the SMMU. If some peripherals are always mapped to fixed physical addresses, what is the role of SMMU? Why some peripherals are communicated with PE (CPU) through fixed physical addresses, and some are communicated with SMMU..? How exactly the peripherals and PE (Processors) can communicate in ARM architecture..?
I am just answering your questions. There are many things wrong with some assumptions you have, which Old Timer tries to clarify.
No. Broadcom has an architecture license. They designed there own HDL code for the ARM CPU and system details can be different. In fact, it is quite common for even direct license (using ARM HDL) that the systems differ from vendor to vendor.
The typical solution to this is 'TrustZone'. Here, the access is defined as having a 'secure' or 'normal' access. The master (CPU) tags the access and either the peripheral or a bus access controller permits or prohibits access to the peripheral. These are best statically mapped at boot time and locked.
By defining permitted use cases, the peripheral/master access patterns can be defined for the system. The complication is 'dynamic' peripherals and masters. The ARM TrustZone CPU is dynamic. A 'world switch' changes the CPU access and there are attacks on the communication interface between 'normal' and 'secure' worlds.
The ARM AXI/AHB buses were originally designed for embedded devices. PCs in contrast have dynamic buses ISA->EISA->PCMIA->PCI (etc.) These addresses are typically dynamic. Note However, this bus structure has an expense. So some of your question is like asking why isn't ARM just like an x86 PC. They had different goals and they are different. You can't put a new grahpics card into your cell phone.
Reference:
Note: It is from the modular nature of the PC system which was part of why the PC dominated Apple, Commodore, Atari, etc. in my opinion (the other aspect was piracy), contrary to others who think the answer is Microsoft. The hardware matters.