I was reading the pros and cons of split design vs unified design of caches in this thread.
Based on my understanding the primary advantage of the split design is: The split design enables us to place the instruction cache close to the instruction fetch unit and the data cache close to the memory unit, thereby simultaneously reducing the latencies of both. And the primary disadvantage is: Combined space of the instruction and data caches may not be efficiently utilized. Simulations have shown that a unified cache of the same total size has a higher hit rate.
I, however, couldn't find an intuitive answer to the question "Why (at-least in most modern processors) L1 caches follow the split design, but the L2/L3 caches follow the unified design.)"
Most of the reason for split L1 is to distribute the necessary read/write ports (and thus bandwidth) across two caches, and to place them physically close to data load/store vs. instruction-fetch parts of the pipeline.
Also for L1d to handle byte load/store (and on some ISAs, unaligned wider loads/stores). On x86 CPUs which want to handle that with maximum efficiency (not an RMW of the containing word(s)), Intel's L1d may only use parity, not ECC. L1i only has to handle fixed-width fetches, often something simple like an aligned 16-byte chunk, and it's always "clean" because it's read-only, so it only needs to detect errors (not correct), and just re-fetch. So it can have less overhead for each line of data, like only a couple parity bits per 8 or 16 bytes.
See Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? re: it being impossible to build one large unified L1 cache with twice the capacity, same latency, and sum total of the bandwidth as a split L1i/d. (At least prohibitively more expensive for power due to size and number of read/write ports, but potentially actually impossible for latency because of physical-distance reasons.)
None of those factors are important for L2 (or exist at all in the case of unaligned / byte stores). Total capacity that can be used for code or data is most useful there, competitively shared based on demand.
It would be very rare for any workload to have lots of L1i and L1d misses in the same clock cycle, because frequent code misses mean the front end stalls, and the back-end will run out of load/store instructions to execute. (Frequent L1i misses are rare, but frequent L1d misses do happen in some normal workloads, e.g. looping over an array that doesn't fit in L1d, or a large hash table or other more scattered access pattern.) Anyway, this means data can get most of the total L2 bandwidth budget under normal conditions, and a unified L2 still only needs 1 read port.
@Hadi's answer that you linked does cover most of these reasons, but I guess it doesn't hurt to write a simplified / summary answer.