What I plan on doing:
I want to develop the English accent (without professional training).
Set of axioms behind my reasoning with executive summary:
Following is knowingly over simplified, sorry for that. I tried to keep question short.
Part 1 : Understanding how learning works.
At the moment I assume, that Broca's area and Wernicke's area must be aware of the language, and muscle memory with existing phonetic alphabet will build the speech. Accents are just formed naturally over time by phonetic alphabet assimilation.

Using Google I found, that speech shadowing, can potentially be used for phonetic symbol assimilation. Muscle memory on the other hand can be easily trained by repetitive action. And this is most effective, if person is of 23-24 years of age and has lots of uninterpretable time on his/her hand as losing focus can dramatically decrease effective learning curve gradient. This kind of procedural memory can be probably optimized to flushed in memory with designed sleep pattern.
Part 2 : Designing behavioral pattern
- Finding a fluent speaker whom accent I want to sound like.
- Distinguishing target accent phonemes and phones.
- Training muscle memory to produce target accent.
Part 3 : Finding a fluent speaker whom accent I want to sound like.
Youtube is a powerful free resource. Sample audio, that I tough about picking :
 
Someone Like You - Adele (Cover) in HD.
It does not bother me, that it is high pitched female voice.
Part 4 : Distinguishing target accent phonemes and phones.
It is not a trivial task - identifying and judging whether spoken phone is correct. And how correctly tangible text is spoken by human. It seems so complex in fact, that I wont bother automating it and just use IPA as baseline.
Here is the first psalm with word stress in american IPA of the sample audio above :

No copyright infringement intended. And image is created with upodn (alternative: photransedit).
Part 5 : Training muscle memory to produce target accent.
Although it is fun to just try to mimic and archive synchronization, then i would prefer building a tool, that extracts words as audio files. So I can use winamp or ipod to loop and shuffle the words I want.
I imagine, that I can use MS Expression Encoder for this.
Question
If given an audio file (ex. in wav format, size < 32mb) and it's text equivalent (finite nr of words, ex. 2000), then how to split it into multiple files, that each contains 1 word. Word can contain some excess whitespace, and boundary checks can be user approved. If it is not accurate, then what is the best way, to get good estimation for word boundaries.
Main intention is to reduce work, that I would be doing, if this would be done manually.
 
                        
First of all I would convert the signal from the time domain into the frequency domain by running a FFT over it. That might allow you to match certain consonant sounds in your text to broadband noise in the fft. The thing here is that you're not trying to do full speech recognition, just find the best match of signal to text. (I did something similar for document image highlighting back when I was at uni - didn't need to resort to OCR because I already had the text). My guess is that looking for dips in amplitude won't help you that much because some words run into each other.
Here's how I'd approach it for a first attempt:
I'm sure it could be generalized, but that's how I'd attempt it.