Voice generation model for conversational scenarios