pythonを使ったpop gen simulation: 第二弾¶

simulationの2回目。前回は乱数を使うことが主な目的だった。今回は改良版です

# standard library
import time

# additional library
import numpy as np
import matplotlib.pyplot as plt
import seaborn

Intro：個体単位のシミュレーションから集団単位のシミュレーションへ¶

これまでのシミュレーションでは、全ての個体のalleleを、ひとつひとつ決めていた。しかし欲しい情報が遺伝子頻度だけで良いのであれば、親集団中の頻度から、直接子孫集団の遺伝子頻度を計算すれば良い。その方が計算速度は速い。

このためにはまず次世代の遺伝子頻度を計算しなければならない

今、集団中の状態をAとaで表そう。Aが祖先型を表し、aはAから派生したものとする。まずはこの遺伝子にはこれ以上の突然変異がおこらないものと仮定して、次世代のallele aの頻度を求めよう。いまaの頻度を$p$で表す。Aおよびaという遺伝子のタイプの違いが生存や生殖に影響しないものとすると、次世代のaタイプの期待頻度は$p$のまま、変わらない。

各個体は確率$p$でタイプaになり、確率$1-p$でタイプAになる。集団全体で考えると平均で$2N\times p$の個体がタイプaを受け継ぐ。

実際の次世代中のタイプaの個体数は二項分布に従う。

二項乱数を用いたシミュレーションの実装¶

size=100
derived=40

derived = np.random.binomial(n=size, p=derived/size)
derived

39

derived = np.random.binomial(n=size, p=derived/size)
derived

37

derived = np.random.binomial(n=size, p=derived/size)
derived

34

# initialization

derived = 40
size = 100

# trajectory
tr = [derived/size]


# 50 generations
for i in range(10):
    
    # make next generation
    derived = np.random.binomial(n=size, p=derived/size)
    tr.append( derived/size )

print(tr)

[0.4, 0.41, 0.37, 0.36, 0.39, 0.36, 0.38, 0.44, 0.46, 0.46, 0.54]

これを何度も走らせてみよう。例えば1000回のシミュレーションを行なう。

sim = []

derived = 40
size = 100

howmany = 1000
end_sim = 1000

start = time.time()

for j in range(howmany):
    
    derived = 40
    tr = [derived/size]
    
    for i in range(end_sim):
        # make the next generation
        derived = np.random.binomial(n=size, p=derived/size)
        tr.append( derived/size )
    
    sim.append(tr)

    
elapsed = time.time()-start
print("elapsed time: {0:.3f} sec".format(elapsed) )

elapsed time: 2.946 sec

なお、ここでは実行時間を計っている。

start = time.time()

でシミュレーションを開始する直前の時間を取得する。

elapsed = time.time()-start

の所でシミュレーション終了時の時間との差分を計算している。time.time()が終了時の時刻を返し、そこから開始時の時刻startを引くことで、実際の実行にかかった時間を計算している。

以前のバージョンを比較すればわかるが、実行速度が速い。以前は集団内の一個体ごとに乱数を発生させて遺伝子を決定していたので、計算量が多かった。今回は集団全体で一度だけ乱数を発生させる。だから速い。

プロットしてみよう

fig=plt.figure()
ax=fig.add_subplot(1,1,1)
for i in range(10):
    ax.plot(sim[i])
ax.set_xlim(0,50)
plt.show()

自然選択の影響¶

自然選択の影響を考えよう。１遺伝子座２対立遺伝子モデルの場合を考える。それぞれの遺伝子型の適応度を以下のように仮定する

祖先型ホモ	ヘテロ	変異型ホモ
AA	Aa	aa
$p^2$	$2pq$	$q^2$
$1$	$1+2hs$	$1+2s$

このときaの増分、$\Delta q$は、 $$ \Delta q = \frac{2 s p q [ q + h (p-q) ]}{1+2 s q (2 h p + q)} $$ であり、次世代の頻度の期待値は $$ q+\Delta q= \frac{q [1+2 s (hp+q)]}{1+2 sq (2hp+q)} $$ で得られる。

以下では近似して、 $$ \Delta q = 2 s p q [ q + h (p-q) ] $$ で、計算する。

このように自然選択が働いた場合でも、二項乱数の頻度として次世代の期待頻度を入れることで二項乱数を利用したシミュレーションを使うことができる。

# initialization

derived = 40
size = 100
# selection coefficient
s = 0.1
h = 0.5

# trajectory
tr = [derived/size]


# 50 generations
for i in range(10):
    
    # calc the effect of selection
    q = derived/size
    p = 1-q
    delta_q = 2*s*p*q*(q+h*(p-q))

    # number of derived alleles in the next generation
    derived = np.random.binomial(n=size, p=q+delta_q)
    
    tr.append( derived/size )

print(tr)

[0.4, 0.52, 0.45, 0.51, 0.6, 0.68, 0.69, 0.71, 0.66, 0.75, 0.76]

s=0.01のとき¶

まず始めにallele a がAに対して1%有利であった場合について、固定確率を計算してみよう。初期頻度は1, すなわち新しい変異が生まれたばかりの状態です。

# generate empty list
sim = []

# patameters
derived = 1
size = 100
s = 0.01
h =0.5

# simulation condition
end_sim = 1000
howmany = 10000

# simulation starts
start = time.time()

for j in range(howmany):
    
    # initialize trajectory data
    derived = 1
    tr = [derived/size]
        
    for i in range(end_sim):

        # calc the effect of selection
        q = derived/size
        p = 1-q
        delta_q = 2*s*p*q*(q+h*(p-q))

        # number of derived alleles in the next generation
        derived = np.random.binomial(n=size, p=q+delta_q)
        tr.append( derived/size )    
    
    # append simulation result
    sim.append(tr)

# simulation ends
elapsed = time.time()-start

# print run-time
print("elapsed time: {0:.3f} sec".format(elapsed) )

elapsed time: 37.789 sec

まずはデータを確認しよう。

data = np.array(sim)

print(data.shape)
print(data.dtype)
print("\n")

print(data[0])
print(data[0].shape)
print(data[0].dtype)

(10000, 1001)
float64


[ 0.01  0.01  0.   ...,  0.    0.    0.  ]
(1001,)
float64

まず始めにnp.array型に変換しその大きさを確認しました。ここでは10000行1001列のリストだということがわかります。多次元になってもリストです。リストのリスト。

次に、含まれているデータの型を確認すると浮動小数点型（float）であることがわかった。ここでfloat64となっているのは、同じ小数点型でも時代によって含まれる情報量が違ったから。最近は64ビットマシンが普通に使えるようになったが、昔は32とか、もっと小さかった。

さて、最初の要素を取り出してみる。その大きさと型を表示させると1001個の小数点型の数字からできているリストだということがわかります。

このデータから固定した変異（集団全体に広まった変異）の割合を求めよう。

last_gen = data.T[-1]

print("Fate of all mutants")
print( last_gen )
print( len(last_gen) )
print("\n")

print("Fixed mutants")
print( last_gen[ last_gen>0 ] )
print( len(last_gen[ last_gen>0 ] ) )
print("\n")

prop = len(last_gen[last_gen>0])/len(last_gen)
print( "proportion of fixed mutants: {0:.3f}".format( prop ) )

Fate of all mutants
[ 0.  0.  0. ...,  0.  0.  0.]
10000


Fixed mutants
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
252


proportion of fixed mutants: 0.025

まず、dataはnp.array型のデータです。このデータの行と列を入れ替えます。

data.T

これで(10000x1001)の行列だったものが(1001x10000)の行列になりました。

元々は行が各シミュレーションで列が世代を表していました。1番左が最初の頻度、右に行くほど時間が経った時の頻度を表していました。入れ替えたあとは1行目が0世代目の若いデータ、1番下すなわち1001行目が1000世代経った時の頻度を表しています。

このデータの1番下の部分をとってくると1000世代経ったあとの各突然変異の運命がわかります。

data.T[-1]

は1番最後のデータ、すなわち1番下のデータをとってくることを意味しています。

1000世代経ったあとに0になっていれば集団から消失し、1になっていれば固定したとみなします。ここでは固定も消失もせずにふらふらしている変異はないものとかんがえています。この固定した割合のことを固定確率といいます。

s=0.05の場合¶

自然選択をもっと強くしてみる。5%有利であった場合はどうだろう？

sim = []

# parameters
derived = 1
size = 100
s = 0.05
h =0.5

# condition
end_sim = 1000
howmany = 10000

# simulation starts
start = time.time()

for j in range(howmany):
    
    # initialize trajectory data
    derived = 1
    tr = [derived/size]
        
    for i in range(end_sim):

        # calc the effect of selection
        q = derived/size
        p = 1-q
        delta_q = 2*s*p*q*(q+h*(p-q))

        # number of derived alleles in the next generation
        derived = np.random.binomial(n=size, p=q+delta_q)
        tr.append( derived/size )    

    # append simulation result
    sim.append(tr)

# simulation ends
elapsed = time.time()-start

# print run-time
print("elapsed time: {0:.3f} sec".format(elapsed) ) 

#
# fixation probability
#
last_gen = np.array(sim).T[-1]
prop = len(last_gen[last_gen>0])/len(last_gen)
print( "proportion of fixed mutants when s={0:.3f}: {1:.5f}".format( s, prop ) )

elapsed time: 33.888 sec
proportion of fixed mutants when s=0.050: 0.09170

シミュレーションの時間は大体同じで、固定確率が変化した。これは有利さの度合いが変わったから。

最後のprint文で.formatに二つ変数を入れていることに注意。一つ目が{0}で二つ目が{1}に代入される仕組み。{0:.3f}だと0番目の変数の値を代入して、小数点以下3桁まで表示してね、という意味。

s=0¶

もし自然選択の影響がなかったとすると、

sim = []

# parameters
derived = 1
size = 100
s = 0.
h =0.5

# condition
end_sim = 1000
howmany = 10000

# simulation starts
start = time.time()

for j in range(howmany):
    
    # initialize trajectory data
    derived = 1
    tr = [derived/size]
        
    for i in range(end_sim):

        # calc the effect of selection
        q = derived/size
        p = 1-q
        delta_q = 2*s*p*q*(q+h*(p-q))

        # number of derived alleles in the next generation
        derived = np.random.binomial(n=size, p=q+delta_q)

        tr.append( derived/size )    

        
    # append simulation result
    sim.append(tr)

# simulation ends
elapsed = time.time()-start

# print run-time
print("elapsed time: {0:.3f} sec".format(elapsed) ) 

#
# fixation probability
#
last_gen = np.array(sim).T[-1]
prop = len(last_gen[last_gen>0])/len(last_gen)
print( "proportion of fixed mutants when s={0:.3f}: {1:.5f}".format( s, prop ) )

elapsed time: 32.544 sec
proportion of fixed mutants when s=0.000: 0.01040

期待値とシミュレーションの比較¶

ここで求めた集団中に広まった突然変異の割合のことを「固定確率」。この固定確率と自然選択の強さとの関係はどうなっているのだろう？　シミュレーション結果と理論的に求められた値を比較しよう。

まず始めに自然選択の強さ$s$、ヘテロ個体のdoinanceの程度を表す係数$h$、が与えられた時にその固定確率を返すシミュレートする関数を作ろう。

def calc_fix_p(s, h):
    sim = []
    derived = 1
    size = 100
    end_sim = 1000
    howmany = 10000
    
    for j in range(howmany):
        derived = 1
        tr = [derived/size]
        #pop = Drift_with_selection(size, init, s, h)
        #tr.extend([ pop.next()/size for i in range(end_sim)])
        
        for i in range(end_sim):
            # calc the effect of selection
            q = derived/size
            p = 1-q
            delta_q = 2*s*p*q*(q+h*(p-q))
            # number of derived alleles in the next generation
            derived = np.random.binomial(n=size, p=q+delta_q)
            tr.append( derived/size )    

        sim.append(tr)
        
    last_gen = np.array(sim).T[-1]
    
    return len(last_gen[last_gen>0])/len(last_gen)

この関数に$s$と$h$を与えると、シミュレーション中で固定した変異の割合、すなわち固定確率を返す。

例えば、

print(calc_fix_p(0.01,0.5))

0.0249

$s$の大きさを次々に変えながらシミュレーションしよう。少し時間がかかるよ

selection = [0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1]

start = time.time()

fixp_data = [ calc_fix_p(s,0.5) for s in selection ]

elapsed = time.time()-start
print("elapsed time: {0:.3f} sec".format(elapsed) )

elapsed time: 302.552 sec

結果を理論値とともにプロットしてみよう。固定確率は理論的に求めることができ、理論的な期待値は$2s$である。

expectation_s = [2*i for i in selection]

ではいよいよプロットしよう。

今回は理論値とシミュレーションの比較なのでわかりやすくするために理論値を実線で、シミュレーション結果を点でプロットする。

fig=plt.figure()
ax=fig.add_subplot(1,1,1)

ax.scatter(selection, fixp_data)
ax.plot(selection, expectation_s)
ax.set_xlim(0,.11)

plt.show()

ん？　何かずれてる？

いや、そんなはずはない

たぶん...

Figure¶

では確認のため、もう少し違う「期待値」も計算してみよう

expectation_n = [.01 for i in selection]
expectation_s1 = [2*i/(1-np.exp(-2*size*i)) for i in selection]
expectation_s2 = [(1-np.exp(-2*i))/(1-np.exp(-2*size*i)) for i in selection]

あわせてプロットすると

fig=plt.figure()
ax=fig.add_subplot(1,1,1)

plt.xscale("log")
plt.yscale("log")
plt.grid(which="both")

ax.scatter(selection, fixp_data)
ax.plot(selection, expectation_n)
ax.plot(selection, expectation_s)
ax.plot(selection, expectation_s1, linestyle='dashdot')
ax.plot(selection, expectation_s2, linestyle='dashed')


ax.set_xlim(0.00009,.11)
ax.set_ylim(0.005,.3)
ax.set_xlabel("Selective advantage, $s$")
ax.set_ylabel("Fixation probability")

ax.annotate("$Ns=1$", xy=(0.02,.009), xytext=(.015,.0065),
            arrowprops=dict(arrowstyle="->"))
ax.annotate("$Ns=0.1$", xy=(0.002,.009), xytext=(.00135,.0065),
            arrowprops=dict(arrowstyle="->"))
ax.annotate("$1/2N$", xy=(0.00035,.01), xytext=(.0002,.006),
            arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
ax.annotate("$2s$", xy=(0.06,.13), xytext=(.04,.2),
            arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))

plt.show()