# 作者存档: Don Tan

## Spark maven 编译问题（国内，已解决）

mvn install:install-file -Dfile=akka-zeromq_2.10-2.3.11.jar -DgroupId=com.typesafe.akka -DartifactId=akka-zeromq_2.10 -Dversion=2.3.11 -Dpackaging=jar

mvn install:install-file -Dfile=datanucleus-core-3.2.10.jar -DgroupId=org.datanucleus -DartifactId=datanucleus-core -Dversion=3.2.10 -Dpackaging=jar

## 我所认识的香港

"自由行”是这一次风波中的一个关键词。“自由行”催生了一个“水客－港货店”的生态圈，其实是有益于香港还是大陆，这点还是有待商榷的。按照港府的统计，其实“自由行”给香港带来的利益只占了几个点的GDP。“自由行”带来的收益，更多的是直接体现在零售业上，13年大陆人在港消费了1700亿，占了零售业的三分之一，但是同时也提高了核心商业区的租金，很多百年老铺也被逼搬迁。这样的情况，说香港人处在一个夹缝中的说法其实是不过分的。在我住红磡的时候，楼下有个小中医馆，馆主姓陈。他在90年前是广州骨科医院的主治医师，后来先是移民到澳洲十多年，再是希望自己儿女能多接受华文教育又于09年移民回香港。对于“自由行”的影响，以及香港新移民的苦，他了解的最是清楚：“近年来，香港的经济其实是不太乐观的，很多行业都不敢请人，工资也提不上去。公司缩减了人手，工作还是那么多，就只好加班囖。经济下行，物价也就应该下降。但是‘自由行’一来，基础物资的市场又有了，降不下来就相当于又增加了压力。英国殖民政府的时候，有条限制房地产的政策说是每年升租不能超过10%。香港回归之后，竟然把这个政策废除了，我这里的小医馆，又不在商业区内，竟然连续几年升租50%。也是没有办法，只好多加班囖。”但是有些事情，不是个简单的加班就可以解决的，大环境上的经济下行，让本来就艰难的新移民的生活更为艰难。

## Linux Mint 搭建Hadoop 2.3.0 单机伪分布模式开发环境

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

Linux mint 的安装忽略，可以VM一个，也可以装进hard disk，自行搞定。

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

 sudo addgroup hadoop
• 创建 hadoop 用户, Terminal 中输入：
 sudo adduser -ingroup hadoop hadoop

• 给 hadoop 用户添加权限，为了简便可以直接在 /etc/sudoer 中添加，也可以设置得复杂点，麻烦点会带来更高的安全度，但是对于我来说，实验机器就能懒就懒吧，Terminal 中打开：
 sudo vim /etc/sudoers

在"root ALL=(ALL:ALL)" 这行下面加上：

 hadoop ALL=(ALL:ALL) ALL

要是vim使用不习惯可以使用gedit，相对简单，加上上面的代码保存就可以了：

 sudo gedit /etc/sudoers

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

## 如何搭建个人WordPress博客

• 个人服务器
• 域名

A记录生效了之后，就如同家门上装好了门牌地址，可以方便地登录自己的网站。在这里补充一下小知识，大网站用的是技术是LAMP，linux+Apache+Mysql+PHP/Python, 服务器商们为了方便顾客使用，已经是对服务器做好设置搭好环境的，也就是到现在你就已经完成了前面两项半. 剩下的一项半由WordPress来完成。

## Linux Mint下的一些软件

1. 星际译王：
1.  安装：sudo apt-get install stardict
3. 在终端上：sudo tar -xjvf name -C /usr/share/stardict/dic
4. 到这里就OK了～作为字典，stardict是很厉害的，我这里默认配置了Espeak TTS发音。更多需求可以查看:                                                                      http://wiki.ubuntu.com.cn/index.php?title=Stardict&variant=zh-hans
2. 输入法fcitx：
1. 安装：sudo apt-get install im-switch libapt-pkg-perl fcitx fcitx-table-wbpy
2. 设置fcitx为默认输入法： im-switch -s fcitx
3. 注销并重新登录就可以了，小企鹅还是很不错的，但是输入分段还是没有sougou好（可能我自己没有找到而已），举例如输入“企鹅”，只能打到个“qie”。
4. 原文出自：http://www.pocketdigi.com/20120109/598.html
5. 其实可以考虑扩展到使用sougou的词库，                                                                        看看：http://www.mintos.org/utility/fcitx-sougou.html
3. 局域网中多设备共享键盘和鼠标Synergy:
1. 下载Synergy: http://synergy-foss.org/?hl=zh
2. 配置server以及client:
1. server是共享键盘，鼠标的机器，server的东西不需要太多设置，设置好连过来的client的名字以及左右屏就可以了。
2. client的设置也是很方便, 只需要输入server的IP就可以了。
3. 当然还可以加密共享，有可选的几种，但是不喜欢就选disable encryption就可以了。
3. 当然，剪贴版也是共享的。

## Hotelling-Williams T-test (2)

I finished the reference reading of last post -- Hotelling-Williams T-test(1) last week. The reference is damn difficult to read. My curiosity gave me a great energy, or I will never finish the reading. However, I still felt disappointed after finishing the reference. Statistics is applied wildly, but I don't think the reason is precision but its imprecision. Some part of its theorem is not precise enough, in mathematical manner, a empirical subject.这段时间抽取了上篇论文 Hotelling-Williams T-test(1)里面的一些reference来看，难度不少，但是再全部看完之后却大失所望了。在这次reference reading之后， 我感觉统计学其实尺度挺大的，有些东西没有非常严格的数学证明也可以使用得很广泛，只要它能做出有用的判断，果然是一个“经验学科”。

## Moment Generating Function and Probability Generating Function

Moment Generating Function(mgf) and Probability Generating Function(pgf) are useful techniques in Probability Theorem. As Loss Model studies a lot about probability, mgf and pgf are necessary techniques. So I post some stuffs about them.

The definition of Moment Generating Function(Univariate Case) is

More generally, if $X=(X_{1}, X_{2}, \dots, X_{n})^{T}$, we use $t^{T}X$ instead of $tX$:

The definition of mgf seems it will be complicated, but why defining it like that? According to Wikipedia, defining that way can be used to find all the moments of the distribution. Employing Taylor's Series to expand $e^{tx}$, we have that

Such that

It is straightforward to differentiate $M_{X}(t)$ n times with respect to t and setting t =0 to get $E[X^{n}]$.

And if $X_{1}$, $X_{2}$, $\dots$, $X_{n}$ is sequence of independent random variables, and $S_{n} = \sum\limits_{i=1}^{n}a_{i}X_{i}$. The mgf of $S_{n}$ is

It is notable to remind that some distributions have no mgf because in some case $\lim\limits_{n\rightarrow\infty}\sum\limits_{i=0}^{n}\frac{t^{i}E[X^{i}]}{i!}$ is not exist. For example, lognormal distribution.

For pgf , the definition is here:

. If we do a little bit transformation, we could drive our car to mgf:

When I reading the instruction of pgf on Wikipeida, it sounds like pgf is more appropriate for discrete random variable, but I don't have any evidence.

For Univariate case, a more detailed pgf definition is here:

And for Multivariate case, the definition is here:

From its definition, it is obviously a power series, which guarantees that $|z|\leq 1$ will make the power series converged. If we setting $z = 1^{-}$, we could get that

And if $X_{1}$, $X_{2}$, $\dots$, $X_{n}$ is sequence of independent random variables, and $S_{n} = \sum\limits_{i=1}^{n}a_{i}X_{i}$. The pgf of $S_{n}$ is

And particularly, if $S = X_{1}-X_{2}$, we have

Note: All the materials of this post comes from wikipedia.org, you could check it out if you want something more detailed.

## Hotelling-Williams T-test (1)

Recently, I am trying to compare the performance of two measures. It turns out a problem of comparing two correlation coefficients $\rho_{12}$ and $\rho_{13}$, where the subscript 1 is denoting the observation group, 2 and 3 is denoting the measures. To be honest, I don't have any idea at the very beginning. Many thanks to my supvisor Dr. Dennis Cheung, he sent me a PPT about correlation coefficients, which Hotelling-Williams T test [Steiger] is also included.

The formula of Hotelling-Williams T test is here:

• N = Number of Observation
• $r_{12} =$ sample correlation between Observation and measure 2
• $r_{13} =$ sample correlation between Observation and measure 3
• $r_{23} =$ sample correlation between measures
• $|R| = 1 - r_{12}^2 - r_{13}^2 - r_{23}^2 + 2(r_{12})(r_{13})(r_{23})$
• $\bar{r} = (r_{12} + r_{13})/2$
• $\rho$ means population correlation and $r$ is denoting sample correlation

Hotelling-Williams T Test performs well in my hypothesis testing. It proofs that there is a significant difference between two measures, which explained the phenomenons I have observed. It is linear in my case, but I doubt that whether Hotelling-Williams T test appropriate for non-linear case, like log case . I found that in [crr] blog, there is a post about solving a similar problem --the correlations between the frequency measures and word processing time. Their post is very detailed and two more similar testing techniques are also introduced. One is the Vuong Test[Vuong, 1989], this test was suggested when dealing with a nonlinear problem, for example, the word processing time and log frequency. This will require we should use non-linear regression model. Vuong was suggested for this case for it based on a comparison of the log-likelihood. Another method is developed by Clarke (2007)[Clarke], he suspected that Vuong test is considered conservative for small N. However, after conducting a simulation test conducted by the [crr] blogger, they concluded that Hotelling-Williams T test is the best one and the latter is Vuong test. The Vuong test will be suggested unless the correlation between variables is very little.

The core idea about Hotelling-Williams T test is not clear yet, I will finish that in next post.

1. [crr]http://crr.ugent.be/archives/546
2. [Vuong] Vuong, Q.H. (1989): Likelihood Ratio Tests for Model Selection and non-nested Hypotheses. Econometrica, 57, 307-333.
3. [Clarke] Clarke, K.A. (2007). A Simple Distribution-Free Test for Nonnested Model Selection. Political Analysis, 15, 347-363.
4. [Steiger] Steiger, J.H. (1980), Tests for comparing elements of a correlation matrix, Psychological Bulletin, 87, 245-251.

Hotelling-Williams T 检验的公式如下：

• $r_{12} =$ correlation between Observation and measure 2
• $r_{13} =$ correlation between Observation and measure 3
• $r_{23} =$ correlation between measures
• N = Number of Observation
• $|R| = 1 - r_{12}^2 - r_{13}^2 - r_{23}^2 + 2(r_{12})(r_{13})(r_{23})$
• $\bar{r} = (r_{12} + r_{13})/2$

1. [crr]http://crr.ugent.be/archives/546
2. [Vuong] Vuong, Q.H. (1989): Likelihood Ratio Tests for Model Selection and non-nested Hypotheses. Econometrica, 57, 307-333.
3. [Clarke] Clarke, K.A. (2007). A Simple Distribution-Free Test for Nonnested Model Selection. Political Analysis, 15, 347-363.
4. [Steiger] Steiger, J.H. (1980), Tests for comparing elements of a correlation matrix, Psychological Bulletin, 87, 245-251.