Just to understand what are the factors which effect the scale of optimization in Cython I thought to do some experiments. What I did was took a basic function which passes large arrays as arguments and tried to analyse the effect of various changes.

Also can we beat the speed of numpy with explicit C looping? We will see ðŸ™‚

The basic function with which I started was:

def fun(a):
x = np.sin(a)
return x

def trial(a):
return fun(a)

Runtime: 195ms
Note:- Using cdef instead of def and defining type of all variables cause no difference in runtime.

Then instead of passing whole array in function, I did explicit looping and used sin function from libc.math.

Code was:-

cdef fun(a):
return sin(a)

def trial(np.ndarray[double] a):
for i in xrange(a.shape[0]):
a[i] = fun(a[i])
return a

Runtime now was: 340ms
There are some more points to note in this function:-
1) If we use def instead of cdef while declaring fun() runtime escalates to 500ms. This is the change a pure C loop without python overhead can bring.
2) Another thing, If np.sin is used in place of runtime is 16s. np.sin is a python function which has some python overhead. When called many times this overhead gets added every time the function is called slows the code heavily. But if we need to pass arrays, this np.sin works quite well as was seen in case-1.

Now only if we define the type of a as double, run-time comes down to 252ms.
Note:- If in def "trial(np.ndarray[double] a): " if we don't define the type of a runtime is 1.525s.

Next I removed the function fun() and did the computations in trial function itself.

from libc.math cimport sin

def trial(np.ndarray[double] a):
for i in xrange(a.shape[0]):
a[i] = sin(a[i])
return a

This time run-time was 169ms. We have finally beaten the 1st code.

I have yet not explained the variation in runtime in many cases. I will try to do it soon.

NB:
1. Data over which all the calculations were done was generated by a = np.linspace(1,1000,1000000)
2. Using typed memoryview instead of np.ndarray caused no change.
3. Timings were estimated by using Ipython magic function %timeit.