七猫的藏经阁

其实只是垃圾箱

VC知识库BLOG 首页 新随笔 联系 聚合 登录
  195 Posts :: 0 Stories :: 639 Comments :: 5 Trackbacks

公告

其实我们每个人都是井底之蛙,最多在不同的井而已。

留言簿(3)

随笔分类

随笔档案

文章分类

文章档案

相册

收藏夹

好友

搜索

最新评论

阅读排行榜

评论排行榜


This time I got results with out-of-cache data. To eliminate cache
effect, I used 1MB source data and 1MB and destination data, and repeated
memcpy*()'s 1MB / datasize times in inner loop, and 1024 times in outer
loop. Total data size was the same 1GB as previous, but the results
were quite different than those with in-cache data.

In this test, non-temporal movntq instruction was obviously a big win.
Since it doesn't pollute cache lines, you can get 2x performance for
copying data not in cache.

Also, I found that my MMX-optimized i686_copyin() is faster than plain
old memcpy for data > 2~3 KB. It seems that saving/restoring FP state in/
from stack is quite expensive for small data copying (it needs 108 bytes
of memcpying from processor to memory plus some overhead).

I'll come up with finalized i686_copyin/out() soon.

Jun-Young

--
Bang Jun-Young <junyoung@mogua.com>

--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memcpy_bench.uncached.txt"

addr1=0x804c000 addr2=0x814c000

memcpy 64B -- 16384 loops
  aligned blocks
      libc memcpy                                        2.893993 s
      rep movsw                                          2.859771 s
      asm loop                                           2.669005 s
      i686_copyin                                        2.910439 s
      i686_copyin2                                       2.885610 s
      MMX memcpy using MOVQ                              2.675665 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.949940 s
      with simple MOVUSB (no prefetch)                   2.719580 s
      arjanv's MOVQ (with prefetch)                      2.938366 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.552954 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.545507 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.723010 s
      MMX memcpy using MOVQ                              2.893861 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.093558 s
      with simple MOVUSB (no prefetch)                   2.973506 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        3.125790 s
      MMX memcpy using MOVQ                              2.661766 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.740727 s
      with simple MOVUSB (no prefetch)                   2.715262 s

addr1=0x804c000 addr2=0x814c000
memcpy 1024B -- 1024 loops
  aligned blocks
      libc memcpy                                        2.761827 s
      rep movsw                                          2.764354 s
      asm loop                                           2.820187 s
      i686_copyin                                        2.647857 s
      i686_copyin2                                       2.647648 s
      MMX memcpy using MOVQ                              2.574933 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.870815 s
      with simple MOVUSB (no prefetch)                   2.684049 s
      arjanv's MOVQ (with prefetch)                      2.518789 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.588186 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.698439 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.800100 s
      MMX memcpy using MOVQ                              2.588999 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.852392 s
      with simple MOVUSB (no prefetch)                   2.723908 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.749374 s
      MMX memcpy using MOVQ                              2.683349 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.203756 s
      with simple MOVUSB (no prefetch)                   2.750306 s

addr1=0x804c000 addr2=0x814c000
memcpy 4kB -- 256 loops
  aligned blocks
      libc memcpy                                        2.758545 s
      rep movsw                                          2.759825 s
      asm loop                                           2.818919 s
      i686_copyin                                        2.633134 s
      i686_copyin2                                       2.641534 s
      MMX memcpy using MOVQ                              2.571201 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.795929 s
      with simple MOVUSB (no prefetch)                   2.681924 s
      arjanv's MOVQ (with prefetch)                      2.512153 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.577637 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.688840 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.828267 s
      MMX memcpy using MOVQ                              2.584795 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.773777 s
      with simple MOVUSB (no prefetch)                   2.691957 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.711029 s
      MMX memcpy using MOVQ                              2.690554 s
      with mingo's MOVUSB (prefetch, non-temporal)       2.047554 s
      with simple MOVUSB (no prefetch)                   2.782641 s

addr1=0x804c000 addr2=0x814c000
memcpy 64kB -- 16 loops
  aligned blocks
      libc memcpy                                        2.764299 s
      rep movsw                                          2.767497 s
      asm loop                                           2.826478 s
      i686_copyin                                        2.626365 s
      i686_copyin2                                       2.625997 s
      MMX memcpy using MOVQ                              2.570352 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.767928 s
      with simple MOVUSB (no prefetch)                   2.685339 s
      arjanv's MOVQ (with prefetch)                      2.521904 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.575878 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.682403 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.823552 s
      MMX memcpy using MOVQ                              2.580810 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.767096 s
      with simple MOVUSB (no prefetch)                   2.707592 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.713003 s
      MMX memcpy using MOVQ                              2.668149 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.975933 s
      with simple MOVUSB (no prefetch)                   2.779886 s

addr1=0x804c000 addr2=0x814c000
memcpy 128kB -- 8 loops
  aligned blocks
      libc memcpy                                        2.766495 s
      rep movsw                                          2.767812 s
      asm loop                                           2.827207 s
      i686_copyin                                        2.626962 s
      i686_copyin2                                       2.618238 s
      MMX memcpy using MOVQ                              2.570613 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.775084 s
      with simple MOVUSB (no prefetch)                   2.684980 s
      arjanv's MOVQ (with prefetch)                      2.521927 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.575982 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.682593 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.817080 s
      MMX memcpy using MOVQ                              2.588906 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.766316 s
      with simple MOVUSB (no prefetch)                   2.706869 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.711935 s
      MMX memcpy using MOVQ                              2.674179 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.963451 s
      with simple MOVUSB (no prefetch)                   2.780192 s

addr1=0x804c000 addr2=0x814c000
memcpy 256kB -- 4 loops
  aligned blocks
      libc memcpy                                        2.766599 s
      rep movsw                                          2.767784 s
      asm loop                                           2.828783 s
      i686_copyin                                        2.619552 s
      i686_copyin2                                       2.627876 s
      MMX memcpy using MOVQ                              2.571837 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.776927 s
      with simple MOVUSB (no prefetch)                   2.686435 s
      arjanv's MOVQ (with prefetch)                      2.523016 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.577187 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.675317 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.827427 s
      MMX memcpy using MOVQ                              2.590171 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.769825 s
      with simple MOVUSB (no prefetch)                   2.708104 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.710984 s
      MMX memcpy using MOVQ                              2.674800 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.972209 s
      with simple MOVUSB (no prefetch)                   2.787717 s

addr1=0x804c000 addr2=0x814c000
memcpy 512kB -- 2 loops
  aligned blocks
      libc memcpy                                        2.766847 s
      rep movsw                                          2.767707 s
      asm loop                                           2.811354 s
      i686_copyin                                        2.626655 s
      i686_copyin2                                       2.626876 s
      MMX memcpy using MOVQ                              2.571146 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.775052 s
      with simple MOVUSB (no prefetch)                   2.684812 s
      arjanv's MOVQ (with prefetch)                      2.513970 s
      arjanv's MOVNTQ (with prefetch, for Athlon)        1.576279 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.683077 s
  +0/+4 moderately unaligned blocks
      libc memcpy                                        2.827907 s
      MMX memcpy using MOVQ                              2.589284 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.767601 s
      with simple MOVUSB (no prefetch)                   2.706929 s
  +10/+13 cruelly unaligned blocks
      libc memcpy                                        2.702820 s
      MMX memcpy using MOVQ                              2.675799 s
      with mingo's MOVUSB (prefetch, non-temporal)       1.969484 s
      with simple MOVUSB (no prefetch)                   2.785175 s


--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memcpy_bench.c"

/* -*- c-file-style: "linux" -*- */

/* memcpy speed benchmark using different i86-specific routines.
*
* Framework (C) 2001 by Martin Pool <mbp@samba.org>, based on speed.c
* by tridge.
*
* Routines lifted from all kinds of places.
*
* You must not use floating-point code anywhere in this application
* because it scribbles on the FP state and does not reset it.  */


#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <sys/time.h>

memcpy_rep_movsl(void *to, const void *from, size_t len);
memcpy_words(void *to, const void *from, size_t len);
i686_copyin(void *to, const void *from, size_t len);
i686_copyin2(void *to, const void *from, size_t len);

#define MAX(a,b) ((a)>(b)?(a):(b))
#define MIN(a,b) ((a)<(b)?(a):(b))

#include <sys/resource.h>
struct rusage tp1,tp2;

static void start_timer()
{
getrusage(RUSAGE_SELF,&tp1);
}


static long end_timer()
{
getrusage(RUSAGE_SELF,&tp2);
#if 0
printf ("tp1 = %ld.%05ld, tp2 = %ld.%05ld\n",
(long) tp1.ru_utime.tv_sec, (long) tp1.ru_utime.tv_usec,
(long) tp2.ru_utime.tv_sec, (long) tp2.ru_utime.tv_usec);
#endif

return ((tp2.ru_utime.tv_sec - tp1.ru_utime.tv_sec) * 1000000 +
(tp2.ru_utime.tv_usec - tp1.ru_utime.tv_usec));
}




/*
* By Ingo Molnar and Doug Ledford; hacked up to remove
* kernel-specific stuff like saving/restoring float registers.
*
* http://people.redhat.com/mingo/mmx-patches/mmx-2.3.99-A0 */
void *
memcpy_movusb (void *to, const void *from, size_t n)
{
size_t size;

#define STEP 0x20
#define ALIGN 0x10
if ((unsigned long)to & (ALIGN-1)) {
size = ALIGN - ((unsigned long)to & (ALIGN-1));
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
     "movups %%xmm0,(%1)\n\t"
     :
     : "r" (from),
     "r" (to));
n -= size;
from += size;
to += size;
}
/*
* If the copy would have tailings, take care of them
* now instead of later
*/
if (n & (ALIGN-1)) {
size = n - ALIGN;
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
     "movups %%xmm0,(%1)\n\t"
     :
     : "r" (from + size),
     "r" (to + size));
n &= ~(ALIGN-1);
}
/*
* Prefetch the first two cachelines now.
*/
__asm__ __volatile__("prefetchnta 0x00(%0)\n\t"
     "prefetchnta 0x20(%0)\n\t"
     :
     : "r" (from));
 
while (n >= STEP) {
__asm__ __volatile__(
"movups 0x00(%0),%%xmm0\n\t"
"movups 0x10(%0),%%xmm1\n\t"
"movntps %%xmm0,0x00(%1)\n\t"
"movntps %%xmm1,0x10(%1)\n\t"
:
: "r" (from), "r" (to)
: "memory");
from += STEP;
/*
* Note: Intermixing the prefetch at *exactly* this point
* in time has been shown to be the fastest possible.
* Timing these prefetch instructions is a complete black
* art with nothing but trial and error showing the way.
* To that extent, this optimum version was found by using
* a userland version of this routine that we clocked for
* lots of runs.  We then fiddled with ordering until we
* settled on our highest speen routines.  So, the long
* and short of this is, don't mess with instruction ordering
* here or suffer permance penalties you will.
*/
__asm__ __volatile__(
"prefetchnta 0x20(%0)\n\t"
:
: "r" (from));
to += STEP;
n -= STEP;
}

return to;
}

void *
memcpy_simple_movusb (void *to, const void *from, size_t n)
{
size_t size;

#define STEP 0x20
#define ALIGN 0x10
if ((unsigned long)to & (ALIGN-1)) {
size = ALIGN - ((unsigned long)to & (ALIGN-1));
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
     "movups %%xmm0,(%1)\n\t"
     :
     : "r" (from),
     "r" (to));
n -= size;
from += size;
to += size;
}
/*
* If the copy would have tailings, take care of them
* now instead of later
*/
if (n & (ALIGN-1)) {
size = n - ALIGN;
__asm__ __volatile__("movups (%0),%%xmm0\n\t"
     "movups %%xmm0,(%1)\n\t"
     :
     : "r" (from + size),
     "r" (to + size));
n &= ~(ALIGN-1);
}

while (n >= STEP) {
__asm__ __volatile__(
"movups 0x00(%0),%%xmm0\n\t"
"movups 0x10(%0),%%xmm1\n\t"
"movups %%xmm0,0x00(%1)\n\t"
"movups %%xmm1,0x10(%1)\n\t"
:
: "r" (from), "r" (to)
: "memory");
from += STEP;
to += STEP;
n -= STEP;
}

return to;
}


/* From Linux 2.4.8.  I think this must be aligned. */
void *
memcpy_mmx (void *to, const void *from, size_t len)
{
int i;

for(i = 0; i < len / 64; i++) {
      __asm__ __volatile__ (
   "movq (%0), %%mm0\n"
   "\tmovq 8(%0), %%mm1\n"
   "\tmovq 16(%0), %%mm2\n"
   "\tmovq 24(%0), %%mm3\n"
   "\tmovq %%mm0, (%1)\n"
   "\tmovq %%mm1, 8(%1)\n"
   "\tmovq %%mm2, 16(%1)\n"
   "\tmovq %%mm3, 24(%1)\n"
   "\tmovq 32(%0), %%mm0\n"
   "\tmovq 40(%0), %%mm1\n"
   "\tmovq 48(%0), %%mm2\n"
   "\tmovq 56(%0), %%mm3\n"
   "\tmovq %%mm0, 32(%1)\n"
   "\tmovq %%mm1, 40(%1)\n"
   "\tmovq %%mm2, 48(%1)\n"
   "\tmovq %%mm3, 56(%1)\n"
   : : "r" (from), "r" (to) : "memory");
from += 64;
to += 64;
}

if (len & 63)
memcpy(to, from, len & 63);

return to;
}

static void print_time (char const *msg,
long long loops,
long t)
{
printf("      %-50s %ld.%06ld s\n", msg, t/1000000,
       t % 1000000);
}

void *
memcpy_arjanv (void *to, const void *from, size_t len)
{
int i;

__asm__ __volatile__ (
"1: prefetchnta (%0)\n"
"   prefetchnta 64(%0)\n"
"   prefetchnta 128(%0)\n"
"   prefetchnta 192(%0)\n"
"   prefetchnta 256(%0)\n"
: : "r" (from) );

for(i=0; i<len/64; i++) {
__asm__ __volatile__ (
"1: prefetchnta 320(%0)\n"
"2: movq (%0), %%mm0\n"
"   movq 8(%0), %%mm1\n"
"   movq 16(%0), %%mm2\n"
"   movq 24(%0), %%mm3\n"
"   movq %%mm0, (%1)\n"
"   movq %%mm1, 8(%1)\n"
"   movq %%mm2, 16(%1)\n"
"   movq %%mm3, 24(%1)\n"
"   movq 32(%0), %%mm0\n"
"   movq 40(%0), %%mm1\n"
"   movq 48(%0), %%mm2\n"
"   movq 56(%0), %%mm3\n"
"   movq %%mm0, 32(%1)\n"
"   movq %%mm1, 40(%1)\n"
"   movq %%mm2, 48(%1)\n"
"   movq %%mm3, 56(%1)\n"
: : "r" (from), "r" (to) : "memory");
from+=64;
to+=64;
}

/*
*Now do the tail of the block
*/
if (len&63)
memcpy(to, from, len&63);

return to;
}

void *
memcpy_arjanv_movntq (void *to, const void *from, size_t len)
{
int i;

__asm__ __volatile__ (
"1: prefetchnta (%0)\n"
"   prefetchnta 64(%0)\n"
"   prefetchnta 128(%0)\n"
"   prefetchnta 192(%0)\n"
: : "r" (from) );

for(i=0; i<len/64; i++) {
__asm__ __volatile__ (
"   prefetchnta 200(%0)\n"
"   movq (%0), %%mm0\n"
"   movq 8(%0), %%mm1\n"
"   movq 16(%0), %%mm2\n"
"   movq 24(%0), %%mm3\n"
"   movq 32(%0), %%mm4\n"
"   movq 40(%0), %%mm5\n"
"   movq 48(%0), %%mm6\n"
"   movq 56(%0), %%mm7\n"
"   movntq %%mm0, (%1)\n"
"   movntq %%mm1, 8(%1)\n"
"   movntq %%mm2, 16(%1)\n"
"   movntq %%mm3, 24(%1)\n"
"   movntq %%mm4, 32(%1)\n"
"   movntq %%mm5, 40(%1)\n"
"   movntq %%mm6, 48(%1)\n"
"   movntq %%mm7, 56(%1)\n"
: : "r" (from), "r" (to) : "memory");
from+=64;
to+=64;
}
/*
*Now do the tail of the block
*/
if (len&63)
memcpy(to, from, len&63);

return to;
}

void *
memcpy_arjanv_interleave (void *to, const void *from, size_t len)
{
int i;

__asm__ __volatile__ (
"1: prefetchnta (%0)\n"
"   prefetchnta 64(%0)\n"
"   prefetchnta 128(%0)\n"
"   prefetchnta 192(%0)\n"
: : "r" (from) );


for(i=0; i<len/64; i++) {
__asm__ __volatile__ (
"   prefetchnta 168(%0)\n"
"   movq (%0), %%mm0\n"
"   movntq %%mm0, (%1)\n"
"   movq 8(%0), %%mm1\n"
"   movntq %%mm1, 8(%1)\n"
"   movq 16(%0), %%mm2\n"
"   movntq %%mm2, 16(%1)\n"
"   movq 24(%0), %%mm3\n"
"   movntq %%mm3, 24(%1)\n"
"   movq 32(%0), %%mm4\n"
"   movntq %%mm4, 32(%1)\n"
"   movq 40(%0), %%mm5\n"
"   movntq %%mm5, 40(%1)\n"
"   movq 48(%0), %%mm6\n"
"   movntq %%mm6, 48(%1)\n"
"   movq 56(%0), %%mm7\n"
"   movntq %%mm7, 56(%1)\n"
: : "r" (from), "r" (to) : "memory");
from+=64;
to+=64;
}
/*
*Now do the tail of the block
*/
if (len&63)
memcpy(to, from, len&63);

return to;
}

static void wrap (char *p1,
  char *p2,
  size_t size,
  long loops,
  void *(*bfn) (void *, const void *, size_t),
  const char *msg)
{
long t;
int i, j;
char *tmp1, *tmp2;


memset(p2,42,size);

tmp1 = p1;
tmp2 = p2;

start_timer();

for (j = 0; j < 1024; j++) {
for (i=0; i<loops; i++) {
bfn (tmp1, tmp2, size);
tmp1 += size;
tmp2 += size;
}
tmp1 = p1;
tmp2 = p2;
}

t = end_timer();

print_time (msg, loops, t);
}

static void memcpy_test(size_t size)
{
long loops = 1024*1024 / size;

/* We need to make sure the blocks are *VERY* aligned, because
   MMX is potentially pretty fussy. */

char *p1 = (char *) malloc (1024 * 1024);
char *p2 = (char *) malloc (1024 * 1024);

printf("addr1=%p addr2=%p\n", p1, p2);

if (size > 2048)
printf ("memcpy %dkB -- %ld loops\n", size>>10, loops);
else
printf ("memcpy %dB -- %ld loops\n", size, loops);


printf ("  aligned blocks\n");

wrap (p1, p2, size, loops, memcpy, "libc memcpy");
wrap (p1, p2, size, loops, memcpy_rep_movsl, "rep movsw");
wrap (p1, p2, size, loops, memcpy_words, "asm loop");
wrap (p1, p2, size, loops, i686_copyin, "i686_copyin");
wrap (p1, p2, size, loops, i686_copyin2, "i686_copyin2");
wrap (p1, p2, size, loops, memcpy_mmx,
"MMX memcpy using MOVQ");
wrap(p1, p2, size, loops, memcpy_movusb,
"with mingo's MOVUSB (prefetch, non-temporal)");
wrap (p1, p2, size, loops, memcpy_simple_movusb,
      "with simple MOVUSB (no prefetch)");
wrap (p1, p2, size, loops, memcpy_arjanv,
      "arjanv's MOVQ (with prefetch)");
wrap (p1, p2, size, loops, memcpy_arjanv_movntq,
      "arjanv's MOVNTQ (with prefetch, for Athlon)");
wrap (p1, p2, size, loops, memcpy_arjanv_interleave,
      "arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA");

printf ("  +0/+4 moderately unaligned blocks\n");

wrap (p1, p2+4, size, loops, memcpy, "libc memcpy");
wrap (p1, p2+4, size, loops, memcpy_mmx,
"MMX memcpy using MOVQ");
wrap(p1, p2+4, size, loops, memcpy_movusb,
"with mingo's MOVUSB (prefetch, non-temporal)");
wrap (p1, p2+4, size, loops, memcpy_simple_movusb,
      "with simple MOVUSB (no prefetch)");

printf ("  +10/+13 cruelly unaligned blocks\n");

wrap (p1+10, p2+13, size, loops, memcpy, "libc memcpy");
wrap (p1+10, p2+13, size, loops, memcpy_mmx,
"MMX memcpy using MOVQ");
wrap(p1+10, p2+13, size, loops, memcpy_movusb,
"with mingo's MOVUSB (prefetch, non-temporal)");
wrap (p1+10, p2+13, size, loops, memcpy_simple_movusb,
      "with simple MOVUSB (no prefetch)");

puts("");

free(p1); free(p2);
}


int main (void)
{
memcpy_test(64);
#if 0
memcpy_test(1<<7);
memcpy_test(1<<8);
memcpy_test(1<<9);
#endif
memcpy_test(1024);
#if 0
memcpy_test(1<<11);
#endif
memcpy_test(4096);
#if 0
memcpy_test(1<<13);
memcpy_test(1<<14);
memcpy_test(1<<15);
#endif
memcpy_test(1<<16);
memcpy_test(1<<17);
memcpy_test(1<<18);
memcpy_test(1<<19);
#if 0
memcpy_test(1<<20);
#endif
return 0;
}

posted on 2005-08-26 11:30 Diviner 阅读(13807) 评论(11)  编辑 收藏

Feedback

# re: 关于memcpy的效率问题 2005-08-26 11:34 SevenCat
On Wed, Oct 16, 2002 at 04:18:30AM +0900, Bang Jun-Young wrote:
> Another attached patch is i686 version of copyin(9) that makes use
> of MMX insns. It works well with intops-only programs, but doesn't
> with ones like XFree86 that uses FP ops. In this case, it would be
> helpful if NPX handling code was imported from FreeBSD (they have
> i586 optimized version of copyin/out(9)). Can anybody give me some
> comments wrt this?
Yup, there's a lot to be had by using SSE(2) instructions, copying
in 128bit quantities is quite a useful thing to do. It's been
on my todo list for a while.
I've been playing with a few SSE memcpy functions myself, but
did not get around to adding the extra checks to the FP
save/restore code yet. There are some checks that need to
be done. It comes down to:
* Don't mess up the current process' FP state, so save it if necessary.
* Don't bother if there's not enough bytes to copy, since you're
paying the price of an entire FP save if someone was using the FPU.
* If you're going all the way, and are using memcpy with SSE in
the kernel too, be careful about interrupts. If you come in
during the FP save path, it will mess up things. And maybe
you don't want to use FP in an interrupt at all, it'll
cause a ton of fp save/restore actions.
It's not overly complicated to do, but it's important to take all
scenarios into account. copyin/out is the simplest case, since
you should be in a process context when doing those.
I'll probably have some time to spend on this soon (next month).
If you're going to work on it before than, please let me review
the changes.
- Frank
------------------------------------------
Posted to Phorum via PhorumMail

# re: 关于memcpy的效率问题 2005-08-26 11:34 周星星
洋猫啊 <无内容> - [usr_root] 2005-8-26 11:30:00 ( 0 字节, 点击:2 )

# re: 关于memcpy的效率问题 2005-08-26 11:35 SevenCat
Here is a new version of i686_copyin(). By saving FPU state in stack,
I could make it work with programs that use FP operations, including
XFree86, xmms, mozilla, etc.
In this version, I set the minimum length to use MMX bcopy to 512.
Since I don't know of a kernel profiling tool or a method to measure
copyin performance at kernel level, the number may be too small, or
too large.
Possible todo:
- i686_copyout(), i686_kcopy(), i686_memcpy(), ...
- use prefetch and movntq instructions for PIII/4 or Athlon.
- use npxproc to eliminate overhead in saving FPU state as
FreeBSD does.
Index: locore.s
===================================================================
RCS file: /cvsroot/syssrc/sys/arch/i386/i386/locore.s,v
retrieving revision 1.265
diff -u -r1.265 locore.s
--- locore.s 2002/10/05 21:20:00 1.265
+++ locore.s 2002/10/22 16:42:17
@@ -951,7 +951,7 @@
#define DEFAULT_COPYIN _C_LABEL(i386_copyin) /* XXX */
#elif defined(I686_CPU)
#define DEFAULT_COPYOUT _C_LABEL(i486_copyout) /* XXX */
-#define DEFAULT_COPYIN _C_LABEL(i386_copyin) /* XXX */
+#define DEFAULT_COPYIN _C_LABEL(i686_copyin) /* XXX */
#endif

.data
@@ -1159,6 +1159,114 @@
xorl %eax,%eax
ret
#endif /* I386_CPU || I486_CPU || I586_CPU || I686_CPU */
+
+#if defined(I686_CPU)
+/* LINTSTUB: Func: int i686_copyin(const void *uaddr, void *kaddr, size_t len) */
+ENTRY(i686_copyin)
+ pushl %esi
+ pushl %edi
+ pushl %ebx
+ GET_CURPCB(%eax)
+ movl $_C_LABEL(i686_copy_fault),PCB_ONFAULT(%eax)
+
+ movl 16(%esp),%esi
+ movl 20(%esp),%edi
+ movl 24(%esp),%eax
+
+ /*
+ * We check that the end of the destination buffer is not past the end
+ * of the user's address space. If it's not, then we only need to
+ * check that each page is readable, and the CPU will do that for us.
+ */
+ movl %esi,%edx
+ addl %eax,%edx
+ jc _C_LABEL(i686_copy_efault)
+ cmpl $VM_MAXUSER_ADDRESS,%edx
+ ja _C_LABEL(i686_copy_efault)
+
+ cmpl $512,%eax
+ jb 2f
+
+ xorl %ebx,%ebx
+ movl %eax,%edx
+ shrl $6,%edx
+
+ /*
+ * Save FPU state in stack.
+ */
+ smsw %cx
+ clts
+ subl $108,%esp
+ fnsave 0(%esp)
+
+1:
+ movq (%esi),%mm0
+ movq 8(%esi),%mm1
+ movq 16(%esi),%mm2
+ movq 24(%esi),%mm3
+ movq 32(%esi),%mm4
+ movq 40(%esi),%mm5
+ movq 48(%esi),%mm6
+ movq 56(%esi),%mm7
+ movq %mm0,(%edi)
+ movq %mm1,8(%edi)
+ movq %mm2,16(%edi)
+ movq %mm3,24(%edi)
+ movq %mm4,32(%edi)
+ movq %mm5,40(%edi)
+ movq %mm6,48(%edi)
+ movq %mm7,56(%edi)
+
+ addl $64,%esi
+ addl $64,%edi
+ incl %ebx
+ cmpl %edx,%ebx
+ jb 1b
+
+ /*
+ * Restore FPU state.
+ */
+ frstor 0(%esp)
+ addl $108,%esp
+ lmsw %cx
+
+ andl $63,%eax
+ je 3f
+
+2:
+ /* bcopy(%esi, %edi, %eax); */
+ cld
+ movl %eax,%ecx
+ shrl $2,%ecx
+ rep
+ movsl
+ movb %al,%cl
+ andb $3,%cl
+ rep
+ movsb
+
+3:
+ GET_CURPCB(%edx)
+ xorl %eax,%eax
+ popl %ebx
+ popl %edi
+ popl %esi
+ movl %eax,PCB_ONFAULT(%edx)
+ ret
+
+/* LINTSTUB: Ignore */
+NENTRY(i686_copy_efault)
+ movl $EFAULT,%eax
+
+/* LINTSTUB: Ignore */
+NENTRY(i686_copy_fault)
+ GET_CURPCB(%edx)
+ movl %eax,PCB_ONFAULT(%edx)
+ popl %ebx
+ popl %edi
+ popl %esi
+ ret
+#endif /* I686_CPU */

/* LINTSTUB: Ignore */
NENTRY(copy_efault)
Jun-Young
--
Bang Jun-Young
------------------------------------------
Posted to Phorum via PhorumMail

# re: 关于memcpy的效率问题 2005-08-26 11:35 SevenCat
A few things:
* i686_copyout() is actually pretty important, because e.g.
we don't have zero-copy socket reads yet (only writes), so
a fast copy routine is important there.
* Same for i686_kcopy() - it's used in the NFS path, at least,
and could significantly improve performance there.
* i686_memcpy() - be careful, because you have the whole
"memcpy() is allowed in interrupts" thing. It's probably
not worth bothering with this one, because there's a
potential to spend a LOT of time saving/restoring FPU
context.
* Yes, only save/restore the FP state if npxproc != NULL.
In the MULTIPROCESSOR case, you also need to be careful
because you could get an IPI from another CPU requesting
the FP state, so you'll need to make sure to provide the
correct one!
In fact, it's probably best to save to the npxproc's PCB,
and restore it back from there, rather than the stack.
(Cuts down on potentially large stack usage, too.)
* You have to handle the fxsave/fxrstor case, i.e. if the CPU
has SSE/SSE2.

# re: 关于memcpy的效率问题 2005-08-26 11:35 SevenCat
Well, the P4 (I don't have one) claims to execute 'rep movsl'
in the cache controller for suitable long and aligned transfers....
So whether SSE2 copies (which must also be aligned) are faster
is any bodies guess.
The other question is how many copies are actually long?
Otherwise the red tape starts becoming significant.
As does the code size itself - unless it is part of the
permanent working set of the application.
Oh - why not write assembler in assembler?
The 'asm' statements are epically hard to read :-(
(I might try decoding them tomorrow.)


# re: 关于memcpy的效率问题 2005-08-26 11:36 SevenCat
I've done some experiments on my slot-A athlon 700.
The libc memcpy is slow (on modern cpus) because of the setup cost of
executing 'rep movs' instructions. In particular the one used to
copy the remaining (0-3) bytes is particularly expensive.
'rep movsl' only starts to win for copies over (about) 200 bytes.
(when the mmx copy is still 50% faster).
For small blocks (probably the commonest?) I get:
addr1=0x804c000 addr2=0x804c080
memcpy 64B -- 16777216 loops
aligned blocks
libc memcpy 1.721654 s
rep movsw 1.310823 s
asm loop 1.000972 s
MMX memcpy using MOVQ 0.762467 s
arjanv's MOVQ (with prefetch) 0.905702 s
arjanv's MOVNTQ (with prefetch, for Athlon) 1.559139 s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA 1.556865 s
+0/+4 moderately unaligned blocks
libc memcpy 1.715516 s
rep movsw 1.310894 s
asm loop 1.000683 s
MMX memcpy using MOVQ 0.881484 s
+10/+13 cruelly unaligned blocks
libc memcpy 1.996214 s
rep movsw 1.619813 s
asm loop 1.190194 s
MMX memcpy using MOVQ 1.024688 s
where the 'rep movsl' and 'asm loop' are:

#include
ENTRY(memcpy_rep_movsl)
pushl %esi
pushl %edi
movl 20(%esp),%ecx
movl 12(%esp),%edi
movl 16(%esp),%esi
movl %edi,%eax /* return value */
movl %ecx,%edx
cld /* copy forwards. */
shrl $2,%ecx /* copy by words */
rep
movsl
testl $3,%edx
jne 1f
2: popl %edi
popl %esi
ret
1:
movl %edx,%ecx
rep
movsb
jmp 2b
ENTRY(memcpy_words)
pushl %esi
pushl %edi
movl 12(%esp),%edi
movl 16(%esp),%esi
movl 20(%esp),%ecx
pushl %ebp
pushl %ebx
shrl $4,%ecx
1:
movl 0(%esi),%eax
movl 4(%esi),%edx
movl 8(%esi),%ebx
movl 12(%esi),%ebp
addl $16,%esi
subl $1,%ecx
movl %eax,0(%edi)
movl %edx,4(%edi)
movl %ebx,8(%edi)
movl %ebp,12(%edi)
leal 16(%edi),%edi
jne 1b
/* We ought to do the remainder here... */
popl %ebx
popl %ebp
movl 12(%esp),%eax
popl %edi
popl %esi
ret
David


# re: 关于memcpy的效率问题 2005-08-26 18:50 pAnic
--b

# re: 关于memcpy的效率问题 2005-11-03 18:17 疯子阿虹
This is preliminary documentation and subject to change. 

Send Feedback on this topic 



memcpy
Copies characters between buffers.


Routine Required Header 
memcpy <memory.h> or <string.h> 


void *memcpy( void *dest, const void *src, size_t count );
Parameters 

dest 
New buffer 
src 
Buffer to copy from 
count 
Number of characters to copy 
Libraries

All versions of the C run-time libraries.

Return Values

memcpy returns the value of dest.

Remarks

The memcpy function copies count bytes of src to dest. If the source and destination overlap, this function does not ensure that the original source bytes in the overlapping region are copied before being overwritten. Use memmove to handle overlapping regions.


# syrsdppb 2010-02-22 20:04 syrsdppb
 <a href="http://hfktcppw.com">qkosvpsh</a>  [URL=http://ivpzzpxd.com]tgmdgblg[/URL]  fjjizqbr http://luqhfems.com ohvotxtc igzvcmqp 

# cialis soft tabs 2010-02-24 21:03 cialis soft tabs
Late to bed and late to wake will keep you long on money and short on mistakes.

# omnicef 2010-02-24 21:04 omnicef
Listen. Do not have an opinion while you listen because frankly, your opinion doesn?t hold much water outside of Your Universe. Just listen. Listen until their brain has been twisted like a dripping towel and what they have to say is all over the floor.

标题  
姓名  
主页
验证码 *
内容   
  登录  使用高级评论  Top
[使用Ctrl+Enter键可以直接提交]