Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

pnggccrd.c@ 209

Last change on this file since 209 was 95, checked in by Rick van der Zwet, 15 years ago
Bad boy, improper move of directory
File size: 235.1 KB

Line
1
2	/* pnggccrd.c - mixed C/assembler version of utilities to read a PNG file
3	*
4	* For Intel x86 CPU (Pentium-MMX or later) and GNU C compiler.
5	*
6	* See http://www.intel.com/drg/pentiumII/appnotes/916/916.htm
7	* and http://www.intel.com/drg/pentiumII/appnotes/923/923.htm
8	* for Intel's performance analysis of the MMX vs. non-MMX code.
9	*
10	* Last changed in libpng 1.2.15 January 5, 2007
11	* For conditions of distribution and use, see copyright notice in png.h
12	* Copyright (c) 1998-2007 Glenn Randers-Pehrson
13	* Copyright (c) 1998, Intel Corporation
14	*
15	* Based on MSVC code contributed by Nirav Chhatrapati, Intel Corp., 1998.
16	* Interface to libpng contributed by Gilles Vollant, 1999.
17	* GNU C port by Greg Roelofs, 1999-2001.
18	*
19	* Lines 2350-4300 converted in place with intel2gas 1.3.1:
20	*
21	* intel2gas -mdI pnggccrd.c.partially-msvc -o pnggccrd.c
22	*
23	* and then cleaned up by hand. See http://hermes.terminal.at/intel2gas/ .
24	*
25	* NOTE: A sufficiently recent version of GNU as (or as.exe under DOS/Windows)
26	* is required to assemble the newer MMX instructions such as movq.
27	* For djgpp, see
28	*
29	* ftp://ftp.simtel.net/pub/simtelnet/gnu/djgpp/v2gnu/bnu281b.zip
30	*
31	* (or a later version in the same directory). For Linux, check your
32	* distribution's web site(s) or try these links:
33	*
34	* http://rufus.w3.org/linux/RPM/binutils.html
35	* http://www.debian.org/Packages/stable/devel/binutils.html
36	* ftp://ftp.slackware.com/pub/linux/slackware/slackware/slakware/d1/
37	* binutils.tgz
38	*
39	* For other platforms, see the main GNU site:
40	*
41	* ftp://ftp.gnu.org/pub/gnu/binutils/
42	*
43	* Version 2.5.2l.15 is definitely too old...
44	*/
45
46	/*
47	* TEMPORARY PORTING NOTES AND CHANGELOG (mostly by Greg Roelofs)
48	* =====================================
49	*
50	* 19991006:
51	* - fixed sign error in post-MMX cleanup code (16- & 32-bit cases)
52	*
53	* 19991007:
54	* - additional optimizations (possible or definite):
55	* x [DONE] write MMX code for 64-bit case (pixel_bytes == 8) [not tested]
56	* - write MMX code for 48-bit case (pixel_bytes == 6)
57	* - figure out what's up with 24-bit case (pixel_bytes == 3):
58	* why subtract 8 from width_mmx in the pass 4/5 case?
59	* (only width_mmx case) (near line 1606)
60	* x [DONE] replace pixel_bytes within each block with the true
61	* constant value (or are compilers smart enough to do that?)
62	* - rewrite all MMX interlacing code so it's aligned with
63	* the beginning of the row buffer, not the end. This
64	* would not only allow one to eliminate half of the memory
65	* writes for odd passes (that is, pass == odd), it may also
66	* eliminate some unaligned-data-access exceptions (assuming
67	* there's a penalty for not aligning 64-bit accesses on
68	* 64-bit boundaries). The only catch is that the "leftover"
69	* pixel(s) at the end of the row would have to be saved,
70	* but there are enough unused MMX registers in every case,
71	* so this is not a problem. A further benefit is that the
72	* post-MMX cleanup code (C code) in at least some of the
73	* cases could be done within the assembler block.
74	* x [DONE] the "v3 v2 v1 v0 v7 v6 v5 v4" comments are confusing,
75	* inconsistent, and don't match the MMX Programmer's Reference
76	* Manual conventions anyway. They should be changed to
77	* "b7 b6 b5 b4 b3 b2 b1 b0," where b0 indicates the byte that
78	* was lowest in memory (e.g., corresponding to a left pixel)
79	* and b7 is the byte that was highest (e.g., a right pixel).
80	*
81	* 19991016:
82	* - Brennan's Guide notwithstanding, gcc under Linux does not
83	* want globals prefixed by underscores when referencing them--
84	* i.e., if the variable is const4, then refer to it as const4,
85	* not _const4. This seems to be a djgpp-specific requirement.
86	* Also, such variables apparently must be declared outside
87	* of functions; neither static nor automatic variables work if
88	* defined within the scope of a single function, but both
89	* static and truly global (multi-module) variables work fine.
90	*
91	* 19991023:
92	* - fixed png_combine_row() non-MMX replication bug (odd passes only?)
93	* - switched from string-concatenation-with-macros to cleaner method of
94	* renaming global variables for djgpp--i.e., always use prefixes in
95	* inlined assembler code (== strings) and conditionally rename the
96	* variables, not the other way around. Hence _const4, _mask8_0, etc.
97	*
98	* 19991024:
99	* - fixed mmxsupport()/png_do_read_interlace() first-row bug
100	* This one was severely weird: even though mmxsupport() doesn't touch
101	* ebx (where "row" pointer was stored), it nevertheless managed to zero
102	* the register (even in static/non-fPIC code--see below), which in turn
103	* caused png_do_read_interlace() to return prematurely on the first row of
104	* interlaced images (i.e., without expanding the interlaced pixels).
105	* Inspection of the generated assembly code didn't turn up any clues,
106	* although it did point at a minor optimization (i.e., get rid of
107	* mmx_supported_local variable and just use eax). Possibly the CPUID
108	* instruction is more destructive than it looks? (Not yet checked.)
109	* - "info gcc" was next to useless, so compared fPIC and non-fPIC assembly
110	* listings... Apparently register spillage has to do with ebx, since
111	* it's used to index the global offset table. Commenting it out of the
112	* input-reg lists in png_combine_row() eliminated compiler barfage, so
113	* ifdef'd with __PIC__ macro: if defined, use a global for unmask
114	*
115	* 19991107:
116	* - verified CPUID clobberage: 12-char string constant ("GenuineIntel",
117	* "AuthenticAMD", etc.) placed in ebx:ecx:edx. Still need to polish.
118	*
119	* 19991120:
120	* - made "diff" variable (now "_dif") global to simplify conversion of
121	* filtering routines (running out of regs, sigh). "diff" is still used
122	* in interlacing routines, however.
123	* - fixed up both versions of mmxsupport() (ORIG_THAT_USED_TO_CLOBBER_EBX
124	* macro determines which is used); original not yet tested.
125	*
126	* 20000213:
127	* - when compiling with gcc, be sure to use -fomit-frame-pointer
128	*
129	* 20000319:
130	* - fixed a register-name typo in png_do_read_interlace(), default (MMX) case,
131	* pass == 4 or 5, that caused visible corruption of interlaced images
132	*
133	* 20000623:
134	* - Various problems were reported with gcc 2.95.2 in the Cygwin environment,
135	* many of the form "forbidden register 0 (ax) was spilled for class AREG."
136	* This is explained at http://gcc.gnu.org/fom_serv/cache/23.html, and
137	* Chuck Wilson supplied a patch involving dummy output registers. See
138	* http://sourceforge.net/bugs/?func=detailbug&bug_id=108741&group_id=5624
139	* for the original (anonymous) SourceForge bug report.
140	*
141	* 20000706:
142	* - Chuck Wilson passed along these remaining gcc 2.95.2 errors:
143	* pnggccrd.c: In function `png_combine_row':
144	* pnggccrd.c:525: more than 10 operands in `asm'
145	* pnggccrd.c:669: more than 10 operands in `asm'
146	* pnggccrd.c:828: more than 10 operands in `asm'
147	* pnggccrd.c:994: more than 10 operands in `asm'
148	* pnggccrd.c:1177: more than 10 operands in `asm'
149	* They are all the same problem and can be worked around by using the
150	* global _unmask variable unconditionally, not just in the -fPIC case.
151	* Reportedly earlier versions of gcc also have the problem with more than
152	* 10 operands; they just don't report it. Much strangeness ensues, etc.
153	*
154	* 20000729:
155	* - enabled png_read_filter_row_mmx_up() (shortest remaining unconverted
156	* MMX routine); began converting png_read_filter_row_mmx_sub()
157	* - to finish remaining sections:
158	* - clean up indentation and comments
159	* - preload local variables
160	* - add output and input regs (order of former determines numerical
161	* mapping of latter)
162	* - avoid all usage of ebx (including bx, bh, bl) register [20000823]
163	* - remove "$" from addressing of Shift and Mask variables [20000823]
164	*
165	* 20000731:
166	* - global union vars causing segfaults in png_read_filter_row_mmx_sub()?
167	*
168	* 20000822:
169	* - ARGH, stupid png_read_filter_row_mmx_sub() segfault only happens with
170	* shared-library (-fPIC) version! Code works just fine as part of static
171	* library. Damn damn damn damn damn, should have tested that sooner.
172	* ebx is getting clobbered again (explicitly this time); need to save it
173	* on stack or rewrite asm code to avoid using it altogether. Blargh!
174	*
175	* 20000823:
176	* - first section was trickiest; all remaining sections have ebx -> edx now.
177	* (-fPIC works again.) Also added missing underscores to various Shift*
178	* and Mask globals and got rid of leading "$" signs.
179	*
180	* 20000826:
181	* - added visual separators to help navigate microscopic printed copies
182	* (http://pobox.com/~newt/code/gpr-latest.zip, mode 10); started working
183	* on png_read_filter_row_mmx_avg()
184	*
185	* 20000828:
186	* - finished png_read_filter_row_mmx_avg(): only Paeth left! (930 lines...)
187	* What the hell, did png_read_filter_row_mmx_paeth(), too. Comments not
188	* cleaned up/shortened in either routine, but functionality is complete
189	* and seems to be working fine.
190	*
191	* 20000829:
192	* - ahhh, figured out last(?) bit of gcc/gas asm-fu: if register is listed
193	* as an input reg (with dummy output variables, etc.), then it cannot
194	* also appear in the clobber list or gcc 2.95.2 will barf. The solution
195	* is simple enough...
196	*
197	* 20000914:
198	* - bug in png_read_filter_row_mmx_avg(): 16-bit grayscale not handled
199	* correctly (but 48-bit RGB just fine)
200	*
201	* 20000916:
202	* - fixed bug in png_read_filter_row_mmx_avg(), bpp == 2 case; three errors:
203	* - "_ShiftBpp.use = 24;" should have been "_ShiftBpp.use = 16;"
204	* - "_ShiftRem.use = 40;" should have been "_ShiftRem.use = 48;"
205	* - "psllq _ShiftRem, %%mm2" should have been "psrlq _ShiftRem, %%mm2"
206	*
207	* 20010101:
208	* - added new png_init_mmx_flags() function (here only because it needs to
209	* call mmxsupport(), which should probably become global png_mmxsupport());
210	* modified other MMX routines to run conditionally (png_ptr->asm_flags)
211	*
212	* 20010103:
213	* - renamed mmxsupport() to png_mmx_support(), with auto-set of mmx_supported,
214	* and made it public; moved png_init_mmx_flags() to png.c as internal func
215	*
216	* 20010104:
217	* - removed dependency on png_read_filter_row_c() (C code already duplicated
218	* within MMX version of png_read_filter_row()) so no longer necessary to
219	* compile it into pngrutil.o
220	*
221	* 20010310:
222	* - fixed buffer-overrun bug in png_combine_row() C code (non-MMX)
223	*
224	* 20020304:
225	* - eliminated incorrect use of width_mmx in pixel_bytes == 8 case
226	*
227	* 20040724:
228	* - more tinkering with clobber list at lines 4529 and 5033, to get
229	* it to compile on gcc-3.4.
230	*
231	* STILL TO DO:
232	* - test png_do_read_interlace() 64-bit case (pixel_bytes == 8)
233	* - write MMX code for 48-bit case (pixel_bytes == 6)
234	* - figure out what's up with 24-bit case (pixel_bytes == 3):
235	* why subtract 8 from width_mmx in the pass 4/5 case?
236	* (only width_mmx case) (near line 1606)
237	* - rewrite all MMX interlacing code so it's aligned with beginning
238	* of the row buffer, not the end (see 19991007 for details)
239	* x pick one version of mmxsupport() and get rid of the other
240	* - add error messages to any remaining bogus default cases
241	* - enable pixel_depth == 8 cases in png_read_filter_row()? (test speed)
242	* x add support for runtime enable/disable/query of various MMX routines
243	*/
244
245	#define PNG_INTERNAL
246	#include "png.h"
247
248	#if defined(PNG_ASSEMBLER_CODE_SUPPORTED) && defined(PNG_USE_PNGGCCRD)
249
250	int PNGAPI png_mmx_support(void);
251
252	#ifdef PNG_USE_LOCAL_ARRAYS
253	const static int FARDATA png_pass_start[7] = {0, 4, 0, 2, 0, 1, 0};
254	const static int FARDATA png_pass_inc[7] = {8, 8, 4, 4, 2, 2, 1};
255	const static int FARDATA png_pass_width[7] = {8, 4, 4, 2, 2, 1, 1};
256	#endif
257
258	#if defined(PNG_MMX_CODE_SUPPORTED)
259	/* djgpp, Win32, Cygwin, and OS2 add their own underscores to global variables,
260	* so define them without: */
261	#if defined(__DJGPP__) \|\| defined(WIN32) \|\| defined(__CYGWIN__) \|\| \
262	defined(__OS2__)
263	# define _mmx_supported mmx_supported
264	# define _const4 const4
265	# define _const6 const6
266	# define _mask8_0 mask8_0
267	# define _mask16_1 mask16_1
268	# define _mask16_0 mask16_0
269	# define _mask24_2 mask24_2
270	# define _mask24_1 mask24_1
271	# define _mask24_0 mask24_0
272	# define _mask32_3 mask32_3
273	# define _mask32_2 mask32_2
274	# define _mask32_1 mask32_1
275	# define _mask32_0 mask32_0
276	# define _mask48_5 mask48_5
277	# define _mask48_4 mask48_4
278	# define _mask48_3 mask48_3
279	# define _mask48_2 mask48_2
280	# define _mask48_1 mask48_1
281	# define _mask48_0 mask48_0
282	# define _LBCarryMask LBCarryMask
283	# define _HBClearMask HBClearMask
284	# define _ActiveMask ActiveMask
285	# define _ActiveMask2 ActiveMask2
286	# define _ActiveMaskEnd ActiveMaskEnd
287	# define _ShiftBpp ShiftBpp
288	# define _ShiftRem ShiftRem
289	#ifdef PNG_THREAD_UNSAFE_OK
290	# define _unmask unmask
291	# define _FullLength FullLength
292	# define _MMXLength MMXLength
293	# define _dif dif
294	# define _patemp patemp
295	# define _pbtemp pbtemp
296	# define _pctemp pctemp
297	#endif
298	#endif
299
300
301	/* These constants are used in the inlined MMX assembly code.
302	Ignore gcc's "At top level: defined but not used" warnings. */
303
304	/* GRR 20000706: originally _unmask was needed only when compiling with -fPIC,
305	* since that case uses the %ebx register for indexing the Global Offset Table
306	* and there were no other registers available. But gcc 2.95 and later emit
307	* "more than 10 operands in `asm'" errors when %ebx is used to preload unmask
308	* in the non-PIC case, so we'll just use the global unconditionally now.
309	*/
310	#ifdef PNG_THREAD_UNSAFE_OK
311	static int _unmask;
312	#endif
313
314	const static unsigned long long _mask8_0 = 0x0102040810204080LL;
315
316	const static unsigned long long _mask16_1 = 0x0101020204040808LL;
317	const static unsigned long long _mask16_0 = 0x1010202040408080LL;
318
319	const static unsigned long long _mask24_2 = 0x0101010202020404LL;
320	const static unsigned long long _mask24_1 = 0x0408080810101020LL;
321	const static unsigned long long _mask24_0 = 0x2020404040808080LL;
322
323	const static unsigned long long _mask32_3 = 0x0101010102020202LL;
324	const static unsigned long long _mask32_2 = 0x0404040408080808LL;
325	const static unsigned long long _mask32_1 = 0x1010101020202020LL;
326	const static unsigned long long _mask32_0 = 0x4040404080808080LL;
327
328	const static unsigned long long _mask48_5 = 0x0101010101010202LL;
329	const static unsigned long long _mask48_4 = 0x0202020204040404LL;
330	const static unsigned long long _mask48_3 = 0x0404080808080808LL;
331	const static unsigned long long _mask48_2 = 0x1010101010102020LL;
332	const static unsigned long long _mask48_1 = 0x2020202040404040LL;
333	const static unsigned long long _mask48_0 = 0x4040808080808080LL;
334
335	const static unsigned long long _const4 = 0x0000000000FFFFFFLL;
336	//const static unsigned long long _const5 = 0x000000FFFFFF0000LL; // NOT USED
337	const static unsigned long long _const6 = 0x00000000000000FFLL;
338
339	// These are used in the row-filter routines and should/would be local
340	// variables if not for gcc addressing limitations.
341	// WARNING: Their presence probably defeats the thread safety of libpng.
342
343	#ifdef PNG_THREAD_UNSAFE_OK
344	static png_uint_32 _FullLength;
345	static png_uint_32 _MMXLength;
346	static int _dif;
347	static int _patemp; // temp variables for Paeth routine
348	static int _pbtemp;
349	static int _pctemp;
350	#endif
351
352	void /* PRIVATE */
353	png_squelch_warnings(void)
354	{
355	#ifdef PNG_THREAD_UNSAFE_OK
356	_dif = _dif;
357	_patemp = _patemp;
358	_pbtemp = _pbtemp;
359	_pctemp = _pctemp;
360	_MMXLength = _MMXLength;
361	#endif
362	_const4 = _const4;
363	_const6 = _const6;
364	_mask8_0 = _mask8_0;
365	_mask16_1 = _mask16_1;
366	_mask16_0 = _mask16_0;
367	_mask24_2 = _mask24_2;
368	_mask24_1 = _mask24_1;
369	_mask24_0 = _mask24_0;
370	_mask32_3 = _mask32_3;
371	_mask32_2 = _mask32_2;
372	_mask32_1 = _mask32_1;
373	_mask32_0 = _mask32_0;
374	_mask48_5 = _mask48_5;
375	_mask48_4 = _mask48_4;
376	_mask48_3 = _mask48_3;
377	_mask48_2 = _mask48_2;
378	_mask48_1 = _mask48_1;
379	_mask48_0 = _mask48_0;
380	}
381	#endif /* PNG_MMX_CODE_SUPPORTED */
382
383
384	static int _mmx_supported = 2;
385
386	/===========================================================================/
387	/* */
388	/* P N G _ C O M B I N E _ R O W */
389	/* */
390	/===========================================================================/
391
392	#if defined(PNG_HAVE_MMX_COMBINE_ROW)
393
394	#define BPP2 2
395	#define BPP3 3 /* bytes per pixel (a.k.a. pixel_bytes) */
396	#define BPP4 4
397	#define BPP6 6 /* (defined only to help avoid cut-and-paste errors) */
398	#define BPP8 8
399
400	/* Combines the row recently read in with the previous row.
401	This routine takes care of alpha and transparency if requested.
402	This routine also handles the two methods of progressive display
403	of interlaced images, depending on the mask value.
404	The mask value describes which pixels are to be combined with
405	the row. The pattern always repeats every 8 pixels, so just 8
406	bits are needed. A one indicates the pixel is to be combined; a
407	zero indicates the pixel is to be skipped. This is in addition
408	to any alpha or transparency value associated with the pixel.
409	If you want all pixels to be combined, pass 0xff (255) in mask. */
410
411	/* Use this routine for the x86 platform - it uses a faster MMX routine
412	if the machine supports MMX. */
413
414	void /* PRIVATE */
415	png_combine_row(png_structp png_ptr, png_bytep row, int mask)
416	{
417	png_debug(1, "in png_combine_row (pnggccrd.c)\n");
418
419	#if defined(PNG_MMX_CODE_SUPPORTED)
420	if (_mmx_supported == 2) {
421	#if !defined(PNG_1_0_X)
422	/* this should have happened in png_init_mmx_flags() already */
423	png_warning(png_ptr, "asm_flags may not have been initialized");
424	#endif
425	png_mmx_support();
426	}
427	#endif
428
429	if (mask == 0xff)
430	{
431	png_debug(2,"mask == 0xff: doing single png_memcpy()\n");
432	png_memcpy(row, png_ptr->row_buf + 1,
433	(png_size_t)PNG_ROWBYTES(png_ptr->row_info.pixel_depth,png_ptr->width));
434	}
435	else /* (png_combine_row() is never called with mask == 0) */
436	{
437	switch (png_ptr->row_info.pixel_depth)
438	{
439	case 1: /* png_ptr->row_info.pixel_depth */
440	{
441	png_bytep sp;
442	png_bytep dp;
443	int s_inc, s_start, s_end;
444	int m;
445	int shift;
446	png_uint_32 i;
447
448	sp = png_ptr->row_buf + 1;
449	dp = row;
450	m = 0x80;
451	#if defined(PNG_READ_PACKSWAP_SUPPORTED)
452	if (png_ptr->transformations & PNG_PACKSWAP)
453	{
454	s_start = 0;
455	s_end = 7;
456	s_inc = 1;
457	}
458	else
459	#endif
460	{
461	s_start = 7;
462	s_end = 0;
463	s_inc = -1;
464	}
465
466	shift = s_start;
467
468	for (i = 0; i < png_ptr->width; i++)
469	{
470	if (m & mask)
471	{
472	int value;
473
474	value = (*sp >> shift) & 0x1;
475	*dp &= (png_byte)((0x7f7f >> (7 - shift)) & 0xff);
476	*dp \|= (png_byte)(value << shift);
477	}
478
479	if (shift == s_end)
480	{
481	shift = s_start;
482	sp++;
483	dp++;
484	}
485	else
486	shift += s_inc;
487
488	if (m == 1)
489	m = 0x80;
490	else
491	m >>= 1;
492	}
493	break;
494	}
495
496	case 2: /* png_ptr->row_info.pixel_depth */
497	{
498	png_bytep sp;
499	png_bytep dp;
500	int s_start, s_end, s_inc;
501	int m;
502	int shift;
503	png_uint_32 i;
504	int value;
505
506	sp = png_ptr->row_buf + 1;
507	dp = row;
508	m = 0x80;
509	#if defined(PNG_READ_PACKSWAP_SUPPORTED)
510	if (png_ptr->transformations & PNG_PACKSWAP)
511	{
512	s_start = 0;
513	s_end = 6;
514	s_inc = 2;
515	}
516	else
517	#endif
518	{
519	s_start = 6;
520	s_end = 0;
521	s_inc = -2;
522	}
523
524	shift = s_start;
525
526	for (i = 0; i < png_ptr->width; i++)
527	{
528	if (m & mask)
529	{
530	value = (*sp >> shift) & 0x3;
531	*dp &= (png_byte)((0x3f3f >> (6 - shift)) & 0xff);
532	*dp \|= (png_byte)(value << shift);
533	}
534
535	if (shift == s_end)
536	{
537	shift = s_start;
538	sp++;
539	dp++;
540	}
541	else
542	shift += s_inc;
543	if (m == 1)
544	m = 0x80;
545	else
546	m >>= 1;
547	}
548	break;
549	}
550
551	case 4: /* png_ptr->row_info.pixel_depth */
552	{
553	png_bytep sp;
554	png_bytep dp;
555	int s_start, s_end, s_inc;
556	int m;
557	int shift;
558	png_uint_32 i;
559	int value;
560
561	sp = png_ptr->row_buf + 1;
562	dp = row;
563	m = 0x80;
564	#if defined(PNG_READ_PACKSWAP_SUPPORTED)
565	if (png_ptr->transformations & PNG_PACKSWAP)
566	{
567	s_start = 0;
568	s_end = 4;
569	s_inc = 4;
570	}
571	else
572	#endif
573	{
574	s_start = 4;
575	s_end = 0;
576	s_inc = -4;
577	}
578	shift = s_start;
579
580	for (i = 0; i < png_ptr->width; i++)
581	{
582	if (m & mask)
583	{
584	value = (*sp >> shift) & 0xf;
585	*dp &= (png_byte)((0xf0f >> (4 - shift)) & 0xff);
586	*dp \|= (png_byte)(value << shift);
587	}
588
589	if (shift == s_end)
590	{
591	shift = s_start;
592	sp++;
593	dp++;
594	}
595	else
596	shift += s_inc;
597	if (m == 1)
598	m = 0x80;
599	else
600	m >>= 1;
601	}
602	break;
603	}
604
605	case 8: /* png_ptr->row_info.pixel_depth */
606	{
607	png_bytep srcptr;
608	png_bytep dstptr;
609
610	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
611	#if !defined(PNG_1_0_X)
612	if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_COMBINE_ROW)
613	/* && _mmx_supported */ )
614	#else
615	if (_mmx_supported)
616	#endif
617	{
618	png_uint_32 len;
619	int diff;
620	int dummy_value_a; // fix 'forbidden register spilled' error
621	int dummy_value_d;
622	int dummy_value_c;
623	int dummy_value_S;
624	int dummy_value_D;
625	_unmask = ~mask; // global variable for -fPIC version
626	srcptr = png_ptr->row_buf + 1;
627	dstptr = row;
628	len = png_ptr->width &~7; // reduce to multiple of 8
629	diff = (int) (png_ptr->width & 7); // amount lost
630
631	__asm__ __volatile__ (
632	"movd _unmask, %%mm7 \n\t" // load bit pattern
633	"psubb %%mm6, %%mm6 \n\t" // zero mm6
634	"punpcklbw %%mm7, %%mm7 \n\t"
635	"punpcklwd %%mm7, %%mm7 \n\t"
636	"punpckldq %%mm7, %%mm7 \n\t" // fill reg with 8 masks
637
638	"movq _mask8_0, %%mm0 \n\t"
639	"pand %%mm7, %%mm0 \n\t" // nonzero if keep byte
640	"pcmpeqb %%mm6, %%mm0 \n\t" // zeros->1s, v versa
641
642	// preload "movl len, %%ecx \n\t" // load length of line
643	// preload "movl srcptr, %%esi \n\t" // load source
644	// preload "movl dstptr, %%edi \n\t" // load dest
645
646	"cmpl $0, %%ecx \n\t" // len == 0 ?
647	"je mainloop8end \n\t"
648
649	"mainloop8: \n\t"
650	"movq (%%esi), %%mm4 \n\t" // *srcptr
651	"pand %%mm0, %%mm4 \n\t"
652	"movq %%mm0, %%mm6 \n\t"
653	"pandn (%%edi), %%mm6 \n\t" // *dstptr
654	"por %%mm6, %%mm4 \n\t"
655	"movq %%mm4, (%%edi) \n\t"
656	"addl $8, %%esi \n\t" // inc by 8 bytes processed
657	"addl $8, %%edi \n\t"
658	"subl $8, %%ecx \n\t" // dec by 8 pixels processed
659	"ja mainloop8 \n\t"
660
661	"mainloop8end: \n\t"
662	// preload "movl diff, %%ecx \n\t" // (diff is in eax)
663	"movl %%eax, %%ecx \n\t"
664	"cmpl $0, %%ecx \n\t"
665	"jz end8 \n\t"
666	// preload "movl mask, %%edx \n\t"
667	"sall $24, %%edx \n\t" // make low byte, high byte
668
669	"secondloop8: \n\t"
670	"sall %%edx \n\t" // move high bit to CF
671	"jnc skip8 \n\t" // if CF = 0
672	"movb (%%esi), %%al \n\t"
673	"movb %%al, (%%edi) \n\t"
674
675	"skip8: \n\t"
676	"incl %%esi \n\t"
677	"incl %%edi \n\t"
678	"decl %%ecx \n\t"
679	"jnz secondloop8 \n\t"
680
681	"end8: \n\t"
682	"EMMS \n\t" // DONE
683
684	: "=a" (dummy_value_a), // output regs (dummy)
685	"=d" (dummy_value_d),
686	"=c" (dummy_value_c),
687	"=S" (dummy_value_S),
688	"=D" (dummy_value_D)
689
690	: "3" (srcptr), // esi // input regs
691	"4" (dstptr), // edi
692	"0" (diff), // eax
693	// was (unmask) "b" RESERVED // ebx // Global Offset Table idx
694	"2" (len), // ecx
695	"1" (mask) // edx
696
697	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
698	: "%mm0", "%mm4", "%mm6", "%mm7" // clobber list
699	#endif
700	);
701	}
702	else /* mmx _not supported - Use modified C routine */
703	#endif /* PNG_MMX_CODE_SUPPORTED */
704	{
705	register png_uint_32 i;
706	png_uint_32 initial_val = png_pass_start[png_ptr->pass];
707	/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
708	register int stride = png_pass_inc[png_ptr->pass];
709	/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
710	register int rep_bytes = png_pass_width[png_ptr->pass];
711	/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
712	png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
713	int diff = (int) (png_ptr->width & 7); /* amount lost */
714	register png_uint_32 final_val = len; /* GRR bugfix */
715
716	srcptr = png_ptr->row_buf + 1 + initial_val;
717	dstptr = row + initial_val;
718
719	for (i = initial_val; i < final_val; i += stride)
720	{
721	png_memcpy(dstptr, srcptr, rep_bytes);
722	srcptr += stride;
723	dstptr += stride;
724	}
725	if (diff) /* number of leftover pixels: 3 for pngtest */
726	{
727	final_val+=diff /* BPP1 / ;
728	for (; i < final_val; i += stride)
729	{
730	if (rep_bytes > (int)(final_val-i))
731	rep_bytes = (int)(final_val-i);
732	png_memcpy(dstptr, srcptr, rep_bytes);
733	srcptr += stride;
734	dstptr += stride;
735	}
736	}
737
738	} /* end of else (_mmx_supported) */
739
740	break;
741	} /* end 8 bpp */
742
743	case 16: /* png_ptr->row_info.pixel_depth */
744	{
745	png_bytep srcptr;
746	png_bytep dstptr;
747
748	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
749	#if !defined(PNG_1_0_X)
750	if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_COMBINE_ROW)
751	/* && _mmx_supported */ )
752	#else
753	if (_mmx_supported)
754	#endif
755	{
756	png_uint_32 len;
757	int diff;
758	int dummy_value_a; // fix 'forbidden register spilled' error
759	int dummy_value_d;
760	int dummy_value_c;
761	int dummy_value_S;
762	int dummy_value_D;
763	_unmask = ~mask; // global variable for -fPIC version
764	srcptr = png_ptr->row_buf + 1;
765	dstptr = row;
766	len = png_ptr->width &~7; // reduce to multiple of 8
767	diff = (int) (png_ptr->width & 7); // amount lost //
768
769	__asm__ __volatile__ (
770	"movd _unmask, %%mm7 \n\t" // load bit pattern
771	"psubb %%mm6, %%mm6 \n\t" // zero mm6
772	"punpcklbw %%mm7, %%mm7 \n\t"
773	"punpcklwd %%mm7, %%mm7 \n\t"
774	"punpckldq %%mm7, %%mm7 \n\t" // fill reg with 8 masks
775
776	"movq _mask16_0, %%mm0 \n\t"
777	"movq _mask16_1, %%mm1 \n\t"
778
779	"pand %%mm7, %%mm0 \n\t"
780	"pand %%mm7, %%mm1 \n\t"
781
782	"pcmpeqb %%mm6, %%mm0 \n\t"
783	"pcmpeqb %%mm6, %%mm1 \n\t"
784
785	// preload "movl len, %%ecx \n\t" // load length of line
786	// preload "movl srcptr, %%esi \n\t" // load source
787	// preload "movl dstptr, %%edi \n\t" // load dest
788
789	"cmpl $0, %%ecx \n\t"
790	"jz mainloop16end \n\t"
791
792	"mainloop16: \n\t"
793	"movq (%%esi), %%mm4 \n\t"
794	"pand %%mm0, %%mm4 \n\t"
795	"movq %%mm0, %%mm6 \n\t"
796	"movq (%%edi), %%mm7 \n\t"
797	"pandn %%mm7, %%mm6 \n\t"
798	"por %%mm6, %%mm4 \n\t"
799	"movq %%mm4, (%%edi) \n\t"
800
801	"movq 8(%%esi), %%mm5 \n\t"
802	"pand %%mm1, %%mm5 \n\t"
803	"movq %%mm1, %%mm7 \n\t"
804	"movq 8(%%edi), %%mm6 \n\t"
805	"pandn %%mm6, %%mm7 \n\t"
806	"por %%mm7, %%mm5 \n\t"
807	"movq %%mm5, 8(%%edi) \n\t"
808
809	"addl $16, %%esi \n\t" // inc by 16 bytes processed
810	"addl $16, %%edi \n\t"
811	"subl $8, %%ecx \n\t" // dec by 8 pixels processed
812	"ja mainloop16 \n\t"
813
814	"mainloop16end: \n\t"
815	// preload "movl diff, %%ecx \n\t" // (diff is in eax)
816	"movl %%eax, %%ecx \n\t"
817	"cmpl $0, %%ecx \n\t"
818	"jz end16 \n\t"
819	// preload "movl mask, %%edx \n\t"
820	"sall $24, %%edx \n\t" // make low byte, high byte
821
822	"secondloop16: \n\t"
823	"sall %%edx \n\t" // move high bit to CF
824	"jnc skip16 \n\t" // if CF = 0
825	"movw (%%esi), %%ax \n\t"
826	"movw %%ax, (%%edi) \n\t"
827
828	"skip16: \n\t"
829	"addl $2, %%esi \n\t"
830	"addl $2, %%edi \n\t"
831	"decl %%ecx \n\t"
832	"jnz secondloop16 \n\t"
833
834	"end16: \n\t"
835	"EMMS \n\t" // DONE
836
837	: "=a" (dummy_value_a), // output regs (dummy)
838	"=c" (dummy_value_c),
839	"=d" (dummy_value_d),
840	"=S" (dummy_value_S),
841	"=D" (dummy_value_D)
842
843	: "0" (diff), // eax // input regs
844	// was (unmask) " " RESERVED // ebx // Global Offset Table idx
845	"1" (len), // ecx
846	"2" (mask), // edx
847	"3" (srcptr), // esi
848	"4" (dstptr) // edi
849
850	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
851	: "%mm0", "%mm1", "%mm4" // clobber list
852	, "%mm5", "%mm6", "%mm7"
853	#endif
854	);
855	}
856	else /* mmx _not supported - Use modified C routine */
857	#endif /* PNG_MMX_CODE_SUPPORTED */
858	{
859	register png_uint_32 i;
860	png_uint_32 initial_val = BPP2 * png_pass_start[png_ptr->pass];
861	/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
862	register int stride = BPP2 * png_pass_inc[png_ptr->pass];
863	/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
864	register int rep_bytes = BPP2 * png_pass_width[png_ptr->pass];
865	/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
866	png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
867	int diff = (int) (png_ptr->width & 7); /* amount lost */
868	register png_uint_32 final_val = BPP2 * len; /* GRR bugfix */
869
870	srcptr = png_ptr->row_buf + 1 + initial_val;
871	dstptr = row + initial_val;
872
873	for (i = initial_val; i < final_val; i += stride)
874	{
875	png_memcpy(dstptr, srcptr, rep_bytes);
876	srcptr += stride;
877	dstptr += stride;
878	}
879	if (diff) /* number of leftover pixels: 3 for pngtest */
880	{
881	final_val+=diff*BPP2;
882	for (; i < final_val; i += stride)
883	{
884	if (rep_bytes > (int)(final_val-i))
885	rep_bytes = (int)(final_val-i);
886	png_memcpy(dstptr, srcptr, rep_bytes);
887	srcptr += stride;
888	dstptr += stride;
889	}
890	}
891	} /* end of else (_mmx_supported) */
892
893	break;
894	} /* end 16 bpp */
895
896	case 24: /* png_ptr->row_info.pixel_depth */
897	{
898	png_bytep srcptr;
899	png_bytep dstptr;
900
901	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
902	#if !defined(PNG_1_0_X)
903	if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_COMBINE_ROW)
904	/* && _mmx_supported */ )
905	#else
906	if (_mmx_supported)
907	#endif
908	{
909	png_uint_32 len;
910	int diff;
911	int dummy_value_a; // fix 'forbidden register spilled' error
912	int dummy_value_d;
913	int dummy_value_c;
914	int dummy_value_S;
915	int dummy_value_D;
916	_unmask = ~mask; // global variable for -fPIC version
917	srcptr = png_ptr->row_buf + 1;
918	dstptr = row;
919	len = png_ptr->width &~7; // reduce to multiple of 8
920	diff = (int) (png_ptr->width & 7); // amount lost //
921
922	__asm__ __volatile__ (
923	"movd _unmask, %%mm7 \n\t" // load bit pattern
924	"psubb %%mm6, %%mm6 \n\t" // zero mm6
925	"punpcklbw %%mm7, %%mm7 \n\t"
926	"punpcklwd %%mm7, %%mm7 \n\t"
927	"punpckldq %%mm7, %%mm7 \n\t" // fill reg with 8 masks
928
929	"movq _mask24_0, %%mm0 \n\t"
930	"movq _mask24_1, %%mm1 \n\t"
931	"movq _mask24_2, %%mm2 \n\t"
932
933	"pand %%mm7, %%mm0 \n\t"
934	"pand %%mm7, %%mm1 \n\t"
935	"pand %%mm7, %%mm2 \n\t"
936
937	"pcmpeqb %%mm6, %%mm0 \n\t"
938	"pcmpeqb %%mm6, %%mm1 \n\t"
939	"pcmpeqb %%mm6, %%mm2 \n\t"
940
941	// preload "movl len, %%ecx \n\t" // load length of line
942	// preload "movl srcptr, %%esi \n\t" // load source
943	// preload "movl dstptr, %%edi \n\t" // load dest
944
945	"cmpl $0, %%ecx \n\t"
946	"jz mainloop24end \n\t"
947
948	"mainloop24: \n\t"
949	"movq (%%esi), %%mm4 \n\t"
950	"pand %%mm0, %%mm4 \n\t"
951	"movq %%mm0, %%mm6 \n\t"
952	"movq (%%edi), %%mm7 \n\t"
953	"pandn %%mm7, %%mm6 \n\t"
954	"por %%mm6, %%mm4 \n\t"
955	"movq %%mm4, (%%edi) \n\t"
956
957	"movq 8(%%esi), %%mm5 \n\t"
958	"pand %%mm1, %%mm5 \n\t"
959	"movq %%mm1, %%mm7 \n\t"
960	"movq 8(%%edi), %%mm6 \n\t"
961	"pandn %%mm6, %%mm7 \n\t"
962	"por %%mm7, %%mm5 \n\t"
963	"movq %%mm5, 8(%%edi) \n\t"
964
965	"movq 16(%%esi), %%mm6 \n\t"
966	"pand %%mm2, %%mm6 \n\t"
967	"movq %%mm2, %%mm4 \n\t"
968	"movq 16(%%edi), %%mm7 \n\t"
969	"pandn %%mm7, %%mm4 \n\t"
970	"por %%mm4, %%mm6 \n\t"
971	"movq %%mm6, 16(%%edi) \n\t"
972
973	"addl $24, %%esi \n\t" // inc by 24 bytes processed
974	"addl $24, %%edi \n\t"
975	"subl $8, %%ecx \n\t" // dec by 8 pixels processed
976
977	"ja mainloop24 \n\t"
978
979	"mainloop24end: \n\t"
980	// preload "movl diff, %%ecx \n\t" // (diff is in eax)
981	"movl %%eax, %%ecx \n\t"
982	"cmpl $0, %%ecx \n\t"
983	"jz end24 \n\t"
984	// preload "movl mask, %%edx \n\t"
985	"sall $24, %%edx \n\t" // make low byte, high byte
986
987	"secondloop24: \n\t"
988	"sall %%edx \n\t" // move high bit to CF
989	"jnc skip24 \n\t" // if CF = 0
990	"movw (%%esi), %%ax \n\t"
991	"movw %%ax, (%%edi) \n\t"
992	"xorl %%eax, %%eax \n\t"
993	"movb 2(%%esi), %%al \n\t"
994	"movb %%al, 2(%%edi) \n\t"
995
996	"skip24: \n\t"
997	"addl $3, %%esi \n\t"
998	"addl $3, %%edi \n\t"
999	"decl %%ecx \n\t"
1000	"jnz secondloop24 \n\t"
1001
1002	"end24: \n\t"
1003	"EMMS \n\t" // DONE
1004
1005	: "=a" (dummy_value_a), // output regs (dummy)
1006	"=d" (dummy_value_d),
1007	"=c" (dummy_value_c),
1008	"=S" (dummy_value_S),
1009	"=D" (dummy_value_D)
1010
1011	: "3" (srcptr), // esi // input regs
1012	"4" (dstptr), // edi
1013	"0" (diff), // eax
1014	// was (unmask) "b" RESERVED // ebx // Global Offset Table idx
1015	"2" (len), // ecx
1016	"1" (mask) // edx
1017
1018	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
1019	: "%mm0", "%mm1", "%mm2" // clobber list
1020	, "%mm4", "%mm5", "%mm6", "%mm7"
1021	#endif
1022	);
1023	}
1024	else /* mmx _not supported - Use modified C routine */
1025	#endif /* PNG_MMX_CODE_SUPPORTED */
1026	{
1027	register png_uint_32 i;
1028	png_uint_32 initial_val = BPP3 * png_pass_start[png_ptr->pass];
1029	/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
1030	register int stride = BPP3 * png_pass_inc[png_ptr->pass];
1031	/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
1032	register int rep_bytes = BPP3 * png_pass_width[png_ptr->pass];
1033	/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
1034	png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
1035	int diff = (int) (png_ptr->width & 7); /* amount lost */
1036	register png_uint_32 final_val = BPP3 * len; /* GRR bugfix */
1037
1038	srcptr = png_ptr->row_buf + 1 + initial_val;
1039	dstptr = row + initial_val;
1040
1041	for (i = initial_val; i < final_val; i += stride)
1042	{
1043	png_memcpy(dstptr, srcptr, rep_bytes);
1044	srcptr += stride;
1045	dstptr += stride;
1046	}
1047	if (diff) /* number of leftover pixels: 3 for pngtest */
1048	{
1049	final_val+=diff*BPP3;
1050	for (; i < final_val; i += stride)
1051	{
1052	if (rep_bytes > (int)(final_val-i))
1053	rep_bytes = (int)(final_val-i);
1054	png_memcpy(dstptr, srcptr, rep_bytes);
1055	srcptr += stride;
1056	dstptr += stride;
1057	}
1058	}
1059	} /* end of else (_mmx_supported) */
1060
1061	break;
1062	} /* end 24 bpp */
1063
1064	case 32: /* png_ptr->row_info.pixel_depth */
1065	{
1066	png_bytep srcptr;
1067	png_bytep dstptr;
1068
1069	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
1070	#if !defined(PNG_1_0_X)
1071	if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_COMBINE_ROW)
1072	/* && _mmx_supported */ )
1073	#else
1074	if (_mmx_supported)
1075	#endif
1076	{
1077	png_uint_32 len;
1078	int diff;
1079	int dummy_value_a; // fix 'forbidden register spilled' error
1080	int dummy_value_d;
1081	int dummy_value_c;
1082	int dummy_value_S;
1083	int dummy_value_D;
1084	_unmask = ~mask; // global variable for -fPIC version
1085	srcptr = png_ptr->row_buf + 1;
1086	dstptr = row;
1087	len = png_ptr->width &~7; // reduce to multiple of 8
1088	diff = (int) (png_ptr->width & 7); // amount lost //
1089
1090	__asm__ __volatile__ (
1091	"movd _unmask, %%mm7 \n\t" // load bit pattern
1092	"psubb %%mm6, %%mm6 \n\t" // zero mm6
1093	"punpcklbw %%mm7, %%mm7 \n\t"
1094	"punpcklwd %%mm7, %%mm7 \n\t"
1095	"punpckldq %%mm7, %%mm7 \n\t" // fill reg with 8 masks
1096
1097	"movq _mask32_0, %%mm0 \n\t"
1098	"movq _mask32_1, %%mm1 \n\t"
1099	"movq _mask32_2, %%mm2 \n\t"
1100	"movq _mask32_3, %%mm3 \n\t"
1101
1102	"pand %%mm7, %%mm0 \n\t"
1103	"pand %%mm7, %%mm1 \n\t"
1104	"pand %%mm7, %%mm2 \n\t"
1105	"pand %%mm7, %%mm3 \n\t"
1106
1107	"pcmpeqb %%mm6, %%mm0 \n\t"
1108	"pcmpeqb %%mm6, %%mm1 \n\t"
1109	"pcmpeqb %%mm6, %%mm2 \n\t"
1110	"pcmpeqb %%mm6, %%mm3 \n\t"
1111
1112	// preload "movl len, %%ecx \n\t" // load length of line
1113	// preload "movl srcptr, %%esi \n\t" // load source
1114	// preload "movl dstptr, %%edi \n\t" // load dest
1115
1116	"cmpl $0, %%ecx \n\t" // lcr
1117	"jz mainloop32end \n\t"
1118
1119	"mainloop32: \n\t"
1120	"movq (%%esi), %%mm4 \n\t"
1121	"pand %%mm0, %%mm4 \n\t"
1122	"movq %%mm0, %%mm6 \n\t"
1123	"movq (%%edi), %%mm7 \n\t"
1124	"pandn %%mm7, %%mm6 \n\t"
1125	"por %%mm6, %%mm4 \n\t"
1126	"movq %%mm4, (%%edi) \n\t"
1127
1128	"movq 8(%%esi), %%mm5 \n\t"
1129	"pand %%mm1, %%mm5 \n\t"
1130	"movq %%mm1, %%mm7 \n\t"
1131	"movq 8(%%edi), %%mm6 \n\t"
1132	"pandn %%mm6, %%mm7 \n\t"
1133	"por %%mm7, %%mm5 \n\t"
1134	"movq %%mm5, 8(%%edi) \n\t"
1135
1136	"movq 16(%%esi), %%mm6 \n\t"
1137	"pand %%mm2, %%mm6 \n\t"
1138	"movq %%mm2, %%mm4 \n\t"
1139	"movq 16(%%edi), %%mm7 \n\t"
1140	"pandn %%mm7, %%mm4 \n\t"
1141	"por %%mm4, %%mm6 \n\t"
1142	"movq %%mm6, 16(%%edi) \n\t"
1143
1144	"movq 24(%%esi), %%mm7 \n\t"
1145	"pand %%mm3, %%mm7 \n\t"
1146	"movq %%mm3, %%mm5 \n\t"
1147	"movq 24(%%edi), %%mm4 \n\t"
1148	"pandn %%mm4, %%mm5 \n\t"
1149	"por %%mm5, %%mm7 \n\t"
1150	"movq %%mm7, 24(%%edi) \n\t"
1151
1152	"addl $32, %%esi \n\t" // inc by 32 bytes processed
1153	"addl $32, %%edi \n\t"
1154	"subl $8, %%ecx \n\t" // dec by 8 pixels processed
1155	"ja mainloop32 \n\t"
1156
1157	"mainloop32end: \n\t"
1158	// preload "movl diff, %%ecx \n\t" // (diff is in eax)
1159	"movl %%eax, %%ecx \n\t"
1160	"cmpl $0, %%ecx \n\t"
1161	"jz end32 \n\t"
1162	// preload "movl mask, %%edx \n\t"
1163	"sall $24, %%edx \n\t" // low byte => high byte
1164
1165	"secondloop32: \n\t"
1166	"sall %%edx \n\t" // move high bit to CF
1167	"jnc skip32 \n\t" // if CF = 0
1168	"movl (%%esi), %%eax \n\t"
1169	"movl %%eax, (%%edi) \n\t"
1170
1171	"skip32: \n\t"
1172	"addl $4, %%esi \n\t"
1173	"addl $4, %%edi \n\t"
1174	"decl %%ecx \n\t"
1175	"jnz secondloop32 \n\t"
1176
1177	"end32: \n\t"
1178	"EMMS \n\t" // DONE
1179
1180	: "=a" (dummy_value_a), // output regs (dummy)
1181	"=d" (dummy_value_d),
1182	"=c" (dummy_value_c),
1183	"=S" (dummy_value_S),
1184	"=D" (dummy_value_D)
1185
1186	: "3" (srcptr), // esi // input regs
1187	"4" (dstptr), // edi
1188	"0" (diff), // eax
1189	// was (unmask) "b" RESERVED // ebx // Global Offset Table idx
1190	"2" (len), // ecx
1191	"1" (mask) // edx
1192
1193	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
1194	: "%mm0", "%mm1", "%mm2", "%mm3" // clobber list
1195	, "%mm4", "%mm5", "%mm6", "%mm7"
1196	#endif
1197	);
1198	}
1199	else /* mmx _not supported - Use modified C routine */
1200	#endif /* PNG_MMX_CODE_SUPPORTED */
1201	{
1202	register png_uint_32 i;
1203	png_uint_32 initial_val = BPP4 * png_pass_start[png_ptr->pass];
1204	/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
1205	register int stride = BPP4 * png_pass_inc[png_ptr->pass];
1206	/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
1207	register int rep_bytes = BPP4 * png_pass_width[png_ptr->pass];
1208	/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
1209	png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
1210	int diff = (int) (png_ptr->width & 7); /* amount lost */
1211	register png_uint_32 final_val = BPP4 * len; /* GRR bugfix */
1212
1213	srcptr = png_ptr->row_buf + 1 + initial_val;
1214	dstptr = row + initial_val;
1215
1216	for (i = initial_val; i < final_val; i += stride)
1217	{
1218	png_memcpy(dstptr, srcptr, rep_bytes);
1219	srcptr += stride;
1220	dstptr += stride;
1221	}
1222	if (diff) /* number of leftover pixels: 3 for pngtest */
1223	{
1224	final_val+=diff*BPP4;
1225	for (; i < final_val; i += stride)
1226	{
1227	if (rep_bytes > (int)(final_val-i))
1228	rep_bytes = (int)(final_val-i);
1229	png_memcpy(dstptr, srcptr, rep_bytes);
1230	srcptr += stride;
1231	dstptr += stride;
1232	}
1233	}
1234	} /* end of else (_mmx_supported) */
1235
1236	break;
1237	} /* end 32 bpp */
1238
1239	case 48: /* png_ptr->row_info.pixel_depth */
1240	{
1241	png_bytep srcptr;
1242	png_bytep dstptr;
1243
1244	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
1245	#if !defined(PNG_1_0_X)
1246	if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_COMBINE_ROW)
1247	/* && _mmx_supported */ )
1248	#else
1249	if (_mmx_supported)
1250	#endif
1251	{
1252	png_uint_32 len;
1253	int diff;
1254	int dummy_value_a; // fix 'forbidden register spilled' error
1255	int dummy_value_d;
1256	int dummy_value_c;
1257	int dummy_value_S;
1258	int dummy_value_D;
1259	_unmask = ~mask; // global variable for -fPIC version
1260	srcptr = png_ptr->row_buf + 1;
1261	dstptr = row;
1262	len = png_ptr->width &~7; // reduce to multiple of 8
1263	diff = (int) (png_ptr->width & 7); // amount lost //
1264
1265	__asm__ __volatile__ (
1266	"movd _unmask, %%mm7 \n\t" // load bit pattern
1267	"psubb %%mm6, %%mm6 \n\t" // zero mm6
1268	"punpcklbw %%mm7, %%mm7 \n\t"
1269	"punpcklwd %%mm7, %%mm7 \n\t"
1270	"punpckldq %%mm7, %%mm7 \n\t" // fill reg with 8 masks
1271
1272	"movq _mask48_0, %%mm0 \n\t"
1273	"movq _mask48_1, %%mm1 \n\t"
1274	"movq _mask48_2, %%mm2 \n\t"
1275	"movq _mask48_3, %%mm3 \n\t"
1276	"movq _mask48_4, %%mm4 \n\t"
1277	"movq _mask48_5, %%mm5 \n\t"
1278
1279	"pand %%mm7, %%mm0 \n\t"
1280	"pand %%mm7, %%mm1 \n\t"
1281	"pand %%mm7, %%mm2 \n\t"
1282	"pand %%mm7, %%mm3 \n\t"
1283	"pand %%mm7, %%mm4 \n\t"
1284	"pand %%mm7, %%mm5 \n\t"
1285
1286	"pcmpeqb %%mm6, %%mm0 \n\t"
1287	"pcmpeqb %%mm6, %%mm1 \n\t"
1288	"pcmpeqb %%mm6, %%mm2 \n\t"
1289	"pcmpeqb %%mm6, %%mm3 \n\t"
1290	"pcmpeqb %%mm6, %%mm4 \n\t"
1291	"pcmpeqb %%mm6, %%mm5 \n\t"
1292
1293	// preload "movl len, %%ecx \n\t" // load length of line
1294	// preload "movl srcptr, %%esi \n\t" // load source
1295	// preload "movl dstptr, %%edi \n\t" // load dest
1296
1297	"cmpl $0, %%ecx \n\t"
1298	"jz mainloop48end \n\t"
1299
1300	"mainloop48: \n\t"
1301	"movq (%%esi), %%mm7 \n\t"
1302	"pand %%mm0, %%mm7 \n\t"
1303	"movq %%mm0, %%mm6 \n\t"
1304	"pandn (%%edi), %%mm6 \n\t"
1305	"por %%mm6, %%mm7 \n\t"
1306	"movq %%mm7, (%%edi) \n\t"
1307
1308	"movq 8(%%esi), %%mm6 \n\t"
1309	"pand %%mm1, %%mm6 \n\t"
1310	"movq %%mm1, %%mm7 \n\t"
1311	"pandn 8(%%edi), %%mm7 \n\t"
1312	"por %%mm7, %%mm6 \n\t"
1313	"movq %%mm6, 8(%%edi) \n\t"
1314
1315	"movq 16(%%esi), %%mm6 \n\t"
1316	"pand %%mm2, %%mm6 \n\t"
1317	"movq %%mm2, %%mm7 \n\t"
1318	"pandn 16(%%edi), %%mm7 \n\t"
1319	"por %%mm7, %%mm6 \n\t"
1320	"movq %%mm6, 16(%%edi) \n\t"
1321
1322	"movq 24(%%esi), %%mm7 \n\t"
1323	"pand %%mm3, %%mm7 \n\t"
1324	"movq %%mm3, %%mm6 \n\t"
1325	"pandn 24(%%edi), %%mm6 \n\t"
1326	"por %%mm6, %%mm7 \n\t"
1327	"movq %%mm7, 24(%%edi) \n\t"
1328
1329	"movq 32(%%esi), %%mm6 \n\t"
1330	"pand %%mm4, %%mm6 \n\t"
1331	"movq %%mm4, %%mm7 \n\t"
1332	"pandn 32(%%edi), %%mm7 \n\t"
1333	"por %%mm7, %%mm6 \n\t"
1334	"movq %%mm6, 32(%%edi) \n\t"
1335
1336	"movq 40(%%esi), %%mm7 \n\t"
1337	"pand %%mm5, %%mm7 \n\t"
1338	"movq %%mm5, %%mm6 \n\t"
1339	"pandn 40(%%edi), %%mm6 \n\t"
1340	"por %%mm6, %%mm7 \n\t"
1341	"movq %%mm7, 40(%%edi) \n\t"
1342
1343	"addl $48, %%esi \n\t" // inc by 48 bytes processed
1344	"addl $48, %%edi \n\t"
1345	"subl $8, %%ecx \n\t" // dec by 8 pixels processed
1346
1347	"ja mainloop48 \n\t"
1348
1349	"mainloop48end: \n\t"
1350	// preload "movl diff, %%ecx \n\t" // (diff is in eax)
1351	"movl %%eax, %%ecx \n\t"
1352	"cmpl $0, %%ecx \n\t"
1353	"jz end48 \n\t"
1354	// preload "movl mask, %%edx \n\t"
1355	"sall $24, %%edx \n\t" // make low byte, high byte
1356
1357	"secondloop48: \n\t"
1358	"sall %%edx \n\t" // move high bit to CF
1359	"jnc skip48 \n\t" // if CF = 0
1360	"movl (%%esi), %%eax \n\t"
1361	"movl %%eax, (%%edi) \n\t"
1362
1363	"skip48: \n\t"
1364	"addl $4, %%esi \n\t"
1365	"addl $4, %%edi \n\t"
1366	"decl %%ecx \n\t"
1367	"jnz secondloop48 \n\t"
1368
1369	"end48: \n\t"
1370	"EMMS \n\t" // DONE
1371
1372	: "=a" (dummy_value_a), // output regs (dummy)
1373	"=d" (dummy_value_d),
1374	"=c" (dummy_value_c),
1375	"=S" (dummy_value_S),
1376	"=D" (dummy_value_D)
1377
1378	: "3" (srcptr), // esi // input regs
1379	"4" (dstptr), // edi
1380	"0" (diff), // eax
1381	// was (unmask) "b" RESERVED // ebx // Global Offset Table idx
1382	"2" (len), // ecx
1383	"1" (mask) // edx
1384
1385	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
1386	: "%mm0", "%mm1", "%mm2", "%mm3" // clobber list
1387	, "%mm4", "%mm5", "%mm6", "%mm7"
1388	#endif
1389	);
1390	}
1391	else /* mmx _not supported - Use modified C routine */
1392	#endif /* PNG_MMX_CODE_SUPPORTED */
1393	{
1394	register png_uint_32 i;
1395	png_uint_32 initial_val = BPP6 * png_pass_start[png_ptr->pass];
1396	/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
1397	register int stride = BPP6 * png_pass_inc[png_ptr->pass];
1398	/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
1399	register int rep_bytes = BPP6 * png_pass_width[png_ptr->pass];
1400	/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
1401	png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
1402	int diff = (int) (png_ptr->width & 7); /* amount lost */
1403	register png_uint_32 final_val = BPP6 * len; /* GRR bugfix */
1404
1405	srcptr = png_ptr->row_buf + 1 + initial_val;
1406	dstptr = row + initial_val;
1407
1408	for (i = initial_val; i < final_val; i += stride)
1409	{
1410	png_memcpy(dstptr, srcptr, rep_bytes);
1411	srcptr += stride;
1412	dstptr += stride;
1413	}
1414	if (diff) /* number of leftover pixels: 3 for pngtest */
1415	{
1416	final_val+=diff*BPP6;
1417	for (; i < final_val; i += stride)
1418	{
1419	if (rep_bytes > (int)(final_val-i))
1420	rep_bytes = (int)(final_val-i);
1421	png_memcpy(dstptr, srcptr, rep_bytes);
1422	srcptr += stride;
1423	dstptr += stride;
1424	}
1425	}
1426	} /* end of else (_mmx_supported) */
1427
1428	break;
1429	} /* end 48 bpp */
1430
1431	case 64: /* png_ptr->row_info.pixel_depth */
1432	{
1433	png_bytep srcptr;
1434	png_bytep dstptr;
1435	register png_uint_32 i;
1436	png_uint_32 initial_val = BPP8 * png_pass_start[png_ptr->pass];
1437	/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
1438	register int stride = BPP8 * png_pass_inc[png_ptr->pass];
1439	/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
1440	register int rep_bytes = BPP8 * png_pass_width[png_ptr->pass];
1441	/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
1442	png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
1443	int diff = (int) (png_ptr->width & 7); /* amount lost */
1444	register png_uint_32 final_val = BPP8 * len; /* GRR bugfix */
1445
1446	srcptr = png_ptr->row_buf + 1 + initial_val;
1447	dstptr = row + initial_val;
1448
1449	for (i = initial_val; i < final_val; i += stride)
1450	{
1451	png_memcpy(dstptr, srcptr, rep_bytes);
1452	srcptr += stride;
1453	dstptr += stride;
1454	}
1455	if (diff) /* number of leftover pixels: 3 for pngtest */
1456	{
1457	final_val+=diff*BPP8;
1458	for (; i < final_val; i += stride)
1459	{
1460	if (rep_bytes > (int)(final_val-i))
1461	rep_bytes = (int)(final_val-i);
1462	png_memcpy(dstptr, srcptr, rep_bytes);
1463	srcptr += stride;
1464	dstptr += stride;
1465	}
1466	}
1467
1468	break;
1469	} /* end 64 bpp */
1470
1471	default: /* png_ptr->row_info.pixel_depth != 1,2,4,8,16,24,32,48,64 */
1472	{
1473	/* this should never happen */
1474	png_warning(png_ptr, "Invalid row_info.pixel_depth in pnggccrd");
1475	break;
1476	}
1477	} /* end switch (png_ptr->row_info.pixel_depth) */
1478
1479	} /* end if (non-trivial mask) */
1480
1481	} /* end png_combine_row() */
1482
1483	#endif /* PNG_HAVE_MMX_COMBINE_ROW */
1484
1485
1486
1487
1488	/===========================================================================/
1489	/* */
1490	/* P N G _ D O _ R E A D _ I N T E R L A C E */
1491	/* */
1492	/===========================================================================/
1493
1494	#if defined(PNG_READ_INTERLACING_SUPPORTED)
1495	#if defined(PNG_HAVE_MMX_READ_INTERLACE)
1496
1497	/* png_do_read_interlace() is called after any 16-bit to 8-bit conversion
1498	* has taken place. [GRR: what other steps come before and/or after?]
1499	*/
1500
1501	void /* PRIVATE */
1502	png_do_read_interlace(png_structp png_ptr)
1503	{
1504	png_row_infop row_info = &(png_ptr->row_info);
1505	png_bytep row = png_ptr->row_buf + 1;
1506	int pass = png_ptr->pass;
1507	#if defined(PNG_READ_PACKSWAP_SUPPORTED)
1508	png_uint_32 transformations = png_ptr->transformations;
1509	#endif
1510
1511	png_debug(1, "in png_do_read_interlace (pnggccrd.c)\n");
1512
1513	#if defined(PNG_MMX_CODE_SUPPORTED)
1514	if (_mmx_supported == 2) {
1515	#if !defined(PNG_1_0_X)
1516	/* this should have happened in png_init_mmx_flags() already */
1517	png_warning(png_ptr, "asm_flags may not have been initialized");
1518	#endif
1519	png_mmx_support();
1520	}
1521	#endif
1522
1523	if (row != NULL && row_info != NULL)
1524	{
1525	png_uint_32 final_width;
1526
1527	final_width = row_info->width * png_pass_inc[pass];
1528
1529	switch (row_info->pixel_depth)
1530	{
1531	case 1:
1532	{
1533	png_bytep sp, dp;
1534	int sshift, dshift;
1535	int s_start, s_end, s_inc;
1536	png_byte v;
1537	png_uint_32 i;
1538	int j;
1539
1540	sp = row + (png_size_t)((row_info->width - 1) >> 3);
1541	dp = row + (png_size_t)((final_width - 1) >> 3);
1542	#if defined(PNG_READ_PACKSWAP_SUPPORTED)
1543	if (transformations & PNG_PACKSWAP)
1544	{
1545	sshift = (int)((row_info->width + 7) & 7);
1546	dshift = (int)((final_width + 7) & 7);
1547	s_start = 7;
1548	s_end = 0;
1549	s_inc = -1;
1550	}
1551	else
1552	#endif
1553	{
1554	sshift = 7 - (int)((row_info->width + 7) & 7);
1555	dshift = 7 - (int)((final_width + 7) & 7);
1556	s_start = 0;
1557	s_end = 7;
1558	s_inc = 1;
1559	}
1560
1561	for (i = row_info->width; i; i--)
1562	{
1563	v = (png_byte)((*sp >> sshift) & 0x1);
1564	for (j = 0; j < png_pass_inc[pass]; j++)
1565	{
1566	*dp &= (png_byte)((0x7f7f >> (7 - dshift)) & 0xff);
1567	*dp \|= (png_byte)(v << dshift);
1568	if (dshift == s_end)
1569	{
1570	dshift = s_start;
1571	dp--;
1572	}
1573	else
1574	dshift += s_inc;
1575	}
1576	if (sshift == s_end)
1577	{
1578	sshift = s_start;
1579	sp--;
1580	}
1581	else
1582	sshift += s_inc;
1583	}
1584	break;
1585	}
1586
1587	case 2:
1588	{
1589	png_bytep sp, dp;
1590	int sshift, dshift;
1591	int s_start, s_end, s_inc;
1592	png_uint_32 i;
1593
1594	sp = row + (png_size_t)((row_info->width - 1) >> 2);
1595	dp = row + (png_size_t)((final_width - 1) >> 2);
1596	#if defined(PNG_READ_PACKSWAP_SUPPORTED)
1597	if (transformations & PNG_PACKSWAP)
1598	{
1599	sshift = (png_size_t)(((row_info->width + 3) & 3) << 1);
1600	dshift = (png_size_t)(((final_width + 3) & 3) << 1);
1601	s_start = 6;
1602	s_end = 0;
1603	s_inc = -2;
1604	}
1605	else
1606	#endif
1607	{
1608	sshift = (png_size_t)((3 - ((row_info->width + 3) & 3)) << 1);
1609	dshift = (png_size_t)((3 - ((final_width + 3) & 3)) << 1);
1610	s_start = 0;
1611	s_end = 6;
1612	s_inc = 2;
1613	}
1614
1615	for (i = row_info->width; i; i--)
1616	{
1617	png_byte v;
1618	int j;
1619
1620	v = (png_byte)((*sp >> sshift) & 0x3);
1621	for (j = 0; j < png_pass_inc[pass]; j++)
1622	{
1623	*dp &= (png_byte)((0x3f3f >> (6 - dshift)) & 0xff);
1624	*dp \|= (png_byte)(v << dshift);
1625	if (dshift == s_end)
1626	{
1627	dshift = s_start;
1628	dp--;
1629	}
1630	else
1631	dshift += s_inc;
1632	}
1633	if (sshift == s_end)
1634	{
1635	sshift = s_start;
1636	sp--;
1637	}
1638	else
1639	sshift += s_inc;
1640	}
1641	break;
1642	}
1643
1644	case 4:
1645	{
1646	png_bytep sp, dp;
1647	int sshift, dshift;
1648	int s_start, s_end, s_inc;
1649	png_uint_32 i;
1650
1651	sp = row + (png_size_t)((row_info->width - 1) >> 1);
1652	dp = row + (png_size_t)((final_width - 1) >> 1);
1653	#if defined(PNG_READ_PACKSWAP_SUPPORTED)
1654	if (transformations & PNG_PACKSWAP)
1655	{
1656	sshift = (png_size_t)(((row_info->width + 1) & 1) << 2);
1657	dshift = (png_size_t)(((final_width + 1) & 1) << 2);
1658	s_start = 4;
1659	s_end = 0;
1660	s_inc = -4;
1661	}
1662	else
1663	#endif
1664	{
1665	sshift = (png_size_t)((1 - ((row_info->width + 1) & 1)) << 2);
1666	dshift = (png_size_t)((1 - ((final_width + 1) & 1)) << 2);
1667	s_start = 0;
1668	s_end = 4;
1669	s_inc = 4;
1670	}
1671
1672	for (i = row_info->width; i; i--)
1673	{
1674	png_byte v;
1675	int j;
1676
1677	v = (png_byte)((*sp >> sshift) & 0xf);
1678	for (j = 0; j < png_pass_inc[pass]; j++)
1679	{
1680	*dp &= (png_byte)((0xf0f >> (4 - dshift)) & 0xff);
1681	*dp \|= (png_byte)(v << dshift);
1682	if (dshift == s_end)
1683	{
1684	dshift = s_start;
1685	dp--;
1686	}
1687	else
1688	dshift += s_inc;
1689	}
1690	if (sshift == s_end)
1691	{
1692	sshift = s_start;
1693	sp--;
1694	}
1695	else
1696	sshift += s_inc;
1697	}
1698	break;
1699	}
1700
1701	/====================================================================/
1702
1703	default: /* 8-bit or larger (this is where the routine is modified) */
1704	{
1705	#if 0
1706	// static unsigned long long _const4 = 0x0000000000FFFFFFLL; no good
1707	// static unsigned long long const4 = 0x0000000000FFFFFFLL; no good
1708	// unsigned long long _const4 = 0x0000000000FFFFFFLL; no good
1709	// unsigned long long const4 = 0x0000000000FFFFFFLL; no good
1710	#endif
1711	png_bytep sptr, dp;
1712	png_uint_32 i;
1713	png_size_t pixel_bytes;
1714	int width = (int)row_info->width;
1715
1716	pixel_bytes = (row_info->pixel_depth >> 3);
1717
1718	/* point sptr at the last pixel in the pre-expanded row: */
1719	sptr = row + (width - 1) * pixel_bytes;
1720
1721	/* point dp at the last pixel position in the expanded row: */
1722	dp = row + (final_width - 1) * pixel_bytes;
1723
1724	/* New code by Nirav Chhatrapati - Intel Corporation */
1725
1726	#if defined(PNG_MMX_CODE_SUPPORTED)
1727	#if !defined(PNG_1_0_X)
1728	if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_INTERLACE)
1729	/* && _mmx_supported */ )
1730	#else
1731	if (_mmx_supported)
1732	#endif
1733	{
1734	//--------------------------------------------------------------
1735	if (pixel_bytes == 3)
1736	{
1737	if (((pass == 0) \|\| (pass == 1)) && width)
1738	{
1739	int dummy_value_c; // fix 'forbidden register spilled'
1740	int dummy_value_S;
1741	int dummy_value_D;
1742	int dummy_value_a;
1743
1744	__asm__ __volatile__ (
1745	"subl $21, %%edi \n\t"
1746	// (png_pass_inc[pass] - 1)*pixel_bytes
1747
1748	".loop3_pass0: \n\t"
1749	"movd (%%esi), %%mm0 \n\t" // x x x x x 2 1 0
1750	"pand (%3), %%mm0 \n\t" // z z z z z 2 1 0
1751	"movq %%mm0, %%mm1 \n\t" // z z z z z 2 1 0
1752	"psllq $16, %%mm0 \n\t" // z z z 2 1 0 z z
1753	"movq %%mm0, %%mm2 \n\t" // z z z 2 1 0 z z
1754	"psllq $24, %%mm0 \n\t" // 2 1 0 z z z z z
1755	"psrlq $8, %%mm1 \n\t" // z z z z z z 2 1
1756	"por %%mm2, %%mm0 \n\t" // 2 1 0 2 1 0 z z
1757	"por %%mm1, %%mm0 \n\t" // 2 1 0 2 1 0 2 1
1758	"movq %%mm0, %%mm3 \n\t" // 2 1 0 2 1 0 2 1
1759	"psllq $16, %%mm0 \n\t" // 0 2 1 0 2 1 z z
1760	"movq %%mm3, %%mm4 \n\t" // 2 1 0 2 1 0 2 1
1761	"punpckhdq %%mm0, %%mm3 \n\t" // 0 2 1 0 2 1 0 2
1762	"movq %%mm4, 16(%%edi) \n\t"
1763	"psrlq $32, %%mm0 \n\t" // z z z z 0 2 1 0
1764	"movq %%mm3, 8(%%edi) \n\t"
1765	"punpckldq %%mm4, %%mm0 \n\t" // 1 0 2 1 0 2 1 0
1766	"subl $3, %%esi \n\t"
1767	"movq %%mm0, (%%edi) \n\t"
1768	"subl $24, %%edi \n\t"
1769	"decl %%ecx \n\t"
1770	"jnz .loop3_pass0 \n\t"
1771	"EMMS \n\t" // DONE
1772
1773	: "=c" (dummy_value_c), // output regs (dummy)
1774	"=S" (dummy_value_S),
1775	"=D" (dummy_value_D),
1776	"=a" (dummy_value_a)
1777
1778
1779	: "1" (sptr), // esi // input regs
1780	"2" (dp), // edi
1781	"0" (width), // ecx
1782	"3" (&_const4) // %1(?) (0x0000000000FFFFFFLL)
1783
1784	#if 0 /* %mm0, ..., %mm4 not supported by gcc 2.7.2.3 or egcs 1.1 */
1785	: "%mm0", "%mm1", "%mm2" // clobber list
1786	, "%mm3", "%mm4"
1787	#endif
1788	);
1789	}
1790	else if (((pass == 2) \|\| (pass == 3)) && width)
1791	{
1792	int dummy_value_c; // fix 'forbidden register spilled'
1793	int dummy_value_S;
1794	int dummy_value_D;
1795	int dummy_value_a;
1796
1797	__asm__ __volatile__ (
1798	"subl $9, %%edi \n\t"
1799	// (png_pass_inc[pass] - 1)*pixel_bytes
1800
1801	".loop3_pass2: \n\t"
1802	"movd (%%esi), %%mm0 \n\t" // x x x x x 2 1 0
1803	"pand (%3), %%mm0 \n\t" // z z z z z 2 1 0
1804	"movq %%mm0, %%mm1 \n\t" // z z z z z 2 1 0
1805	"psllq $16, %%mm0 \n\t" // z z z 2 1 0 z z
1806	"movq %%mm0, %%mm2 \n\t" // z z z 2 1 0 z z
1807	"psllq $24, %%mm0 \n\t" // 2 1 0 z z z z z
1808	"psrlq $8, %%mm1 \n\t" // z z z z z z 2 1
1809	"por %%mm2, %%mm0 \n\t" // 2 1 0 2 1 0 z z
1810	"por %%mm1, %%mm0 \n\t" // 2 1 0 2 1 0 2 1
1811	"movq %%mm0, 4(%%edi) \n\t"
1812	"psrlq $16, %%mm0 \n\t" // z z 2 1 0 2 1 0
1813	"subl $3, %%esi \n\t"
1814	"movd %%mm0, (%%edi) \n\t"
1815	"subl $12, %%edi \n\t"
1816	"decl %%ecx \n\t"
1817	"jnz .loop3_pass2 \n\t"
1818	"EMMS \n\t" // DONE
1819
1820	: "=c" (dummy_value_c), // output regs (dummy)
1821	"=S" (dummy_value_S),
1822	"=D" (dummy_value_D),
1823	"=a" (dummy_value_a)
1824
1825	: "1" (sptr), // esi // input regs
1826	"2" (dp), // edi
1827	"0" (width), // ecx
1828	"3" (&_const4) // (0x0000000000FFFFFFLL)
1829
1830	#if 0 /* %mm0, ..., %mm2 not supported by gcc 2.7.2.3 or egcs 1.1 */
1831	: "%mm0", "%mm1", "%mm2" // clobber list
1832	#endif
1833	);
1834	}
1835	else if (width) /* && ((pass == 4) \|\| (pass == 5)) */
1836	{
1837	int width_mmx = ((width >> 1) << 1) - 8; // GRR: huh?
1838	if (width_mmx < 0)
1839	width_mmx = 0;
1840	width -= width_mmx; // 8 or 9 pix, 24 or 27 bytes
1841	if (width_mmx)
1842	{
1843	// png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1};
1844	// sptr points at last pixel in pre-expanded row
1845	// dp points at last pixel position in expanded row
1846	int dummy_value_c; // fix 'forbidden register spilled'
1847	int dummy_value_S;
1848	int dummy_value_D;
1849	int dummy_value_a;
1850	int dummy_value_d;
1851
1852	__asm__ __volatile__ (
1853	"subl $3, %%esi \n\t"
1854	"subl $9, %%edi \n\t"
1855	// (png_pass_inc[pass] + 1)*pixel_bytes
1856
1857	".loop3_pass4: \n\t"
1858	"movq (%%esi), %%mm0 \n\t" // x x 5 4 3 2 1 0
1859	"movq %%mm0, %%mm1 \n\t" // x x 5 4 3 2 1 0
1860	"movq %%mm0, %%mm2 \n\t" // x x 5 4 3 2 1 0
1861	"psllq $24, %%mm0 \n\t" // 4 3 2 1 0 z z z
1862	"pand (%3), %%mm1 \n\t" // z z z z z 2 1 0
1863	"psrlq $24, %%mm2 \n\t" // z z z x x 5 4 3
1864	"por %%mm1, %%mm0 \n\t" // 4 3 2 1 0 2 1 0
1865	"movq %%mm2, %%mm3 \n\t" // z z z x x 5 4 3
1866	"psllq $8, %%mm2 \n\t" // z z x x 5 4 3 z
1867	"movq %%mm0, (%%edi) \n\t"
1868	"psrlq $16, %%mm3 \n\t" // z z z z z x x 5
1869	"pand (%4), %%mm3 \n\t" // z z z z z z z 5
1870	"por %%mm3, %%mm2 \n\t" // z z x x 5 4 3 5
1871	"subl $6, %%esi \n\t"
1872	"movd %%mm2, 8(%%edi) \n\t"
1873	"subl $12, %%edi \n\t"
1874	"subl $2, %%ecx \n\t"
1875	"jnz .loop3_pass4 \n\t"
1876	"EMMS \n\t" // DONE
1877
1878	: "=c" (dummy_value_c), // output regs (dummy)
1879	"=S" (dummy_value_S),
1880	"=D" (dummy_value_D),
1881	"=a" (dummy_value_a),
1882	"=d" (dummy_value_d)
1883
1884	: "1" (sptr), // esi // input regs
1885	"2" (dp), // edi
1886	"0" (width_mmx), // ecx
1887	"3" (&_const4), // 0x0000000000FFFFFFLL
1888	"4" (&_const6) // 0x00000000000000FFLL
1889
1890	#if 0 /* %mm0, ..., %mm3 not supported by gcc 2.7.2.3 or egcs 1.1 */
1891	: "%mm0", "%mm1" // clobber list
1892	, "%mm2", "%mm3"
1893	#endif
1894	);
1895	}
1896
1897	sptr -= width_mmx*3;
1898	dp -= width_mmx*6;
1899	for (i = width; i; i--)
1900	{
1901	png_byte v[8];
1902	int j;
1903
1904	png_memcpy(v, sptr, 3);
1905	for (j = 0; j < png_pass_inc[pass]; j++)
1906	{
1907	png_memcpy(dp, v, 3);
1908	dp -= 3;
1909	}
1910	sptr -= 3;
1911	}
1912	}
1913	} /* end of pixel_bytes == 3 */
1914
1915	//--------------------------------------------------------------
1916	else if (pixel_bytes == 1)
1917	{
1918	if (((pass == 0) \|\| (pass == 1)) && width)
1919	{
1920	int width_mmx = ((width >> 2) << 2);
1921	width -= width_mmx; // 0-3 pixels => 0-3 bytes
1922	if (width_mmx)
1923	{
1924	int dummy_value_c; // fix 'forbidden register spilled'
1925	int dummy_value_S;
1926	int dummy_value_D;
1927
1928	__asm__ __volatile__ (
1929	"subl $3, %%esi \n\t"
1930	"subl $31, %%edi \n\t"
1931
1932	".loop1_pass0: \n\t"
1933	"movd (%%esi), %%mm0 \n\t" // x x x x 3 2 1 0
1934	"movq %%mm0, %%mm1 \n\t" // x x x x 3 2 1 0
1935	"punpcklbw %%mm0, %%mm0 \n\t" // 3 3 2 2 1 1 0 0
1936	"movq %%mm0, %%mm2 \n\t" // 3 3 2 2 1 1 0 0
1937	"punpcklwd %%mm0, %%mm0 \n\t" // 1 1 1 1 0 0 0 0
1938	"movq %%mm0, %%mm3 \n\t" // 1 1 1 1 0 0 0 0
1939	"punpckldq %%mm0, %%mm0 \n\t" // 0 0 0 0 0 0 0 0
1940	"punpckhdq %%mm3, %%mm3 \n\t" // 1 1 1 1 1 1 1 1
1941	"movq %%mm0, (%%edi) \n\t"
1942	"punpckhwd %%mm2, %%mm2 \n\t" // 3 3 3 3 2 2 2 2
1943	"movq %%mm3, 8(%%edi) \n\t"
1944	"movq %%mm2, %%mm4 \n\t" // 3 3 3 3 2 2 2 2
1945	"punpckldq %%mm2, %%mm2 \n\t" // 2 2 2 2 2 2 2 2
1946	"punpckhdq %%mm4, %%mm4 \n\t" // 3 3 3 3 3 3 3 3
1947	"movq %%mm2, 16(%%edi) \n\t"
1948	"subl $4, %%esi \n\t"
1949	"movq %%mm4, 24(%%edi) \n\t"
1950	"subl $32, %%edi \n\t"
1951	"subl $4, %%ecx \n\t"
1952	"jnz .loop1_pass0 \n\t"
1953	"EMMS \n\t" // DONE
1954
1955	: "=c" (dummy_value_c), // output regs (dummy)
1956	"=S" (dummy_value_S),
1957	"=D" (dummy_value_D)
1958
1959	: "1" (sptr), // esi // input regs
1960	"2" (dp), // edi
1961	"0" (width_mmx) // ecx
1962
1963	#if 0 /* %mm0, ..., %mm4 not supported by gcc 2.7.2.3 or egcs 1.1 */
1964	: "%mm0", "%mm1", "%mm2" // clobber list
1965	, "%mm3", "%mm4"
1966	#endif
1967	);
1968	}
1969
1970	sptr -= width_mmx;
1971	dp -= width_mmx*8;
1972	for (i = width; i; i--)
1973	{
1974	int j;
1975
1976	/* I simplified this part in version 1.0.4e
1977	* here and in several other instances where
1978	* pixel_bytes == 1 -- GR-P
1979	*
1980	* Original code:
1981	*
1982	* png_byte v[8];
1983	* png_memcpy(v, sptr, pixel_bytes);
1984	* for (j = 0; j < png_pass_inc[pass]; j++)
1985	* {
1986	* png_memcpy(dp, v, pixel_bytes);
1987	* dp -= pixel_bytes;
1988	* }
1989	* sptr -= pixel_bytes;
1990	*
1991	* Replacement code is in the next three lines:
1992	*/
1993
1994	for (j = 0; j < png_pass_inc[pass]; j++)
1995	{
1996	dp-- = sptr;
1997	}
1998	--sptr;
1999	}
2000	}
2001	else if (((pass == 2) \|\| (pass == 3)) && width)
2002	{
2003	int width_mmx = ((width >> 2) << 2);
2004	width -= width_mmx; // 0-3 pixels => 0-3 bytes
2005	if (width_mmx)
2006	{
2007	int dummy_value_c; // fix 'forbidden register spilled'
2008	int dummy_value_S;
2009	int dummy_value_D;
2010
2011	__asm__ __volatile__ (
2012	"subl $3, %%esi \n\t"
2013	"subl $15, %%edi \n\t"
2014
2015	".loop1_pass2: \n\t"
2016	"movd (%%esi), %%mm0 \n\t" // x x x x 3 2 1 0
2017	"punpcklbw %%mm0, %%mm0 \n\t" // 3 3 2 2 1 1 0 0
2018	"movq %%mm0, %%mm1 \n\t" // 3 3 2 2 1 1 0 0
2019	"punpcklwd %%mm0, %%mm0 \n\t" // 1 1 1 1 0 0 0 0
2020	"punpckhwd %%mm1, %%mm1 \n\t" // 3 3 3 3 2 2 2 2
2021	"movq %%mm0, (%%edi) \n\t"
2022	"subl $4, %%esi \n\t"
2023	"movq %%mm1, 8(%%edi) \n\t"
2024	"subl $16, %%edi \n\t"
2025	"subl $4, %%ecx \n\t"
2026	"jnz .loop1_pass2 \n\t"
2027	"EMMS \n\t" // DONE
2028
2029	: "=c" (dummy_value_c), // output regs (dummy)
2030	"=S" (dummy_value_S),
2031	"=D" (dummy_value_D)
2032
2033	: "1" (sptr), // esi // input regs
2034	"2" (dp), // edi
2035	"0" (width_mmx) // ecx
2036
2037	#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
2038	: "%mm0", "%mm1" // clobber list
2039	#endif
2040	);
2041	}
2042
2043	sptr -= width_mmx;
2044	dp -= width_mmx*4;
2045	for (i = width; i; i--)
2046	{
2047	int j;
2048
2049	for (j = 0; j < png_pass_inc[pass]; j++)
2050	{
2051	dp-- = sptr;
2052	}
2053	--sptr;
2054	}
2055	}
2056	else if (width) /* && ((pass == 4) \|\| (pass == 5)) */
2057	{
2058	int width_mmx = ((width >> 3) << 3);
2059	width -= width_mmx; // 0-3 pixels => 0-3 bytes
2060	if (width_mmx)
2061	{
2062	int dummy_value_c; // fix 'forbidden register spilled'
2063	int dummy_value_S;
2064	int dummy_value_D;
2065
2066	__asm__ __volatile__ (
2067	"subl $7, %%esi \n\t"
2068	"subl $15, %%edi \n\t"
2069
2070	".loop1_pass4: \n\t"
2071	"movq (%%esi), %%mm0 \n\t" // 7 6 5 4 3 2 1 0
2072	"movq %%mm0, %%mm1 \n\t" // 7 6 5 4 3 2 1 0
2073	"punpcklbw %%mm0, %%mm0 \n\t" // 3 3 2 2 1 1 0 0
2074	"punpckhbw %%mm1, %%mm1 \n\t" // 7 7 6 6 5 5 4 4
2075	"movq %%mm1, 8(%%edi) \n\t"
2076	"subl $8, %%esi \n\t"
2077	"movq %%mm0, (%%edi) \n\t"
2078	"subl $16, %%edi \n\t"
2079	"subl $8, %%ecx \n\t"
2080	"jnz .loop1_pass4 \n\t"
2081	"EMMS \n\t" // DONE
2082
2083	: "=c" (dummy_value_c), // output regs (none)
2084	"=S" (dummy_value_S),
2085	"=D" (dummy_value_D)
2086
2087	: "1" (sptr), // esi // input regs
2088	"2" (dp), // edi
2089	"0" (width_mmx) // ecx
2090
2091	#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
2092	: "%mm0", "%mm1" // clobber list
2093	#endif
2094	);
2095	}
2096
2097	sptr -= width_mmx;
2098	dp -= width_mmx*2;
2099	for (i = width; i; i--)
2100	{
2101	int j;
2102
2103	for (j = 0; j < png_pass_inc[pass]; j++)
2104	{
2105	dp-- = sptr;
2106	}
2107	--sptr;
2108	}
2109	}
2110	} /* end of pixel_bytes == 1 */
2111
2112	//--------------------------------------------------------------
2113	else if (pixel_bytes == 2)
2114	{
2115	if (((pass == 0) \|\| (pass == 1)) && width)
2116	{
2117	int width_mmx = ((width >> 1) << 1);
2118	width -= width_mmx; // 0,1 pixels => 0,2 bytes
2119	if (width_mmx)
2120	{
2121	int dummy_value_c; // fix 'forbidden register spilled'
2122	int dummy_value_S;
2123	int dummy_value_D;
2124
2125	__asm__ __volatile__ (
2126	"subl $2, %%esi \n\t"
2127	"subl $30, %%edi \n\t"
2128
2129	".loop2_pass0: \n\t"
2130	"movd (%%esi), %%mm0 \n\t" // x x x x 3 2 1 0
2131	"punpcklwd %%mm0, %%mm0 \n\t" // 3 2 3 2 1 0 1 0
2132	"movq %%mm0, %%mm1 \n\t" // 3 2 3 2 1 0 1 0
2133	"punpckldq %%mm0, %%mm0 \n\t" // 1 0 1 0 1 0 1 0
2134	"punpckhdq %%mm1, %%mm1 \n\t" // 3 2 3 2 3 2 3 2
2135	"movq %%mm0, (%%edi) \n\t"
2136	"movq %%mm0, 8(%%edi) \n\t"
2137	"movq %%mm1, 16(%%edi) \n\t"
2138	"subl $4, %%esi \n\t"
2139	"movq %%mm1, 24(%%edi) \n\t"
2140	"subl $32, %%edi \n\t"
2141	"subl $2, %%ecx \n\t"
2142	"jnz .loop2_pass0 \n\t"
2143	"EMMS \n\t" // DONE
2144
2145	: "=c" (dummy_value_c), // output regs (dummy)
2146	"=S" (dummy_value_S),
2147	"=D" (dummy_value_D)
2148
2149	: "1" (sptr), // esi // input regs
2150	"2" (dp), // edi
2151	"0" (width_mmx) // ecx
2152
2153	#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
2154	: "%mm0", "%mm1" // clobber list
2155	#endif
2156	);
2157	}
2158
2159	sptr -= (width_mmx*2 - 2); // sign fixed
2160	dp -= (width_mmx*16 - 2); // sign fixed
2161	for (i = width; i; i--)
2162	{
2163	png_byte v[8];
2164	int j;
2165	sptr -= 2;
2166	png_memcpy(v, sptr, 2);
2167	for (j = 0; j < png_pass_inc[pass]; j++)
2168	{
2169	dp -= 2;
2170	png_memcpy(dp, v, 2);
2171	}
2172	}
2173	}
2174	else if (((pass == 2) \|\| (pass == 3)) && width)
2175	{
2176	int width_mmx = ((width >> 1) << 1) ;
2177	width -= width_mmx; // 0,1 pixels => 0,2 bytes
2178	if (width_mmx)
2179	{
2180	int dummy_value_c; // fix 'forbidden register spilled'
2181	int dummy_value_S;
2182	int dummy_value_D;
2183
2184	__asm__ __volatile__ (
2185	"subl $2, %%esi \n\t"
2186	"subl $14, %%edi \n\t"
2187
2188	".loop2_pass2: \n\t"
2189	"movd (%%esi), %%mm0 \n\t" // x x x x 3 2 1 0
2190	"punpcklwd %%mm0, %%mm0 \n\t" // 3 2 3 2 1 0 1 0
2191	"movq %%mm0, %%mm1 \n\t" // 3 2 3 2 1 0 1 0
2192	"punpckldq %%mm0, %%mm0 \n\t" // 1 0 1 0 1 0 1 0
2193	"punpckhdq %%mm1, %%mm1 \n\t" // 3 2 3 2 3 2 3 2
2194	"movq %%mm0, (%%edi) \n\t"
2195	"subl $4, %%esi \n\t"
2196	"movq %%mm1, 8(%%edi) \n\t"
2197	"subl $16, %%edi \n\t"
2198	"subl $2, %%ecx \n\t"
2199	"jnz .loop2_pass2 \n\t"
2200	"EMMS \n\t" // DONE
2201
2202	: "=c" (dummy_value_c), // output regs (dummy)
2203	"=S" (dummy_value_S),
2204	"=D" (dummy_value_D)
2205
2206	: "1" (sptr), // esi // input regs
2207	"2" (dp), // edi
2208	"0" (width_mmx) // ecx
2209
2210	#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
2211	: "%mm0", "%mm1" // clobber list
2212	#endif
2213	);
2214	}
2215
2216	sptr -= (width_mmx*2 - 2); // sign fixed
2217	dp -= (width_mmx*8 - 2); // sign fixed
2218	for (i = width; i; i--)
2219	{
2220	png_byte v[8];
2221	int j;
2222	sptr -= 2;
2223	png_memcpy(v, sptr, 2);
2224	for (j = 0; j < png_pass_inc[pass]; j++)
2225	{
2226	dp -= 2;
2227	png_memcpy(dp, v, 2);
2228	}
2229	}
2230	}
2231	else if (width) // pass == 4 or 5
2232	{
2233	int width_mmx = ((width >> 1) << 1) ;
2234	width -= width_mmx; // 0,1 pixels => 0,2 bytes
2235	if (width_mmx)
2236	{
2237	int dummy_value_c; // fix 'forbidden register spilled'
2238	int dummy_value_S;
2239	int dummy_value_D;
2240
2241	__asm__ __volatile__ (
2242	"subl $2, %%esi \n\t"
2243	"subl $6, %%edi \n\t"
2244
2245	".loop2_pass4: \n\t"
2246	"movd (%%esi), %%mm0 \n\t" // x x x x 3 2 1 0
2247	"punpcklwd %%mm0, %%mm0 \n\t" // 3 2 3 2 1 0 1 0
2248	"subl $4, %%esi \n\t"
2249	"movq %%mm0, (%%edi) \n\t"
2250	"subl $8, %%edi \n\t"
2251	"subl $2, %%ecx \n\t"
2252	"jnz .loop2_pass4 \n\t"
2253	"EMMS \n\t" // DONE
2254
2255	: "=c" (dummy_value_c), // output regs (dummy)
2256	"=S" (dummy_value_S),
2257	"=D" (dummy_value_D)
2258
2259	: "1" (sptr), // esi // input regs
2260	"2" (dp), // edi
2261	"0" (width_mmx) // ecx
2262
2263	#if 0 /* %mm0 not supported by gcc 2.7.2.3 or egcs 1.1 */
2264	: "%mm0" // clobber list
2265	#endif
2266	);
2267	}
2268
2269	sptr -= (width_mmx*2 - 2); // sign fixed
2270	dp -= (width_mmx*4 - 2); // sign fixed
2271	for (i = width; i; i--)
2272	{
2273	png_byte v[8];
2274	int j;
2275	sptr -= 2;
2276	png_memcpy(v, sptr, 2);
2277	for (j = 0; j < png_pass_inc[pass]; j++)
2278	{
2279	dp -= 2;
2280	png_memcpy(dp, v, 2);
2281	}
2282	}
2283	}
2284	} /* end of pixel_bytes == 2 */
2285
2286	//--------------------------------------------------------------
2287	else if (pixel_bytes == 4)
2288	{
2289	if (((pass == 0) \|\| (pass == 1)) && width)
2290	{
2291	int width_mmx = ((width >> 1) << 1);
2292	width -= width_mmx; // 0,1 pixels => 0,4 bytes
2293	if (width_mmx)
2294	{
2295	int dummy_value_c; // fix 'forbidden register spilled'
2296	int dummy_value_S;
2297	int dummy_value_D;
2298
2299	__asm__ __volatile__ (
2300	"subl $4, %%esi \n\t"
2301	"subl $60, %%edi \n\t"
2302
2303	".loop4_pass0: \n\t"
2304	"movq (%%esi), %%mm0 \n\t" // 7 6 5 4 3 2 1 0
2305	"movq %%mm0, %%mm1 \n\t" // 7 6 5 4 3 2 1 0
2306	"punpckldq %%mm0, %%mm0 \n\t" // 3 2 1 0 3 2 1 0
2307	"punpckhdq %%mm1, %%mm1 \n\t" // 7 6 5 4 7 6 5 4
2308	"movq %%mm0, (%%edi) \n\t"
2309	"movq %%mm0, 8(%%edi) \n\t"
2310	"movq %%mm0, 16(%%edi) \n\t"
2311	"movq %%mm0, 24(%%edi) \n\t"
2312	"movq %%mm1, 32(%%edi) \n\t"
2313	"movq %%mm1, 40(%%edi) \n\t"
2314	"movq %%mm1, 48(%%edi) \n\t"
2315	"subl $8, %%esi \n\t"
2316	"movq %%mm1, 56(%%edi) \n\t"
2317	"subl $64, %%edi \n\t"
2318	"subl $2, %%ecx \n\t"
2319	"jnz .loop4_pass0 \n\t"
2320	"EMMS \n\t" // DONE
2321
2322	: "=c" (dummy_value_c), // output regs (dummy)
2323	"=S" (dummy_value_S),
2324	"=D" (dummy_value_D)
2325
2326	: "1" (sptr), // esi // input regs
2327	"2" (dp), // edi
2328	"0" (width_mmx) // ecx
2329
2330	#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
2331	: "%mm0", "%mm1" // clobber list
2332	#endif
2333	);
2334	}
2335
2336	sptr -= (width_mmx*4 - 4); // sign fixed
2337	dp -= (width_mmx*32 - 4); // sign fixed
2338	for (i = width; i; i--)
2339	{
2340	png_byte v[8];
2341	int j;
2342	sptr -= 4;
2343	png_memcpy(v, sptr, 4);
2344	for (j = 0; j < png_pass_inc[pass]; j++)
2345	{
2346	dp -= 4;
2347	png_memcpy(dp, v, 4);
2348	}
2349	}
2350	}
2351	else if (((pass == 2) \|\| (pass == 3)) && width)
2352	{
2353	int width_mmx = ((width >> 1) << 1);
2354	width -= width_mmx; // 0,1 pixels => 0,4 bytes
2355	if (width_mmx)
2356	{
2357	int dummy_value_c; // fix 'forbidden register spilled'
2358	int dummy_value_S;
2359	int dummy_value_D;
2360
2361	__asm__ __volatile__ (
2362	"subl $4, %%esi \n\t"
2363	"subl $28, %%edi \n\t"
2364
2365	".loop4_pass2: \n\t"
2366	"movq (%%esi), %%mm0 \n\t" // 7 6 5 4 3 2 1 0
2367	"movq %%mm0, %%mm1 \n\t" // 7 6 5 4 3 2 1 0
2368	"punpckldq %%mm0, %%mm0 \n\t" // 3 2 1 0 3 2 1 0
2369	"punpckhdq %%mm1, %%mm1 \n\t" // 7 6 5 4 7 6 5 4
2370	"movq %%mm0, (%%edi) \n\t"
2371	"movq %%mm0, 8(%%edi) \n\t"
2372	"movq %%mm1, 16(%%edi) \n\t"
2373	"movq %%mm1, 24(%%edi) \n\t"
2374	"subl $8, %%esi \n\t"
2375	"subl $32, %%edi \n\t"
2376	"subl $2, %%ecx \n\t"
2377	"jnz .loop4_pass2 \n\t"
2378	"EMMS \n\t" // DONE
2379
2380	: "=c" (dummy_value_c), // output regs (dummy)
2381	"=S" (dummy_value_S),
2382	"=D" (dummy_value_D)
2383
2384	: "1" (sptr), // esi // input regs
2385	"2" (dp), // edi
2386	"0" (width_mmx) // ecx
2387
2388	#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
2389	: "%mm0", "%mm1" // clobber list
2390	#endif
2391	);
2392	}
2393
2394	sptr -= (width_mmx*4 - 4); // sign fixed
2395	dp -= (width_mmx*16 - 4); // sign fixed
2396	for (i = width; i; i--)
2397	{
2398	png_byte v[8];
2399	int j;
2400	sptr -= 4;
2401	png_memcpy(v, sptr, 4);
2402	for (j = 0; j < png_pass_inc[pass]; j++)
2403	{
2404	dp -= 4;
2405	png_memcpy(dp, v, 4);
2406	}
2407	}
2408	}
2409	else if (width) // pass == 4 or 5
2410	{
2411	int width_mmx = ((width >> 1) << 1) ;
2412	width -= width_mmx; // 0,1 pixels => 0,4 bytes
2413	if (width_mmx)
2414	{
2415	int dummy_value_c; // fix 'forbidden register spilled'
2416	int dummy_value_S;
2417	int dummy_value_D;
2418
2419	__asm__ __volatile__ (
2420	"subl $4, %%esi \n\t"
2421	"subl $12, %%edi \n\t"
2422
2423	".loop4_pass4: \n\t"
2424	"movq (%%esi), %%mm0 \n\t" // 7 6 5 4 3 2 1 0
2425	"movq %%mm0, %%mm1 \n\t" // 7 6 5 4 3 2 1 0
2426	"punpckldq %%mm0, %%mm0 \n\t" // 3 2 1 0 3 2 1 0
2427	"punpckhdq %%mm1, %%mm1 \n\t" // 7 6 5 4 7 6 5 4
2428	"movq %%mm0, (%%edi) \n\t"
2429	"subl $8, %%esi \n\t"
2430	"movq %%mm1, 8(%%edi) \n\t"
2431	"subl $16, %%edi \n\t"
2432	"subl $2, %%ecx \n\t"
2433	"jnz .loop4_pass4 \n\t"
2434	"EMMS \n\t" // DONE
2435
2436	: "=c" (dummy_value_c), // output regs (dummy)
2437	"=S" (dummy_value_S),
2438	"=D" (dummy_value_D)
2439
2440	: "1" (sptr), // esi // input regs
2441	"2" (dp), // edi
2442	"0" (width_mmx) // ecx
2443
2444	#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
2445	: "%mm0", "%mm1" // clobber list
2446	#endif
2447	);
2448	}
2449
2450	sptr -= (width_mmx*4 - 4); // sign fixed
2451	dp -= (width_mmx*8 - 4); // sign fixed
2452	for (i = width; i; i--)
2453	{
2454	png_byte v[8];
2455	int j;
2456	sptr -= 4;
2457	png_memcpy(v, sptr, 4);
2458	for (j = 0; j < png_pass_inc[pass]; j++)
2459	{
2460	dp -= 4;
2461	png_memcpy(dp, v, 4);
2462	}
2463	}
2464	}
2465	} /* end of pixel_bytes == 4 */
2466
2467	//--------------------------------------------------------------
2468	else if (pixel_bytes == 8)
2469	{
2470	// GRR TEST: should work, but needs testing (special 64-bit version of rpng2?)
2471	// GRR NOTE: no need to combine passes here!
2472	if (((pass == 0) \|\| (pass == 1)) && width)
2473	{
2474	int dummy_value_c; // fix 'forbidden register spilled'
2475	int dummy_value_S;
2476	int dummy_value_D;
2477
2478	// source is 8-byte RRGGBBAA
2479	// dest is 64-byte RRGGBBAA RRGGBBAA RRGGBBAA RRGGBBAA ...
2480	__asm__ __volatile__ (
2481	"subl $56, %%edi \n\t" // start of last block
2482
2483	".loop8_pass0: \n\t"
2484	"movq (%%esi), %%mm0 \n\t" // 7 6 5 4 3 2 1 0
2485	"movq %%mm0, (%%edi) \n\t"
2486	"movq %%mm0, 8(%%edi) \n\t"
2487	"movq %%mm0, 16(%%edi) \n\t"
2488	"movq %%mm0, 24(%%edi) \n\t"
2489	"movq %%mm0, 32(%%edi) \n\t"
2490	"movq %%mm0, 40(%%edi) \n\t"
2491	"movq %%mm0, 48(%%edi) \n\t"
2492	"subl $8, %%esi \n\t"
2493	"movq %%mm0, 56(%%edi) \n\t"
2494	"subl $64, %%edi \n\t"
2495	"decl %%ecx \n\t"
2496	"jnz .loop8_pass0 \n\t"
2497	"EMMS \n\t" // DONE
2498
2499	: "=c" (dummy_value_c), // output regs (dummy)
2500	"=S" (dummy_value_S),
2501	"=D" (dummy_value_D)
2502
2503	: "1" (sptr), // esi // input regs
2504	"2" (dp), // edi
2505	"0" (width) // ecx
2506
2507	#if 0 /* %mm0 not supported by gcc 2.7.2.3 or egcs 1.1 */
2508	: "%mm0" // clobber list
2509	#endif
2510	);
2511	}
2512	else if (((pass == 2) \|\| (pass == 3)) && width)
2513	{
2514	// source is 8-byte RRGGBBAA
2515	// dest is 32-byte RRGGBBAA RRGGBBAA RRGGBBAA RRGGBBAA
2516	// (recall that expansion is _in place_: sptr and dp
2517	// both point at locations within same row buffer)
2518	{
2519	int dummy_value_c; // fix 'forbidden register spilled'
2520	int dummy_value_S;
2521	int dummy_value_D;
2522
2523	__asm__ __volatile__ (
2524	"subl $24, %%edi \n\t" // start of last block
2525
2526	".loop8_pass2: \n\t"
2527	"movq (%%esi), %%mm0 \n\t" // 7 6 5 4 3 2 1 0
2528	"movq %%mm0, (%%edi) \n\t"
2529	"movq %%mm0, 8(%%edi) \n\t"
2530	"movq %%mm0, 16(%%edi) \n\t"
2531	"subl $8, %%esi \n\t"
2532	"movq %%mm0, 24(%%edi) \n\t"
2533	"subl $32, %%edi \n\t"
2534	"decl %%ecx \n\t"
2535	"jnz .loop8_pass2 \n\t"
2536	"EMMS \n\t" // DONE
2537
2538	: "=c" (dummy_value_c), // output regs (dummy)
2539	"=S" (dummy_value_S),
2540	"=D" (dummy_value_D)
2541
2542	: "1" (sptr), // esi // input regs
2543	"2" (dp), // edi
2544	"0" (width) // ecx
2545
2546	#if 0 /* %mm0 not supported by gcc 2.7.2.3 or egcs 1.1 */
2547	: "%mm0" // clobber list
2548	#endif
2549	);
2550	}
2551	}
2552	else if (width) // pass == 4 or 5
2553	{
2554	// source is 8-byte RRGGBBAA
2555	// dest is 16-byte RRGGBBAA RRGGBBAA
2556	{
2557	int dummy_value_c; // fix 'forbidden register spilled'
2558	int dummy_value_S;
2559	int dummy_value_D;
2560
2561	__asm__ __volatile__ (
2562	"subl $8, %%edi \n\t" // start of last block
2563
2564	".loop8_pass4: \n\t"
2565	"movq (%%esi), %%mm0 \n\t" // 7 6 5 4 3 2 1 0
2566	"movq %%mm0, (%%edi) \n\t"
2567	"subl $8, %%esi \n\t"
2568	"movq %%mm0, 8(%%edi) \n\t"
2569	"subl $16, %%edi \n\t"
2570	"decl %%ecx \n\t"
2571	"jnz .loop8_pass4 \n\t"
2572	"EMMS \n\t" // DONE
2573
2574	: "=c" (dummy_value_c), // output regs (dummy)
2575	"=S" (dummy_value_S),
2576	"=D" (dummy_value_D)
2577
2578	: "1" (sptr), // esi // input regs
2579	"2" (dp), // edi
2580	"0" (width) // ecx
2581
2582	#if 0 /* %mm0 not supported by gcc 2.7.2.3 or egcs 1.1 */
2583	: "%mm0" // clobber list
2584	#endif
2585	);
2586	}
2587	}
2588
2589	} /* end of pixel_bytes == 8 */
2590
2591	//--------------------------------------------------------------
2592	else if (pixel_bytes == 6)
2593	{
2594	for (i = width; i; i--)
2595	{
2596	png_byte v[8];
2597	int j;
2598	png_memcpy(v, sptr, 6);
2599	for (j = 0; j < png_pass_inc[pass]; j++)
2600	{
2601	png_memcpy(dp, v, 6);
2602	dp -= 6;
2603	}
2604	sptr -= 6;
2605	}
2606	} /* end of pixel_bytes == 6 */
2607
2608	//--------------------------------------------------------------
2609	else
2610	{
2611	for (i = width; i; i--)
2612	{
2613	png_byte v[8];
2614	int j;
2615	png_memcpy(v, sptr, pixel_bytes);
2616	for (j = 0; j < png_pass_inc[pass]; j++)
2617	{
2618	png_memcpy(dp, v, pixel_bytes);
2619	dp -= pixel_bytes;
2620	}
2621	sptr-= pixel_bytes;
2622	}
2623	}
2624	} // end of _mmx_supported ========================================
2625
2626	else /* MMX not supported: use modified C code - takes advantage
2627	* of inlining of png_memcpy for a constant */
2628	/* GRR 19991007: does it? or should pixel_bytes in each
2629	* block be replaced with immediate value (e.g., 1)? */
2630	/* GRR 19991017: replaced with constants in each case */
2631	#endif /* PNG_MMX_CODE_SUPPORTED */
2632	{
2633	if (pixel_bytes == 1)
2634	{
2635	for (i = width; i; i--)
2636	{
2637	int j;
2638	for (j = 0; j < png_pass_inc[pass]; j++)
2639	{
2640	dp-- = sptr;
2641	}
2642	--sptr;
2643	}
2644	}
2645	else if (pixel_bytes == 3)
2646	{
2647	for (i = width; i; i--)
2648	{
2649	png_byte v[8];
2650	int j;
2651	png_memcpy(v, sptr, 3);
2652	for (j = 0; j < png_pass_inc[pass]; j++)
2653	{
2654	png_memcpy(dp, v, 3);
2655	dp -= 3;
2656	}
2657	sptr -= 3;
2658	}
2659	}
2660	else if (pixel_bytes == 2)
2661	{
2662	for (i = width; i; i--)
2663	{
2664	png_byte v[8];
2665	int j;
2666	png_memcpy(v, sptr, 2);
2667	for (j = 0; j < png_pass_inc[pass]; j++)
2668	{
2669	png_memcpy(dp, v, 2);
2670	dp -= 2;
2671	}
2672	sptr -= 2;
2673	}
2674	}
2675	else if (pixel_bytes == 4)
2676	{
2677	for (i = width; i; i--)
2678	{
2679	png_byte v[8];
2680	int j;
2681	png_memcpy(v, sptr, 4);
2682	for (j = 0; j < png_pass_inc[pass]; j++)
2683	{
2684	#ifdef PNG_DEBUG
2685	if (dp < row \|\| dp+3 > row+png_ptr->row_buf_size)
2686	{
2687	printf("dp out of bounds: row=%d, dp=%d, rp=%d\n",
2688	row, dp, row+png_ptr->row_buf_size);
2689	printf("row_buf=%d\n",png_ptr->row_buf_size);
2690	}
2691	#endif
2692	png_memcpy(dp, v, 4);
2693	dp -= 4;
2694	}
2695	sptr -= 4;
2696	}
2697	}
2698	else if (pixel_bytes == 6)
2699	{
2700	for (i = width; i; i--)
2701	{
2702	png_byte v[8];
2703	int j;
2704	png_memcpy(v, sptr, 6);
2705	for (j = 0; j < png_pass_inc[pass]; j++)
2706	{
2707	png_memcpy(dp, v, 6);
2708	dp -= 6;
2709	}
2710	sptr -= 6;
2711	}
2712	}
2713	else if (pixel_bytes == 8)
2714	{
2715	for (i = width; i; i--)
2716	{
2717	png_byte v[8];
2718	int j;
2719	png_memcpy(v, sptr, 8);
2720	for (j = 0; j < png_pass_inc[pass]; j++)
2721	{
2722	png_memcpy(dp, v, 8);
2723	dp -= 8;
2724	}
2725	sptr -= 8;
2726	}
2727	}
2728	else /* GRR: should never be reached */
2729	{
2730	for (i = width; i; i--)
2731	{
2732	png_byte v[8];
2733	int j;
2734	png_memcpy(v, sptr, pixel_bytes);
2735	for (j = 0; j < png_pass_inc[pass]; j++)
2736	{
2737	png_memcpy(dp, v, pixel_bytes);
2738	dp -= pixel_bytes;
2739	}
2740	sptr -= pixel_bytes;
2741	}
2742	}
2743
2744	} /* end if (MMX not supported) */
2745	break;
2746	}
2747	} /* end switch (row_info->pixel_depth) */
2748
2749	row_info->width = final_width;
2750
2751	row_info->rowbytes = PNG_ROWBYTES(row_info->pixel_depth,final_width);
2752	}
2753
2754	} /* end png_do_read_interlace() */
2755
2756	#endif /* PNG_HAVE_MMX_READ_INTERLACE */
2757	#endif /* PNG_READ_INTERLACING_SUPPORTED */
2758
2759
2760
2761	#if defined(PNG_HAVE_MMX_READ_FILTER_ROW)
2762	#if defined(PNG_MMX_CODE_SUPPORTED)
2763
2764	// These variables are utilized in the functions below. They are declared
2765	// globally here to ensure alignment on 8-byte boundaries.
2766
2767	union uAll {
2768	long long use;
2769	double align;
2770	} _LBCarryMask = {0x0101010101010101LL},
2771	_HBClearMask = {0x7f7f7f7f7f7f7f7fLL},
2772	_ActiveMask, _ActiveMask2, _ActiveMaskEnd, _ShiftBpp, _ShiftRem;
2773
2774	#ifdef PNG_THREAD_UNSAFE_OK
2775	//===========================================================================//
2776	// //
2777	// P N G _ R E A D _ F I L T E R _ R O W _ M M X _ A V G //
2778	// //
2779	//===========================================================================//
2780
2781	// Optimized code for PNG Average filter decoder
2782
2783	static void /* PRIVATE */
2784	png_read_filter_row_mmx_avg(png_row_infop row_info, png_bytep row,
2785	png_bytep prev_row)
2786	{
2787	int bpp;
2788	int dummy_value_c; // fix 'forbidden register 2 (cx) was spilled' error
2789	int dummy_value_S;
2790	int dummy_value_D;
2791
2792	bpp = (row_info->pixel_depth + 7) >> 3; // get # bytes per pixel
2793	_FullLength = row_info->rowbytes; // # of bytes to filter
2794
2795	__asm__ __volatile__ (
2796	// initialize address pointers and offset
2797	#ifdef __PIC__
2798	"pushl %%ebx \n\t" // save index to Global Offset Table
2799	#endif
2800	//pre "movl row, %%edi \n\t" // edi: Avg(x)
2801	"xorl %%ebx, %%ebx \n\t" // ebx: x
2802	"movl %%edi, %%edx \n\t"
2803	//pre "movl prev_row, %%esi \n\t" // esi: Prior(x)
2804	//pre "subl bpp, %%edx \n\t" // (bpp is preloaded into ecx)
2805	"subl %%ecx, %%edx \n\t" // edx: Raw(x-bpp)
2806
2807	"xorl %%eax,%%eax \n\t"
2808
2809	// Compute the Raw value for the first bpp bytes
2810	// Raw(x) = Avg(x) + (Prior(x)/2)
2811	"avg_rlp: \n\t"
2812	"movb (%%esi,%%ebx,),%%al \n\t" // load al with Prior(x)
2813	"incl %%ebx \n\t"
2814	"shrb %%al \n\t" // divide by 2
2815	"addb -1(%%edi,%%ebx,),%%al \n\t" // add Avg(x); -1 to offset inc ebx
2816	//pre "cmpl bpp, %%ebx \n\t" // (bpp is preloaded into ecx)
2817	"cmpl %%ecx, %%ebx \n\t"
2818	"movb %%al,-1(%%edi,%%ebx,) \n\t" // write Raw(x); -1 to offset inc ebx
2819	"jb avg_rlp \n\t" // mov does not affect flags
2820
2821	// get # of bytes to alignment
2822	"movl %%edi, _dif \n\t" // take start of row
2823	"addl %%ebx, _dif \n\t" // add bpp
2824	"addl $0xf, _dif \n\t" // add 7+8 to incr past alignment bdry
2825	"andl $0xfffffff8, _dif \n\t" // mask to alignment boundary
2826	"subl %%edi, _dif \n\t" // subtract from start => value ebx at
2827	"jz avg_go \n\t" // alignment
2828
2829	// fix alignment
2830	// Compute the Raw value for the bytes up to the alignment boundary
2831	// Raw(x) = Avg(x) + ((Raw(x-bpp) + Prior(x))/2)
2832	"xorl %%ecx, %%ecx \n\t"
2833
2834	"avg_lp1: \n\t"
2835	"xorl %%eax, %%eax \n\t"
2836	"movb (%%esi,%%ebx,), %%cl \n\t" // load cl with Prior(x)
2837	"movb (%%edx,%%ebx,), %%al \n\t" // load al with Raw(x-bpp)
2838	"addw %%cx, %%ax \n\t"
2839	"incl %%ebx \n\t"
2840	"shrw %%ax \n\t" // divide by 2
2841	"addb -1(%%edi,%%ebx,), %%al \n\t" // add Avg(x); -1 to offset inc ebx
2842	"cmpl _dif, %%ebx \n\t" // check if at alignment boundary
2843	"movb %%al, -1(%%edi,%%ebx,) \n\t" // write Raw(x); -1 to offset inc ebx
2844	"jb avg_lp1 \n\t" // repeat until at alignment boundary
2845
2846	"avg_go: \n\t"
2847	"movl _FullLength, %%eax \n\t"
2848	"movl %%eax, %%ecx \n\t"
2849	"subl %%ebx, %%eax \n\t" // subtract alignment fix
2850	"andl $0x00000007, %%eax \n\t" // calc bytes over mult of 8
2851	"subl %%eax, %%ecx \n\t" // drop over bytes from original length
2852	"movl %%ecx, _MMXLength \n\t"
2853	#ifdef __PIC__
2854	"popl %%ebx \n\t" // restore index to Global Offset Table
2855	#endif
2856
2857	: "=c" (dummy_value_c), // output regs (dummy)
2858	"=S" (dummy_value_S),
2859	"=D" (dummy_value_D)
2860
2861	: "0" (bpp), // ecx // input regs
2862	"1" (prev_row), // esi
2863	"2" (row) // edi
2864
2865	: "%eax", "%edx" // clobber list
2866	#ifndef __PIC__
2867	, "%ebx"
2868	#endif
2869	// GRR: INCLUDE "memory" as clobbered? (_dif, _MMXLength)
2870	// (seems to work fine without...)
2871	);
2872
2873	// now do the math for the rest of the row
2874	switch (bpp)
2875	{
2876	case 3:
2877	{
2878	_ActiveMask.use = 0x0000000000ffffffLL;
2879	_ShiftBpp.use = 24; // == 3 * 8
2880	_ShiftRem.use = 40; // == 64 - 24
2881
2882	__asm__ __volatile__ (
2883	// re-init address pointers and offset
2884	"movq _ActiveMask, %%mm7 \n\t"
2885	"movl _dif, %%ecx \n\t" // ecx: x = offset to
2886	"movq _LBCarryMask, %%mm5 \n\t" // alignment boundary
2887	// preload "movl row, %%edi \n\t" // edi: Avg(x)
2888	"movq _HBClearMask, %%mm4 \n\t"
2889	// preload "movl prev_row, %%esi \n\t" // esi: Prior(x)
2890
2891	// prime the pump: load the first Raw(x-bpp) data set
2892	"movq -8(%%edi,%%ecx,), %%mm2 \n\t" // load previous aligned 8 bytes
2893	// (correct pos. in loop below)
2894	"avg_3lp: \n\t"
2895	"movq (%%edi,%%ecx,), %%mm0 \n\t" // load mm0 with Avg(x)
2896	"movq %%mm5, %%mm3 \n\t"
2897	"psrlq _ShiftRem, %%mm2 \n\t" // correct position Raw(x-bpp)
2898	// data
2899	"movq (%%esi,%%ecx,), %%mm1 \n\t" // load mm1 with Prior(x)
2900	"movq %%mm7, %%mm6 \n\t"
2901	"pand %%mm1, %%mm3 \n\t" // get lsb for each prev_row byte
2902	"psrlq $1, %%mm1 \n\t" // divide prev_row bytes by 2
2903	"pand %%mm4, %%mm1 \n\t" // clear invalid bit 7 of each
2904	// byte
2905	"paddb %%mm1, %%mm0 \n\t" // add (Prev_row/2) to Avg for
2906	// each byte
2907	// add 1st active group (Raw(x-bpp)/2) to average with LBCarry
2908	"movq %%mm3, %%mm1 \n\t" // now use mm1 for getting
2909	// LBCarrys
2910	"pand %%mm2, %%mm1 \n\t" // get LBCarrys for each byte
2911	// where both
2912	// lsb's were == 1 (only valid for active group)
2913	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
2914	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7 of each
2915	// byte
2916	"paddb %%mm1, %%mm2 \n\t" // add LBCarrys to (Raw(x-bpp)/2)
2917	// for each byte
2918	"pand %%mm6, %%mm2 \n\t" // leave only Active Group 1
2919	// bytes to add to Avg
2920	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) + LBCarrys to
2921	// Avg for each Active
2922	// byte
2923	// add 2nd active group (Raw(x-bpp)/2) to average with _LBCarry
2924	"psllq _ShiftBpp, %%mm6 \n\t" // shift the mm6 mask to cover
2925	// bytes 3-5
2926	"movq %%mm0, %%mm2 \n\t" // mov updated Raws to mm2
2927	"psllq _ShiftBpp, %%mm2 \n\t" // shift data to pos. correctly
2928	"movq %%mm3, %%mm1 \n\t" // now use mm1 for getting
2929	// LBCarrys
2930	"pand %%mm2, %%mm1 \n\t" // get LBCarrys for each byte
2931	// where both
2932	// lsb's were == 1 (only valid for active group)
2933	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
2934	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7 of each
2935	// byte
2936	"paddb %%mm1, %%mm2 \n\t" // add LBCarrys to (Raw(x-bpp)/2)
2937	// for each byte
2938	"pand %%mm6, %%mm2 \n\t" // leave only Active Group 2
2939	// bytes to add to Avg
2940	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) + LBCarrys to
2941	// Avg for each Active
2942	// byte
2943
2944	// add 3rd active group (Raw(x-bpp)/2) to average with _LBCarry
2945	"psllq _ShiftBpp, %%mm6 \n\t" // shift mm6 mask to cover last
2946	// two
2947	// bytes
2948	"movq %%mm0, %%mm2 \n\t" // mov updated Raws to mm2
2949	"psllq _ShiftBpp, %%mm2 \n\t" // shift data to pos. correctly
2950	// Data only needs to be shifted once here to
2951	// get the correct x-bpp offset.
2952	"movq %%mm3, %%mm1 \n\t" // now use mm1 for getting
2953	// LBCarrys
2954	"pand %%mm2, %%mm1 \n\t" // get LBCarrys for each byte
2955	// where both
2956	// lsb's were == 1 (only valid for active group)
2957	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
2958	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7 of each
2959	// byte
2960	"paddb %%mm1, %%mm2 \n\t" // add LBCarrys to (Raw(x-bpp)/2)
2961	// for each byte
2962	"pand %%mm6, %%mm2 \n\t" // leave only Active Group 2
2963	// bytes to add to Avg
2964	"addl $8, %%ecx \n\t"
2965	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) + LBCarrys to
2966	// Avg for each Active
2967	// byte
2968	// now ready to write back to memory
2969	"movq %%mm0, -8(%%edi,%%ecx,) \n\t"
2970	// move updated Raw(x) to use as Raw(x-bpp) for next loop
2971	"cmpl _MMXLength, %%ecx \n\t"
2972	"movq %%mm0, %%mm2 \n\t" // mov updated Raw(x) to mm2
2973	"jb avg_3lp \n\t"
2974
2975	: "=S" (dummy_value_S), // output regs (dummy)
2976	"=D" (dummy_value_D)
2977
2978	: "0" (prev_row), // esi // input regs
2979	"1" (row) // edi
2980
2981	: "%ecx" // clobber list
2982	#if 0 /* %mm0, ..., %mm7 not supported by gcc 2.7.2.3 or egcs 1.1 */
2983	, "%mm0", "%mm1", "%mm2", "%mm3"
2984	, "%mm4", "%mm5", "%mm6", "%mm7"
2985	#endif
2986	);
2987	}
2988	break; // end 3 bpp
2989
2990	case 6:
2991	case 4:
2992	//case 7: // who wrote this? PNG doesn't support 5 or 7 bytes/pixel
2993	//case 5: // GRR BOGUS
2994	{
2995	_ActiveMask.use = 0xffffffffffffffffLL; // use shift below to clear
2996	// appropriate inactive bytes
2997	_ShiftBpp.use = bpp << 3;
2998	_ShiftRem.use = 64 - _ShiftBpp.use;
2999
3000	__asm__ __volatile__ (
3001	"movq _HBClearMask, %%mm4 \n\t"
3002
3003	// re-init address pointers and offset
3004	"movl _dif, %%ecx \n\t" // ecx: x = offset to
3005	// alignment boundary
3006
3007	// load _ActiveMask and clear all bytes except for 1st active group
3008	"movq _ActiveMask, %%mm7 \n\t"
3009	// preload "movl row, %%edi \n\t" // edi: Avg(x)
3010	"psrlq _ShiftRem, %%mm7 \n\t"
3011	// preload "movl prev_row, %%esi \n\t" // esi: Prior(x)
3012	"movq %%mm7, %%mm6 \n\t"
3013	"movq _LBCarryMask, %%mm5 \n\t"
3014	"psllq _ShiftBpp, %%mm6 \n\t" // create mask for 2nd active
3015	// group
3016
3017	// prime the pump: load the first Raw(x-bpp) data set
3018	"movq -8(%%edi,%%ecx,), %%mm2 \n\t" // load previous aligned 8 bytes
3019	// (we correct pos. in loop below)
3020	"avg_4lp: \n\t"
3021	"movq (%%edi,%%ecx,), %%mm0 \n\t"
3022	"psrlq _ShiftRem, %%mm2 \n\t" // shift data to pos. correctly
3023	"movq (%%esi,%%ecx,), %%mm1 \n\t"
3024	// add (Prev_row/2) to average
3025	"movq %%mm5, %%mm3 \n\t"
3026	"pand %%mm1, %%mm3 \n\t" // get lsb for each prev_row byte
3027	"psrlq $1, %%mm1 \n\t" // divide prev_row bytes by 2
3028	"pand %%mm4, %%mm1 \n\t" // clear invalid bit 7 of each
3029	// byte
3030	"paddb %%mm1, %%mm0 \n\t" // add (Prev_row/2) to Avg for
3031	// each byte
3032	// add 1st active group (Raw(x-bpp)/2) to average with _LBCarry
3033	"movq %%mm3, %%mm1 \n\t" // now use mm1 for getting
3034	// LBCarrys
3035	"pand %%mm2, %%mm1 \n\t" // get LBCarrys for each byte
3036	// where both
3037	// lsb's were == 1 (only valid for active group)
3038	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
3039	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7 of each
3040	// byte
3041	"paddb %%mm1, %%mm2 \n\t" // add LBCarrys to (Raw(x-bpp)/2)
3042	// for each byte
3043	"pand %%mm7, %%mm2 \n\t" // leave only Active Group 1
3044	// bytes to add to Avg
3045	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) + LBCarrys to Avg
3046	// for each Active
3047	// byte
3048	// add 2nd active group (Raw(x-bpp)/2) to average with _LBCarry
3049	"movq %%mm0, %%mm2 \n\t" // mov updated Raws to mm2
3050	"psllq _ShiftBpp, %%mm2 \n\t" // shift data to pos. correctly
3051	"addl $8, %%ecx \n\t"
3052	"movq %%mm3, %%mm1 \n\t" // now use mm1 for getting
3053	// LBCarrys
3054	"pand %%mm2, %%mm1 \n\t" // get LBCarrys for each byte
3055	// where both
3056	// lsb's were == 1 (only valid for active group)
3057	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
3058	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7 of each
3059	// byte
3060	"paddb %%mm1, %%mm2 \n\t" // add LBCarrys to (Raw(x-bpp)/2)
3061	// for each byte
3062	"pand %%mm6, %%mm2 \n\t" // leave only Active Group 2
3063	// bytes to add to Avg
3064	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) + LBCarrys to
3065	// Avg for each Active
3066	// byte
3067	"cmpl _MMXLength, %%ecx \n\t"
3068	// now ready to write back to memory
3069	"movq %%mm0, -8(%%edi,%%ecx,) \n\t"
3070	// prep Raw(x-bpp) for next loop
3071	"movq %%mm0, %%mm2 \n\t" // mov updated Raws to mm2
3072	"jb avg_4lp \n\t"
3073
3074	: "=S" (dummy_value_S), // output regs (dummy)
3075	"=D" (dummy_value_D)
3076
3077	: "0" (prev_row), // esi // input regs
3078	"1" (row) // edi
3079
3080	: "%ecx" // clobber list
3081	#if 0 /* %mm0, ..., %mm7 not supported by gcc 2.7.2.3 or egcs 1.1 */
3082	, "%mm0", "%mm1", "%mm2", "%mm3"
3083	, "%mm4", "%mm5", "%mm6", "%mm7"
3084	#endif
3085	);
3086	}
3087	break; // end 4,6 bpp
3088
3089	case 2:
3090	{
3091	_ActiveMask.use = 0x000000000000ffffLL;
3092	_ShiftBpp.use = 16; // == 2 * 8
3093	_ShiftRem.use = 48; // == 64 - 16
3094
3095	__asm__ __volatile__ (
3096	// load _ActiveMask
3097	"movq _ActiveMask, %%mm7 \n\t"
3098	// re-init address pointers and offset
3099	"movl _dif, %%ecx \n\t" // ecx: x = offset to alignment
3100	// boundary
3101	"movq _LBCarryMask, %%mm5 \n\t"
3102	// preload "movl row, %%edi \n\t" // edi: Avg(x)
3103	"movq _HBClearMask, %%mm4 \n\t"
3104	// preload "movl prev_row, %%esi \n\t" // esi: Prior(x)
3105
3106	// prime the pump: load the first Raw(x-bpp) data set
3107	"movq -8(%%edi,%%ecx,), %%mm2 \n\t" // load previous aligned 8 bytes
3108	// (we correct pos. in loop below)
3109	"avg_2lp: \n\t"
3110	"movq (%%edi,%%ecx,), %%mm0 \n\t"
3111	"psrlq _ShiftRem, %%mm2 \n\t" // shift data to pos. correctly
3112	"movq (%%esi,%%ecx,), %%mm1 \n\t" // (GRR BUGFIX: was psllq)
3113	// add (Prev_row/2) to average
3114	"movq %%mm5, %%mm3 \n\t"
3115	"pand %%mm1, %%mm3 \n\t" // get lsb for each prev_row byte
3116	"psrlq $1, %%mm1 \n\t" // divide prev_row bytes by 2
3117	"pand %%mm4, %%mm1 \n\t" // clear invalid bit 7 of each
3118	// byte
3119	"movq %%mm7, %%mm6 \n\t"
3120	"paddb %%mm1, %%mm0 \n\t" // add (Prev_row/2) to Avg for
3121	// each byte
3122
3123	// add 1st active group (Raw(x-bpp)/2) to average with _LBCarry
3124	"movq %%mm3, %%mm1 \n\t" // now use mm1 for getting
3125	// LBCarrys
3126	"pand %%mm2, %%mm1 \n\t" // get LBCarrys for each byte
3127	// where both
3128	// lsb's were == 1 (only valid
3129	// for active group)
3130	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
3131	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7 of each
3132	// byte
3133	"paddb %%mm1, %%mm2 \n\t" // add LBCarrys to (Raw(x-bpp)/2)
3134	// for each byte
3135	"pand %%mm6, %%mm2 \n\t" // leave only Active Group 1
3136	// bytes to add to Avg
3137	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) + LBCarrys to Avg
3138	// for each Active byte
3139
3140	// add 2nd active group (Raw(x-bpp)/2) to average with _LBCarry
3141	"psllq _ShiftBpp, %%mm6 \n\t" // shift the mm6 mask to cover
3142	// bytes 2 & 3
3143	"movq %%mm0, %%mm2 \n\t" // mov updated Raws to mm2
3144	"psllq _ShiftBpp, %%mm2 \n\t" // shift data to pos. correctly
3145	"movq %%mm3, %%mm1 \n\t" // now use mm1 for getting
3146	// LBCarrys
3147	"pand %%mm2, %%mm1 \n\t" // get LBCarrys for each byte
3148	// where both
3149	// lsb's were == 1 (only valid
3150	// for active group)
3151	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
3152	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7 of each
3153	// byte
3154	"paddb %%mm1, %%mm2 \n\t" // add LBCarrys to (Raw(x-bpp)/2)
3155	// for each byte
3156	"pand %%mm6, %%mm2 \n\t" // leave only Active Group 2
3157	// bytes to add to Avg
3158	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) + LBCarrys to
3159	// Avg for each Active byte
3160
3161	// add 3rd active group (Raw(x-bpp)/2) to average with _LBCarry
3162	"psllq _ShiftBpp, %%mm6 \n\t" // shift the mm6 mask to cover
3163	// bytes 4 & 5
3164	"movq %%mm0, %%mm2 \n\t" // mov updated Raws to mm2
3165	"psllq _ShiftBpp, %%mm2 \n\t" // shift data to pos. correctly
3166	"movq %%mm3, %%mm1 \n\t" // now use mm1 for getting
3167	// LBCarrys
3168	"pand %%mm2, %%mm1 \n\t" // get LBCarrys for each byte
3169	// where both lsb's were == 1
3170	// (only valid for active group)
3171	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
3172	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7 of each
3173	// byte
3174	"paddb %%mm1, %%mm2 \n\t" // add LBCarrys to (Raw(x-bpp)/2)
3175	// for each byte
3176	"pand %%mm6, %%mm2 \n\t" // leave only Active Group 2
3177	// bytes to add to Avg
3178	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) + LBCarrys to
3179	// Avg for each Active byte
3180
3181	// add 4th active group (Raw(x-bpp)/2) to average with _LBCarry
3182	"psllq _ShiftBpp, %%mm6 \n\t" // shift the mm6 mask to cover
3183	// bytes 6 & 7
3184	"movq %%mm0, %%mm2 \n\t" // mov updated Raws to mm2
3185	"psllq _ShiftBpp, %%mm2 \n\t" // shift data to pos. correctly
3186	"addl $8, %%ecx \n\t"
3187	"movq %%mm3, %%mm1 \n\t" // now use mm1 for getting
3188	// LBCarrys
3189	"pand %%mm2, %%mm1 \n\t" // get LBCarrys for each byte
3190	// where both
3191	// lsb's were == 1 (only valid
3192	// for active group)
3193	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
3194	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7 of each
3195	// byte
3196	"paddb %%mm1, %%mm2 \n\t" // add LBCarrys to (Raw(x-bpp)/2)
3197	// for each byte
3198	"pand %%mm6, %%mm2 \n\t" // leave only Active Group 2
3199	// bytes to add to Avg
3200	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) + LBCarrys to
3201	// Avg for each Active byte
3202
3203	"cmpl _MMXLength, %%ecx \n\t"
3204	// now ready to write back to memory
3205	"movq %%mm0, -8(%%edi,%%ecx,) \n\t"
3206	// prep Raw(x-bpp) for next loop
3207	"movq %%mm0, %%mm2 \n\t" // mov updated Raws to mm2
3208	"jb avg_2lp \n\t"
3209
3210	: "=S" (dummy_value_S), // output regs (dummy)
3211	"=D" (dummy_value_D)
3212
3213	: "0" (prev_row), // esi // input regs
3214	"1" (row) // edi
3215
3216	: "%ecx" // clobber list
3217	#if 0 /* %mm0, ..., %mm7 not supported by gcc 2.7.2.3 or egcs 1.1 */
3218	, "%mm0", "%mm1", "%mm2", "%mm3"
3219	, "%mm4", "%mm5", "%mm6", "%mm7"
3220	#endif
3221	);
3222	}
3223	break; // end 2 bpp
3224
3225	case 1:
3226	{
3227	__asm__ __volatile__ (
3228	// re-init address pointers and offset
3229	#ifdef __PIC__
3230	"pushl %%ebx \n\t" // save Global Offset Table index
3231	#endif
3232	"movl _dif, %%ebx \n\t" // ebx: x = offset to alignment
3233	// boundary
3234	// preload "movl row, %%edi \n\t" // edi: Avg(x)
3235	"cmpl _FullLength, %%ebx \n\t" // test if offset at end of array
3236	"jnb avg_1end \n\t"
3237	// do Paeth decode for remaining bytes
3238	// preload "movl prev_row, %%esi \n\t" // esi: Prior(x)
3239	"movl %%edi, %%edx \n\t"
3240	// preload "subl bpp, %%edx \n\t" // (bpp is preloaded into ecx)
3241	"subl %%ecx, %%edx \n\t" // edx: Raw(x-bpp)
3242	"xorl %%ecx, %%ecx \n\t" // zero ecx before using cl & cx
3243	// in loop below
3244	"avg_1lp: \n\t"
3245	// Raw(x) = Avg(x) + ((Raw(x-bpp) + Prior(x))/2)
3246	"xorl %%eax, %%eax \n\t"
3247	"movb (%%esi,%%ebx,), %%cl \n\t" // load cl with Prior(x)
3248	"movb (%%edx,%%ebx,), %%al \n\t" // load al with Raw(x-bpp)
3249	"addw %%cx, %%ax \n\t"
3250	"incl %%ebx \n\t"
3251	"shrw %%ax \n\t" // divide by 2
3252	"addb -1(%%edi,%%ebx,), %%al \n\t" // add Avg(x); -1 to offset
3253	// inc ebx
3254	"cmpl _FullLength, %%ebx \n\t" // check if at end of array
3255	"movb %%al, -1(%%edi,%%ebx,) \n\t" // write back Raw(x);
3256	// mov does not affect flags; -1 to offset inc ebx
3257	"jb avg_1lp \n\t"
3258
3259	"avg_1end: \n\t"
3260	#ifdef __PIC__
3261	"popl %%ebx \n\t" // Global Offset Table index
3262	#endif
3263
3264	: "=c" (dummy_value_c), // output regs (dummy)
3265	"=S" (dummy_value_S),
3266	"=D" (dummy_value_D)
3267
3268	: "0" (bpp), // ecx // input regs
3269	"1" (prev_row), // esi
3270	"2" (row) // edi
3271
3272	: "%eax", "%edx" // clobber list
3273	#ifndef __PIC__
3274	, "%ebx"
3275	#endif
3276	);
3277	}
3278	return; // end 1 bpp
3279
3280	case 8:
3281	{
3282	__asm__ __volatile__ (
3283	// re-init address pointers and offset
3284	"movl _dif, %%ecx \n\t" // ecx: x == offset to alignment
3285	"movq _LBCarryMask, %%mm5 \n\t" // boundary
3286	// preload "movl row, %%edi \n\t" // edi: Avg(x)
3287	"movq _HBClearMask, %%mm4 \n\t"
3288	// preload "movl prev_row, %%esi \n\t" // esi: Prior(x)
3289
3290	// prime the pump: load the first Raw(x-bpp) data set
3291	"movq -8(%%edi,%%ecx,), %%mm2 \n\t" // load previous aligned 8 bytes
3292	// (NO NEED to correct pos. in loop below)
3293
3294	"avg_8lp: \n\t"
3295	"movq (%%edi,%%ecx,), %%mm0 \n\t"
3296	"movq %%mm5, %%mm3 \n\t"
3297	"movq (%%esi,%%ecx,), %%mm1 \n\t"
3298	"addl $8, %%ecx \n\t"
3299	"pand %%mm1, %%mm3 \n\t" // get lsb for each prev_row byte
3300	"psrlq $1, %%mm1 \n\t" // divide prev_row bytes by 2
3301	"pand %%mm2, %%mm3 \n\t" // get LBCarrys for each byte
3302	// where both lsb's were == 1
3303	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
3304	"pand %%mm4, %%mm1 \n\t" // clear invalid bit 7, each byte
3305	"paddb %%mm3, %%mm0 \n\t" // add LBCarrys to Avg, each byte
3306	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7, each byte
3307	"paddb %%mm1, %%mm0 \n\t" // add (Prev_row/2) to Avg, each
3308	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) to Avg for each
3309	"cmpl _MMXLength, %%ecx \n\t"
3310	"movq %%mm0, -8(%%edi,%%ecx,) \n\t"
3311	"movq %%mm0, %%mm2 \n\t" // reuse as Raw(x-bpp)
3312	"jb avg_8lp \n\t"
3313
3314	: "=S" (dummy_value_S), // output regs (dummy)
3315	"=D" (dummy_value_D)
3316
3317	: "0" (prev_row), // esi // input regs
3318	"1" (row) // edi
3319
3320	: "%ecx" // clobber list
3321	#if 0 /* %mm0, ..., %mm5 not supported by gcc 2.7.2.3 or egcs 1.1 */
3322	, "%mm0", "%mm1", "%mm2"
3323	, "%mm3", "%mm4", "%mm5"
3324	#endif
3325	);
3326	}
3327	break; // end 8 bpp
3328
3329	default: // bpp greater than 8 (!= 1,2,3,4,[5],6,[7],8)
3330	{
3331
3332	#ifdef PNG_DEBUG
3333	// GRR: PRINT ERROR HERE: SHOULD NEVER BE REACHED
3334	png_debug(1,
3335	"Internal logic error in pnggccrd (png_read_filter_row_mmx_avg())\n");
3336	#endif
3337
3338	#if 0
3339	__asm__ __volatile__ (
3340	"movq _LBCarryMask, %%mm5 \n\t"
3341	// re-init address pointers and offset
3342	"movl _dif, %%ebx \n\t" // ebx: x = offset to
3343	// alignment boundary
3344	"movl row, %%edi \n\t" // edi: Avg(x)
3345	"movq _HBClearMask, %%mm4 \n\t"
3346	"movl %%edi, %%edx \n\t"
3347	"movl prev_row, %%esi \n\t" // esi: Prior(x)
3348	"subl bpp, %%edx \n\t" // edx: Raw(x-bpp)
3349	"avg_Alp: \n\t"
3350	"movq (%%edi,%%ebx,), %%mm0 \n\t"
3351	"movq %%mm5, %%mm3 \n\t"
3352	"movq (%%esi,%%ebx,), %%mm1 \n\t"
3353	"pand %%mm1, %%mm3 \n\t" // get lsb for each prev_row byte
3354	"movq (%%edx,%%ebx,), %%mm2 \n\t"
3355	"psrlq $1, %%mm1 \n\t" // divide prev_row bytes by 2
3356	"pand %%mm2, %%mm3 \n\t" // get LBCarrys for each byte
3357	// where both lsb's were == 1
3358	"psrlq $1, %%mm2 \n\t" // divide raw bytes by 2
3359	"pand %%mm4, %%mm1 \n\t" // clear invalid bit 7 of each
3360	// byte
3361	"paddb %%mm3, %%mm0 \n\t" // add LBCarrys to Avg for each
3362	// byte
3363	"pand %%mm4, %%mm2 \n\t" // clear invalid bit 7 of each
3364	// byte
3365	"paddb %%mm1, %%mm0 \n\t" // add (Prev_row/2) to Avg for
3366	// each byte
3367	"addl $8, %%ebx \n\t"
3368	"paddb %%mm2, %%mm0 \n\t" // add (Raw/2) to Avg for each
3369	// byte
3370	"cmpl _MMXLength, %%ebx \n\t"
3371	"movq %%mm0, -8(%%edi,%%ebx,) \n\t"
3372	"jb avg_Alp \n\t"
3373
3374	: // FIXASM: output regs/vars go here, e.g.: "=m" (memory_var)
3375
3376	: // FIXASM: input regs, e.g.: "c" (count), "S" (src), "D" (dest)
3377
3378	: "%ebx", "%edx", "%edi", "%esi" // CHECKASM: clobber list
3379	);
3380	#endif /* 0 - NEVER REACHED */
3381	}
3382	break;
3383
3384	} // end switch (bpp)
3385
3386	__asm__ __volatile__ (
3387	// MMX acceleration complete; now do clean-up
3388	// check if any remaining bytes left to decode
3389	#ifdef __PIC__
3390	"pushl %%ebx \n\t" // save index to Global Offset Table
3391	#endif
3392	"movl _MMXLength, %%ebx \n\t" // ebx: x == offset bytes after MMX
3393	//pre "movl row, %%edi \n\t" // edi: Avg(x)
3394	"cmpl _FullLength, %%ebx \n\t" // test if offset at end of array
3395	"jnb avg_end \n\t"
3396
3397	// do Avg decode for remaining bytes
3398	//pre "movl prev_row, %%esi \n\t" // esi: Prior(x)
3399	"movl %%edi, %%edx \n\t"
3400	//pre "subl bpp, %%edx \n\t" // (bpp is preloaded into ecx)
3401	"subl %%ecx, %%edx \n\t" // edx: Raw(x-bpp)
3402	"xorl %%ecx, %%ecx \n\t" // zero ecx before using cl & cx below
3403
3404	"avg_lp2: \n\t"
3405	// Raw(x) = Avg(x) + ((Raw(x-bpp) + Prior(x))/2)
3406	"xorl %%eax, %%eax \n\t"
3407	"movb (%%esi,%%ebx,), %%cl \n\t" // load cl with Prior(x)
3408	"movb (%%edx,%%ebx,), %%al \n\t" // load al with Raw(x-bpp)
3409	"addw %%cx, %%ax \n\t"
3410	"incl %%ebx \n\t"
3411	"shrw %%ax \n\t" // divide by 2
3412	"addb -1(%%edi,%%ebx,), %%al \n\t" // add Avg(x); -1 to offset inc ebx
3413	"cmpl _FullLength, %%ebx \n\t" // check if at end of array
3414	"movb %%al, -1(%%edi,%%ebx,) \n\t" // write back Raw(x) [mov does not
3415	"jb avg_lp2 \n\t" // affect flags; -1 to offset inc ebx]
3416
3417	"avg_end: \n\t"
3418	"EMMS \n\t" // end MMX; prep for poss. FP instrs.
3419	#ifdef __PIC__
3420	"popl %%ebx \n\t" // restore index to Global Offset Table
3421	#endif
3422
3423	: "=c" (dummy_value_c), // output regs (dummy)
3424	"=S" (dummy_value_S),
3425	"=D" (dummy_value_D)
3426
3427	: "0" (bpp), // ecx // input regs
3428	"1" (prev_row), // esi
3429	"2" (row) // edi
3430
3431	: "%eax", "%edx" // clobber list
3432	#ifndef __PIC__
3433	, "%ebx"
3434	#endif
3435	);
3436
3437	} /* end png_read_filter_row_mmx_avg() */
3438	#endif
3439
3440
3441
3442	#ifdef PNG_THREAD_UNSAFE_OK
3443	//===========================================================================//
3444	// //
3445	// P N G _ R E A D _ F I L T E R _ R O W _ M M X _ P A E T H //
3446	// //
3447	//===========================================================================//
3448
3449	// Optimized code for PNG Paeth filter decoder
3450
3451	static void /* PRIVATE */
3452	png_read_filter_row_mmx_paeth(png_row_infop row_info, png_bytep row,
3453	png_bytep prev_row)
3454	{
3455	int bpp;
3456	int dummy_value_c; // fix 'forbidden register 2 (cx) was spilled' error
3457	int dummy_value_S;
3458	int dummy_value_D;
3459
3460	bpp = (row_info->pixel_depth + 7) >> 3; // Get # bytes per pixel
3461	_FullLength = row_info->rowbytes; // # of bytes to filter
3462
3463	__asm__ __volatile__ (
3464	#ifdef __PIC__
3465	"pushl %%ebx \n\t" // save index to Global Offset Table
3466	#endif
3467	"xorl %%ebx, %%ebx \n\t" // ebx: x offset
3468	//pre "movl row, %%edi \n\t"
3469	"xorl %%edx, %%edx \n\t" // edx: x-bpp offset
3470	//pre "movl prev_row, %%esi \n\t"
3471	"xorl %%eax, %%eax \n\t"
3472
3473	// Compute the Raw value for the first bpp bytes
3474	// Note: the formula works out to be always
3475	// Paeth(x) = Raw(x) + Prior(x) where x < bpp
3476	"paeth_rlp: \n\t"
3477	"movb (%%edi,%%ebx,), %%al \n\t"
3478	"addb (%%esi,%%ebx,), %%al \n\t"
3479	"incl %%ebx \n\t"
3480	//pre "cmpl bpp, %%ebx \n\t" (bpp is preloaded into ecx)
3481	"cmpl %%ecx, %%ebx \n\t"
3482	"movb %%al, -1(%%edi,%%ebx,) \n\t"
3483	"jb paeth_rlp \n\t"
3484	// get # of bytes to alignment
3485	"movl %%edi, _dif \n\t" // take start of row
3486	"addl %%ebx, _dif \n\t" // add bpp
3487	"xorl %%ecx, %%ecx \n\t"
3488	"addl $0xf, _dif \n\t" // add 7 + 8 to incr past alignment
3489	// boundary
3490	"andl $0xfffffff8, _dif \n\t" // mask to alignment boundary
3491	"subl %%edi, _dif \n\t" // subtract from start ==> value ebx
3492	// at alignment
3493	"jz paeth_go \n\t"
3494	// fix alignment
3495
3496	"paeth_lp1: \n\t"
3497	"xorl %%eax, %%eax \n\t"
3498	// pav = p - a = (a + b - c) - a = b - c
3499	"movb (%%esi,%%ebx,), %%al \n\t" // load Prior(x) into al
3500	"movb (%%esi,%%edx,), %%cl \n\t" // load Prior(x-bpp) into cl
3501	"subl %%ecx, %%eax \n\t" // subtract Prior(x-bpp)
3502	"movl %%eax, _patemp \n\t" // Save pav for later use
3503	"xorl %%eax, %%eax \n\t"
3504	// pbv = p - b = (a + b - c) - b = a - c
3505	"movb (%%edi,%%edx,), %%al \n\t" // load Raw(x-bpp) into al
3506	"subl %%ecx, %%eax \n\t" // subtract Prior(x-bpp)
3507	"movl %%eax, %%ecx \n\t"
3508	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
3509	"addl _patemp, %%eax \n\t" // pcv = pav + pbv
3510	// pc = abs(pcv)
3511	"testl $0x80000000, %%eax \n\t"
3512	"jz paeth_pca \n\t"
3513	"negl %%eax \n\t" // reverse sign of neg values
3514
3515	"paeth_pca: \n\t"
3516	"movl %%eax, _pctemp \n\t" // save pc for later use
3517	// pb = abs(pbv)
3518	"testl $0x80000000, %%ecx \n\t"
3519	"jz paeth_pba \n\t"
3520	"negl %%ecx \n\t" // reverse sign of neg values
3521
3522	"paeth_pba: \n\t"
3523	"movl %%ecx, _pbtemp \n\t" // save pb for later use
3524	// pa = abs(pav)
3525	"movl _patemp, %%eax \n\t"
3526	"testl $0x80000000, %%eax \n\t"
3527	"jz paeth_paa \n\t"
3528	"negl %%eax \n\t" // reverse sign of neg values
3529
3530	"paeth_paa: \n\t"
3531	"movl %%eax, _patemp \n\t" // save pa for later use
3532	// test if pa <= pb
3533	"cmpl %%ecx, %%eax \n\t"
3534	"jna paeth_abb \n\t"
3535	// pa > pb; now test if pb <= pc
3536	"cmpl _pctemp, %%ecx \n\t"
3537	"jna paeth_bbc \n\t"
3538	// pb > pc; Raw(x) = Paeth(x) + Prior(x-bpp)
3539	"movb (%%esi,%%edx,), %%cl \n\t" // load Prior(x-bpp) into cl
3540	"jmp paeth_paeth \n\t"
3541
3542	"paeth_bbc: \n\t"
3543	// pb <= pc; Raw(x) = Paeth(x) + Prior(x)
3544	"movb (%%esi,%%ebx,), %%cl \n\t" // load Prior(x) into cl
3545	"jmp paeth_paeth \n\t"
3546
3547	"paeth_abb: \n\t"
3548	// pa <= pb; now test if pa <= pc
3549	"cmpl _pctemp, %%eax \n\t"
3550	"jna paeth_abc \n\t"
3551	// pa > pc; Raw(x) = Paeth(x) + Prior(x-bpp)
3552	"movb (%%esi,%%edx,), %%cl \n\t" // load Prior(x-bpp) into cl
3553	"jmp paeth_paeth \n\t"
3554
3555	"paeth_abc: \n\t"
3556	// pa <= pc; Raw(x) = Paeth(x) + Raw(x-bpp)
3557	"movb (%%edi,%%edx,), %%cl \n\t" // load Raw(x-bpp) into cl
3558
3559	"paeth_paeth: \n\t"
3560	"incl %%ebx \n\t"
3561	"incl %%edx \n\t"
3562	// Raw(x) = (Paeth(x) + Paeth_Predictor( a, b, c )) mod 256
3563	"addb %%cl, -1(%%edi,%%ebx,) \n\t"
3564	"cmpl _dif, %%ebx \n\t"
3565	"jb paeth_lp1 \n\t"
3566
3567	"paeth_go: \n\t"
3568	"movl _FullLength, %%ecx \n\t"
3569	"movl %%ecx, %%eax \n\t"
3570	"subl %%ebx, %%eax \n\t" // subtract alignment fix
3571	"andl $0x00000007, %%eax \n\t" // calc bytes over mult of 8
3572	"subl %%eax, %%ecx \n\t" // drop over bytes from original length
3573	"movl %%ecx, _MMXLength \n\t"
3574	#ifdef __PIC__
3575	"popl %%ebx \n\t" // restore index to Global Offset Table
3576	#endif
3577
3578	: "=c" (dummy_value_c), // output regs (dummy)
3579	"=S" (dummy_value_S),
3580	"=D" (dummy_value_D)
3581
3582	: "0" (bpp), // ecx // input regs
3583	"1" (prev_row), // esi
3584	"2" (row) // edi
3585
3586	: "%eax", "%edx" // clobber list
3587	#ifndef __PIC__
3588	, "%ebx"
3589	#endif
3590	);
3591
3592	// now do the math for the rest of the row
3593	switch (bpp)
3594	{
3595	case 3:
3596	{
3597	_ActiveMask.use = 0x0000000000ffffffLL;
3598	_ActiveMaskEnd.use = 0xffff000000000000LL;
3599	_ShiftBpp.use = 24; // == bpp(3) * 8
3600	_ShiftRem.use = 40; // == 64 - 24
3601
3602	__asm__ __volatile__ (
3603	"movl _dif, %%ecx \n\t"
3604	// preload "movl row, %%edi \n\t"
3605	// preload "movl prev_row, %%esi \n\t"
3606	"pxor %%mm0, %%mm0 \n\t"
3607	// prime the pump: load the first Raw(x-bpp) data set
3608	"movq -8(%%edi,%%ecx,), %%mm1 \n\t"
3609	"paeth_3lp: \n\t"
3610	"psrlq _ShiftRem, %%mm1 \n\t" // shift last 3 bytes to 1st
3611	// 3 bytes
3612	"movq (%%esi,%%ecx,), %%mm2 \n\t" // load b=Prior(x)
3613	"punpcklbw %%mm0, %%mm1 \n\t" // unpack High bytes of a
3614	"movq -8(%%esi,%%ecx,), %%mm3 \n\t" // prep c=Prior(x-bpp) bytes
3615	"punpcklbw %%mm0, %%mm2 \n\t" // unpack High bytes of b
3616	"psrlq _ShiftRem, %%mm3 \n\t" // shift last 3 bytes to 1st
3617	// 3 bytes
3618	// pav = p - a = (a + b - c) - a = b - c
3619	"movq %%mm2, %%mm4 \n\t"
3620	"punpcklbw %%mm0, %%mm3 \n\t" // unpack High bytes of c
3621	// pbv = p - b = (a + b - c) - b = a - c
3622	"movq %%mm1, %%mm5 \n\t"
3623	"psubw %%mm3, %%mm4 \n\t"
3624	"pxor %%mm7, %%mm7 \n\t"
3625	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
3626	"movq %%mm4, %%mm6 \n\t"
3627	"psubw %%mm3, %%mm5 \n\t"
3628
3629	// pa = abs(p-a) = abs(pav)
3630	// pb = abs(p-b) = abs(pbv)
3631	// pc = abs(p-c) = abs(pcv)
3632	"pcmpgtw %%mm4, %%mm0 \n\t" // create mask pav bytes < 0
3633	"paddw %%mm5, %%mm6 \n\t"
3634	"pand %%mm4, %%mm0 \n\t" // only pav bytes < 0 in mm7
3635	"pcmpgtw %%mm5, %%mm7 \n\t" // create mask pbv bytes < 0
3636	"psubw %%mm0, %%mm4 \n\t"
3637	"pand %%mm5, %%mm7 \n\t" // only pbv bytes < 0 in mm0
3638	"psubw %%mm0, %%mm4 \n\t"
3639	"psubw %%mm7, %%mm5 \n\t"
3640	"pxor %%mm0, %%mm0 \n\t"
3641	"pcmpgtw %%mm6, %%mm0 \n\t" // create mask pcv bytes < 0
3642	"pand %%mm6, %%mm0 \n\t" // only pav bytes < 0 in mm7
3643	"psubw %%mm7, %%mm5 \n\t"
3644	"psubw %%mm0, %%mm6 \n\t"
3645	// test pa <= pb
3646	"movq %%mm4, %%mm7 \n\t"
3647	"psubw %%mm0, %%mm6 \n\t"
3648	"pcmpgtw %%mm5, %%mm7 \n\t" // pa > pb?
3649	"movq %%mm7, %%mm0 \n\t"
3650	// use mm7 mask to merge pa & pb
3651	"pand %%mm7, %%mm5 \n\t"
3652	// use mm0 mask copy to merge a & b
3653	"pand %%mm0, %%mm2 \n\t"
3654	"pandn %%mm4, %%mm7 \n\t"
3655	"pandn %%mm1, %%mm0 \n\t"
3656	"paddw %%mm5, %%mm7 \n\t"
3657	"paddw %%mm2, %%mm0 \n\t"
3658	// test ((pa <= pb)? pa:pb) <= pc
3659	"pcmpgtw %%mm6, %%mm7 \n\t" // pab > pc?
3660	"pxor %%mm1, %%mm1 \n\t"
3661	"pand %%mm7, %%mm3 \n\t"
3662	"pandn %%mm0, %%mm7 \n\t"
3663	"paddw %%mm3, %%mm7 \n\t"
3664	"pxor %%mm0, %%mm0 \n\t"
3665	"packuswb %%mm1, %%mm7 \n\t"
3666	"movq (%%esi,%%ecx,), %%mm3 \n\t" // load c=Prior(x-bpp)
3667	"pand _ActiveMask, %%mm7 \n\t"
3668	"movq %%mm3, %%mm2 \n\t" // load b=Prior(x) step 1
3669	"paddb (%%edi,%%ecx,), %%mm7 \n\t" // add Paeth predictor with Raw(x)
3670	"punpcklbw %%mm0, %%mm3 \n\t" // unpack High bytes of c
3671	"movq %%mm7, (%%edi,%%ecx,) \n\t" // write back updated value
3672	"movq %%mm7, %%mm1 \n\t" // now mm1 will be used as
3673	// Raw(x-bpp)
3674	// now do Paeth for 2nd set of bytes (3-5)
3675	"psrlq _ShiftBpp, %%mm2 \n\t" // load b=Prior(x) step 2
3676	"punpcklbw %%mm0, %%mm1 \n\t" // unpack High bytes of a
3677	"pxor %%mm7, %%mm7 \n\t"
3678	"punpcklbw %%mm0, %%mm2 \n\t" // unpack High bytes of b
3679	// pbv = p - b = (a + b - c) - b = a - c
3680	"movq %%mm1, %%mm5 \n\t"
3681	// pav = p - a = (a + b - c) - a = b - c
3682	"movq %%mm2, %%mm4 \n\t"
3683	"psubw %%mm3, %%mm5 \n\t"
3684	"psubw %%mm3, %%mm4 \n\t"
3685	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) =
3686	// pav + pbv = pbv + pav
3687	"movq %%mm5, %%mm6 \n\t"
3688	"paddw %%mm4, %%mm6 \n\t"
3689
3690	// pa = abs(p-a) = abs(pav)
3691	// pb = abs(p-b) = abs(pbv)
3692	// pc = abs(p-c) = abs(pcv)
3693	"pcmpgtw %%mm5, %%mm0 \n\t" // create mask pbv bytes < 0
3694	"pcmpgtw %%mm4, %%mm7 \n\t" // create mask pav bytes < 0
3695	"pand %%mm5, %%mm0 \n\t" // only pbv bytes < 0 in mm0
3696	"pand %%mm4, %%mm7 \n\t" // only pav bytes < 0 in mm7
3697	"psubw %%mm0, %%mm5 \n\t"
3698	"psubw %%mm7, %%mm4 \n\t"
3699	"psubw %%mm0, %%mm5 \n\t"
3700	"psubw %%mm7, %%mm4 \n\t"
3701	"pxor %%mm0, %%mm0 \n\t"
3702	"pcmpgtw %%mm6, %%mm0 \n\t" // create mask pcv bytes < 0
3703	"pand %%mm6, %%mm0 \n\t" // only pav bytes < 0 in mm7
3704	"psubw %%mm0, %%mm6 \n\t"
3705	// test pa <= pb
3706	"movq %%mm4, %%mm7 \n\t"
3707	"psubw %%mm0, %%mm6 \n\t"
3708	"pcmpgtw %%mm5, %%mm7 \n\t" // pa > pb?
3709	"movq %%mm7, %%mm0 \n\t"
3710	// use mm7 mask to merge pa & pb
3711	"pand %%mm7, %%mm5 \n\t"
3712	// use mm0 mask copy to merge a & b
3713	"pand %%mm0, %%mm2 \n\t"
3714	"pandn %%mm4, %%mm7 \n\t"
3715	"pandn %%mm1, %%mm0 \n\t"
3716	"paddw %%mm5, %%mm7 \n\t"
3717	"paddw %%mm2, %%mm0 \n\t"
3718	// test ((pa <= pb)? pa:pb) <= pc
3719	"pcmpgtw %%mm6, %%mm7 \n\t" // pab > pc?
3720	"movq (%%esi,%%ecx,), %%mm2 \n\t" // load b=Prior(x)
3721	"pand %%mm7, %%mm3 \n\t"
3722	"pandn %%mm0, %%mm7 \n\t"
3723	"pxor %%mm1, %%mm1 \n\t"
3724	"paddw %%mm3, %%mm7 \n\t"
3725	"pxor %%mm0, %%mm0 \n\t"
3726	"packuswb %%mm1, %%mm7 \n\t"
3727	"movq %%mm2, %%mm3 \n\t" // load c=Prior(x-bpp) step 1
3728	"pand _ActiveMask, %%mm7 \n\t"
3729	"punpckhbw %%mm0, %%mm2 \n\t" // unpack High bytes of b
3730	"psllq _ShiftBpp, %%mm7 \n\t" // shift bytes to 2nd group of
3731	// 3 bytes
3732	// pav = p - a = (a + b - c) - a = b - c
3733	"movq %%mm2, %%mm4 \n\t"
3734	"paddb (%%edi,%%ecx,), %%mm7 \n\t" // add Paeth predictor with Raw(x)
3735	"psllq _ShiftBpp, %%mm3 \n\t" // load c=Prior(x-bpp) step 2
3736	"movq %%mm7, (%%edi,%%ecx,) \n\t" // write back updated value
3737	"movq %%mm7, %%mm1 \n\t"
3738	"punpckhbw %%mm0, %%mm3 \n\t" // unpack High bytes of c
3739	"psllq _ShiftBpp, %%mm1 \n\t" // shift bytes
3740	// now mm1 will be used as Raw(x-bpp)
3741	// now do Paeth for 3rd, and final, set of bytes (6-7)
3742	"pxor %%mm7, %%mm7 \n\t"
3743	"punpckhbw %%mm0, %%mm1 \n\t" // unpack High bytes of a
3744	"psubw %%mm3, %%mm4 \n\t"
3745	// pbv = p - b = (a + b - c) - b = a - c
3746	"movq %%mm1, %%mm5 \n\t"
3747	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
3748	"movq %%mm4, %%mm6 \n\t"
3749	"psubw %%mm3, %%mm5 \n\t"
3750	"pxor %%mm0, %%mm0 \n\t"
3751	"paddw %%mm5, %%mm6 \n\t"
3752
3753	// pa = abs(p-a) = abs(pav)
3754	// pb = abs(p-b) = abs(pbv)
3755	// pc = abs(p-c) = abs(pcv)
3756	"pcmpgtw %%mm4, %%mm0 \n\t" // create mask pav bytes < 0
3757	"pcmpgtw %%mm5, %%mm7 \n\t" // create mask pbv bytes < 0
3758	"pand %%mm4, %%mm0 \n\t" // only pav bytes < 0 in mm7
3759	"pand %%mm5, %%mm7 \n\t" // only pbv bytes < 0 in mm0
3760	"psubw %%mm0, %%mm4 \n\t"
3761	"psubw %%mm7, %%mm5 \n\t"
3762	"psubw %%mm0, %%mm4 \n\t"
3763	"psubw %%mm7, %%mm5 \n\t"
3764	"pxor %%mm0, %%mm0 \n\t"
3765	"pcmpgtw %%mm6, %%mm0 \n\t" // create mask pcv bytes < 0
3766	"pand %%mm6, %%mm0 \n\t" // only pav bytes < 0 in mm7
3767	"psubw %%mm0, %%mm6 \n\t"
3768	// test pa <= pb
3769	"movq %%mm4, %%mm7 \n\t"
3770	"psubw %%mm0, %%mm6 \n\t"
3771	"pcmpgtw %%mm5, %%mm7 \n\t" // pa > pb?
3772	"movq %%mm7, %%mm0 \n\t"
3773	// use mm0 mask copy to merge a & b
3774	"pand %%mm0, %%mm2 \n\t"
3775	// use mm7 mask to merge pa & pb
3776	"pand %%mm7, %%mm5 \n\t"
3777	"pandn %%mm1, %%mm0 \n\t"
3778	"pandn %%mm4, %%mm7 \n\t"
3779	"paddw %%mm2, %%mm0 \n\t"
3780	"paddw %%mm5, %%mm7 \n\t"
3781	// test ((pa <= pb)? pa:pb) <= pc
3782	"pcmpgtw %%mm6, %%mm7 \n\t" // pab > pc?
3783	"pand %%mm7, %%mm3 \n\t"
3784	"pandn %%mm0, %%mm7 \n\t"
3785	"paddw %%mm3, %%mm7 \n\t"
3786	"pxor %%mm1, %%mm1 \n\t"
3787	"packuswb %%mm7, %%mm1 \n\t"
3788	// step ecx to next set of 8 bytes and repeat loop til done
3789	"addl $8, %%ecx \n\t"
3790	"pand _ActiveMaskEnd, %%mm1 \n\t"
3791	"paddb -8(%%edi,%%ecx,), %%mm1 \n\t" // add Paeth predictor with
3792	// Raw(x)
3793
3794	"cmpl _MMXLength, %%ecx \n\t"
3795	"pxor %%mm0, %%mm0 \n\t" // pxor does not affect flags
3796	"movq %%mm1, -8(%%edi,%%ecx,) \n\t" // write back updated value
3797	// mm1 will be used as Raw(x-bpp) next loop
3798	// mm3 ready to be used as Prior(x-bpp) next loop
3799	"jb paeth_3lp \n\t"
3800
3801	: "=S" (dummy_value_S), // output regs (dummy)
3802	"=D" (dummy_value_D)
3803
3804	: "0" (prev_row), // esi // input regs
3805	"1" (row) // edi
3806
3807	: "%ecx" // clobber list
3808	#if 0 /* %mm0, ..., %mm7 not supported by gcc 2.7.2.3 or egcs 1.1 */
3809	, "%mm0", "%mm1", "%mm2", "%mm3"
3810	, "%mm4", "%mm5", "%mm6", "%mm7"
3811	#endif
3812	);
3813	}
3814	break; // end 3 bpp
3815
3816	case 6:
3817	//case 7: // GRR BOGUS
3818	//case 5: // GRR BOGUS
3819	{
3820	_ActiveMask.use = 0x00000000ffffffffLL;
3821	_ActiveMask2.use = 0xffffffff00000000LL;
3822	_ShiftBpp.use = bpp << 3; // == bpp * 8
3823	_ShiftRem.use = 64 - _ShiftBpp.use;
3824
3825	__asm__ __volatile__ (
3826	"movl _dif, %%ecx \n\t"
3827	// preload "movl row, %%edi \n\t"
3828	// preload "movl prev_row, %%esi \n\t"
3829	// prime the pump: load the first Raw(x-bpp) data set
3830	"movq -8(%%edi,%%ecx,), %%mm1 \n\t"
3831	"pxor %%mm0, %%mm0 \n\t"
3832
3833	"paeth_6lp: \n\t"
3834	// must shift to position Raw(x-bpp) data
3835	"psrlq _ShiftRem, %%mm1 \n\t"
3836	// do first set of 4 bytes
3837	"movq -8(%%esi,%%ecx,), %%mm3 \n\t" // read c=Prior(x-bpp) bytes
3838	"punpcklbw %%mm0, %%mm1 \n\t" // unpack Low bytes of a
3839	"movq (%%esi,%%ecx,), %%mm2 \n\t" // load b=Prior(x)
3840	"punpcklbw %%mm0, %%mm2 \n\t" // unpack Low bytes of b
3841	// must shift to position Prior(x-bpp) data
3842	"psrlq _ShiftRem, %%mm3 \n\t"
3843	// pav = p - a = (a + b - c) - a = b - c
3844	"movq %%mm2, %%mm4 \n\t"
3845	"punpcklbw %%mm0, %%mm3 \n\t" // unpack Low bytes of c
3846	// pbv = p - b = (a + b - c) - b = a - c
3847	"movq %%mm1, %%mm5 \n\t"
3848	"psubw %%mm3, %%mm4 \n\t"
3849	"pxor %%mm7, %%mm7 \n\t"
3850	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
3851	"movq %%mm4, %%mm6 \n\t"
3852	"psubw %%mm3, %%mm5 \n\t"
3853	// pa = abs(p-a) = abs(pav)
3854	// pb = abs(p-b) = abs(pbv)
3855	// pc = abs(p-c) = abs(pcv)
3856	"pcmpgtw %%mm4, %%mm0 \n\t" // create mask pav bytes < 0
3857	"paddw %%mm5, %%mm6 \n\t"
3858	"pand %%mm4, %%mm0 \n\t" // only pav bytes < 0 in mm7
3859	"pcmpgtw %%mm5, %%mm7 \n\t" // create mask pbv bytes < 0
3860	"psubw %%mm0, %%mm4 \n\t"
3861	"pand %%mm5, %%mm7 \n\t" // only pbv bytes < 0 in mm0
3862	"psubw %%mm0, %%mm4 \n\t"
3863	"psubw %%mm7, %%mm5 \n\t"
3864	"pxor %%mm0, %%mm0 \n\t"
3865	"pcmpgtw %%mm6, %%mm0 \n\t" // create mask pcv bytes < 0
3866	"pand %%mm6, %%mm0 \n\t" // only pav bytes < 0 in mm7
3867	"psubw %%mm7, %%mm5 \n\t"
3868	"psubw %%mm0, %%mm6 \n\t"
3869	// test pa <= pb
3870	"movq %%mm4, %%mm7 \n\t"
3871	"psubw %%mm0, %%mm6 \n\t"
3872	"pcmpgtw %%mm5, %%mm7 \n\t" // pa > pb?
3873	"movq %%mm7, %%mm0 \n\t"
3874	// use mm7 mask to merge pa & pb
3875	"pand %%mm7, %%mm5 \n\t"
3876	// use mm0 mask copy to merge a & b
3877	"pand %%mm0, %%mm2 \n\t"
3878	"pandn %%mm4, %%mm7 \n\t"
3879	"pandn %%mm1, %%mm0 \n\t"
3880	"paddw %%mm5, %%mm7 \n\t"
3881	"paddw %%mm2, %%mm0 \n\t"
3882	// test ((pa <= pb)? pa:pb) <= pc
3883	"pcmpgtw %%mm6, %%mm7 \n\t" // pab > pc?
3884	"pxor %%mm1, %%mm1 \n\t"
3885	"pand %%mm7, %%mm3 \n\t"
3886	"pandn %%mm0, %%mm7 \n\t"
3887	"paddw %%mm3, %%mm7 \n\t"
3888	"pxor %%mm0, %%mm0 \n\t"
3889	"packuswb %%mm1, %%mm7 \n\t"
3890	"movq -8(%%esi,%%ecx,), %%mm3 \n\t" // load c=Prior(x-bpp)
3891	"pand _ActiveMask, %%mm7 \n\t"
3892	"psrlq _ShiftRem, %%mm3 \n\t"
3893	"movq (%%esi,%%ecx,), %%mm2 \n\t" // load b=Prior(x) step 1
3894	"paddb (%%edi,%%ecx,), %%mm7 \n\t" // add Paeth predictor and Raw(x)
3895	"movq %%mm2, %%mm6 \n\t"
3896	"movq %%mm7, (%%edi,%%ecx,) \n\t" // write back updated value
3897	"movq -8(%%edi,%%ecx,), %%mm1 \n\t"
3898	"psllq _ShiftBpp, %%mm6 \n\t"
3899	"movq %%mm7, %%mm5 \n\t"
3900	"psrlq _ShiftRem, %%mm1 \n\t"
3901	"por %%mm6, %%mm3 \n\t"
3902	"psllq _ShiftBpp, %%mm5 \n\t"
3903	"punpckhbw %%mm0, %%mm3 \n\t" // unpack High bytes of c
3904	"por %%mm5, %%mm1 \n\t"
3905	// do second set of 4 bytes
3906	"punpckhbw %%mm0, %%mm2 \n\t" // unpack High bytes of b
3907	"punpckhbw %%mm0, %%mm1 \n\t" // unpack High bytes of a
3908	// pav = p - a = (a + b - c) - a = b - c
3909	"movq %%mm2, %%mm4 \n\t"
3910	// pbv = p - b = (a + b - c) - b = a - c
3911	"movq %%mm1, %%mm5 \n\t"
3912	"psubw %%mm3, %%mm4 \n\t"
3913	"pxor %%mm7, %%mm7 \n\t"
3914	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
3915	"movq %%mm4, %%mm6 \n\t"
3916	"psubw %%mm3, %%mm5 \n\t"
3917	// pa = abs(p-a) = abs(pav)
3918	// pb = abs(p-b) = abs(pbv)
3919	// pc = abs(p-c) = abs(pcv)
3920	"pcmpgtw %%mm4, %%mm0 \n\t" // create mask pav bytes < 0
3921	"paddw %%mm5, %%mm6 \n\t"
3922	"pand %%mm4, %%mm0 \n\t" // only pav bytes < 0 in mm7
3923	"pcmpgtw %%mm5, %%mm7 \n\t" // create mask pbv bytes < 0
3924	"psubw %%mm0, %%mm4 \n\t"
3925	"pand %%mm5, %%mm7 \n\t" // only pbv bytes < 0 in mm0
3926	"psubw %%mm0, %%mm4 \n\t"
3927	"psubw %%mm7, %%mm5 \n\t"
3928	"pxor %%mm0, %%mm0 \n\t"
3929	"pcmpgtw %%mm6, %%mm0 \n\t" // create mask pcv bytes < 0
3930	"pand %%mm6, %%mm0 \n\t" // only pav bytes < 0 in mm7
3931	"psubw %%mm7, %%mm5 \n\t"
3932	"psubw %%mm0, %%mm6 \n\t"
3933	// test pa <= pb
3934	"movq %%mm4, %%mm7 \n\t"
3935	"psubw %%mm0, %%mm6 \n\t"
3936	"pcmpgtw %%mm5, %%mm7 \n\t" // pa > pb?
3937	"movq %%mm7, %%mm0 \n\t"
3938	// use mm7 mask to merge pa & pb
3939	"pand %%mm7, %%mm5 \n\t"
3940	// use mm0 mask copy to merge a & b
3941	"pand %%mm0, %%mm2 \n\t"
3942	"pandn %%mm4, %%mm7 \n\t"
3943	"pandn %%mm1, %%mm0 \n\t"
3944	"paddw %%mm5, %%mm7 \n\t"
3945	"paddw %%mm2, %%mm0 \n\t"
3946	// test ((pa <= pb)? pa:pb) <= pc
3947	"pcmpgtw %%mm6, %%mm7 \n\t" // pab > pc?
3948	"pxor %%mm1, %%mm1 \n\t"
3949	"pand %%mm7, %%mm3 \n\t"
3950	"pandn %%mm0, %%mm7 \n\t"
3951	"pxor %%mm1, %%mm1 \n\t"
3952	"paddw %%mm3, %%mm7 \n\t"
3953	"pxor %%mm0, %%mm0 \n\t"
3954	// step ecx to next set of 8 bytes and repeat loop til done
3955	"addl $8, %%ecx \n\t"
3956	"packuswb %%mm7, %%mm1 \n\t"
3957	"paddb -8(%%edi,%%ecx,), %%mm1 \n\t" // add Paeth predictor with Raw(x)
3958	"cmpl _MMXLength, %%ecx \n\t"
3959	"movq %%mm1, -8(%%edi,%%ecx,) \n\t" // write back updated value
3960	// mm1 will be used as Raw(x-bpp) next loop
3961	"jb paeth_6lp \n\t"
3962
3963	: "=S" (dummy_value_S), // output regs (dummy)
3964	"=D" (dummy_value_D)
3965
3966	: "0" (prev_row), // esi // input regs
3967	"1" (row) // edi
3968
3969	: "%ecx" // clobber list
3970	#if 0 /* %mm0, ..., %mm7 not supported by gcc 2.7.2.3 or egcs 1.1 */
3971	, "%mm0", "%mm1", "%mm2", "%mm3"
3972	, "%mm4", "%mm5", "%mm6", "%mm7"
3973	#endif
3974	);
3975	}
3976	break; // end 6 bpp
3977
3978	case 4:
3979	{
3980	_ActiveMask.use = 0x00000000ffffffffLL;
3981
3982	__asm__ __volatile__ (
3983	"movl _dif, %%ecx \n\t"
3984	// preload "movl row, %%edi \n\t"
3985	// preload "movl prev_row, %%esi \n\t"
3986	"pxor %%mm0, %%mm0 \n\t"
3987	// prime the pump: load the first Raw(x-bpp) data set
3988	"movq -8(%%edi,%%ecx,), %%mm1 \n\t" // only time should need to read
3989	// a=Raw(x-bpp) bytes
3990	"paeth_4lp: \n\t"
3991	// do first set of 4 bytes
3992	"movq -8(%%esi,%%ecx,), %%mm3 \n\t" // read c=Prior(x-bpp) bytes
3993	"punpckhbw %%mm0, %%mm1 \n\t" // unpack Low bytes of a
3994	"movq (%%esi,%%ecx,), %%mm2 \n\t" // load b=Prior(x)
3995	"punpcklbw %%mm0, %%mm2 \n\t" // unpack High bytes of b
3996	// pav = p - a = (a + b - c) - a = b - c
3997	"movq %%mm2, %%mm4 \n\t"
3998	"punpckhbw %%mm0, %%mm3 \n\t" // unpack High bytes of c
3999	// pbv = p - b = (a + b - c) - b = a - c
4000	"movq %%mm1, %%mm5 \n\t"
4001	"psubw %%mm3, %%mm4 \n\t"
4002	"pxor %%mm7, %%mm7 \n\t"
4003	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
4004	"movq %%mm4, %%mm6 \n\t"
4005	"psubw %%mm3, %%mm5 \n\t"
4006	// pa = abs(p-a) = abs(pav)
4007	// pb = abs(p-b) = abs(pbv)
4008	// pc = abs(p-c) = abs(pcv)
4009	"pcmpgtw %%mm4, %%mm0 \n\t" // create mask pav bytes < 0
4010	"paddw %%mm5, %%mm6 \n\t"
4011	"pand %%mm4, %%mm0 \n\t" // only pav bytes < 0 in mm7
4012	"pcmpgtw %%mm5, %%mm7 \n\t" // create mask pbv bytes < 0
4013	"psubw %%mm0, %%mm4 \n\t"
4014	"pand %%mm5, %%mm7 \n\t" // only pbv bytes < 0 in mm0
4015	"psubw %%mm0, %%mm4 \n\t"
4016	"psubw %%mm7, %%mm5 \n\t"
4017	"pxor %%mm0, %%mm0 \n\t"
4018	"pcmpgtw %%mm6, %%mm0 \n\t" // create mask pcv bytes < 0
4019	"pand %%mm6, %%mm0 \n\t" // only pav bytes < 0 in mm7
4020	"psubw %%mm7, %%mm5 \n\t"
4021	"psubw %%mm0, %%mm6 \n\t"
4022	// test pa <= pb
4023	"movq %%mm4, %%mm7 \n\t"
4024	"psubw %%mm0, %%mm6 \n\t"
4025	"pcmpgtw %%mm5, %%mm7 \n\t" // pa > pb?
4026	"movq %%mm7, %%mm0 \n\t"
4027	// use mm7 mask to merge pa & pb
4028	"pand %%mm7, %%mm5 \n\t"
4029	// use mm0 mask copy to merge a & b
4030	"pand %%mm0, %%mm2 \n\t"
4031	"pandn %%mm4, %%mm7 \n\t"
4032	"pandn %%mm1, %%mm0 \n\t"
4033	"paddw %%mm5, %%mm7 \n\t"
4034	"paddw %%mm2, %%mm0 \n\t"
4035	// test ((pa <= pb)? pa:pb) <= pc
4036	"pcmpgtw %%mm6, %%mm7 \n\t" // pab > pc?
4037	"pxor %%mm1, %%mm1 \n\t"
4038	"pand %%mm7, %%mm3 \n\t"
4039	"pandn %%mm0, %%mm7 \n\t"
4040	"paddw %%mm3, %%mm7 \n\t"
4041	"pxor %%mm0, %%mm0 \n\t"
4042	"packuswb %%mm1, %%mm7 \n\t"
4043	"movq (%%esi,%%ecx,), %%mm3 \n\t" // load c=Prior(x-bpp)
4044	"pand _ActiveMask, %%mm7 \n\t"
4045	"movq %%mm3, %%mm2 \n\t" // load b=Prior(x) step 1
4046	"paddb (%%edi,%%ecx,), %%mm7 \n\t" // add Paeth predictor with Raw(x)
4047	"punpcklbw %%mm0, %%mm3 \n\t" // unpack High bytes of c
4048	"movq %%mm7, (%%edi,%%ecx,) \n\t" // write back updated value
4049	"movq %%mm7, %%mm1 \n\t" // now mm1 will be used as Raw(x-bpp)
4050	// do second set of 4 bytes
4051	"punpckhbw %%mm0, %%mm2 \n\t" // unpack Low bytes of b
4052	"punpcklbw %%mm0, %%mm1 \n\t" // unpack Low bytes of a
4053	// pav = p - a = (a + b - c) - a = b - c
4054	"movq %%mm2, %%mm4 \n\t"
4055	// pbv = p - b = (a + b - c) - b = a - c
4056	"movq %%mm1, %%mm5 \n\t"
4057	"psubw %%mm3, %%mm4 \n\t"
4058	"pxor %%mm7, %%mm7 \n\t"
4059	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
4060	"movq %%mm4, %%mm6 \n\t"
4061	"psubw %%mm3, %%mm5 \n\t"
4062	// pa = abs(p-a) = abs(pav)
4063	// pb = abs(p-b) = abs(pbv)
4064	// pc = abs(p-c) = abs(pcv)
4065	"pcmpgtw %%mm4, %%mm0 \n\t" // create mask pav bytes < 0
4066	"paddw %%mm5, %%mm6 \n\t"
4067	"pand %%mm4, %%mm0 \n\t" // only pav bytes < 0 in mm7
4068	"pcmpgtw %%mm5, %%mm7 \n\t" // create mask pbv bytes < 0
4069	"psubw %%mm0, %%mm4 \n\t"
4070	"pand %%mm5, %%mm7 \n\t" // only pbv bytes < 0 in mm0
4071	"psubw %%mm0, %%mm4 \n\t"
4072	"psubw %%mm7, %%mm5 \n\t"
4073	"pxor %%mm0, %%mm0 \n\t"
4074	"pcmpgtw %%mm6, %%mm0 \n\t" // create mask pcv bytes < 0
4075	"pand %%mm6, %%mm0 \n\t" // only pav bytes < 0 in mm7
4076	"psubw %%mm7, %%mm5 \n\t"
4077	"psubw %%mm0, %%mm6 \n\t"
4078	// test pa <= pb
4079	"movq %%mm4, %%mm7 \n\t"
4080	"psubw %%mm0, %%mm6 \n\t"
4081	"pcmpgtw %%mm5, %%mm7 \n\t" // pa > pb?
4082	"movq %%mm7, %%mm0 \n\t"
4083	// use mm7 mask to merge pa & pb
4084	"pand %%mm7, %%mm5 \n\t"
4085	// use mm0 mask copy to merge a & b
4086	"pand %%mm0, %%mm2 \n\t"
4087	"pandn %%mm4, %%mm7 \n\t"
4088	"pandn %%mm1, %%mm0 \n\t"
4089	"paddw %%mm5, %%mm7 \n\t"
4090	"paddw %%mm2, %%mm0 \n\t"
4091	// test ((pa <= pb)? pa:pb) <= pc
4092	"pcmpgtw %%mm6, %%mm7 \n\t" // pab > pc?
4093	"pxor %%mm1, %%mm1 \n\t"
4094	"pand %%mm7, %%mm3 \n\t"
4095	"pandn %%mm0, %%mm7 \n\t"
4096	"pxor %%mm1, %%mm1 \n\t"
4097	"paddw %%mm3, %%mm7 \n\t"
4098	"pxor %%mm0, %%mm0 \n\t"
4099	// step ecx to next set of 8 bytes and repeat loop til done
4100	"addl $8, %%ecx \n\t"
4101	"packuswb %%mm7, %%mm1 \n\t"
4102	"paddb -8(%%edi,%%ecx,), %%mm1 \n\t" // add predictor with Raw(x)
4103	"cmpl _MMXLength, %%ecx \n\t"
4104	"movq %%mm1, -8(%%edi,%%ecx,) \n\t" // write back updated value
4105	// mm1 will be used as Raw(x-bpp) next loop
4106	"jb paeth_4lp \n\t"
4107
4108	: "=S" (dummy_value_S), // output regs (dummy)
4109	"=D" (dummy_value_D)
4110
4111	: "0" (prev_row), // esi // input regs
4112	"1" (row) // edi
4113
4114	: "%ecx" // clobber list
4115	#if 0 /* %mm0, ..., %mm7 not supported by gcc 2.7.2.3 or egcs 1.1 */
4116	, "%mm0", "%mm1", "%mm2", "%mm3"
4117	, "%mm4", "%mm5", "%mm6", "%mm7"
4118	#endif
4119	);
4120	}
4121	break; // end 4 bpp
4122
4123	case 8: // bpp == 8
4124	{
4125	_ActiveMask.use = 0x00000000ffffffffLL;
4126
4127	__asm__ __volatile__ (
4128	"movl _dif, %%ecx \n\t"
4129	// preload "movl row, %%edi \n\t"
4130	// preload "movl prev_row, %%esi \n\t"
4131	"pxor %%mm0, %%mm0 \n\t"
4132	// prime the pump: load the first Raw(x-bpp) data set
4133	"movq -8(%%edi,%%ecx,), %%mm1 \n\t" // only time should need to read
4134	// a=Raw(x-bpp) bytes
4135	"paeth_8lp: \n\t"
4136	// do first set of 4 bytes
4137	"movq -8(%%esi,%%ecx,), %%mm3 \n\t" // read c=Prior(x-bpp) bytes
4138	"punpcklbw %%mm0, %%mm1 \n\t" // unpack Low bytes of a
4139	"movq (%%esi,%%ecx,), %%mm2 \n\t" // load b=Prior(x)
4140	"punpcklbw %%mm0, %%mm2 \n\t" // unpack Low bytes of b
4141	// pav = p - a = (a + b - c) - a = b - c
4142	"movq %%mm2, %%mm4 \n\t"
4143	"punpcklbw %%mm0, %%mm3 \n\t" // unpack Low bytes of c
4144	// pbv = p - b = (a + b - c) - b = a - c
4145	"movq %%mm1, %%mm5 \n\t"
4146	"psubw %%mm3, %%mm4 \n\t"
4147	"pxor %%mm7, %%mm7 \n\t"
4148	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
4149	"movq %%mm4, %%mm6 \n\t"
4150	"psubw %%mm3, %%mm5 \n\t"
4151	// pa = abs(p-a) = abs(pav)
4152	// pb = abs(p-b) = abs(pbv)
4153	// pc = abs(p-c) = abs(pcv)
4154	"pcmpgtw %%mm4, %%mm0 \n\t" // create mask pav bytes < 0
4155	"paddw %%mm5, %%mm6 \n\t"
4156	"pand %%mm4, %%mm0 \n\t" // only pav bytes < 0 in mm7
4157	"pcmpgtw %%mm5, %%mm7 \n\t" // create mask pbv bytes < 0
4158	"psubw %%mm0, %%mm4 \n\t"
4159	"pand %%mm5, %%mm7 \n\t" // only pbv bytes < 0 in mm0
4160	"psubw %%mm0, %%mm4 \n\t"
4161	"psubw %%mm7, %%mm5 \n\t"
4162	"pxor %%mm0, %%mm0 \n\t"
4163	"pcmpgtw %%mm6, %%mm0 \n\t" // create mask pcv bytes < 0
4164	"pand %%mm6, %%mm0 \n\t" // only pav bytes < 0 in mm7
4165	"psubw %%mm7, %%mm5 \n\t"
4166	"psubw %%mm0, %%mm6 \n\t"
4167	// test pa <= pb
4168	"movq %%mm4, %%mm7 \n\t"
4169	"psubw %%mm0, %%mm6 \n\t"
4170	"pcmpgtw %%mm5, %%mm7 \n\t" // pa > pb?
4171	"movq %%mm7, %%mm0 \n\t"
4172	// use mm7 mask to merge pa & pb
4173	"pand %%mm7, %%mm5 \n\t"
4174	// use mm0 mask copy to merge a & b
4175	"pand %%mm0, %%mm2 \n\t"
4176	"pandn %%mm4, %%mm7 \n\t"
4177	"pandn %%mm1, %%mm0 \n\t"
4178	"paddw %%mm5, %%mm7 \n\t"
4179	"paddw %%mm2, %%mm0 \n\t"
4180	// test ((pa <= pb)? pa:pb) <= pc
4181	"pcmpgtw %%mm6, %%mm7 \n\t" // pab > pc?
4182	"pxor %%mm1, %%mm1 \n\t"
4183	"pand %%mm7, %%mm3 \n\t"
4184	"pandn %%mm0, %%mm7 \n\t"
4185	"paddw %%mm3, %%mm7 \n\t"
4186	"pxor %%mm0, %%mm0 \n\t"
4187	"packuswb %%mm1, %%mm7 \n\t"
4188	"movq -8(%%esi,%%ecx,), %%mm3 \n\t" // read c=Prior(x-bpp) bytes
4189	"pand _ActiveMask, %%mm7 \n\t"
4190	"movq (%%esi,%%ecx,), %%mm2 \n\t" // load b=Prior(x)
4191	"paddb (%%edi,%%ecx,), %%mm7 \n\t" // add Paeth predictor with Raw(x)
4192	"punpckhbw %%mm0, %%mm3 \n\t" // unpack High bytes of c
4193	"movq %%mm7, (%%edi,%%ecx,) \n\t" // write back updated value
4194	"movq -8(%%edi,%%ecx,), %%mm1 \n\t" // read a=Raw(x-bpp) bytes
4195
4196	// do second set of 4 bytes
4197	"punpckhbw %%mm0, %%mm2 \n\t" // unpack High bytes of b
4198	"punpckhbw %%mm0, %%mm1 \n\t" // unpack High bytes of a
4199	// pav = p - a = (a + b - c) - a = b - c
4200	"movq %%mm2, %%mm4 \n\t"
4201	// pbv = p - b = (a + b - c) - b = a - c
4202	"movq %%mm1, %%mm5 \n\t"
4203	"psubw %%mm3, %%mm4 \n\t"
4204	"pxor %%mm7, %%mm7 \n\t"
4205	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
4206	"movq %%mm4, %%mm6 \n\t"
4207	"psubw %%mm3, %%mm5 \n\t"
4208	// pa = abs(p-a) = abs(pav)
4209	// pb = abs(p-b) = abs(pbv)
4210	// pc = abs(p-c) = abs(pcv)
4211	"pcmpgtw %%mm4, %%mm0 \n\t" // create mask pav bytes < 0
4212	"paddw %%mm5, %%mm6 \n\t"
4213	"pand %%mm4, %%mm0 \n\t" // only pav bytes < 0 in mm7
4214	"pcmpgtw %%mm5, %%mm7 \n\t" // create mask pbv bytes < 0
4215	"psubw %%mm0, %%mm4 \n\t"
4216	"pand %%mm5, %%mm7 \n\t" // only pbv bytes < 0 in mm0
4217	"psubw %%mm0, %%mm4 \n\t"
4218	"psubw %%mm7, %%mm5 \n\t"
4219	"pxor %%mm0, %%mm0 \n\t"
4220	"pcmpgtw %%mm6, %%mm0 \n\t" // create mask pcv bytes < 0
4221	"pand %%mm6, %%mm0 \n\t" // only pav bytes < 0 in mm7
4222	"psubw %%mm7, %%mm5 \n\t"
4223	"psubw %%mm0, %%mm6 \n\t"
4224	// test pa <= pb
4225	"movq %%mm4, %%mm7 \n\t"
4226	"psubw %%mm0, %%mm6 \n\t"
4227	"pcmpgtw %%mm5, %%mm7 \n\t" // pa > pb?
4228	"movq %%mm7, %%mm0 \n\t"
4229	// use mm7 mask to merge pa & pb
4230	"pand %%mm7, %%mm5 \n\t"
4231	// use mm0 mask copy to merge a & b
4232	"pand %%mm0, %%mm2 \n\t"
4233	"pandn %%mm4, %%mm7 \n\t"
4234	"pandn %%mm1, %%mm0 \n\t"
4235	"paddw %%mm5, %%mm7 \n\t"
4236	"paddw %%mm2, %%mm0 \n\t"
4237	// test ((pa <= pb)? pa:pb) <= pc
4238	"pcmpgtw %%mm6, %%mm7 \n\t" // pab > pc?
4239	"pxor %%mm1, %%mm1 \n\t"
4240	"pand %%mm7, %%mm3 \n\t"
4241	"pandn %%mm0, %%mm7 \n\t"
4242	"pxor %%mm1, %%mm1 \n\t"
4243	"paddw %%mm3, %%mm7 \n\t"
4244	"pxor %%mm0, %%mm0 \n\t"
4245	// step ecx to next set of 8 bytes and repeat loop til done
4246	"addl $8, %%ecx \n\t"
4247	"packuswb %%mm7, %%mm1 \n\t"
4248	"paddb -8(%%edi,%%ecx,), %%mm1 \n\t" // add Paeth predictor with Raw(x)
4249	"cmpl _MMXLength, %%ecx \n\t"
4250	"movq %%mm1, -8(%%edi,%%ecx,) \n\t" // write back updated value
4251	// mm1 will be used as Raw(x-bpp) next loop
4252	"jb paeth_8lp \n\t"
4253
4254	: "=S" (dummy_value_S), // output regs (dummy)
4255	"=D" (dummy_value_D)
4256
4257	: "0" (prev_row), // esi // input regs
4258	"1" (row) // edi
4259
4260	: "%ecx" // clobber list
4261	#if 0 /* %mm0, ..., %mm7 not supported by gcc 2.7.2.3 or egcs 1.1 */
4262	, "%mm0", "%mm1", "%mm2", "%mm3"
4263	, "%mm4", "%mm5", "%mm6", "%mm7"
4264	#endif
4265	);
4266	}
4267	break; // end 8 bpp
4268
4269	case 1: // bpp = 1
4270	case 2: // bpp = 2
4271	default: // bpp > 8
4272	{
4273	__asm__ __volatile__ (
4274	#ifdef __PIC__
4275	"pushl %%ebx \n\t" // save Global Offset Table index
4276	#endif
4277	"movl _dif, %%ebx \n\t"
4278	"cmpl _FullLength, %%ebx \n\t"
4279	"jnb paeth_dend \n\t"
4280
4281	// preload "movl row, %%edi \n\t"
4282	// preload "movl prev_row, %%esi \n\t"
4283	// do Paeth decode for remaining bytes
4284	"movl %%ebx, %%edx \n\t"
4285	// preload "subl bpp, %%edx \n\t" // (bpp is preloaded into ecx)
4286	"subl %%ecx, %%edx \n\t" // edx = ebx - bpp
4287	"xorl %%ecx, %%ecx \n\t" // zero ecx before using cl & cx
4288
4289	"paeth_dlp: \n\t"
4290	"xorl %%eax, %%eax \n\t"
4291	// pav = p - a = (a + b - c) - a = b - c
4292	"movb (%%esi,%%ebx,), %%al \n\t" // load Prior(x) into al
4293	"movb (%%esi,%%edx,), %%cl \n\t" // load Prior(x-bpp) into cl
4294	"subl %%ecx, %%eax \n\t" // subtract Prior(x-bpp)
4295	"movl %%eax, _patemp \n\t" // Save pav for later use
4296	"xorl %%eax, %%eax \n\t"
4297	// pbv = p - b = (a + b - c) - b = a - c
4298	"movb (%%edi,%%edx,), %%al \n\t" // load Raw(x-bpp) into al
4299	"subl %%ecx, %%eax \n\t" // subtract Prior(x-bpp)
4300	"movl %%eax, %%ecx \n\t"
4301	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
4302	"addl _patemp, %%eax \n\t" // pcv = pav + pbv
4303	// pc = abs(pcv)
4304	"testl $0x80000000, %%eax \n\t"
4305	"jz paeth_dpca \n\t"
4306	"negl %%eax \n\t" // reverse sign of neg values
4307
4308	"paeth_dpca: \n\t"
4309	"movl %%eax, _pctemp \n\t" // save pc for later use
4310	// pb = abs(pbv)
4311	"testl $0x80000000, %%ecx \n\t"
4312	"jz paeth_dpba \n\t"
4313	"negl %%ecx \n\t" // reverse sign of neg values
4314
4315	"paeth_dpba: \n\t"
4316	"movl %%ecx, _pbtemp \n\t" // save pb for later use
4317	// pa = abs(pav)
4318	"movl _patemp, %%eax \n\t"
4319	"testl $0x80000000, %%eax \n\t"
4320	"jz paeth_dpaa \n\t"
4321	"negl %%eax \n\t" // reverse sign of neg values
4322
4323	"paeth_dpaa: \n\t"
4324	"movl %%eax, _patemp \n\t" // save pa for later use
4325	// test if pa <= pb
4326	"cmpl %%ecx, %%eax \n\t"
4327	"jna paeth_dabb \n\t"
4328	// pa > pb; now test if pb <= pc
4329	"cmpl _pctemp, %%ecx \n\t"
4330	"jna paeth_dbbc \n\t"
4331	// pb > pc; Raw(x) = Paeth(x) + Prior(x-bpp)
4332	"movb (%%esi,%%edx,), %%cl \n\t" // load Prior(x-bpp) into cl
4333	"jmp paeth_dpaeth \n\t"
4334
4335	"paeth_dbbc: \n\t"
4336	// pb <= pc; Raw(x) = Paeth(x) + Prior(x)
4337	"movb (%%esi,%%ebx,), %%cl \n\t" // load Prior(x) into cl
4338	"jmp paeth_dpaeth \n\t"
4339
4340	"paeth_dabb: \n\t"
4341	// pa <= pb; now test if pa <= pc
4342	"cmpl _pctemp, %%eax \n\t"
4343	"jna paeth_dabc \n\t"
4344	// pa > pc; Raw(x) = Paeth(x) + Prior(x-bpp)
4345	"movb (%%esi,%%edx,), %%cl \n\t" // load Prior(x-bpp) into cl
4346	"jmp paeth_dpaeth \n\t"
4347
4348	"paeth_dabc: \n\t"
4349	// pa <= pc; Raw(x) = Paeth(x) + Raw(x-bpp)
4350	"movb (%%edi,%%edx,), %%cl \n\t" // load Raw(x-bpp) into cl
4351
4352	"paeth_dpaeth: \n\t"
4353	"incl %%ebx \n\t"
4354	"incl %%edx \n\t"
4355	// Raw(x) = (Paeth(x) + Paeth_Predictor( a, b, c )) mod 256
4356	"addb %%cl, -1(%%edi,%%ebx,) \n\t"
4357	"cmpl _FullLength, %%ebx \n\t"
4358	"jb paeth_dlp \n\t"
4359
4360	"paeth_dend: \n\t"
4361	#ifdef __PIC__
4362	"popl %%ebx \n\t" // index to Global Offset Table
4363	#endif
4364
4365	: "=c" (dummy_value_c), // output regs (dummy)
4366	"=S" (dummy_value_S),
4367	"=D" (dummy_value_D)
4368
4369	: "0" (bpp), // ecx // input regs
4370	"1" (prev_row), // esi
4371	"2" (row) // edi
4372
4373	: "%eax", "%edx" // clobber list
4374	#ifndef __PIC__
4375	, "%ebx"
4376	#endif
4377	);
4378	}
4379	return; // No need to go further with this one
4380
4381	} // end switch (bpp)
4382
4383	__asm__ __volatile__ (
4384	// MMX acceleration complete; now do clean-up
4385	// check if any remaining bytes left to decode
4386	#ifdef __PIC__
4387	"pushl %%ebx \n\t" // save index to Global Offset Table
4388	#endif
4389	"movl _MMXLength, %%ebx \n\t"
4390	"cmpl _FullLength, %%ebx \n\t"
4391	"jnb paeth_end \n\t"
4392	//pre "movl row, %%edi \n\t"
4393	//pre "movl prev_row, %%esi \n\t"
4394	// do Paeth decode for remaining bytes
4395	"movl %%ebx, %%edx \n\t"
4396	//pre "subl bpp, %%edx \n\t" // (bpp is preloaded into ecx)
4397	"subl %%ecx, %%edx \n\t" // edx = ebx - bpp
4398	"xorl %%ecx, %%ecx \n\t" // zero ecx before using cl & cx below
4399
4400	"paeth_lp2: \n\t"
4401	"xorl %%eax, %%eax \n\t"
4402	// pav = p - a = (a + b - c) - a = b - c
4403	"movb (%%esi,%%ebx,), %%al \n\t" // load Prior(x) into al
4404	"movb (%%esi,%%edx,), %%cl \n\t" // load Prior(x-bpp) into cl
4405	"subl %%ecx, %%eax \n\t" // subtract Prior(x-bpp)
4406	"movl %%eax, _patemp \n\t" // Save pav for later use
4407	"xorl %%eax, %%eax \n\t"
4408	// pbv = p - b = (a + b - c) - b = a - c
4409	"movb (%%edi,%%edx,), %%al \n\t" // load Raw(x-bpp) into al
4410	"subl %%ecx, %%eax \n\t" // subtract Prior(x-bpp)
4411	"movl %%eax, %%ecx \n\t"
4412	// pcv = p - c = (a + b - c) -c = (a - c) + (b - c) = pav + pbv
4413	"addl _patemp, %%eax \n\t" // pcv = pav + pbv
4414	// pc = abs(pcv)
4415	"testl $0x80000000, %%eax \n\t"
4416	"jz paeth_pca2 \n\t"
4417	"negl %%eax \n\t" // reverse sign of neg values
4418
4419	"paeth_pca2: \n\t"
4420	"movl %%eax, _pctemp \n\t" // save pc for later use
4421	// pb = abs(pbv)
4422	"testl $0x80000000, %%ecx \n\t"
4423	"jz paeth_pba2 \n\t"
4424	"negl %%ecx \n\t" // reverse sign of neg values
4425
4426	"paeth_pba2: \n\t"
4427	"movl %%ecx, _pbtemp \n\t" // save pb for later use
4428	// pa = abs(pav)
4429	"movl _patemp, %%eax \n\t"
4430	"testl $0x80000000, %%eax \n\t"
4431	"jz paeth_paa2 \n\t"
4432	"negl %%eax \n\t" // reverse sign of neg values
4433
4434	"paeth_paa2: \n\t"
4435	"movl %%eax, _patemp \n\t" // save pa for later use
4436	// test if pa <= pb
4437	"cmpl %%ecx, %%eax \n\t"
4438	"jna paeth_abb2 \n\t"
4439	// pa > pb; now test if pb <= pc
4440	"cmpl _pctemp, %%ecx \n\t"
4441	"jna paeth_bbc2 \n\t"
4442	// pb > pc; Raw(x) = Paeth(x) + Prior(x-bpp)
4443	"movb (%%esi,%%edx,), %%cl \n\t" // load Prior(x-bpp) into cl
4444	"jmp paeth_paeth2 \n\t"
4445
4446	"paeth_bbc2: \n\t"
4447	// pb <= pc; Raw(x) = Paeth(x) + Prior(x)
4448	"movb (%%esi,%%ebx,), %%cl \n\t" // load Prior(x) into cl
4449	"jmp paeth_paeth2 \n\t"
4450
4451	"paeth_abb2: \n\t"
4452	// pa <= pb; now test if pa <= pc
4453	"cmpl _pctemp, %%eax \n\t"
4454	"jna paeth_abc2 \n\t"
4455	// pa > pc; Raw(x) = Paeth(x) + Prior(x-bpp)
4456	"movb (%%esi,%%edx,), %%cl \n\t" // load Prior(x-bpp) into cl
4457	"jmp paeth_paeth2 \n\t"
4458
4459	"paeth_abc2: \n\t"
4460	// pa <= pc; Raw(x) = Paeth(x) + Raw(x-bpp)
4461	"movb (%%edi,%%edx,), %%cl \n\t" // load Raw(x-bpp) into cl
4462
4463	"paeth_paeth2: \n\t"
4464	"incl %%ebx \n\t"
4465	"incl %%edx \n\t"
4466	// Raw(x) = (Paeth(x) + Paeth_Predictor( a, b, c )) mod 256
4467	"addb %%cl, -1(%%edi,%%ebx,) \n\t"
4468	"cmpl _FullLength, %%ebx \n\t"
4469	"jb paeth_lp2 \n\t"
4470
4471	"paeth_end: \n\t"
4472	"EMMS \n\t" // end MMX; prep for poss. FP instrs.
4473	#ifdef __PIC__
4474	"popl %%ebx \n\t" // restore index to Global Offset Table
4475	#endif
4476
4477	: "=c" (dummy_value_c), // output regs (dummy)
4478	"=S" (dummy_value_S),
4479	"=D" (dummy_value_D)
4480
4481	: "0" (bpp), // ecx // input regs
4482	"1" (prev_row), // esi
4483	"2" (row) // edi
4484
4485	: "%eax", "%edx" // clobber list (no input regs!)
4486	#ifndef __PIC__
4487	, "%ebx"
4488	#endif
4489	);
4490
4491	} /* end png_read_filter_row_mmx_paeth() */
4492	#endif
4493
4494
4495
4496
4497	#ifdef PNG_THREAD_UNSAFE_OK
4498	//===========================================================================//
4499	// //
4500	// P N G _ R E A D _ F I L T E R _ R O W _ M M X _ S U B //
4501	// //
4502	//===========================================================================//
4503
4504	// Optimized code for PNG Sub filter decoder
4505
4506	static void /* PRIVATE */
4507	png_read_filter_row_mmx_sub(png_row_infop row_info, png_bytep row)
4508	{
4509	int bpp;
4510	int dummy_value_a;
4511	int dummy_value_D;
4512
4513	bpp = (row_info->pixel_depth + 7) >> 3; // calc number of bytes per pixel
4514	_FullLength = row_info->rowbytes - bpp; // number of bytes to filter
4515
4516	__asm__ __volatile__ (
4517	//pre "movl row, %%edi \n\t"
4518	"movl %%edi, %%esi \n\t" // lp = row
4519	//pre "movl bpp, %%eax \n\t"
4520	"addl %%eax, %%edi \n\t" // rp = row + bpp
4521	//irr "xorl %%eax, %%eax \n\t"
4522	// get # of bytes to alignment
4523	"movl %%edi, _dif \n\t" // take start of row
4524	"addl $0xf, _dif \n\t" // add 7 + 8 to incr past
4525	// alignment boundary
4526	"xorl %%ecx, %%ecx \n\t"
4527	"andl $0xfffffff8, _dif \n\t" // mask to alignment boundary
4528	"subl %%edi, _dif \n\t" // subtract from start ==> value
4529	"jz sub_go \n\t" // ecx at alignment
4530
4531	"sub_lp1: \n\t" // fix alignment
4532	"movb (%%esi,%%ecx,), %%al \n\t"
4533	"addb %%al, (%%edi,%%ecx,) \n\t"
4534	"incl %%ecx \n\t"
4535	"cmpl _dif, %%ecx \n\t"
4536	"jb sub_lp1 \n\t"
4537
4538	"sub_go: \n\t"
4539	"movl _FullLength, %%eax \n\t"
4540	"movl %%eax, %%edx \n\t"
4541	"subl %%ecx, %%edx \n\t" // subtract alignment fix
4542	"andl $0x00000007, %%edx \n\t" // calc bytes over mult of 8
4543	"subl %%edx, %%eax \n\t" // drop over bytes from length
4544	"movl %%eax, _MMXLength \n\t"
4545
4546	: "=a" (dummy_value_a), // 0 // output regs (dummy)
4547	"=D" (dummy_value_D) // 1
4548
4549	: "0" (bpp), // eax // input regs
4550	"1" (row) // edi
4551
4552	: "%esi", "%ecx", "%edx" // clobber list
4553
4554	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
4555	, "%mm0", "%mm1", "%mm2", "%mm3"
4556	, "%mm4", "%mm5", "%mm6", "%mm7"
4557	#endif
4558	);
4559
4560	// now do the math for the rest of the row
4561	switch (bpp)
4562	{
4563	case 3:
4564	{
4565	_ActiveMask.use = 0x0000ffffff000000LL;
4566	_ShiftBpp.use = 24; // == 3 * 8
4567	_ShiftRem.use = 40; // == 64 - 24
4568
4569	__asm__ __volatile__ (
4570	// preload "movl row, %%edi \n\t"
4571	"movq _ActiveMask, %%mm7 \n\t" // load _ActiveMask for 2nd
4572	// active byte group
4573	"movl %%edi, %%esi \n\t" // lp = row
4574	// preload "movl bpp, %%eax \n\t"
4575	"addl %%eax, %%edi \n\t" // rp = row + bpp
4576	"movq %%mm7, %%mm6 \n\t"
4577	"movl _dif, %%edx \n\t"
4578	"psllq _ShiftBpp, %%mm6 \n\t" // move mask in mm6 to cover
4579	// 3rd active byte group
4580	// prime the pump: load the first Raw(x-bpp) data set
4581	"movq -8(%%edi,%%edx,), %%mm1 \n\t"
4582
4583	"sub_3lp: \n\t" // shift data for adding first
4584	"psrlq _ShiftRem, %%mm1 \n\t" // bpp bytes (no need for mask;
4585	// shift clears inactive bytes)
4586	// add 1st active group
4587	"movq (%%edi,%%edx,), %%mm0 \n\t"
4588	"paddb %%mm1, %%mm0 \n\t"
4589
4590	// add 2nd active group
4591	"movq %%mm0, %%mm1 \n\t" // mov updated Raws to mm1
4592	"psllq _ShiftBpp, %%mm1 \n\t" // shift data to pos. correctly
4593	"pand %%mm7, %%mm1 \n\t" // mask to use 2nd active group
4594	"paddb %%mm1, %%mm0 \n\t"
4595
4596	// add 3rd active group
4597	"movq %%mm0, %%mm1 \n\t" // mov updated Raws to mm1
4598	"psllq _ShiftBpp, %%mm1 \n\t" // shift data to pos. correctly
4599	"pand %%mm6, %%mm1 \n\t" // mask to use 3rd active group
4600	"addl $8, %%edx \n\t"
4601	"paddb %%mm1, %%mm0 \n\t"
4602
4603	"cmpl _MMXLength, %%edx \n\t"
4604	"movq %%mm0, -8(%%edi,%%edx,) \n\t" // write updated Raws to array
4605	"movq %%mm0, %%mm1 \n\t" // prep 1st add at top of loop
4606	"jb sub_3lp \n\t"
4607
4608	: "=a" (dummy_value_a), // 0 // output regs (dummy)
4609	"=D" (dummy_value_D) // 1
4610
4611	: "0" (bpp), // eax // input regs
4612	"1" (row) // edi
4613
4614	: "%edx", "%esi" // clobber list
4615	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
4616	, "%mm0", "%mm1", "%mm6", "%mm7"
4617	#endif
4618	);
4619	}
4620	break;
4621
4622	case 1:
4623	{
4624	__asm__ __volatile__ (
4625	"movl _dif, %%edx \n\t"
4626	// preload "movl row, %%edi \n\t"
4627	"cmpl _FullLength, %%edx \n\t"
4628	"jnb sub_1end \n\t"
4629	"movl %%edi, %%esi \n\t" // lp = row
4630	"xorl %%eax, %%eax \n\t"
4631	// preload "movl bpp, %%eax \n\t"
4632	"addl %%eax, %%edi \n\t" // rp = row + bpp
4633
4634	"sub_1lp: \n\t"
4635	"movb (%%esi,%%edx,), %%al \n\t"
4636	"addb %%al, (%%edi,%%edx,) \n\t"
4637	"incl %%edx \n\t"
4638	"cmpl _FullLength, %%edx \n\t"
4639	"jb sub_1lp \n\t"
4640
4641	"sub_1end: \n\t"
4642
4643	: "=a" (dummy_value_a), // 0 // output regs (dummy)
4644	"=D" (dummy_value_D) // 1
4645
4646	: "0" (bpp), // eax // input regs
4647	"1" (row) // edi
4648
4649	: "%edx", "%esi" // clobber list
4650	);
4651	}
4652	return;
4653
4654	case 6:
4655	case 4:
4656	//case 7: // GRR BOGUS
4657	//case 5: // GRR BOGUS
4658	{
4659	_ShiftBpp.use = bpp << 3;
4660	_ShiftRem.use = 64 - _ShiftBpp.use;
4661
4662	__asm__ __volatile__ (
4663	// preload "movl row, %%edi \n\t"
4664	"movl _dif, %%edx \n\t"
4665	"movl %%edi, %%esi \n\t" // lp = row
4666	// preload "movl bpp, %%eax \n\t"
4667	"addl %%eax, %%edi \n\t" // rp = row + bpp
4668
4669	// prime the pump: load the first Raw(x-bpp) data set
4670	"movq -8(%%edi,%%edx,), %%mm1 \n\t"
4671
4672	"sub_4lp: \n\t" // shift data for adding first
4673	"psrlq _ShiftRem, %%mm1 \n\t" // bpp bytes (no need for mask;
4674	// shift clears inactive bytes)
4675	"movq (%%edi,%%edx,), %%mm0 \n\t"
4676	"paddb %%mm1, %%mm0 \n\t"
4677
4678	// add 2nd active group
4679	"movq %%mm0, %%mm1 \n\t" // mov updated Raws to mm1
4680	"psllq _ShiftBpp, %%mm1 \n\t" // shift data to pos. correctly
4681	"addl $8, %%edx \n\t"
4682	"paddb %%mm1, %%mm0 \n\t"
4683
4684	"cmpl _MMXLength, %%edx \n\t"
4685	"movq %%mm0, -8(%%edi,%%edx,) \n\t"
4686	"movq %%mm0, %%mm1 \n\t" // prep 1st add at top of loop
4687	"jb sub_4lp \n\t"
4688
4689	: "=a" (dummy_value_a), // 0 // output regs (dummy)
4690	"=D" (dummy_value_D) // 1
4691
4692	: "0" (bpp), // eax // input regs
4693	"1" (row) // edi
4694
4695	: "%edx", "%esi" // clobber list
4696	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
4697	, "%mm0", "%mm1"
4698	#endif
4699	);
4700	}
4701	break;
4702
4703	case 2:
4704	{
4705	_ActiveMask.use = 0x00000000ffff0000LL;
4706	_ShiftBpp.use = 16; // == 2 * 8
4707	_ShiftRem.use = 48; // == 64 - 16
4708
4709	__asm__ __volatile__ (
4710	"movq _ActiveMask, %%mm7 \n\t" // load _ActiveMask for 2nd
4711	// active byte group
4712	"movl _dif, %%edx \n\t"
4713	"movq %%mm7, %%mm6 \n\t"
4714	// preload "movl row, %%edi \n\t"
4715	"psllq _ShiftBpp, %%mm6 \n\t" // move mask in mm6 to cover
4716	// 3rd active byte group
4717	"movl %%edi, %%esi \n\t" // lp = row
4718	"movq %%mm6, %%mm5 \n\t"
4719	// preload "movl bpp, %%eax \n\t"
4720	"addl %%eax, %%edi \n\t" // rp = row + bpp
4721	"psllq _ShiftBpp, %%mm5 \n\t" // move mask in mm5 to cover
4722	// 4th active byte group
4723	// prime the pump: load the first Raw(x-bpp) data set
4724	"movq -8(%%edi,%%edx,), %%mm1 \n\t"
4725
4726	"sub_2lp: \n\t" // shift data for adding first
4727	"psrlq _ShiftRem, %%mm1 \n\t" // bpp bytes (no need for mask;
4728	// shift clears inactive bytes)
4729	// add 1st active group
4730	"movq (%%edi,%%edx,), %%mm0 \n\t"
4731	"paddb %%mm1, %%mm0 \n\t"
4732
4733	// add 2nd active group
4734	"movq %%mm0, %%mm1 \n\t" // mov updated Raws to mm1
4735	"psllq _ShiftBpp, %%mm1 \n\t" // shift data to pos. correctly
4736	"pand %%mm7, %%mm1 \n\t" // mask to use 2nd active group
4737	"paddb %%mm1, %%mm0 \n\t"
4738
4739	// add 3rd active group
4740	"movq %%mm0, %%mm1 \n\t" // mov updated Raws to mm1
4741	"psllq _ShiftBpp, %%mm1 \n\t" // shift data to pos. correctly
4742	"pand %%mm6, %%mm1 \n\t" // mask to use 3rd active group
4743	"paddb %%mm1, %%mm0 \n\t"
4744
4745	// add 4th active group
4746	"movq %%mm0, %%mm1 \n\t" // mov updated Raws to mm1
4747	"psllq _ShiftBpp, %%mm1 \n\t" // shift data to pos. correctly
4748	"pand %%mm5, %%mm1 \n\t" // mask to use 4th active group
4749	"addl $8, %%edx \n\t"
4750	"paddb %%mm1, %%mm0 \n\t"
4751	"cmpl _MMXLength, %%edx \n\t"
4752	"movq %%mm0, -8(%%edi,%%edx,) \n\t" // write updated Raws to array
4753	"movq %%mm0, %%mm1 \n\t" // prep 1st add at top of loop
4754	"jb sub_2lp \n\t"
4755
4756	: "=a" (dummy_value_a), // 0 // output regs (dummy)
4757	"=D" (dummy_value_D) // 1
4758
4759	: "0" (bpp), // eax // input regs
4760	"1" (row) // edi
4761
4762	: "%edx", "%esi" // clobber list
4763	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
4764	, "%mm0", "%mm1", "%mm5", "%mm6", "%mm7"
4765	#endif
4766	);
4767	}
4768	break;
4769
4770	case 8:
4771	{
4772	__asm__ __volatile__ (
4773	// preload "movl row, %%edi \n\t"
4774	"movl _dif, %%edx \n\t"
4775	"movl %%edi, %%esi \n\t" // lp = row
4776	// preload "movl bpp, %%eax \n\t"
4777	"addl %%eax, %%edi \n\t" // rp = row + bpp
4778	"movl _MMXLength, %%ecx \n\t"
4779
4780	// prime the pump: load the first Raw(x-bpp) data set
4781	"movq -8(%%edi,%%edx,), %%mm7 \n\t"
4782	"andl $0x0000003f, %%ecx \n\t" // calc bytes over mult of 64
4783
4784	"sub_8lp: \n\t"
4785	"movq (%%edi,%%edx,), %%mm0 \n\t" // load Sub(x) for 1st 8 bytes
4786	"paddb %%mm7, %%mm0 \n\t"
4787	"movq 8(%%edi,%%edx,), %%mm1 \n\t" // load Sub(x) for 2nd 8 bytes
4788	"movq %%mm0, (%%edi,%%edx,) \n\t" // write Raw(x) for 1st 8 bytes
4789
4790	// Now mm0 will be used as Raw(x-bpp) for the 2nd group of 8 bytes.
4791	// This will be repeated for each group of 8 bytes with the 8th
4792	// group being used as the Raw(x-bpp) for the 1st group of the
4793	// next loop.
4794
4795	"paddb %%mm0, %%mm1 \n\t"
4796	"movq 16(%%edi,%%edx,), %%mm2 \n\t" // load Sub(x) for 3rd 8 bytes
4797	"movq %%mm1, 8(%%edi,%%edx,) \n\t" // write Raw(x) for 2nd 8 bytes
4798	"paddb %%mm1, %%mm2 \n\t"
4799	"movq 24(%%edi,%%edx,), %%mm3 \n\t" // load Sub(x) for 4th 8 bytes
4800	"movq %%mm2, 16(%%edi,%%edx,) \n\t" // write Raw(x) for 3rd 8 bytes
4801	"paddb %%mm2, %%mm3 \n\t"
4802	"movq 32(%%edi,%%edx,), %%mm4 \n\t" // load Sub(x) for 5th 8 bytes
4803	"movq %%mm3, 24(%%edi,%%edx,) \n\t" // write Raw(x) for 4th 8 bytes
4804	"paddb %%mm3, %%mm4 \n\t"
4805	"movq 40(%%edi,%%edx,), %%mm5 \n\t" // load Sub(x) for 6th 8 bytes
4806	"movq %%mm4, 32(%%edi,%%edx,) \n\t" // write Raw(x) for 5th 8 bytes
4807	"paddb %%mm4, %%mm5 \n\t"
4808	"movq 48(%%edi,%%edx,), %%mm6 \n\t" // load Sub(x) for 7th 8 bytes
4809	"movq %%mm5, 40(%%edi,%%edx,) \n\t" // write Raw(x) for 6th 8 bytes
4810	"paddb %%mm5, %%mm6 \n\t"
4811	"movq 56(%%edi,%%edx,), %%mm7 \n\t" // load Sub(x) for 8th 8 bytes
4812	"movq %%mm6, 48(%%edi,%%edx,) \n\t" // write Raw(x) for 7th 8 bytes
4813	"addl $64, %%edx \n\t"
4814	"paddb %%mm6, %%mm7 \n\t"
4815	"cmpl %%ecx, %%edx \n\t"
4816	"movq %%mm7, -8(%%edi,%%edx,) \n\t" // write Raw(x) for 8th 8 bytes
4817	"jb sub_8lp \n\t"
4818
4819	"cmpl _MMXLength, %%edx \n\t"
4820	"jnb sub_8lt8 \n\t"
4821
4822	"sub_8lpA: \n\t"
4823	"movq (%%edi,%%edx,), %%mm0 \n\t"
4824	"addl $8, %%edx \n\t"
4825	"paddb %%mm7, %%mm0 \n\t"
4826	"cmpl _MMXLength, %%edx \n\t"
4827	"movq %%mm0, -8(%%edi,%%edx,) \n\t" // -8 to offset early addl edx
4828	"movq %%mm0, %%mm7 \n\t" // move calculated Raw(x) data
4829	// to mm1 to be new Raw(x-bpp)
4830	// for next loop
4831	"jb sub_8lpA \n\t"
4832
4833	"sub_8lt8: \n\t"
4834
4835	: "=a" (dummy_value_a), // 0 // output regs (dummy)
4836	"=D" (dummy_value_D) // 1
4837
4838	: "0" (bpp), // eax // input regs
4839	"1" (row) // edi
4840
4841	: "%ecx", "%edx", "%esi" // clobber list
4842	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
4843	, "%mm0", "%mm1", "%mm2", "%mm3", "%mm4", "%mm5", "%mm6", "%mm7"
4844	#endif
4845	);
4846	}
4847	break;
4848
4849	default: // bpp greater than 8 bytes GRR BOGUS
4850	{
4851	__asm__ __volatile__ (
4852	"movl _dif, %%edx \n\t"
4853	// preload "movl row, %%edi \n\t"
4854	"movl %%edi, %%esi \n\t" // lp = row
4855	// preload "movl bpp, %%eax \n\t"
4856	"addl %%eax, %%edi \n\t" // rp = row + bpp
4857
4858	"sub_Alp: \n\t"
4859	"movq (%%edi,%%edx,), %%mm0 \n\t"
4860	"movq (%%esi,%%edx,), %%mm1 \n\t"
4861	"addl $8, %%edx \n\t"
4862	"paddb %%mm1, %%mm0 \n\t"
4863	"cmpl _MMXLength, %%edx \n\t"
4864	"movq %%mm0, -8(%%edi,%%edx,) \n\t" // mov does not affect flags;
4865	// -8 to offset addl edx
4866	"jb sub_Alp \n\t"
4867
4868	: "=a" (dummy_value_a), // 0 // output regs (dummy)
4869	"=D" (dummy_value_D) // 1
4870
4871	: "0" (bpp), // eax // input regs
4872	"1" (row) // edi
4873
4874	: "%edx", "%esi" // clobber list
4875	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
4876	, "%mm0", "%mm1"
4877	#endif
4878	);
4879	}
4880	break;
4881
4882	} // end switch (bpp)
4883
4884	__asm__ __volatile__ (
4885	"movl _MMXLength, %%edx \n\t"
4886	//pre "movl row, %%edi \n\t"
4887	"cmpl _FullLength, %%edx \n\t"
4888	"jnb sub_end \n\t"
4889
4890	"movl %%edi, %%esi \n\t" // lp = row
4891	//pre "movl bpp, %%eax \n\t"
4892	"addl %%eax, %%edi \n\t" // rp = row + bpp
4893	"xorl %%eax, %%eax \n\t"
4894
4895	"sub_lp2: \n\t"
4896	"movb (%%esi,%%edx,), %%al \n\t"
4897	"addb %%al, (%%edi,%%edx,) \n\t"
4898	"incl %%edx \n\t"
4899	"cmpl _FullLength, %%edx \n\t"
4900	"jb sub_lp2 \n\t"
4901
4902	"sub_end: \n\t"
4903	"EMMS \n\t" // end MMX instructions
4904
4905	: "=a" (dummy_value_a), // 0 // output regs (dummy)
4906	"=D" (dummy_value_D) // 1
4907
4908	: "0" (bpp), // eax // input regs
4909	"1" (row) // edi
4910
4911	: "%edx", "%esi" // clobber list
4912	);
4913
4914	} // end of png_read_filter_row_mmx_sub()
4915	#endif
4916
4917
4918
4919
4920	//===========================================================================//
4921	// //
4922	// P N G _ R E A D _ F I L T E R _ R O W _ M M X _ U P //
4923	// //
4924	//===========================================================================//
4925
4926	// Optimized code for PNG Up filter decoder
4927
4928	static void /* PRIVATE */
4929	png_read_filter_row_mmx_up(png_row_infop row_info, png_bytep row,
4930	png_bytep prev_row)
4931	{
4932	png_uint_32 len;
4933	int dummy_value_d; // fix 'forbidden register 3 (dx) was spilled' error
4934	int dummy_value_S;
4935	int dummy_value_D;
4936
4937	len = row_info->rowbytes; // number of bytes to filter
4938
4939	__asm__ __volatile__ (
4940	//pre "movl row, %%edi \n\t"
4941	// get # of bytes to alignment
4942	#ifdef __PIC__
4943	"pushl %%ebx \n\t"
4944	#endif
4945	"movl %%edi, %%ecx \n\t"
4946	"xorl %%ebx, %%ebx \n\t"
4947	"addl $0x7, %%ecx \n\t"
4948	"xorl %%eax, %%eax \n\t"
4949	"andl $0xfffffff8, %%ecx \n\t"
4950	//pre "movl prev_row, %%esi \n\t"
4951	"subl %%edi, %%ecx \n\t"
4952	"jz up_go \n\t"
4953
4954	"up_lp1: \n\t" // fix alignment
4955	"movb (%%edi,%%ebx,), %%al \n\t"
4956	"addb (%%esi,%%ebx,), %%al \n\t"
4957	"incl %%ebx \n\t"
4958	"cmpl %%ecx, %%ebx \n\t"
4959	"movb %%al, -1(%%edi,%%ebx,) \n\t" // mov does not affect flags; -1 to
4960	"jb up_lp1 \n\t" // offset incl ebx
4961
4962	"up_go: \n\t"
4963	//pre "movl len, %%edx \n\t"
4964	"movl %%edx, %%ecx \n\t"
4965	"subl %%ebx, %%edx \n\t" // subtract alignment fix
4966	"andl $0x0000003f, %%edx \n\t" // calc bytes over mult of 64
4967	"subl %%edx, %%ecx \n\t" // drop over bytes from length
4968
4969	// unrolled loop - use all MMX registers and interleave to reduce
4970	// number of branch instructions (loops) and reduce partial stalls
4971	"up_loop: \n\t"
4972	"movq (%%esi,%%ebx,), %%mm1 \n\t"
4973	"movq (%%edi,%%ebx,), %%mm0 \n\t"
4974	"movq 8(%%esi,%%ebx,), %%mm3 \n\t"
4975	"paddb %%mm1, %%mm0 \n\t"
4976	"movq 8(%%edi,%%ebx,), %%mm2 \n\t"
4977	"movq %%mm0, (%%edi,%%ebx,) \n\t"
4978	"paddb %%mm3, %%mm2 \n\t"
4979	"movq 16(%%esi,%%ebx,), %%mm5 \n\t"
4980	"movq %%mm2, 8(%%edi,%%ebx,) \n\t"
4981	"movq 16(%%edi,%%ebx,), %%mm4 \n\t"
4982	"movq 24(%%esi,%%ebx,), %%mm7 \n\t"
4983	"paddb %%mm5, %%mm4 \n\t"
4984	"movq 24(%%edi,%%ebx,), %%mm6 \n\t"
4985	"movq %%mm4, 16(%%edi,%%ebx,) \n\t"
4986	"paddb %%mm7, %%mm6 \n\t"
4987	"movq 32(%%esi,%%ebx,), %%mm1 \n\t"
4988	"movq %%mm6, 24(%%edi,%%ebx,) \n\t"
4989	"movq 32(%%edi,%%ebx,), %%mm0 \n\t"
4990	"movq 40(%%esi,%%ebx,), %%mm3 \n\t"
4991	"paddb %%mm1, %%mm0 \n\t"
4992	"movq 40(%%edi,%%ebx,), %%mm2 \n\t"
4993	"movq %%mm0, 32(%%edi,%%ebx,) \n\t"
4994	"paddb %%mm3, %%mm2 \n\t"
4995	"movq 48(%%esi,%%ebx,), %%mm5 \n\t"
4996	"movq %%mm2, 40(%%edi,%%ebx,) \n\t"
4997	"movq 48(%%edi,%%ebx,), %%mm4 \n\t"
4998	"movq 56(%%esi,%%ebx,), %%mm7 \n\t"
4999	"paddb %%mm5, %%mm4 \n\t"
5000	"movq 56(%%edi,%%ebx,), %%mm6 \n\t"
5001	"movq %%mm4, 48(%%edi,%%ebx,) \n\t"
5002	"addl $64, %%ebx \n\t"
5003	"paddb %%mm7, %%mm6 \n\t"
5004	"cmpl %%ecx, %%ebx \n\t"
5005	"movq %%mm6, -8(%%edi,%%ebx,) \n\t" // (+56)movq does not affect flags;
5006	"jb up_loop \n\t" // -8 to offset addl ebx
5007
5008	"cmpl $0, %%edx \n\t" // test for bytes over mult of 64
5009	"jz up_end \n\t"
5010
5011	"cmpl $8, %%edx \n\t" // test for less than 8 bytes
5012	"jb up_lt8 \n\t" // [added by lcreeve at netins.net]
5013
5014	"addl %%edx, %%ecx \n\t"
5015	"andl $0x00000007, %%edx \n\t" // calc bytes over mult of 8
5016	"subl %%edx, %%ecx \n\t" // drop over bytes from length
5017	"jz up_lt8 \n\t"
5018
5019	"up_lpA: \n\t" // use MMX regs to update 8 bytes sim.
5020	"movq (%%esi,%%ebx,), %%mm1 \n\t"
5021	"movq (%%edi,%%ebx,), %%mm0 \n\t"
5022	"addl $8, %%ebx \n\t"
5023	"paddb %%mm1, %%mm0 \n\t"
5024	"cmpl %%ecx, %%ebx \n\t"
5025	"movq %%mm0, -8(%%edi,%%ebx,) \n\t" // movq does not affect flags; -8 to
5026	"jb up_lpA \n\t" // offset add ebx
5027	"cmpl $0, %%edx \n\t" // test for bytes over mult of 8
5028	"jz up_end \n\t"
5029
5030	"up_lt8: \n\t"
5031	"xorl %%eax, %%eax \n\t"
5032	"addl %%edx, %%ecx \n\t" // move over byte count into counter
5033
5034	"up_lp2: \n\t" // use x86 regs for remaining bytes
5035	"movb (%%edi,%%ebx,), %%al \n\t"
5036	"addb (%%esi,%%ebx,), %%al \n\t"
5037	"incl %%ebx \n\t"
5038	"cmpl %%ecx, %%ebx \n\t"
5039	"movb %%al, -1(%%edi,%%ebx,) \n\t" // mov does not affect flags; -1 to
5040	"jb up_lp2 \n\t" // offset inc ebx
5041
5042	"up_end: \n\t"
5043	"EMMS \n\t" // conversion of filtered row complete
5044	#ifdef __PIC__
5045	"popl %%ebx \n\t"
5046	#endif
5047
5048	: "=d" (dummy_value_d), // 0 // output regs (dummy)
5049	"=S" (dummy_value_S), // 1
5050	"=D" (dummy_value_D) // 2
5051
5052	: "0" (len), // edx // input regs
5053	"1" (prev_row), // esi
5054	"2" (row) // edi
5055
5056	: "%eax", "%ecx" // clobber list (no input regs!)
5057	#ifndef __PIC__
5058	, "%ebx"
5059	#endif
5060
5061	#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
5062	, "%mm0", "%mm1", "%mm2", "%mm3"
5063	, "%mm4", "%mm5", "%mm6", "%mm7"
5064	#endif
5065	);
5066
5067	} // end of png_read_filter_row_mmx_up()
5068
5069	#endif /* PNG_MMX_CODE_SUPPORTED */
5070
5071
5072
5073
5074	/===========================================================================/
5075	/* */
5076	/* P N G _ R E A D _ F I L T E R _ R O W */
5077	/* */
5078	/===========================================================================/
5079
5080
5081	/* Optimized png_read_filter_row routines */
5082
5083	void /* PRIVATE */
5084	png_read_filter_row(png_structp png_ptr, png_row_infop row_info, png_bytep
5085	row, png_bytep prev_row, int filter)
5086	{
5087	#ifdef PNG_DEBUG
5088	char filnm[10];
5089	#endif
5090
5091	#if defined(PNG_MMX_CODE_SUPPORTED)
5092	/* GRR: these are superseded by png_ptr->asm_flags: */
5093	#define UseMMX_sub 1 // GRR: converted 20000730
5094	#define UseMMX_up 1 // GRR: converted 20000729
5095	#define UseMMX_avg 1 // GRR: converted 20000828 (+ 16-bit bugfix 20000916)
5096	#define UseMMX_paeth 1 // GRR: converted 20000828
5097
5098	if (_mmx_supported == 2) {
5099	/* this should have happened in png_init_mmx_flags() already */
5100	#if !defined(PNG_1_0_X)
5101	png_warning(png_ptr, "asm_flags may not have been initialized");
5102	#endif
5103	png_mmx_support();
5104	}
5105	#endif /* PNG_MMX_CODE_SUPPORTED */
5106
5107	#ifdef PNG_DEBUG
5108	png_debug(1, "in png_read_filter_row (pnggccrd.c)\n");
5109	switch (filter)
5110	{
5111	case 0: sprintf(filnm, "none");
5112	break;
5113	case 1: sprintf(filnm, "sub-%s",
5114	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
5115	#if !defined(PNG_1_0_X)
5116	(png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_FILTER_SUB)? "MMX" :
5117	#endif
5118	#endif
5119	"x86");
5120	break;
5121	case 2: sprintf(filnm, "up-%s",
5122	#ifdef PNG_MMX_CODE_SUPPORTED
5123	#if !defined(PNG_1_0_X)
5124	(png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_FILTER_UP)? "MMX" :
5125	#endif
5126	#endif
5127	"x86");
5128	break;
5129	case 3: sprintf(filnm, "avg-%s",
5130	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
5131	#if !defined(PNG_1_0_X)
5132	(png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_FILTER_AVG)? "MMX" :
5133	#endif
5134	#endif
5135	"x86");
5136	break;
5137	case 4: sprintf(filnm, "Paeth-%s",
5138	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
5139	#if !defined(PNG_1_0_X)
5140	(png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_FILTER_PAETH)? "MMX":
5141	#endif
5142	#endif
5143	"x86");
5144	break;
5145	default: sprintf(filnm, "unknw");
5146	break;
5147	}
5148	png_debug2(0, "row_number=%5ld, %5s, ", png_ptr->row_number, filnm);
5149	png_debug1(0, "row=0x%08lx, ", (unsigned long)row);
5150	png_debug2(0, "pixdepth=%2d, bytes=%d, ", (int)row_info->pixel_depth,
5151	(int)((row_info->pixel_depth + 7) >> 3));
5152	png_debug1(0,"rowbytes=%8ld\n", row_info->rowbytes);
5153	#endif /* PNG_DEBUG */
5154
5155	switch (filter)
5156	{
5157	case PNG_FILTER_VALUE_NONE:
5158	break;
5159
5160	case PNG_FILTER_VALUE_SUB:
5161	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
5162	#if !defined(PNG_1_0_X)
5163	if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_FILTER_SUB) &&
5164	(row_info->pixel_depth >= png_ptr->mmx_bitdepth_threshold) &&
5165	(row_info->rowbytes >= png_ptr->mmx_rowbytes_threshold))
5166	#else
5167	if (_mmx_supported)
5168	#endif
5169	{
5170	png_read_filter_row_mmx_sub(row_info, row);
5171	}
5172	else
5173	#endif /* PNG_MMX_CODE_SUPPORTED */
5174	{
5175	png_uint_32 i;
5176	png_uint_32 istop = row_info->rowbytes;
5177	png_uint_32 bpp = (row_info->pixel_depth + 7) >> 3;
5178	png_bytep rp = row + bpp;
5179	png_bytep lp = row;
5180
5181	for (i = bpp; i < istop; i++)
5182	{
5183	rp = (png_byte)(((int)(rp) + (int)(*lp++)) & 0xff);
5184	rp++;
5185	}
5186	} /* end !UseMMX_sub */
5187	break;
5188
5189	case PNG_FILTER_VALUE_UP:
5190	#if defined(PNG_MMX_CODE_SUPPORTED)
5191	#if !defined(PNG_1_0_X)
5192	if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_FILTER_UP) &&
5193	(row_info->pixel_depth >= png_ptr->mmx_bitdepth_threshold) &&
5194	(row_info->rowbytes >= png_ptr->mmx_rowbytes_threshold))
5195	#else
5196	if (_mmx_supported)
5197	#endif
5198	{
5199	png_read_filter_row_mmx_up(row_info, row, prev_row);
5200	}
5201	else
5202	#endif /* PNG_MMX_CODE_SUPPORTED */
5203	{
5204	png_uint_32 i;
5205	png_uint_32 istop = row_info->rowbytes;
5206	png_bytep rp = row;
5207	png_bytep pp = prev_row;
5208
5209	for (i = 0; i < istop; ++i)
5210	{
5211	rp = (png_byte)(((int)(rp) + (int)(*pp++)) & 0xff);
5212	rp++;
5213	}
5214	} /* end !UseMMX_up */
5215	break;
5216
5217	case PNG_FILTER_VALUE_AVG:
5218	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
5219	#if !defined(PNG_1_0_X)
5220	if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_FILTER_AVG) &&
5221	(row_info->pixel_depth >= png_ptr->mmx_bitdepth_threshold) &&
5222	(row_info->rowbytes >= png_ptr->mmx_rowbytes_threshold))
5223	#else
5224	if (_mmx_supported)
5225	#endif
5226	{
5227	png_read_filter_row_mmx_avg(row_info, row, prev_row);
5228	}
5229	else
5230	#endif /* PNG_MMX_CODE_SUPPORTED */
5231	{
5232	png_uint_32 i;
5233	png_bytep rp = row;
5234	png_bytep pp = prev_row;
5235	png_bytep lp = row;
5236	png_uint_32 bpp = (row_info->pixel_depth + 7) >> 3;
5237	png_uint_32 istop = row_info->rowbytes - bpp;
5238
5239	for (i = 0; i < bpp; i++)
5240	{
5241	rp = (png_byte)(((int)(rp) +
5242	((int)(*pp++) >> 1)) & 0xff);
5243	rp++;
5244	}
5245
5246	for (i = 0; i < istop; i++)
5247	{
5248	rp = (png_byte)(((int)(rp) +
5249	((int)(pp++ + lp++) >> 1)) & 0xff);
5250	rp++;
5251	}
5252	} /* end !UseMMX_avg */
5253	break;
5254
5255	case PNG_FILTER_VALUE_PAETH:
5256	#if defined(PNG_MMX_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
5257	#if !defined(PNG_1_0_X)
5258	if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_FILTER_PAETH) &&
5259	(row_info->pixel_depth >= png_ptr->mmx_bitdepth_threshold) &&
5260	(row_info->rowbytes >= png_ptr->mmx_rowbytes_threshold))
5261	#else
5262	if (_mmx_supported)
5263	#endif
5264	{
5265	png_read_filter_row_mmx_paeth(row_info, row, prev_row);
5266	}
5267	else
5268	#endif /* PNG_MMX_CODE_SUPPORTED */
5269	{
5270	png_uint_32 i;
5271	png_bytep rp = row;
5272	png_bytep pp = prev_row;
5273	png_bytep lp = row;
5274	png_bytep cp = prev_row;
5275	png_uint_32 bpp = (row_info->pixel_depth + 7) >> 3;
5276	png_uint_32 istop = row_info->rowbytes - bpp;
5277
5278	for (i = 0; i < bpp; i++)
5279	{
5280	rp = (png_byte)(((int)(rp) + (int)(*pp++)) & 0xff);
5281	rp++;
5282	}
5283
5284	for (i = 0; i < istop; i++) /* use leftover rp,pp */
5285	{
5286	int a, b, c, pa, pb, pc, p;
5287
5288	a = *lp++;
5289	b = *pp++;
5290	c = *cp++;
5291
5292	p = b - c;
5293	pc = a - c;
5294
5295	#ifdef PNG_USE_ABS
5296	pa = abs(p);
5297	pb = abs(pc);
5298	pc = abs(p + pc);
5299	#else
5300	pa = p < 0 ? -p : p;
5301	pb = pc < 0 ? -pc : pc;
5302	pc = (p + pc) < 0 ? -(p + pc) : p + pc;
5303	#endif
5304
5305	/*
5306	if (pa <= pb && pa <= pc)
5307	p = a;
5308	else if (pb <= pc)
5309	p = b;
5310	else
5311	p = c;
5312	*/
5313
5314	p = (pa <= pb && pa <= pc) ? a : (pb <= pc) ? b : c;
5315
5316	rp = (png_byte)(((int)(rp) + p) & 0xff);
5317	rp++;
5318	}
5319	} /* end !UseMMX_paeth */
5320	break;
5321
5322	default:
5323	png_warning(png_ptr, "Ignoring bad row-filter type");
5324	*row=0;
5325	break;
5326	}
5327	}
5328
5329	#endif /* PNG_HAVE_MMX_READ_FILTER_ROW */
5330
5331
5332	/===========================================================================/
5333	/* */
5334	/* P N G _ M M X _ S U P P O R T */
5335	/* */
5336	/===========================================================================/
5337
5338	/* GRR NOTES: (1) the following code assumes 386 or better (pushfl/popfl)
5339	* (2) all instructions compile with gcc 2.7.2.3 and later
5340	* (3) the function is moved down here to prevent gcc from
5341	* inlining it in multiple places and then barfing be-
5342	* cause the ".NOT_SUPPORTED" label is multiply defined
5343	* [is there a way to signal that a single function should
5344	* not be inlined? is there a way to modify the label for
5345	* each inlined instance, e.g., by appending _1, _2, etc.?
5346	* maybe if don't use leading "." in label name? (nope...sigh)]
5347	*/
5348
5349	int PNGAPI
5350	png_mmx_support(void)
5351	{
5352	#if defined(PNG_MMX_CODE_SUPPORTED)
5353	int result;
5354	__asm__ __volatile__ (
5355	"pushl %%ebx \n\t" // ebx gets clobbered by CPUID instruction
5356	"pushl %%ecx \n\t" // so does ecx...
5357	"pushl %%edx \n\t" // ...and edx (but ecx & edx safe on Linux)
5358	// ".byte 0x66 \n\t" // convert 16-bit pushf to 32-bit pushfd
5359	// "pushf \n\t" // 16-bit pushf
5360	"pushfl \n\t" // save Eflag to stack
5361	"popl %%eax \n\t" // get Eflag from stack into eax
5362	"movl %%eax, %%ecx \n\t" // make another copy of Eflag in ecx
5363	"xorl $0x200000, %%eax \n\t" // toggle ID bit in Eflag (i.e., bit 21)
5364	"pushl %%eax \n\t" // save modified Eflag back to stack
5365	// ".byte 0x66 \n\t" // convert 16-bit popf to 32-bit popfd
5366	// "popf \n\t" // 16-bit popf
5367	"popfl \n\t" // restore modified value to Eflag reg
5368	"pushfl \n\t" // save Eflag to stack
5369	"popl %%eax \n\t" // get Eflag from stack
5370	"pushl %%ecx \n\t" // save original Eflag to stack
5371	"popfl \n\t" // restore original Eflag
5372	"xorl %%ecx, %%eax \n\t" // compare new Eflag with original Eflag
5373	"jz 0f \n\t" // if same, CPUID instr. is not supported
5374
5375	"xorl %%eax, %%eax \n\t" // set eax to zero
5376	// ".byte 0x0f, 0xa2 \n\t" // CPUID instruction (two-byte opcode)
5377	"cpuid \n\t" // get the CPU identification info
5378	"cmpl $1, %%eax \n\t" // make sure eax return non-zero value
5379	"jl 0f \n\t" // if eax is zero, MMX is not supported
5380
5381	"xorl %%eax, %%eax \n\t" // set eax to zero and...
5382	"incl %%eax \n\t" // ...increment eax to 1. This pair is
5383	// faster than the instruction "mov eax, 1"
5384	"cpuid \n\t" // get the CPU identification info again
5385	"andl $0x800000, %%edx \n\t" // mask out all bits but MMX bit (23)
5386	"cmpl $0, %%edx \n\t" // 0 = MMX not supported
5387	"jz 0f \n\t" // non-zero = yes, MMX IS supported
5388
5389	"movl $1, %%eax \n\t" // set return value to 1
5390	"jmp 1f \n\t" // DONE: have MMX support
5391
5392	"0: \n\t" // .NOT_SUPPORTED: target label for jump instructions
5393	"movl $0, %%eax \n\t" // set return value to 0
5394	"1: \n\t" // .RETURN: target label for jump instructions
5395	"popl %%edx \n\t" // restore edx
5396	"popl %%ecx \n\t" // restore ecx
5397	"popl %%ebx \n\t" // restore ebx
5398
5399	// "ret \n\t" // DONE: no MMX support
5400	// (fall through to standard C "ret")
5401
5402	: "=a" (result) // output list
5403
5404	: // any variables used on input (none)
5405
5406	// no clobber list
5407	// , "%ebx", "%ecx", "%edx" // GRR: we handle these manually
5408	// , "memory" // if write to a variable gcc thought was in a reg
5409	// , "cc" // "condition codes" (flag bits)
5410	);
5411	_mmx_supported = result;
5412	#else
5413	_mmx_supported = 0;
5414	#endif /* PNG_MMX_CODE_SUPPORTED */
5415
5416	return _mmx_supported;
5417	}
5418
5419
5420	#endif /* PNG_USE_PNGGCCRD */

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: liacs/MIR2010/SourceCode/cximage/png/pnggccrd.c@ 209

Download in other formats: