SCHED_ULE should not be the default

nice 20 doesn't mean it should give time to just any other program. Have
you tried setting dnetc_idprio?

Not everybody runs this sort of program, but there are plenty of
similar projects out there, and people who try to participate in
them will be mightily displeased with their FreeBSD systems when
they do. Is there some case where SCHED_ULE exhibits significantly
better performance than SCHED_4BSD? If not, I think SCHED-4BSD
should remain the default GENERIC configuration until this is fixed.

Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. You incidentally found rare misbehavior of
SCHED_ULE and I think this would be treated.

--
Sphinx of black quartz judge my vow.

O. Hartmann

2011-12-12 14:11:06 UTC

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD? Whenever the subject comes up, it is
mentioned, that SCHED_ULE has better performance on boxes with a ncpu >
2. But in the end I see here contradictionary statements. People
complain about poor performance (especially in scientific environments),
and other give contra not being the case.

Within our department, we developed a highly scalable code for planetary
science purposes on imagery. It utilizes present GPUs via OpenCL if
present. Otherwise it grabs as many cores as it can.
By the end of this year I'll get a new desktop box based on Intels new
Sandy Bridge-E architecture with plenty of memory. If the colleague who
developed the code is willing performing some benchmarks on the same
hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
recent Suse. For FreeBSD I intent also to look for performance with both
different schedulers available.

O.

Steve Kargl

2011-12-12 15:54:17 UTC

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD? Whenever the subject comes up, it is
mentioned, that SCHED_ULE has better performance on boxes with a ncpu >
2. But in the end I see here contradictionary statements. People
complain about poor performance (especially in scientific environments),
and other give contra not being the case.
Within our department, we developed a highly scalable code for planetary
science purposes on imagery. It utilizes present GPUs via OpenCL if
present. Otherwise it grabs as many cores as it can.
By the end of this year I'll get a new desktop box based on Intels new
Sandy Bridge-E architecture with plenty of memory. If the colleague who
developed the code is willing performing some benchmarks on the same
hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
recent Suse. For FreeBSD I intent also to look for performance with both
different schedulers available.

This comes up every 9 months or so, and must be approaching
FAQ status.

In a HPC environment, I recommend 4BSD. Depending on
the workload, ULE can cause a severe increase in turn
around time when doing already long computations. If
you have an MPI application, simply launching greater
than ncpu+1 jobs can show the problem.

PS: search the list archives for "kargl and ULE".

--
Steve

Lars Engels

2011-12-12 16:15:52 UTC

Would it be possible to implement a mechanism that lets one change the scheduler on the fly? Afaik Solaris can do that.

_____________________________________________
Von: Steve Kargl <***@troutmask.apl.washington.edu>
Versendet am: Mon Dec 12 16:51:59 MEZ 2011
An: "O. Hartmann" <***@mail.zedat.fu-berlin.de>
CC: freebsd-***@freebsd.org, Current FreeBSD <freebsd-***@freebsd.org>, freebsd-***@freebsd.org
Betreff: Re: SCHED_ULE should not be the default

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD? Whenever the subject comes up, it is
mentioned, that SCHED_ULE has better performance on boxes with a ncpu >
2. But in the end I see here contradictionary statements. People
complain about poor performance (especially in scientific environments),
and other give contra not being the case.
Within our department, we developed a highly scalable code for planetary
science purposes on imagery. It utilizes present GPUs via OpenCL if
present. Otherwise it grabs as many cores as it can.
By the end of this year I'll get a new desktop box based on Intels new
Sandy Bridge-E architecture with plenty of memory. If the colleague who
developed the code is willing performing some benchmarks on the same
hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
recent Suse. For FreeBSD I intent also to look for performance with both
different schedulers available.

--
Steve
_____________________________________________

freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-***@freebsd.org"

Bruce Cran

2011-12-12 16:20:33 UTC

This comes up every 9 months or so, and must be approaching FAQ
status. In a HPC environment, I recommend 4BSD. Depending on the
workload, ULE can cause a severe increase in turn around time when
doing already long computations. If you have an MPI application,
search the list archives for "kargl and ULE".

This isn't something that can be fixed by tuning ULE? For example for
desktop applications kern.sched.preempt_thresh should be set to 224 from
its default. I'm wondering if the installer should ask people what the
typical use will be, and tune the scheduler appropriately.

--
Bruce Cran

Ivan Klymenko

2011-12-12 16:45:21 UTC

В Mon, 12 Dec 2011 16:18:35 +0000

This comes up every 9 months or so, and must be approaching FAQ
status. In a HPC environment, I recommend 4BSD. Depending on the
workload, ULE can cause a severe increase in turn around time when
doing already long computations. If you have an MPI application,
search the list archives for "kargl and ULE".

This isn't something that can be fixed by tuning ULE? For example for
desktop applications kern.sched.preempt_thresh should be set to 224
from its default. I'm wondering if the installer should ask people
what the typical use will be, and tune the scheduler appropriately.

This is by and large does not help in certain situations ...

Steve Kargl

2011-12-12 17:07:21 UTC

This comes up every 9 months or so, and must be approaching FAQ
status. In a HPC environment, I recommend 4BSD. Depending on the
workload, ULE can cause a severe increase in turn around time when
doing already long computations. If you have an MPI application,
search the list archives for "kargl and ULE".

Tuning kern.sched.preempt_thresh did not seem to help for
my workload. My code is a classic master-slave OpenMPI
application where the master runs on one node and all
cpu-bound slaves are sent to a second node. If I send
send ncpu+1 jobs to the 2nd node with ncpu's, then
ncpu-1 jobs are assigned to the 1st ncpu-1 cpus. The
last two jobs are assigned to the ncpu'th cpu, and
these ping-pong on the this cpu. AFAICT, it is a cpu
affinity issue, where ULE is trying to keep each job
associated with its initially assigned cpu.

While one might suggest that starting ncpu+1 jobs
is not prudent, my example is just that. It is an
example showing that ULE has performance issues.
So, I now can start only ncpu jobs on each node
in the cluster and send emails to all other users
to not use those node, or use 4BSD and not worry
about loading issues.

--
Steve

John Baldwin

2011-12-12 18:52:56 UTC

This comes up every 9 months or so, and must be approaching FAQ
status. In a HPC environment, I recommend 4BSD. Depending on the
workload, ULE can cause a severe increase in turn around time when
doing already long computations. If you have an MPI application,
search the list archives for "kargl and ULE".

Tuning kern.sched.preempt_thresh did not seem to help for
my workload. My code is a classic master-slave OpenMPI
application where the master runs on one node and all
cpu-bound slaves are sent to a second node. If I send
send ncpu+1 jobs to the 2nd node with ncpu's, then
ncpu-1 jobs are assigned to the 1st ncpu-1 cpus. The
last two jobs are assigned to the ncpu'th cpu, and
these ping-pong on the this cpu. AFAICT, it is a cpu
affinity issue, where ULE is trying to keep each job
associated with its initially assigned cpu.
While one might suggest that starting ncpu+1 jobs
is not prudent, my example is just that. It is an
example showing that ULE has performance issues.
So, I now can start only ncpu jobs on each node
in the cluster and send emails to all other users
to not use those node, or use 4BSD and not worry
about loading issues.

This is a case where 4BSD's naive algorithm will spread out the load more
evenly because all the threads are on a single, shared queue and each CPU
just grabs the head of the queue when it finishes a timeslice. ULE always
assigns threads to a single CPU (even if they aren't pinned to a single
CPU using cpuset, etc.) and then tries to balance the load across cores
later, but I believe in this case it's rebalancer won't have anything to
really do as no matter what it does with the N+1 job it's going to be
sharing a CPU with another job.

--
John Baldwin

Scott Lambert

2011-12-12 19:20:14 UTC

Post by Steve Kargl
Tuning kern.sched.preempt_thresh did not seem to help for
my workload. My code is a classic master-slave OpenMPI
application where the master runs on one node and all
cpu-bound slaves are sent to a second node. If I send
send ncpu+1 jobs to the 2nd node with ncpu's, then
ncpu-1 jobs are assigned to the 1st ncpu-1 cpus. The
last two jobs are assigned to the ncpu'th cpu, and
these ping-pong on the this cpu. AFAICT, it is a cpu
affinity issue, where ULE is trying to keep each job
associated with its initially assigned cpu.
While one might suggest that starting ncpu+1 jobs
is not prudent, my example is just that. It is an
example showing that ULE has performance issues.
So, I now can start only ncpu jobs on each node
in the cluster and send emails to all other users
to not use those node, or use 4BSD and not worry
about loading issues.

Does it meet your expectations if you start (j modulo ncpu) = 0
jobs on a node?

--
Scott Lambert KC5MLE Unix SysAdmin
***@lambertfam.org

Steve Kargl

2011-12-12 19:28:04 UTC

Post by Scott Lambert

Does it meet your expectations if you start (j modulo ncpu) = 0
jobs on a node?

I've never tried to launch more than ncpu + 1 (or + 2)
jobs. I suppose at the time I was investigating the issue,
it was determined that 4BSD allowed me to get my work done
in a more timely manner. So, I took the path of least
resistance.

--
Steve

O. Hartmann

2011-12-13 00:05:27 UTC

This comes up every 9 months or so, and must be approaching FAQ
status. In a HPC environment, I recommend 4BSD. Depending on the
workload, ULE can cause a severe increase in turn around time when
doing already long computations. If you have an MPI application,
search the list archives for "kargl and ULE".

Is the tuning of kern.sched.preempt_thresh and a proper method of
estimating its correct value for the intended to use workload documented
in the manpages, maybe tuning()?

I find it hard to crawl a lot of pros and cons of mailing lists for
evaluating a correct value of this, seemingly, important tunable.

Bruce Cran

2011-12-13 00:16:44 UTC

Post by O. Hartmann
Is the tuning of kern.sched.preempt_thresh and a proper method of
estimating its correct value for the intended to use workload
documented in the manpages, maybe tuning()? I find it hard to crawl a
lot of pros and cons of mailing lists for evaluating a correct value
of this, seemingly, important tunable.

Note that I said "for example" :)
I was suggesting that there may be sysctl's that can be tweaked to
improve performance.

--
Bruce Cran

George Mitchell

2011-12-13 11:06:08 UTC

This comes up every 9 months or so, and must be approaching FAQ
status. In a HPC environment, I recommend 4BSD. Depending on the
workload, ULE can cause a severe increase in turn around time when
doing already long computations. If you have an MPI application,
search the list archives for "kargl and ULE".

I tried my "make buildkernel" test with "dnetc" running after setting
kern.sched.preempt_thresh set to 224. It did far worse than before,
getting only as far as compiling bxe overnight (compared to getting
to netgragh with the default kern.sched.preempt_thresh setting).
-- George Mitchell

Garrett Wollman

2011-12-12 19:24:04 UTC

Post by Bruce Cran
This isn't something that can be fixed by tuning ULE? For example for
desktop applications kern.sched.preempt_thresh should be set to 224 from
its default.

Where do you get that idea? I've never seen any evidence for this
proposition (although the claim is repeated often enough). What are
the specific circumstances that make this useful? Where did the
number come from?

-GAWollman

Bruce Cran

2011-12-12 19:44:20 UTC

Post by Garrett Wollman
Where do you get that idea? I've never seen any evidence for this
proposition (although the claim is repeated often enough). What are
the specific circumstances that make this useful? Where did the
number come from?

It's just something I've heard repeated, and people claiming that
setting it improves performance.

This explains how the value 224 was obtained:
http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/058686.html

--
Bruce Cran

Garrett Wollman

2011-12-12 20:27:07 UTC

It's just something I've heard repeated, and people claiming that
setting it improves performance.
http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/058686.html

Not so far as I can see.

The message does suggest that it helps if you are running a CPU-hog
GUI, which seems plausible to me, but doesn't justify making it the
default -- particularly when the setting is undocumented. (It appears
to control how CPU-bound a process can be and still preempt another
even more CPU-bound process, so using this as a "desktop performance"
"fix" looks doubly wrong.)

-GAWollman

O. Hartmann

2011-12-13 13:25:05 UTC

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD? Whenever the subject comes up, it is
mentioned, that SCHED_ULE has better performance on boxes with a ncpu >
2. But in the end I see here contradictionary statements. People
complain about poor performance (especially in scientific environments),
and other give contra not being the case.
Within our department, we developed a highly scalable code for planetary
science purposes on imagery. It utilizes present GPUs via OpenCL if
present. Otherwise it grabs as many cores as it can.
By the end of this year I'll get a new desktop box based on Intels new
Sandy Bridge-E architecture with plenty of memory. If the colleague who
developed the code is willing performing some benchmarks on the same
hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
recent Suse. For FreeBSD I intent also to look for performance with both
different schedulers available.

This comes up every 9 months or so, and must be approaching
FAQ status.
In a HPC environment, I recommend 4BSD. Depending on
the workload, ULE can cause a severe increase in turn
around time when doing already long computations. If
you have an MPI application, simply launching greater
than ncpu+1 jobs can show the problem.

Well, those recommendations should based on "WHY". As the mostly
negative experiences with SCHED_ULE in highly computative workloads get
allways contradicted by "...but there are workloads that show the
opposite ..." this should be shown by more recent benchmarks and
explanations than legacy benchmarks from years ago.

And, indeed, I highly would recommend having a FAQ or a short note in
"tuning" or the handbook in which it is mentioned to use SCHED_4BSD in
HPC environments and SCHED_ULE for other workloads (which has to be more
specific).

It is not an easy task setting up a certain kind of OS for a specific
purpose and tuning by crawling the mailing lists. Some notes and hints
in the documentation is always a valuable hint and highly appreciated by
folks not deep into development.

And by the way, I have the deep impression that most of these
discussions about the poor performance of SCHED_ULE tend to always end
up in a covering up that flaw and the conclusive waste of development.
But this is only my personal impression.

Steve Kargl

2011-12-13 15:55:50 UTC

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD? Whenever the subject comes up, it is
mentioned, that SCHED_ULE has better performance on boxes with a ncpu >
2. But in the end I see here contradictionary statements. People
complain about poor performance (especially in scientific environments),
and other give contra not being the case.
Within our department, we developed a highly scalable code for planetary
science purposes on imagery. It utilizes present GPUs via OpenCL if
present. Otherwise it grabs as many cores as it can.
By the end of this year I'll get a new desktop box based on Intels new
Sandy Bridge-E architecture with plenty of memory. If the colleague who
developed the code is willing performing some benchmarks on the same
hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
recent Suse. For FreeBSD I intent also to look for performance with both
different schedulers available.

This comes up every 9 months or so, and must be approaching
FAQ status.
In a HPC environment, I recommend 4BSD. Depending on
the workload, ULE can cause a severe increase in turn
around time when doing already long computations. If
you have an MPI application, simply launching greater
than ncpu+1 jobs can show the problem.

--
Steve

Mike Tancsa

2011-12-13 21:29:06 UTC

Post by Steve Kargl
I have given the WHY in previous discussions of ULE, based
on what you call legacy benchmarks. I have not seen any
commit to sched_ule.c that would lead me to believe that
the performance issues with ULE and cpu-bound numerical
codes have been addressed. Repeating the benchmark would
be a waste of time.

Trying a simple pbzip2 on a large file, the results are pretty
consistent through iterations. pbzip2 with 4BSD is barely faster on a
file thats 322MB in size.

after a reboot, I did a
strings bigfile > /dev/null
then ran
pbzip2 -v xaa -c > /dev/null
7 times

If I do a burnP6 in the background, they perform about the same.

(from sysutils/cpuburn)
eg

pbzip2 -v xaa -c > /dev/null
Parallel BZIP2 v1.1.6 - by: Jeff Gilchrist [http://compression.ca]
[Oct. 30, 2011] (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+***@gmail.com>

# CPUs: 4
BWT Block Size: 900 KB
File Block Size: 900 KB
Maximum Memory: 100 MB
-------------------------------------------
File #: 1 of 1
Input Name: xaa
Output Name: <stdout>

Input Size: 352404831 bytes
Compressing data...
Output Size: 50630745 bytes
-------------------------------------------

Wall Clock: 18.139342 seconds

ULE
18.113204
18.116896
18.123400
18.105894
18.163332
18.139342
18.082888

ULE with burnP6
23.076085
22.003666
21.162987
21.682445
21.935568
23.595781
21.601277

4BSD
17.983395
17.986218
18.009254
18.004312
18.001494
17.997032

4BSD with burnP6
22.215508
21.886459
21.595179
21.361830
21.325351
21.244793

# ministat uleP6 bsdP6
x uleP6
+ bsdP6
+------------------------------------------------------------------------------------------------------------------------------------------+
|x + + + + x + x x +
x x|
|
|____|______________MA____________________|M_____________A__________________________________________________|
|
+------------------------------------------------------------------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 6 21.162987 23.595781 22.003666 22.242755 0.91175566
+ 6 21.244793 22.215508 21.595179 21.604853 0.3792413
No difference proven at 95.0% confidence

x ule
+ bsd
+------------------------------------------------------------------------------------------------------------------------------------------+
|+ + + + + +
x x x x x x x|
| |______A___M___|
|________________M__A__________________| |
+------------------------------------------------------------------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 7 18.082888 18.163332 18.116896 18.120708 0.025468695
+ 6 17.983395 18.009254 18.001494 17.996951 0.010248473
Difference at 95.0% confidence
-0.123757 +/- 0.024538
-0.68296% +/- 0.135414%
(Student's t, pooled s = 0.0200388)

hardware is X3450 with 8G of memory. RELENG8

---Mike

--
-------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, ***@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada http://www.tancsa.com/

Malin Randstrom

2011-12-13 21:59:06 UTC

stop sending me spam mail ... you never stop despite me having unsubscribeb
several times. stop this!

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

environments),

Post by O. Hartmann
and other give contra not being the case.
Within our department, we developed a highly scalable code for

planetary

Post by O. Hartmann
science purposes on imagery. It utilizes present GPUs via OpenCL if
present. Otherwise it grabs as many cores as it can.
By the end of this year I'll get a new desktop box based on Intels new
Sandy Bridge-E architecture with plenty of memory. If the colleague

who

Post by O. Hartmann
developed the code is willing performing some benchmarks on the same
hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
recent Suse. For FreeBSD I intent also to look for performance with

both

Post by O. Hartmann
different schedulers available.

This comes up every 9 months or so, and must be approaching
FAQ status.
In a HPC environment, I recommend 4BSD. Depending on
the workload, ULE can cause a severe increase in turn
around time when doing already long computations. If
you have an MPI application, simply launching greater
than ncpu+1 jobs can show the problem.

I have given the WHY in previous discussions of ULE, based
on what you call legacy benchmarks. I have not seen any
commit to sched_ule.c that would lead me to believe that
the performance issues with ULE and cpu-bound numerical
codes have been addressed. Repeating the benchmark would
be a waste of time.
--
Steve
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "

Doug Barton

2011-12-13 22:04:52 UTC

Post by Malin Randstrom
stop sending me spam mail ... you never stop despite me having unsubscribeb
several times. stop this!

If you had actually unsubscribed, the mail would have stopped. :)

You can see the instructions you need to follow below.

Post by Malin Randstrom
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-stable

--
[^L]

Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price. :) http://SupersetSolutions.com/

Gary Jennejohn

2011-12-12 16:03:30 UTC

On Mon, 12 Dec 2011 15:13:00 +0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

It all a little old now but some if the stuff in
http://people.freebsd.org/~kris/scaling/
covers improvements that were seen.
http://jeffr-tech.livejournal.com/5705.html
shows a little too, reading though Jeffs blog is worth it as it has some
interesting stuff on SHED_ULE.
I thought there were some more benchmarks floating round but cant find
any with a quick google.
Vince

Post by O. Hartmann
Within our department, we developed a highly scalable code for planetary
science purposes on imagery. It utilizes present GPUs via OpenCL if
present. Otherwise it grabs as many cores as it can.
By the end of this year I'll get a new desktop box based on Intels new
Sandy Bridge-E architecture with plenty of memory. If the colleague who
developed the code is willing performing some benchmarks on the same
hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
recent Suse. For FreeBSD I intent also to look for performance with both
different schedulers available.

These observations are not scientific, but I have a CPU from AMD with
6 cores (AMD Phenom(tm) II X6 1090T Processor).

My simple test was ``make buildkernel'' while watching the core usage with
gkrellm.

With SCHED_4BSD all 6 cores are loaded to 97% during the build phase.
I've never seen any value above 97% with gkrellm.

With SCHED_ULE I never saw all 6 cores loaded this heavily. Usually
2 or more cores were at or below 90%. Not really that significant, but
still a noticeable difference in apparent scheduling behavior. Whether
the observed difference is due to some change in data from the kernel to
gkrellm is beyond me.

--
Gary Jennejohn

Lars Engels

2011-12-12 16:13:45 UTC

Did you use -jX to build the world?

_____________________________________________
Von: Gary Jennejohn <***@googlemail.com>
Versendet am: Mon Dec 12 16:32:21 MEZ 2011
An: Vincent Hoffman <***@unsane.co.uk>
CC: "O. Hartmann" <***@mail.zedat.fu-berlin.de>, Current FreeBSD <freebsd-***@freebsd.org>, freebsd-***@freebsd.org, freebsd-***@freebsd.org
Betreff: Re: SCHED_ULE should not be the default

On Mon, 12 Dec 2011 15:13:00 +0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

--
Gary Jennejohn
_____________________________________________

freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-***@freebsd.org"

Gary Jennejohn

2011-12-12 16:49:56 UTC

On Mon, 12 Dec 2011 17:10:46 +0100

Post by Lars Engels
Did you use -jX to build the world?

I'm top posting since Lars did.

It was buildkernel, not buildworld.

Yes, -j6.

Post by Lars Engels
_____________________________________________
Versendet am: Mon Dec 12 16:32:21 MEZ 2011
Betreff: Re: SCHED_ULE should not be the default
On Mon, 12 Dec 2011 15:13:00 +0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

These observations are not scientific, but I have a CPU from AMD with
6 cores (AMD Phenom(tm) II X6 1090T Processor).
My simple test was ``make buildkernel'' while watching the core usage with
gkrellm.
With SCHED_4BSD all 6 cores are loaded to 97% during the build phase.
I've never seen any value above 97% with gkrellm.
With SCHED_ULE I never saw all 6 cores loaded this heavily. Usually
2 or more cores were at or below 90%. Not really that significant, but
still a noticeable difference in apparent scheduling behavior. Whether
the observed difference is due to some change in data from the kernel to
gkrellm is beyond me.
--
Gary Jennejohn
_____________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-stable

--
Gary Jennejohn

m***@FreeBSD.org

2011-12-12 16:31:37 UTC

On Mon, Dec 12, 2011 at 7:32 AM, Gary Jennejohn

Post by Gary Jennejohn
On Mon, 12 Dec 2011 15:13:00 +0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

These observations are not scientific, but I have a CPU from AMD with
6 cores (AMD Phenom(tm) II X6 1090T Processor).
My simple test was ``make buildkernel'' while watching the core usage with
gkrellm.
With SCHED_4BSD all 6 cores are loaded to 97% during the build phase.
I've never seen any value above 97% with gkrellm.
With SCHED_ULE I never saw all 6 cores loaded this heavily. Usually
2 or more cores were at or below 90%. Not really that significant, but
still a noticeable difference in apparent scheduling behavior. Whether
the observed difference is due to some change in data from the kernel to
gkrellm is beyond me.

SCHED_ULE is much sloppier about calculating which thread used a
timeslice -- unless the timeslice went 100% to a thread, the fraction
it used may get attributed elsewhere. So top's reporting of thread
usage is not a useful metric. Total buildworld time is, potentially.

Thanks,
matthew

Gary Jennejohn

2011-12-12 16:51:35 UTC

On Mon, 12 Dec 2011 08:04:37 -0800

Post by m***@FreeBSD.org
On Mon, Dec 12, 2011 at 7:32 AM, Gary Jennejohn

Post by Gary Jennejohn
On Mon, 12 Dec 2011 15:13:00 +0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

These observations are not scientific, but I have a CPU from AMD with
6 cores (AMD Phenom(tm) II X6 1090T Processor).
My simple test was ``make buildkernel'' while watching the core usage with
gkrellm.
With SCHED_4BSD all 6 cores are loaded to 97% during the build phase.
I've never seen any value above 97% with gkrellm.
With SCHED_ULE I never saw all 6 cores loaded this heavily. Usually
2 or more cores were at or below 90%. Not really that significant, but
still a noticeable difference in apparent scheduling behavior. Whether
the observed difference is due to some change in data from the kernel to
gkrellm is beyond me.

I suspect you're right since the buildworld time, a much better test,
was pretty much the same with 4BSD and ULE.

--
Gary Jennejohn

Pieter de Goeje

2011-12-12 16:33:00 UTC

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD? Whenever the subject comes up, it is
mentioned, that SCHED_ULE has better performance on boxes with a ncpu >
2. But in the end I see here contradictionary statements. People
complain about poor performance (especially in scientific environments),
and other give contra not being the case.
Within our department, we developed a highly scalable code for planetary
science purposes on imagery. It utilizes present GPUs via OpenCL if
present. Otherwise it grabs as many cores as it can.
By the end of this year I'll get a new desktop box based on Intels new
Sandy Bridge-E architecture with plenty of memory. If the colleague who
developed the code is willing performing some benchmarks on the same
hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
recent Suse. For FreeBSD I intent also to look for performance with both
different schedulers available.
O.

In my spare time I do some stuff which can be considered "HPC". If I recall
correctly the most loud supporters of the notion that SCHED_BSD is faster
than SCHED_ULE are using more threads than there are cores, causing CPU core
contention and more importantly unevenly distributed runtimes among threads,
resulting in suboptimal execution times for their programs. Since I've never
actually seen that code in question it's hard to say whether or not
this "unfair" distribution actually results in lower throughput or that it
simply violates an assumption in the code that each thread takes about as
long to finish its task.
Although I haven't actually benchmarked the two schedulers directly, I have no
reason to suspect SCHED_ULE of suboptimal performance because:
1) A program model where there are N threads on N cores which take work items
from a shared queue until it is empty has almost perfect scaling on SCHED_ULE
(I get 398% CPU usage on a quadcore)
2) The same program on Linux (dual boot) compiled with exactly the same
compiler and flags runs slightly slower. I think this has to do with VM
differences.

What I'm trying to say is that until someone actually shows some code which
has demonstrably lower performance on SCHED_ULE and this is not caused by
IMHO improper timing dependencies between threads I'd say that there is no
cause for concern here. I actually expect performance differences between the
two schedulers to show in problems which cause a lot more contention on the
CPU cores and use lots of locks internally so threads are frequently waiting
on each other, for instance the MySQL benchmarks done a couple of years ago
by Kris Kennaway.

Aside from algorithmic limitations (SCHED_BSD doesn't really scale all that
well), there will always exist some problems in which SCHED_BSD is faster
because it by chance has a better execution order for these problems... The
good thing is people have a choice :-).

I'm looking forward to the results of your benchmark.

--
Pieter de Goeje

Doug Barton

2011-12-13 00:31:09 UTC

Post by O. Hartmann
Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD?

I complained about poor interactive performance of ULE in a desktop
environment for years. I had numerous people try to help, including
Jeff, with various tunables, dtrace'ing, etc. The cause of the problem
was never found.

I switched to 4BSD, problem gone.

This is on 2 separate systems with core 2 duos.

hth,

Doug

--
[^L]

Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price. :) http://SupersetSolutions.com/

George Mitchell

2011-12-13 01:30:18 UTC

Post by Doug Barton
[...]
I switched to 4BSD, problem gone.
[...]

Ditto. If there's some common situation where the average user would
have a perceptibly better experience with ULE, let's go for it. But
when there's a plausible usage scenario in which ULE gives OVER AN
ORDER OF MAGNITUDE worse performance[1], making ULE the default seems
like a bad choice. -- George Mitchell

[1]
http://lists.freebsd.org/pipermail/freebsd-stable/2011-December/064773.html

Ivan Klymenko

2011-12-13 08:42:21 UTC

Post by O. Hartmann
Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD?

I complained about poor interactive performance of ULE in a desktop
environment for years. I had numerous people try to help, including
Jeff, with various tunables, dtrace'ing, etc. The cause of the problem
was never found.
I switched to 4BSD, problem gone.
This is on 2 separate systems with core 2 duos.
hth,
Doug

If the algorithm ULE does not contain problems - it means the problem
has Core2Duo, or in a piece of code that uses the ULE scheduler.
I already wrote in a mailing list that specifically in my case (Core2Duo)
partially helps the following patch:
--- sched_ule.c.orig 2011-11-24 18:11:48.000000000 +0200
+++ sched_ule.c 2011-12-10 22:47:08.000000000 +0200
@@ -794,7 +794,8 @@
* 1.5 * balance_interval.
*/
balance_ticks = max(balance_interval / 2, 1);
- balance_ticks += random() % balance_interval;
+// balance_ticks += random() % balance_interval;
+ balance_ticks += ((int)random()) % balance_interval;
if (smp_started == 0 || rebalance == 0)
return;
tdq = TDQ_SELF();
@@ -2118,13 +2119,21 @@
struct td_sched *ts;

THREAD_LOCK_ASSERT(td, MA_OWNED);
+ if (td->td_pri_class & PRI_FIFO_BIT)
+ return;
+ ts = td->td_sched;
+ /*
+ * We used up one time slice.
+ */
+ if (--ts->ts_slice > 0)
+ return;
tdq = TDQ_SELF();
#ifdef SMP
/*
* We run the long term load balancer infrequently on the first cpu.
*/
- if (balance_tdq == tdq) {
- if (balance_ticks && --balance_ticks == 0)
+ if (balance_ticks && --balance_ticks == 0) {
+ if (balance_tdq == tdq)
sched_balance();
}
#endif
@@ -2144,9 +2153,6 @@
if (TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx]))
tdq->tdq_ridx = tdq->tdq_idx;
}
- ts = td->td_sched;
- if (td->td_pri_class & PRI_FIFO_BIT)
- return;
if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE) {
/*
* We used a tick; charge it to the thread so
@@ -2157,11 +2163,6 @@
sched_priority(td);
}
/*
- * We used up one time slice.
- */
- if (--ts->ts_slice > 0)
- return;
- /*
* We're out of time, force a requeue at userret().
*/
ts->ts_slice = sched_slice;

and refusal to use options FULL_PREEMPTION
But no one has unsubscribed to my letter, my patch helps or not in the case of Core2Duo...
There is a suspicion that the problems stem from the sections of code associated with the SMP...
Maybe I'm in something wrong, but I want to help in solving this problem ...

Andrey Chernov

2011-12-13 09:14:00 UTC

Post by Ivan Klymenko

Post by O. Hartmann
Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD?

I complained about poor interactive performance of ULE in a desktop
environment for years. I had numerous people try to help, including
Jeff, with various tunables, dtrace'ing, etc. The cause of the problem
was never found.
I switched to 4BSD, problem gone.
This is on 2 separate systems with core 2 duos.
hth,
Doug

If the algorithm ULE does not contain problems - it means the problem
has Core2Duo, or in a piece of code that uses the ULE scheduler.

I observe ULE interactivity slowness even on single core machine (Pentium
4) in very visible places, like 'ps ax' output stucks in the middle by ~1
second. When I switch back to SHED_4BSD, all slowness is gone.

--
http://ache.vniz.net/

Adrian Chadd

2011-12-13 10:23:56 UTC

Post by Ivan Klymenko
If the algorithm ULE does not contain problems - it means the problem
has Core2Duo, or in a piece of code that uses the ULE scheduler.

Are you able to provide KTR traces of the scheduler results? Something
that can be fed to schedgraph?

Adrian

Andrey Chernov

2011-12-14 17:36:15 UTC

Post by Adrian Chadd

Post by Ivan Klymenko
If the algorithm ULE does not contain problems - it means the problem
has Core2Duo, or in a piece of code that uses the ULE scheduler.

Are you able to provide KTR traces of the scheduler results? Something
that can be fed to schedgraph?

Sorry, this machine is not mine anymore. I try SCHED_ULE on Core 2 Duo
instead and don't notice this effect, but it is overall pretty fast
comparing to that Pentium 4.

--
http://ache.vniz.net/

Ivan Klymenko

2011-12-14 17:57:52 UTC

В Wed, 14 Dec 2011 21:34:35 +0400

Post by Adrian Chadd

Post by Ivan Klymenko
If the algorithm ULE does not contain problems - it means the
problem has Core2Duo, or in a piece of code that uses the ULE
scheduler.

I observe ULE interactivity slowness even on single core machine
(Pentium 4) in very visible places, like 'ps ax' output stucks in
the middle by ~1 second. When I switch back to SHED_4BSD, all
slowness is gone.

Are you able to provide KTR traces of the scheduler results?
Something that can be fed to schedgraph?

Sorry, this machine is not mine anymore. I try SCHED_ULE on Core 2
Duo instead and don't notice this effect, but it is overall pretty
fast comparing to that Pentium 4.

Give me, please, detailed instructions on how to do it - I'll do it ...
Be a shame if this the theme is will end again just only the
discussions ... :(

Jilles Tjoelker

2011-12-13 23:05:40 UTC

Post by Ivan Klymenko
If the algorithm ULE does not contain problems - it means the problem
has Core2Duo, or in a piece of code that uses the ULE scheduler.
I already wrote in a mailing list that specifically in my case (Core2Duo)
--- sched_ule.c.orig 2011-11-24 18:11:48.000000000 +0200
+++ sched_ule.c 2011-12-10 22:47:08.000000000 +0200
@@ -794,7 +794,8 @@
* 1.5 * balance_interval.
*/
balance_ticks = max(balance_interval / 2, 1);
- balance_ticks += random() % balance_interval;
+// balance_ticks += random() % balance_interval;
+ balance_ticks += ((int)random()) % balance_interval;
if (smp_started == 0 || rebalance == 0)
return;
tdq = TDQ_SELF();

This avoids a 64-bit division on 64-bit platforms but seems to have no
effect otherwise. Because this function is not called very often, the
change seems unlikely to help.

Post by Ivan Klymenko
@@ -2118,13 +2119,21 @@
struct td_sched *ts;
THREAD_LOCK_ASSERT(td, MA_OWNED);
+ if (td->td_pri_class & PRI_FIFO_BIT)
+ return;
+ ts = td->td_sched;
+ /*
+ * We used up one time slice.
+ */
+ if (--ts->ts_slice > 0)
+ return;

This skips most of the periodic functionality (long term load balancer,
saving switch count (?), insert index (?), interactivity score update
for long running thread) if the thread is not going to be rescheduled
right now.

It looks wrong but it is a data point if it helps your workload.

Post by Ivan Klymenko
tdq = TDQ_SELF();
#ifdef SMP
/*
* We run the long term load balancer infrequently on the first cpu.
*/
- if (balance_tdq == tdq) {
- if (balance_ticks && --balance_ticks == 0)
+ if (balance_ticks && --balance_ticks == 0) {
+ if (balance_tdq == tdq)
sched_balance();
}
#endif

The main effect of this appears to be to disable the long term load
balancer completely after some time. At some point, a CPU other than the
first CPU (which uses balance_tdq) will set balance_ticks = 0, and
sched_balance() will never be called again.

It also introduces a hypothetical race condition because the access to
balance_ticks is no longer restricted to one CPU under a spinlock.

If the long term load balancer may be causing trouble, try setting
kern.sched.balance_interval to a higher value with unpatched code.

Post by Ivan Klymenko
@@ -2144,9 +2153,6 @@
if (TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx]))
tdq->tdq_ridx = tdq->tdq_idx;
}
- ts = td->td_sched;
- if (td->td_pri_class & PRI_FIFO_BIT)
- return;
if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE) {
/*
* We used a tick; charge it to the thread so
@@ -2157,11 +2163,6 @@
sched_priority(td);
}
/*
- * We used up one time slice.
- */
- if (--ts->ts_slice > 0)
- return;
- /*
* We're out of time, force a requeue at userret().
*/
ts->ts_slice = sched_slice;
and refusal to use options FULL_PREEMPTION
But no one has unsubscribed to my letter, my patch helps or not in the case of Core2Duo...
There is a suspicion that the problems stem from the sections of code
associated with the SMP...
Maybe I'm in something wrong, but I want to help in solving this problem ...

--
Jilles Tjoelker

Ivan Klymenko

2011-12-13 23:43:43 UTC

В Wed, 14 Dec 2011 00:04:42 +0100

Post by Ivan Klymenko
If the algorithm ULE does not contain problems - it means the
problem has Core2Duo, or in a piece of code that uses the ULE
scheduler. I already wrote in a mailing list that specifically in
--- sched_ule.c.orig 2011-11-24 18:11:48.000000000 +0200
+++ sched_ule.c 2011-12-10 22:47:08.000000000 +0200
@@ -794,7 +794,8 @@
* 1.5 * balance_interval.
*/
balance_ticks = max(balance_interval / 2, 1);
- balance_ticks += random() % balance_interval;
+// balance_ticks += random() % balance_interval;
+ balance_ticks += ((int)random()) % balance_interval;
if (smp_started == 0 || rebalance == 0)
return;
tdq = TDQ_SELF();

This avoids a 64-bit division on 64-bit platforms but seems to have no
effect otherwise. Because this function is not called very often, the
change seems unlikely to help.

Yes, this section does not apply to this problem :)
Just I posted the latest patch which i using now...

This skips most of the periodic functionality (long term load
balancer, saving switch count (?), insert index (?), interactivity
score update for long running thread) if the thread is not going to
be rescheduled right now.
It looks wrong but it is a data point if it helps your workload.

Yes, I did it for as long as possible to delay the execution of the code in section:
..
#ifdef SMP
/*
* We run the long term load balancer infrequently on the first cpu.
*/
if (balance_tdq == tdq) {
if (balance_ticks && --balance_ticks == 0)
sched_balance();
}
#endif
..

Post by Ivan Klymenko
tdq = TDQ_SELF();
#ifdef SMP
/*
* We run the long term load balancer infrequently on the
first cpu. */
- if (balance_tdq == tdq) {
- if (balance_ticks && --balance_ticks == 0)
+ if (balance_ticks && --balance_ticks == 0) {
+ if (balance_tdq == tdq)
sched_balance();
}
#endif

That is, for the same reason as above in the text...

Post by Jilles Tjoelker
It also introduces a hypothetical race condition because the access to
balance_ticks is no longer restricted to one CPU under a spinlock.
If the long term load balancer may be causing trouble, try setting
kern.sched.balance_interval to a higher value with unpatched code.

I checked it in the first place - but it did not help fix the situation...

The impression of malfunction rebalancing...
It seems that the thread is passed on to the same core that is loaded and so...
Perhaps this is a consequence of an incorrect definition of the topology CPU?

Post by Ivan Klymenko
@@ -2144,9 +2153,6 @@
if
(TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx]))
tdq->tdq_ridx = tdq->tdq_idx; }
- ts = td->td_sched;
- if (td->td_pri_class & PRI_FIFO_BIT)
- return;
if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE) {
/*
* We used a tick; charge it to the thread so
@@ -2157,11 +2163,6 @@
sched_priority(td);
}
/*
- * We used up one time slice.
- */
- if (--ts->ts_slice > 0)
- return;
- /*
* We're out of time, force a requeue at userret().
*/
ts->ts_slice = sched_slice;
and refusal to use options FULL_PREEMPTION
But no one has unsubscribed to my letter, my patch helps or not in
the case of Core2Duo...
There is a suspicion that the problems stem from the sections of
code associated with the SMP...
Maybe I'm in something wrong, but I want to help in solving this problem ...

Bruce Evans

2011-12-14 03:26:39 UTC

Ð Wed, 14 Dec 2011 00:04:42 +0100

Post by Ivan Klymenko
If the algorithm ULE does not contain problems - it means the
problem has Core2Duo, or in a piece of code that uses the ULE
scheduler. I already wrote in a mailing list that specifically in
--- sched_ule.c.orig 2011-11-24 18:11:48.000000000 +0200
+++ sched_ule.c 2011-12-10 22:47:08.000000000 +0200
...
@@ -2118,13 +2119,21 @@
struct td_sched *ts;
THREAD_LOCK_ASSERT(td, MA_OWNED);
+ if (td->td_pri_class & PRI_FIFO_BIT)
+ return;
+ ts = td->td_sched;
+ /*
+ * We used up one time slice.
+ */
+ if (--ts->ts_slice > 0)
+ return;

This skips most of the periodic functionality (long term load
balancer, saving switch count (?), insert index (?), interactivity
score update for long running thread) if the thread is not going to
be rescheduled right now.
It looks wrong but it is a data point if it helps your workload.

I don't understand what you are doing here, but recently noticed that
the timeslicing in SCHED_4BSD is completely broken. This bug may be a
feature. SCHED_4BSD doesn't have its own timeslice counter like ts_slice
above. It uses `switchticks' instead. But switchticks hasn't been usable
for this purpose since long before SCHED_4BSD started using it for this
purpose. switchticks is reset on every context switch, so it is useless
for almost all purposes -- any interrupt activity on a non-fast interrupt
clobbers it.

Removing the check of ts_slice in the above and always returning might
give a similar bug to the SCHED_4BSD one.

I noticed this while looking for bugs in realtime scheduling. In the
above, returning early for PRI_FIFO_BIT also skips most of the periodic
functionality. In SCHED_4BSD, returning early is the usual case, so
the PRI_FIFO_BIT might as well not be checked, and it is the unusual
fifo scheduling case (which is supposed to only apply to realtime
priority threads) which has a chance of working as intended, while the
usual roundrobin case degenerates to an impure form of fifo scheduling
(iit is impure since priority decay still works so it is only fifo
among threads of the same priority).

Post by Jilles Tjoelker
...

With the ts_slice check here before you moved it, removing it might
give buggy behaviour closer to SCHED_4BSD.

Post by Ivan Klymenko
and refusal to use options FULL_PREEMPTION

4-5 years ago, I found that any form of PREMPTION was a pessimization
for at least makeworld (since it caused too many context switches).
PREEMPTION was needed for the !SMP case, at least partly because of
the broken switchticks (switchticks, when it works, gives voluntary
yielding by some CPU hogs in the kernel. PREEMPTION, if it works,
should do this better). So I used PREEMPTION in the !SMP case and
not for the SMP case. I didn't worry about the CPU hogs in the SMP
case since it is rare to have more than 1 of them and 1 will use at
most 1/2 of a multi-CPU system.

Post by Ivan Klymenko
But no one has unsubscribed to my letter, my patch helps or not in
the case of Core2Duo...
There is a suspicion that the problems stem from the sections of
code associated with the SMP...
Maybe I'm in something wrong, but I want to help in solving this problem ...

The main point of SCHED_ULE is to give better affinity for multi-CPU
systems. But the `multi' apparently needs to be strictly more than
2 for it to brak even.

Bruce

m***@FreeBSD.org

2011-12-14 00:04:51 UTC

Post by Ivan Klymenko
В Wed, 14 Dec 2011 00:04:42 +0100

Post by Ivan Klymenko
If the algorithm ULE does not contain problems - it means the
problem has Core2Duo, or in a piece of code that uses the ULE
scheduler. I already wrote in a mailing list that specifically in
--- sched_ule.c.orig 2011-11-24 18:11:48.000000000 +0200
+++ sched_ule.c 2011-12-10 22:47:08.000000000 +0200
@@ -794,7 +794,8 @@
* 1.5 * balance_interval.
*/
balance_ticks = max(balance_interval / 2, 1);
- balance_ticks += random() % balance_interval;
+// balance_ticks += random() % balance_interval;
+ balance_ticks += ((int)random()) % balance_interval;
if (smp_started == 0 || rebalance == 0)
return;
tdq = TDQ_SELF();

This avoids a 64-bit division on 64-bit platforms but seems to have no
effect otherwise. Because this function is not called very often, the
change seems unlikely to help.

Yes, this section does not apply to this problem :)
Just I posted the latest patch which i using now...

This skips most of the periodic functionality (long term load
balancer, saving switch count (?), insert index (?), interactivity
score update for long running thread) if the thread is not going to
be rescheduled right now.
It looks wrong but it is a data point if it helps your workload.

...
#ifdef SMP
/*
* We run the long term load balancer infrequently on the first cpu.
*/
if (balance_tdq == tdq) {
if (balance_ticks && --balance_ticks == 0)
sched_balance();
}
#endif
...

Post by Ivan Klymenko
tdq = TDQ_SELF();
#ifdef SMP
/*
* We run the long term load balancer infrequently on the
first cpu. */
- if (balance_tdq == tdq) {
- if (balance_ticks && --balance_ticks == 0)
+ if (balance_ticks && --balance_ticks == 0) {
+ if (balance_tdq == tdq)
sched_balance();
}
#endif

That is, for the same reason as above in the text...

I checked it in the first place - but it did not help fix the situation...
The impression of malfunction rebalancing...
It seems that the thread is passed on to the same core that is loaded and so...
Perhaps this is a consequence of an incorrect definition of the topology CPU?

Post by Ivan Klymenko
@@ -2144,9 +2153,6 @@
if
(TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx]))
tdq->tdq_ridx = tdq->tdq_idx; }
- ts = td->td_sched;
- if (td->td_pri_class & PRI_FIFO_BIT)
- return;
if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE) {
/*
* We used a tick; charge it to the thread so
@@ -2157,11 +2163,6 @@
sched_priority(td);
}
/*
- * We used up one time slice.
- */
- if (--ts->ts_slice > 0)
- return;
- /*
* We're out of time, force a requeue at userret().
*/
ts->ts_slice = sched_slice;
and refusal to use options FULL_PREEMPTION
But no one has unsubscribed to my letter, my patch helps or not in
the case of Core2Duo...
There is a suspicion that the problems stem from the sections of
code associated with the SMP...
Maybe I'm in something wrong, but I want to help in solving this problem ...

Has anyone experiencing problems tried to set sysctl kern.sched.steal_thresh=1 ?

I don't remember what our specific problem at $WORK was, perhaps it
was just interrupt threads not getting serviced fast enough, but we've
hard-coded this to 1 and removed the code that sets it in
sched_initticks(). The same effect should be had by setting the
sysctl after a box is up.

Thanks,
matthew

Ivan Klymenko

2011-12-14 00:37:57 UTC

В Tue, 13 Dec 2011 16:01:56 -0800

Post by m***@FreeBSD.org

Post by Ivan Klymenko
В Wed, 14 Dec 2011 00:04:42 +0100

Post by Ivan Klymenko
If the algorithm ULE does not contain problems - it means the
problem has Core2Duo, or in a piece of code that uses the ULE
scheduler. I already wrote in a mailing list that specifically in
--- sched_ule.c.orig 2011-11-24 18:11:48.000000000 +0200
+++ sched_ule.c 2011-12-10 22:47:08.000000000 +0200
@@ -794,7 +794,8 @@
* 1.5 * balance_interval.
*/
balance_ticks = max(balance_interval / 2, 1);
- balance_ticks += random() % balance_interval;
+// balance_ticks += random() % balance_interval;
+ balance_ticks += ((int)random()) % balance_interval;
if (smp_started == 0 || rebalance == 0)
return;
tdq = TDQ_SELF();

This avoids a 64-bit division on 64-bit platforms but seems to
have no effect otherwise. Because this function is not called very
often, the change seems unlikely to help.

Yes, this section does not apply to this problem :)
Just I posted the latest patch which i using now...

This skips most of the periodic functionality (long term load
balancer, saving switch count (?), insert index (?), interactivity
score update for long running thread) if the thread is not going to
be rescheduled right now.
It looks wrong but it is a data point if it helps your workload.

Yes, I did it for as long as possible to delay the execution of the
code in section: ...
#ifdef SMP
/*
* We run the long term load balancer infrequently on the
first cpu. */
if (balance_tdq == tdq) {
if (balance_ticks && --balance_ticks == 0)
sched_balance();
}
#endif
...

Post by Ivan Klymenko
tdq = TDQ_SELF();
#ifdef SMP
/*
* We run the long term load balancer infrequently on the
first cpu. */
- if (balance_tdq == tdq) {
- if (balance_ticks && --balance_ticks == 0)
+ if (balance_ticks && --balance_ticks == 0) {
+ if (balance_tdq == tdq)
sched_balance();
}
#endif

That is, for the same reason as above in the text...

Post by Jilles Tjoelker
It also introduces a hypothetical race condition because the
access to balance_ticks is no longer restricted to one CPU under a
spinlock.
If the long term load balancer may be causing trouble, try setting
kern.sched.balance_interval to a higher value with unpatched code.

I checked it in the first place - but it did not help fix the situation...
The impression of malfunction rebalancing...
It seems that the thread is passed on to the same core that is
loaded and so... Perhaps this is a consequence of an incorrect
definition of the topology CPU?

Post by Ivan Klymenko
@@ -2144,9 +2153,6 @@
if
(TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx]))
tdq->tdq_ridx = tdq->tdq_idx; }
- ts = td->td_sched;
- if (td->td_pri_class & PRI_FIFO_BIT)
- return;
if (PRI_BASE(td->td_pri_class) == PRI_TIMESHARE) {
/*
* We used a tick; charge it to the thread so
@@ -2157,11 +2163,6 @@
sched_priority(td);
}
/*
- * We used up one time slice.
- */
- if (--ts->ts_slice > 0)
- return;
- /*
* We're out of time, force a requeue at userret().
*/
ts->ts_slice = sched_slice;
and refusal to use options FULL_PREEMPTION
But no one has unsubscribed to my letter, my patch helps or not
in the case of Core2Duo...
There is a suspicion that the problems stem from the sections of
code associated with the SMP...
Maybe I'm in something wrong, but I want to help in solving this problem ...

Has anyone experiencing problems tried to set sysctl
kern.sched.steal_thresh=1 ?

In my case, the variable kern.sched.steal_thresh and so has the value 1.

Post by m***@FreeBSD.org
I don't remember what our specific problem at $WORK was, perhaps it
was just interrupt threads not getting serviced fast enough, but we've
hard-coded this to 1 and removed the code that sets it in
sched_initticks(). The same effect should be had by setting the
sysctl after a box is up.
Thanks,
matthew

Mike Tancsa

2011-12-14 17:01:52 UTC

Post by m***@FreeBSD.org
Has anyone experiencing problems tried to set sysctl kern.sched.steal_thresh=1 ?
I don't remember what our specific problem at $WORK was, perhaps it
was just interrupt threads not getting serviced fast enough, but we've
hard-coded this to 1 and removed the code that sets it in
sched_initticks(). The same effect should be had by setting the
sysctl after a box is up.

FWIW, this does impact the performance of pbzip2 on an i7. Using a 1.1G file

pbzip2 -v -c big > /dev/null

with burnP6 running in the background,

sysctl kern.sched.steal_thresh=1
vs
sysctl kern.sched.steal_thresh=3

N Min Max Median Avg Stddev
x 10 38.005022 38.42238 38.194648 38.165052 0.15546188
+ 9 38.695417 40.595544 39.392127 39.435384 0.59814114
Difference at 95.0% confidence
1.27033 +/- 0.412636
3.32852% +/- 1.08119%
(Student's t, pooled s = 0.425627)

a value of 1 is *slightly* faster.

--
-------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, ***@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada http://www.tancsa.com/

Attilio Rao

2011-12-15 16:29:37 UTC

Post by Mike Tancsa

FWIW, this does impact the performance of pbzip2 on an i7. Using a 1.1G file
pbzip2 -v -c big > /dev/null
with burnP6 running in the background,
sysctl kern.sched.steal_thresh=1
vs
sysctl kern.sched.steal_thresh=3
N Min Max Median Avg Stddev
x 10 38.005022 38.42238 38.194648 38.165052 0.15546188
+ 9 38.695417 40.595544 39.392127 39.435384 0.59814114
Difference at 95.0% confidence
1.27033 +/- 0.412636
3.32852% +/- 1.08119%
(Student's t, pooled s = 0.425627)
a value of 1 is *slightly* faster.

Hi Mike,
was that just the same codebase with the switch SCHED_4BSD/SCHED_ULE?

Also, the results here should be in the 3% interval for the avg case,
which is not yet at the 'alarm level' but could still be an
indication.
I still suspect I/O plays a big role here, however, thus it could be
detemined by other factors.

Could you retry the bench checking CPU usage and possible thread
migration around for both cases?

Thanks,
Attilio

--
Peace can only be achieved by understanding - A. Einstein

Marcus Reid

2011-12-13 23:22:44 UTC

Post by O. Hartmann
Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD?

The issues that I've seen with ULE on the desktop seem to be caused by X
taking up a steady amount of CPU, and being demoted from being an
"interactive" process. X then becomes the bottleneck for other
processes that would otherwise be "interactive". Try 'renice -20
<pid_of_X>' and see if that makes your problems go away.

Marcus

Ivan Klymenko

2011-12-13 23:45:44 UTC

В Tue, 13 Dec 2011 23:02:15 +0000

Post by Marcus Reid

Post by O. Hartmann
Do we have any proof at hand for such cases where SCHED_ULE
performs much better than SCHED_4BSD?

The issues that I've seen with ULE on the desktop seem to be caused
by X taking up a steady amount of CPU, and being demoted from being an
"interactive" process. X then becomes the bottleneck for other
processes that would otherwise be "interactive". Try 'renice -20
<pid_of_X>' and see if that makes your problems go away.

Why, then X is not a bottleneck when using 4BSD?

Post by Marcus Reid
Marcus

George Mitchell

2011-12-14 01:39:49 UTC

Post by Marcus Reid
[...]
The issues that I've seen with ULE on the desktop seem to be caused by X
taking up a steady amount of CPU, and being demoted from being an
"interactive" process. X then becomes the bottleneck for other
processes that would otherwise be "interactive". Try 'renice -20
<pid_of_X>' and see if that makes your problems go away.
Marcus
[...]

renice on X has no effect. Stopping my compute-bound dnetc process
immediately speeds everything up; restarting it slows it back down.

Post by Marcus Reid
[...]
Has anyone experiencing problems tried to set sysctl

kern.sched.steal_thresh=1 ?

Post by Marcus Reid
[...]

1 appears to be the default value for kern.sched.steal_thresh.

-- George Mitchell

Jeremy Chadwick

2011-12-13 07:37:48 UTC

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

Do we have any proof at hand for such cases where SCHED_ULE performs
much better than SCHED_4BSD? Whenever the subject comes up, it is
mentioned, that SCHED_ULE has better performance on boxes with a ncpu >
2. But in the end I see here contradictionary statements. People
complain about poor performance (especially in scientific environments),
and other give contra not being the case.
Within our department, we developed a highly scalable code for planetary
science purposes on imagery. It utilizes present GPUs via OpenCL if
present. Otherwise it grabs as many cores as it can.
By the end of this year I'll get a new desktop box based on Intels new
Sandy Bridge-E architecture with plenty of memory. If the colleague who
developed the code is willing performing some benchmarks on the same
hardware platform, we'll benchmark bot FreeBSD 9.0/10.0 and the most
recent Suse. For FreeBSD I intent also to look for performance with both
different schedulers available.

This is in no way shape or form the same kind of benchmark as what
you're planning to do, but I thought I'd throw it out there for folks to
take in as they see fit.

I know folks were focused mainly on buildworld.

I personally would find it interesting if someone with a higher-end
system (e.g. 2 physical CPUs, with 6 or 8 cores per CPU) was to do the
same test (changing -jX to -j{numofcores} of course).
--
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networking http://www.parodius.com/ |
| UNIX Systems Administrator Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |

sched_ule
===========
- time make -j2 buildworld
1689.831u 229.328s 18:46.20 170.4% 6566+2051k 432+4264io 4565pf+0w
- time make -j2 buildkernel
640.542u 87.737s 9:01.38 134.5% 6490+1920k 134+5968io 0pf+0w

sched_4bsd
============
- time make -j2 buildworld
1662.793u 206.908s 17:12.02 181.1% 6578+2054k 23750+4271io 6451pf+0w
- time make -j2 buildkernel
638.717u 76.146s 8:34.90 138.8% 6530+1927k 6415+5903io 0pf+0w

software
==========
* sched_ule test: FreeBSD 8.2-STABLE, Thu Dec 1 04:37:29 PST 2011
* sched_4bsd test: FreeBSD 8.2-STABLE, Mon Dec 12 22:42:54 PST 2011

hardware
==========
* Intel Core 2 Duo E8400, 3GHz
* Supermicro X7SBA
* 8GB ECC RAM (4x2GB), DDR2-800
* Intel 320-series SSD, 80GB: /, swap, /var, /tmp, /usr

tuning adjustments / etc.
===========================
* Before each scheduler test, system was rebooted to ensure I/O cache
and other whatnots were empty
* All filesystems stock UFS2 + SU (root is non-SU)
* All filesystems had tunefs -t enable applied to them
* powerd(8) in use, with two rc.conf variables (per CPU spec):

performance_cx_lowest="C2"
economy_cx_lowest="C2"

* loader.conf

kern.maxdsiz="2560M"
kern.dfldsiz="2560M"
kern.maxssiz="256M"
ahci_load="yes"
hint.p4tcc.0.disabled="1"
hint.acpi_throttle.0.disabled="1"
vfs.zfs.arc_max="5120M"

* make.conf

CPUTYPE?=core2

* src.conf

WITHOUT_INET6=true
WITHOUT_IPFILTER=true
WITHOUT_LIB32=true
WITHOUT_KERBEROS=true
WITHOUT_PAM_SUPPORT=true
WITHOUT_PROFILE=true
WITHOUT_SENDMAIL=true

* kernel configuration
- note: between kernel builds, config was changed to either use
SCHED_4BSD or SCHED_ULE respectively.

cpu HAMMER
ident GENERIC

makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols

options SCHED_4BSD # Classic BSD scheduler
#options SCHED_ULE # ULE scheduler
options PREEMPTION # Enable kernel thread preemption
options INET # InterNETworking
options FFS # Berkeley Fast Filesystem
options SOFTUPDATES # Enable FFS soft updates support
options UFS_ACL # Support for access control lists
options UFS_DIRHASH # Improve performance on big directories
options UFS_GJOURNAL # Enable gjournal-based UFS journaling
options MD_ROOT # MD is a potential root device
options NFSCLIENT # Network Filesystem Client
options NFSSERVER # Network Filesystem Server
options NFSLOCKD # Network Lock Manager
options NFS_ROOT # NFS usable as /, requires NFSCLIENT
options MSDOSFS # MSDOS Filesystem
options CD9660 # ISO 9660 Filesystem
options PROCFS # Process filesystem (requires PSEUDOFS)
options PSEUDOFS # Pseudo-filesystem framework
options GEOM_PART_GPT # GUID Partition Tables.
options GEOM_LABEL # Provides labelization
options COMPAT_43TTY # BSD 4.3 TTY compat (sgtty)
options SCSI_DELAY=5000 # Delay (in ms) before probing SCSI
options KTRACE # ktrace(1) support
options STACK # stack(9) support
options SYSVSHM # SYSV-style shared memory
options SYSVMSG # SYSV-style message queues
options SYSVSEM # SYSV-style semaphores
options P1003_1B_SEMAPHORES # POSIX-style semaphores
options _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions
options PRINTF_BUFR_SIZE=128 # Prevent printf output being interspersed.
options KBD_INSTALL_CDEV # install a CDEV entry in /dev
options HWPMC_HOOKS # Necessary kernel hooks for hwpmc(4)
options AUDIT # Security event auditing
options MAC # TrustedBSD MAC Framework
options FLOWTABLE # per-cpu routing cache
#options KDTRACE_FRAME # Ensure frames are compiled in
#options KDTRACE_HOOKS # Kernel DTrace hooks
options INCLUDE_CONFIG_FILE # Include this file in kernel

# Make an SMP-capable kernel by default
options SMP # Symmetric MultiProcessor Kernel

# Debugging options
options BREAK_TO_DEBUGGER # Sending a serial BREAK drops to DDB
options ALT_BREAK_TO_DEBUGGER # Permit <CR>~<Ctrl-b> to drop to DDB
options KDB # Enable kernel debugger support
options KDB_TRACE # Print stack trace automatically on panic
options DDB # Support DDB
options DDB_NUMSYM # Print numeric value of symbols
options GDB # Support remote GDB

# CPU frequency control
device cpufreq

# Bus support.
device acpi
device pci

# Floppy drives
device fdc

# ATA and ATAPI devices
# NOTE: "device ata" is missing because we use the Modular ATA core
# to only include the ATA-related drivers we need (e.g. AHCI).
device atadisk # ATA disk drives
device ataraid # ATA RAID drives
device atapicd # ATAPI CDROM drives
options ATA_STATIC_ID # Static device numbering

# Modular ATA
device atacore # Core ATA functionality
device ataisa # ISA bus support
device atapci # PCI bus support; only generic chipset support
device ataahci # AHCI SATA
device ataintel # Intel

# SCSI peripherals
device scbus # SCSI bus (required for SCSI)
device da # Direct Access (disks)
device cd # CD
device pass # Passthrough device (direct SCSI access)
device ses # SCSI Environmental Services (and SAF-TE)
options CAMDEBUG # CAM debugging (camcontrol debug)

# atkbdc0 controls both the keyboard and the PS/2 mouse
device atkbdc # AT keyboard controller
device atkbd # AT keyboard
device psm # PS/2 mouse

device kbdmux # keyboard multiplexer

device vga # VGA video card driver

device splash # Splash screen and screen saver support

# syscons is the default console driver, resembling an SCO console
device sc

device agp # support several AGP chipsets

# Serial (COM) ports
device uart # Generic UART driver

# PCI Ethernet NICs.
device em # Intel PRO/1000 Gigabit Ethernet Family

# Wireless NIC cards
device wlan # 802.11 support
options IEEE80211_DEBUG # enable debug msgs
options IEEE80211_AMPDU_AGE # age frames in AMPDU reorder q's
device wlan_wep # 802.11 WEP support
device wlan_ccmp # 802.11 CCMP support
device wlan_tkip # 802.11 TKIP support
device wlan_amrr # AMRR transmit rate control algorithm
device wlan_acl # MAC Access Control List support

# Pseudo devices.
device loop # Network loopback
device random # Entropy device
device ether # Ethernet support
device pty # BSD-style compatibility pseudo ttys
device md # Memory "disks"
device gif # IPv6 and IPv4 tunneling
device faith # IPv6-to-IPv4 relaying (translation)
device firmware # firmware assist module

# The `bpf' device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
# Note that 'bpf' is required for DHCP.
device bpf # Berkeley packet filter

# USB support
device uhci # UHCI PCI->USB interface
device ohci # OHCI PCI->USB interface
device ehci # EHCI PCI->USB interface (USB 2.0)
device usb # USB Bus (required)
#device udbp # USB Double Bulk Pipe devices
device uhid # "Human Interface Devices"
device ukbd # Keyboard
device umass # Disks/Mass storage - Requires scbus and da
device ums # Mouse

# Intel Core/Core2Duo CPU temperature monitoring driver
device coretemp

# SMBus support, needed for bsdhwmon
device smbus
device smb
device ichsmb

# Intel ICH hardware watchdog support
device ichwd

# pf ALTQ support
options ALTQ
options ALTQ_CBQ # Class Bases Queueing
options ALTQ_RED # Random Early Detection
options ALTQ_RIO # RED In/Out
options ALTQ_HFSC # Hierarchical Packet Scheduler
options ALTQ_CDNR # Traffic conditioner
options ALTQ_PRIQ # Priority Queueing
options ALTQ_NOPCC # Required for SMP build

Daniel Kalchev

2011-12-13 08:15:09 UTC

Post by Jeremy Chadwick
I personally would find it interesting if someone with a higher-end
system (e.g. 2 physical CPUs, with 6 or 8 cores per CPU) was to do the
same test (changing -jX to -j{numofcores} of course).

Is 4 way 8 core Opteron ok? That is 32 cores, 64GB RAM.

Testing with buildworld in my opinion is not adequate, as it involves
way too much I/O. Any advice on proper testing methodology?

These systems run ZFS, but could be booted diskless for the tests.

Daniel

Attilio Rao

2011-12-15 16:31:14 UTC

Post by Daniel Kalchev

I personally would find it interesting if someone with a higher-end system
(e.g. 2 physical CPUs, with 6 or 8 cores per CPU) was to do the same test
(changing -jX to -j{numofcores} of course).

Is 4 way 8 core Opteron ok? That is 32 cores, 64GB RAM.
Testing with buildworld in my opinion is not adequate, as it involves way
too much I/O. Any advice on proper testing methodology?

I'm sure that I/O and pmap subsystem contention (because of
buildworld) and TLB shootdown overhead (because of 32 CPUs) will be so
overwhelming that you are not really going to benchmark the scheduler
activity at all.

However I still don't get what you want to verify exactly?

Thanks,
Attilio

--
Peace can only be achieved by understanding - A. Einstein

O. Hartmann

2011-12-13 11:14:48 UTC

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

Interesting, there seems to be a much more performant scheduler in 7.0,
called SCHED_SMP. I have some faint recalls on that ... where is this
beast gone?

Oliver

Jeremy Chadwick

2011-12-13 12:49:31 UTC

Post by Volodymyr Kostyrko
Not fully right, boinc defaults to run on idprio 31 so this isn't an
issue. And yes, there are cases where SCHED_ULE shows much better
performance then SCHED_4BSD. [...]

Interesting, there seems to be a much more performant scheduler in 7.0,
called SCHED_SMP. I have some faint recalls on that ... where is this
beast gone?

Boy I sure hope I remember this right. I strongly urge others to
correct me where I'm wrong; thanks in advance!

The classic scheduler, SCHED_4BSD, was implemented back before there was
oxygen. sched_4bsd(4) mentions this. No need to discuss it.

Jeff Robertson began working on the "first-generation ULE scheduler"
during the days of FreeBSD 5.x (I believe 5.1), and a paper on it was
presented at USENIX circa 2003:
http://www.usenix.org/event/bsdcon03/tech/full_papers/roberson/roberson.pdf

Over the following years, Jeff (and others I assume -- maybe folks like
George Neville-Neil and/or Kirk McKusick?) adjusted and tinkered with
some of the semantics and models/methods. If I remember right, some of
these quirks/fixes were committed. All of this was happening under the
scheduler that was then called SCHED_ULE, but it was "ULE 1.0" for lack
of better terminology.

This scheduler did not perform well, if I remember right, and Jeff was
quite honest about that. From this point forward, Jeff began idealising
and working on a scheduler which he called SCHED_SMP -- think of it as
"ULE 2.0", again, for lack of better terminology. It was different than
the existing SCHED_ULE scheduler, hence a different name. Jeff blogged
about this in early 2007, using exactly that term ("ULE 2.0"):
http://jeffr-tech.livejournal.com/3729.html

In mid-2007, prior to FreeBSD 7.0-RELEASE, Jeff announced that
effectively he wanted to make SCHED_ULE do what SCHED_SMP did, and
provided a patch to SCHED_ULE to accomplish just that:
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/current/2007-07/msg00755.html

Full thread is here (beware -- many replies):
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/current/2007-07/threads.html#00755

The patch mentioned above was merged into HEAD on 2007/07/19.
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/sched_ule.c#rev1.202

So in effect, as of 2007/07/19, SCHED_ULE became SCHED_SMP.

FreeBSD 7.0-RELEASE was released on 2008/02/27, and the above
commit/changes were available at that time as well (meaning: RELENG_7
and RELENG_7_0 at that moment in time should have included the patch
from the above paragraph).

The document released by Kris Kenneway hinted at those changes and
performance improvements:
http://people.freebsd.org/~kris/scaling/7.0%20Preview.pdf

Keep in mind, however, that at that time kernel configuration files
(GENERIC, etc.) still defaulted to SCHED_4BSD.

The default scheduler in kernel config files (GENERIC, etc.) for i386
and amd64 (not sure about others) was changed in 2007/10/19:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/i386/conf/GENERIC#rev1.475
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/conf/GENERIC#rev1.485

This was done *prior* to FreeBSD 7.1-RELEASE. So, it first became
available as the default scheduler "for the masses" when 7.1-RELEASE
came out on 2009/01/05.

"All of the answers", in a roundabout and non-user-friendly way, are
available by examining the commit history for src/sys/kern/sched_ule.c.
It's hard to follow especially given that you have to consider all
the releases/branchpoints that took place over time, but:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/sched_ule.c

Are we having fun yet? :-)

Attilio Rao

2011-12-09 15:49:38 UTC

dnetc is an open-source program from http://www.distributed.net/. It
tries a brute-force approach to cracking RC4 puzzles and also computes
optimal Golomb rulers. It starts up one process per CPU and runs at
nice 20 and is, for all intents and purposes, 100% compute bound.
Here is what happens on my system, running 9.0-PRERELEASE, with and
without dnetc running, with SCHED_ULE and SCHED-4BSD, when I run the
time make buildkernel KERNCONF=WONDERLAND
(I get similar results on 8.x as well.)
1329.715u 123.739s 24:47.95 97.6% 6310+1987k 11233+11098io 419pf+0w
1329.364u 115.158s 26:14.83 91.7% 6325+1987k 10912+11060io 393pf+0w
1357.457u 121.526s 25:20.64 97.2% 6326+1990k 11234+11149io 419pf+0w
Still going after seven and a half hours of clock time, up to
compiling netgraph/bluetooth. (Completed in another five minutes
after stopping dnetc so I could write this message in a reasonable
amount of time.)
Not everybody runs this sort of program, but there are plenty of
similar projects out there, and people who try to participate in
them will be mightily displeased with their FreeBSD systems when
they do. Is there some case where SCHED_ULE exhibits significantly
better performance than SCHED_4BSD? If not, I think SCHED-4BSD
should remain the default GENERIC configuration until this is fixed.

Hi George,
are you interested in exploring more the case with SCHED_ULE and dnetc?

More precisely I'd be interested in KTR traces.
To be even more precise:
With a completely stable GENERIC configuration (or otherwise please
post your kernel config) please add the following:
options KTR
options KTR_ENTRIES=262144
options KTR_COMPILE=(KTR_SCHED)
options KTR_MASK=(KTR_SCHED)

While you are in the middle of the slow-down (so once it is well
established) please do:
# sysclt debug.ktr.cpumask=""

In the end go with:
# ktrdump -ctf > ktr-ule-problem.out

and send the file to this mailing list.

Thanks,
Attilio

--
Peace can only be achieved by understanding - A. Einstein

George Mitchell

2011-12-10 00:58:50 UTC

Post by Attilio Rao
[...]
More precisely I'd be interested in KTR traces.
With a completely stable GENERIC configuration (or otherwise please
options KTR
options KTR_ENTRIES=262144
options KTR_COMPILE=(KTR_SCHED)
options KTR_MASK=(KTR_SCHED)
While you are in the middle of the slow-down (so once it is well
# sysclt debug.ktr.cpumask=""

wonderland# sysctl debug.ktr.cpumask=""
debug.ktr.cpumask: ffffffffffffffff
sysctl: debug.ktr.cpumask: Invalid argument

Post by Attilio Rao
# ktrdump -ctf> ktr-ule-problem.out

It's 44MB, so it's at http://www.m5p.com/~george/ktr-ule-problem.out

Post by Attilio Rao
and send the file to this mailing list.
Thanks,
Attilio

I hope this helps. -- George Mitchell

Attilio Rao

2011-12-10 01:01:06 UTC

Post by George Mitchell

wonderland# sysctl debug.ktr.cpumask=""
debug.ktr.cpumask: ffffffffffffffff
sysctl: debug.ktr.cpumask: Invalid argument

Post by Attilio Rao
# ktrdump -ctf> ktr-ule-problem.out

It's 44MB, so it's at http://www.m5p.com/~george/ktr-ule-problem.out

What svn revision did you use for it?
What is the CPUs frequencies of machines generating this?

Attilio

--
Peace can only be achieved by understanding - A. Einstein

George Mitchell

2011-12-10 01:17:35 UTC

Post by Attilio Rao
[...]
What svn revision did you use for it?
What is the CPUs frequencies of machines generating this?
Attilio

Hope the attached helps. -- George Mitchell

Eitan Adler

2011-12-10 01:23:22 UTC

Hope the attached helps. -- George Mitchell

You attached dmesg, not a patch.

--
Eitan Adler

Attilio Rao

2011-12-10 01:25:21 UTC

Post by Eitan Adler

Hope the attached helps. -- George Mitchell

You attached dmesg, not a patch.

This is what is needed for a schedgraph analysis, along with KTR
points collection.

Attilio

--
Peace can only be achieved by understanding - A. Einstein

George Mitchell

2011-12-14 11:08:12 UTC

Post by George Mitchell

wonderland# sysctl debug.ktr.cpumask=""
debug.ktr.cpumask: ffffffffffffffff
sysctl: debug.ktr.cpumask: Invalid argument

Post by Attilio Rao
# ktrdump -ctf> ktr-ule-problem.out

It's 44MB, so it's at http://www.m5p.com/~george/ktr-ule-problem.out

There have been 22 downloads of this file so far; does anyone who
looked at it have any results to report?

Dear Secret Masters of FreeBSD: Can we have a decision on whether to
change back to SCHED_4BSD while SCHED_ULE gets properly fixed?

-- George Mitchell

Post by George Mitchell

Post by Attilio Rao
and send the file to this mailing list.
Thanks,
Attilio

I hope this helps. -- George Mitchell

Tom Evans

2011-12-14 17:55:39 UTC

On Wed, Dec 14, 2011 at 11:06 AM, George Mitchell

Post by George Mitchell
Dear Secret Masters of FreeBSD: Can we have a decision on whether to
change back to SCHED_4BSD while SCHED_ULE gets properly fixed?

Please do not do this. This thread has shown that ULE performs poorly
in very specific scenarios where the server is loaded with NCPU+1 CPU
bound processes, and brought forward more complaints about
interactivity in X (I've never noticed this, and use a FreeBSD desktop
daily).

On the other hand, we have very many benchmarks showing how poorly
4BSD scales on things like postgresql. We get much more load out of
our 8.1 ULE DB and web servers than we do out of our 7.0 ones. It's
easy to look at what you do and say "well, what suits my environment
is clearly the best default", but I think there are probably more
users typically running IO bound processes than CPU bound processes.

I believe the correct thing to do is to put some extra documentation
into the handbook about scheduler choice, noting the potential issues
with loading NCPU+1 CPU bound processes. Perhaps making it easier to
switch scheduler would also help?

Cheers

Tom

References:

Loading Image...

http://suckit.blog.hu/2009/10/05/freebsd_8_is_it_worth_to_upgrade

Marcus Reid

2011-12-14 19:54:36 UTC

brought forward more complaints about interactivity in X (I've never
noticed this, and use a FreeBSD desktop daily).

. that was me, but I forgot to add that it almost never happens, and it
can only be triggered when there are processes that want to take up 100%
of the CPU running on the system along with X and friends.

Don't want to spread FUD, I've been happily using FreeBSD on the desktop
for a decade and ULE seems to work great.

Marcus

O. Hartmann

2011-12-14 23:41:39 UTC

Post by Tom Evans
On Wed, Dec 14, 2011 at 11:06 AM, George Mitchell

Post by George Mitchell
Dear Secret Masters of FreeBSD: Can we have a decision on whether to
change back to SCHED_4BSD while SCHED_ULE gets properly fixed?

I would highly appreciate a decission against SCHED_ULE as the default
scheduler! SCHED_4BSD is considered a more mature entity and obviously
it seems that SCHED_ULE needs some refinements to achieve a better level
of quality.

Post by Tom Evans
On the other hand, we have very many benchmarks showing how poorly
4BSD scales on things like postgresql. We get much more load out of
our 8.1 ULE DB and web servers than we do out of our 7.0 ones. It's
easy to look at what you do and say "well, what suits my environment
is clearly the best default", but I think there are probably more
users typically running IO bound processes than CPU bound processes.

You compare SCHED_ULE on FBSD 8.1 with SCHED_4BSD on FBSD 7.0? Shouldn't
you compare SCHED_ULE and SCHED_4BSD on the very same platform?

Development of SCHED_ULE has been focused very much on DB like
PostgreSQL, no wonder the performance benefit. But this is also a very
specific scneario where SCHED_ULE shows a real benefit compared to
SCHED_4BSD.

Post by Tom Evans
I believe the correct thing to do is to put some extra documentation
into the handbook about scheduler choice, noting the potential issues
with loading NCPU+1 CPU bound processes. Perhaps making it easier to
switch scheduler would also help?

Many people more experst in the issue than myself revealed some issues
in the code of both SCHED_ULE and even SCHED_4BSD. It would be a pitty
if all the discussions get flushed away like a "toilette-busisness" as
it has been done all the way in the past.

Well, I'd like to see a kind of "standardized" benchmark. Like on
openbenchmark.org or at phoronix.com. I know that Phoronix' way of
performing benchmarks is questionable and do not reveal much of the
issues, but it is better than nothing. I'm always surprised by the worse
performance of FreeBSD when it comes to threaded I/O. The differences
between Linux and FreeBSD of the same development maturity are
tremendous and scaring!

It is a long time since I saw a SPEC benchmark on a FreeBSD driven HPC
box. Most benchmark around for testing hardware are performed with Linux
and Linux seems to make the race in nearly every scenario. It would be
highly appreciable and interesting to see how Linux and FreeBSD would
perform in SPEC on the same hardware platform. This is only an idea.
Without a suitable benchmark with a codebase understood the discussion
is in many aspects pointless -both ways.

Post by Tom Evans
Cheers
Tom
http://people.freebsd.org/~kris/scaling/mysql-freebsd.png
http://suckit.blog.hu/2009/10/05/freebsd_8_is_it_worth_to_upgrade
_______________________________________________

Jeremy Chadwick

2011-12-15 00:43:28 UTC

You compare SCHED_ULE on FBSD 8.1 with SCHED_4BSD on FBSD 7.0? Shouldn't
you compare SCHED_ULE and SCHED_4BSD on the very same platform?

Agreed -- this is a bad comparison. Again, I'm going to tell people to
do the one thing that's painful and nobody likes to do: *look at
commits* and pay close attention to the branches and any commits that
involve "tagging" for a release (so you can determine what "version" of
the code you might be running).

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/sched_ule.c
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/sched_4bsd.c

I'm a bit busy today, otherwise I would offer to go over the SCHED_4BSD
changes between 7.0-RELEASE and 8.1-RELEASE (I would need Tom to confirm
those are the exact versions being used; I wish people would stop saying
things like "FreeBSD x.y" because it's inaccurate). But the data is
there at the above URLs, including the committers/those involves.

Replying to Tom's comment here:

It is already easy to switch schedulers. You change the option in your
kernel config, rebuild kernel (world isn't necessary as long as you
haven't csup'd between your last rebuild and now), make installkernel,
shutdown -r now, done.

If what you're proposing is to make the scheduler changeable in
real-time? I think that would require a **lot** of work for something
that very few people would benefit from (please stop for a moment and
think about the majority of the userbase, not just niche environments; I
say this politely, not with any condescension BTW). Sure, it'd be
"nice to have", but should be extremely low on the priority list (IMO).

Post by O. Hartmann
Many people more experst in the issue than myself revealed some issues
in the code of both SCHED_ULE and even SCHED_4BSD. It would be a pitty
if all the discussions get flushed away like a "toilette-busisness" as
it has been done all the way in the past.

Gut feeling says this is what will happen, and that's because the people
who are (and have in the past) touching the scheduler bits are not
involved in this conversation. We're not going to get anywhere unless
those people are involved and are available to make adjustments/etc. I
would love to start CC'ing them all, but I don't think that's
necessarily effective.

I will take the time to point out/remind folks that the number of people
who *truly understand* the schedulers are few and far between. We're
talking single-digit numbers, folks. And those people are already busy
enough as-is. This makes solving this problem difficult.

So, what I think WOULD be effective would be for someone to catalogue a
list of their systems/specifications/benchmarks/software/etc. that show
exactly where the problems are in their workspace when using ULE vs.
4BSD, or vice-versa. That may give the developers some leads as to how
to progress.

Let's also not forget about the compiler ordeal; gcc versions greatly
differ (some folks overwrite the default base gcc with ones in ports),
and then there's the clang stuff... Sigh.

Post by O. Hartmann
Well, I'd like to see a kind of "standardized" benchmark. Like on
openbenchmark.org or at phoronix.com. I know that Phoronix' way of
performing benchmarks is questionable and do not reveal much of the
issues, but it is better than nothing.

I would love to run such benchmarks on all of our systems, but I have no
idea what kind of benchmark suites/etc. would be beneficial for the
developers who maintain/touch the schedulers. You understand what I'm
saying? For example, some folks earlier in the thread said the best
thing to do for this would be buildworld, but then further follow-ups
from others said buildworld is not effective given the I/O demands.

Furthermore, I want whatever benchmark/app suite thing to be minimal as
hell. It should be standalone, no dependencies (or only 1 or 2).

Regarding threadsing: a colleague of mine, ex-co-worker who now works at
Apple as a developer, wrote a C program while he was at my current
workplace which -- pardon my French -- "beat the shit out of our Solaris
boxes, thread-wise". It was customisable via command-line. The thing
got some of our Solaris machines up to load averages of nearly 42000
(yes you read that right!), and spit out some benchmark-esque results
when finished. I'll mention this thread to him, let him read it, and
see if he has anything to say. He is *extremely* busy (even more so
with the holiday coming up), so I have little faith he can/will help
here, but he may give the code for it if he still has it. I believe he
did have me run it on FreeBSD, but it was a long time ago.

Post by O. Hartmann
I'm always surprised by the worse
performance of FreeBSD when it comes to threaded I/O. The differences
between Linux and FreeBSD of the same development maturity are
tremendous and scaring!

Agreed. Linux has the upper hand in many areas, and this is one of
them. Please do not think this means "Linux is better, FreeBSD sucks".
It simply means that there are more people active/working on these
things in Linux. People need to use what works best for them!

p***@pluto.rain.com

2011-12-15 09:13:18 UTC

and you have thereby shot freebsd-update in the foot,
because you are no longer using a generic kernel.

Post by Jeremy Chadwick
If what you're proposing is to make the scheduler changeable
in real-time? I think that would require a **lot** of work
for something that very few people would benefit from ...

Switching on the fly sounds frightfully difficult, as long as
4BSD and ULE are separate code bases. (It might not be so bad
if a tunable or 3 could be added to ULE, so that it could be
configured to behave like 4BSD.)

However, the freebsd-update complication could in principle be
relieved by building both schedulers into the generic kernel,
with the choice being configurable in loader.conf. It would
still take a reboot to switch, but not a kernel rebuild. Of
course there may be practical issues, e.g. name collisions.

Steven Hartland

2011-12-15 14:21:37 UTC

With all the discussion I thought I'd give a buildworld
benchmark a go here on a spare 24 core machine. ULE
tested fine but with 4BSD it wont even boot panicing
with the following:-
Loading Image...

This is on a clean 8.2-RELEASE-p4

Upgrading to RELENG_9 fixed this but its a bit concerning
that just changing the scheduler would cause the machine
to panic on boot.

Its only a single run so varience could be high but here's
the result of a buildworld on this machine running the
two different schedulers:-
4BSD: 24m54.10s real 2h43m12.42s user 56m20.07s sys
ULE: 23m54.68s real 2h34m59.04s user 50m59.91s sys

What really sticks out is that this is over double that
of an 8.2 buildworld on the same machine with the same
kernel
ULE: 11m12.76s real 1h27m59.39s user 28m59.57s sys

This was run 9.0-PRERELEASE kernel due to 4BSD panicing
on boot under 8.2.

So for this use ULE vs 4BSD is neither here-nor-there
but 9.0 buildworld is very slow (x2 slower) compared
with 8.2 so whats a bigger question in my mind.

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

Lars Engels

2011-12-15 15:06:43 UTC

Post by Steven Hartland
With all the discussion I thought I'd give a buildworld
benchmark a go here on a spare 24 core machine. ULE
tested fine but with 4BSD it wont even boot panicing
with the following:-
http://screensnapr.com/v/hwysGV.png
This is on a clean 8.2-RELEASE-p4
Upgrading to RELENG_9 fixed this but its a bit concerning
that just changing the scheduler would cause the machine
to panic on boot.
Its only a single run so varience could be high but here's
the result of a buildworld on this machine running the
two different schedulers:-
4BSD: 24m54.10s real 2h43m12.42s user 56m20.07s sys
ULE: 23m54.68s real 2h34m59.04s user 50m59.91s sys
What really sticks out is that this is over double that
of an 8.2 buildworld on the same machine with the same
kernel
ULE: 11m12.76s real 1h27m59.39s user 28m59.57s sys

9.0 ships with gcc and clang which both need to be compiled, 8.2 only
has gcc.

Post by Steven Hartland
This was run 9.0-PRERELEASE kernel due to 4BSD panicing
on boot under 8.2.
So for this use ULE vs 4BSD is neither here-nor-there
but 9.0 buildworld is very slow (x2 slower) compared
with 8.2 so whats a bigger question in my mind.
Regards
Steve

Steven Hartland

2011-12-15 15:33:23 UTC

Post by Lars Engels
9.0 ships with gcc and clang which both need to be compiled, 8.2 only
has gcc.

Ahh, any reason we need both, and is it possible to disable clang?

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

Eitan Adler

2011-12-15 15:44:22 UTC

On Thu, Dec 15, 2011 at 10:32 AM, Steven Hartland

Post by Steven Hartland

Post by Lars Engels
9.0 ships with gcc and clang which both need to be compiled, 8.2 only
has gcc.

Ahh, any reason we need both, and is it possible to disable clang?

man src.conf
add WITHOUT_CLANG=yes to /etc/src.conf

--
Eitan Adler

Matthew Seaman

2011-12-15 09:14:34 UTC

Post by Jeremy Chadwick
It is already easy to switch schedulers. You change the option in your
kernel config, rebuild kernel (world isn't necessary as long as you
haven't csup'd between your last rebuild and now), make installkernel,
shutdown -r now, done.
If what you're proposing is to make the scheduler changeable in
real-time? I think that would require a **lot** of work for something
that very few people would benefit from (please stop for a moment and
think about the majority of the userbase, not just niche environments; I
say this politely, not with any condescension BTW). Sure, it'd be
"nice to have", but should be extremely low on the priority list (IMO).

Somewhere in between might be a good idea it seems to me: viz, change a
setting in loader.conf and reboot to switch to a new scheduler. Having
to juggle different kernels is no big deal for the likes of you and me,
but it is quite a barrier in many environments.

Cheers,

Matthew

--
Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard
Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
JID: ***@infracaninophile.co.uk Kent, CT11 9PW

Tom Evans

2011-12-15 09:46:09 UTC

On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick

It is already easy to switch schedulers. You change the option in your
kernel config, rebuild kernel (world isn't necessary as long as you
haven't csup'd between your last rebuild and now), make installkernel,
shutdown -r now, done.

Your definition of 'easy' differs wildly from mine. How is that in any
way 'easy' to do across 200 servers?

If what you're proposing is to make the scheduler changeable in
real-time? I think that would require a **lot** of work for something
that very few people would benefit from (please stop for a moment and
think about the majority of the userbase, not just niche environments; I
say this politely, not with any condescension BTW). Sure, it'd be
"nice to have", but should be extremely low on the priority list (IMO).

Real time scheduler changing would be insane! I was thinking that
both/any/all schedulers could be compiled into the kernel, and the
choice of which one to use becomes a boot time configuration. You
don't have to recompile the kernel to change timecounter.

Cheers

Tom

Oliver Pinter

2011-12-15 02:34:11 UTC

Post by Tom Evans
On Wed, Dec 14, 2011 at 11:06 AM, George Mitchell

Post by George Mitchell
Dear Secret Masters of FreeBSD: Can we have a decision on whether to
change back to SCHED_4BSD while SCHED_ULE gets properly fixed?

You compare SCHED_ULE on FBSD 8.1 with SCHED_4BSD on FBSD 7.0? Shouldn't
you compare SCHED_ULE and SCHED_4BSD on the very same platform?
Development of SCHED_ULE has been focused very much on DB like
PostgreSQL, no wonder the performance benefit. But this is also a very
specific scneario where SCHED_ULE shows a real benefit compared to
SCHED_4BSD.

Many people more experst in the issue than myself revealed some issues
in the code of both SCHED_ULE and even SCHED_4BSD. It would be a pitty
if all the discussions get flushed away like a "toilette-busisness" as
it has been done all the way in the past.
Well, I'd like to see a kind of "standardized" benchmark. Like on
openbenchmark.org or at phoronix.com. I know that Phoronix' way of
performing benchmarks is questionable and do not reveal much of the
issues, but it is better than nothing. I'm always surprised by the worse
performance of FreeBSD when it comes to threaded I/O. The differences
between Linux and FreeBSD of the same development maturity are
tremendous and scaring!
It is a long time since I saw a SPEC benchmark on a FreeBSD driven HPC
box. Most benchmark around for testing hardware are performed with Linux
and Linux seems to make the race in nearly every scenario. It would be
highly appreciable and interesting to see how Linux and FreeBSD would
perform in SPEC on the same hardware platform. This is only an idea.
Without a suitable benchmark with a codebase understood the discussion
is in many aspects pointless -both ways.

Hi!

Can you try with this settings:

***@opn ~> sysctl kern.sched.
kern.sched.cpusetsize: 8
kern.sched.preemption: 0
kern.sched.name: ULE
kern.sched.slice: 13
kern.sched.interact: 30
kern.sched.preempt_thresh: 224
kern.sched.static_boost: 152
kern.sched.idlespins: 10000
kern.sched.idlespinthresh: 16
kern.sched.affinity: 1
kern.sched.balance: 1
kern.sched.balance_interval: 133
kern.sched.steal_htt: 1
kern.sched.steal_idle: 1
kern.sched.steal_thresh: 1
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="2" mask="3">0, 1</cpu>
<children>
<group level="2" cache-level="2">
<cpu count="2" mask="3">0, 1</cpu>
</group>
</children>
</group>
</groups>

Most of them from 7-STABLE settings, and with this, "works for me".
This an laptop with core2 duo cpu (with enabled powerd), and my kernel
config is here:
http://oliverp.teteny.bme.hu/freebsd/kernel_conf

Jeremy Chadwick

2011-12-15 02:43:39 UTC

Post by Oliver Pinter

Post by Tom Evans
On Wed, Dec 14, 2011 at 11:06 AM, George Mitchell

Post by George Mitchell
Dear Secret Masters of FreeBSD: Can we have a decision on whether to
change back to SCHED_4BSD while SCHED_ULE gets properly fixed?

You compare SCHED_ULE on FBSD 8.1 with SCHED_4BSD on FBSD 7.0? Shouldn't
you compare SCHED_ULE and SCHED_4BSD on the very same platform?
Development of SCHED_ULE has been focused very much on DB like
PostgreSQL, no wonder the performance benefit. But this is also a very
specific scneario where SCHED_ULE shows a real benefit compared to
SCHED_4BSD.

Many people more experst in the issue than myself revealed some issues
in the code of both SCHED_ULE and even SCHED_4BSD. It would be a pitty
if all the discussions get flushed away like a "toilette-busisness" as
it has been done all the way in the past.
Well, I'd like to see a kind of "standardized" benchmark. Like on
openbenchmark.org or at phoronix.com. I know that Phoronix' way of
performing benchmarks is questionable and do not reveal much of the
issues, but it is better than nothing. I'm always surprised by the worse
performance of FreeBSD when it comes to threaded I/O. The differences
between Linux and FreeBSD of the same development maturity are
tremendous and scaring!
It is a long time since I saw a SPEC benchmark on a FreeBSD driven HPC
box. Most benchmark around for testing hardware are performed with Linux
and Linux seems to make the race in nearly every scenario. It would be
highly appreciable and interesting to see how Linux and FreeBSD would
perform in SPEC on the same hardware platform. This is only an idea.
Without a suitable benchmark with a codebase understood the discussion
is in many aspects pointless -both ways.

Hi!

I'm replying with a list of each setting which differs compared to
RELENG_8 stock on our ULE systems. Note that our ULE systems are 1
physical CPU with 4 cores.

Post by Oliver Pinter
kern.sched.cpusetsize: 8

I see no such tunable/sysctl on any of our RELENG_8 and RELENG_7
systems. Nor do I find any references to it in /usr/src (on any
system). Is this a RELENG_9 setting? Please explain where it comes
from. I hope it's not a custom kernel patch...

Post by Oliver Pinter
kern.sched.preemption: 0

This differs; default value is 1.

Post by Oliver Pinter
kern.sched.name: ULE
kern.sched.slice: 13
kern.sched.interact: 30
kern.sched.preempt_thresh: 224

This differs; default value is 64. The "magic value" of 224 has been
discussed in the past, in this thread even.

Post by Oliver Pinter
kern.sched.static_boost: 152

This differs; on our systems it's 160.

Post by Oliver Pinter
kern.sched.idlespins: 10000
kern.sched.idlespinthresh: 16

This differs; on our systems it's 4.

Post by Oliver Pinter
Most of them from 7-STABLE settings, and with this, "works for me".
This an laptop with core2 duo cpu (with enabled powerd), and my kernel
http://oliverp.teteny.bme.hu/freebsd/kernel_conf

O. Hartmann

2011-12-15 07:35:35 UTC

Just saw this shot benchmark on Phoronix dot com today:

http://www.phoronix.com/scan.php?page=news_item&px=MTAyNzA

It may be worth to discuss the sad performance of FBSD in some parts of
the benchmark. A difference of a factor 10 or 100 is simply far beyond
disapointing, it is more than inacceptable and by just reading those
benchmarks, I'd like to drop thinking of using FreeBSD even as a backend
server in scientific and business environments. In detail, some of the
SciMark benches look disappointing. The overall image can't help over
the fact that in C-Ray FreeBSD is better performing.

From the compiler, I'd like say there couldn't be a drop of more than 10
- 15% in performance - but not 10 or 100 times.

I'm just thinking about the discussion of SCHED_ULE and all the saur
spots we discussed when I stumbled over the test.

Regards,
Oliver

Adrian Chadd

2011-12-15 07:43:36 UTC

Well, the only way it's going to get fixed is if someone sits down,
replicates it, and starts to document exactly what it is that these
benchmarks are/aren't doing.

Sometimes it's because the benchmark is very much tickling things
incorrectly. In a lot of cases though, the benchmark is testing
something synthetic that Linux just happens to have micro-optimised.

So if you care about this a lot, someone needs to stand up, work with
Phronix to get some actual feedback about what's going on, and see if
it can be fixed. Maybe you'll find ULE is broken in some instances; I
bet you'll find something like "the disk driver is suboptimal." For
example, I remember seeing someone mess up a test because they split
their filesystems across raid5 boundaries, and this was hidden by the
choice of raid controller and stripe size. This made FreeBSD look
worse; when this was corrected for, it sped up far past Linux.

Adrian

Samuel J. Greear

2011-12-15 12:58:25 UTC

Post by Adrian Chadd
Well, the only way it's going to get fixed is if someone sits down,
replicates it, and starts to document exactly what it is that these
benchmarks are/aren't doing.

I think you will find that investigation is largely a waste of time,
because not only are some of these benchmarks just downright silly,
there are huge differences in the environments (compiler versions),
etc., etc. leading to a largely apples/oranges comparison. But also
the the analysis and reporting of the results by Phoronix is simply
moronic to the point of being worse than useful, they are spreading
misinformation.

Take the first test as an example, Blogbench read. This doesn't raise
any red flags, right? At least not until you realize that Blogbench
isn't a read test, it's a read/write test. So what they have done here
is run a read/write test and then thrown away the write results for
both platforms and reported only the read results. If you dig down
into the actual results,
http://openbenchmarking.org/result/1112113-AR-ORACLELIN37 -- you will
see two Blogbench numbers, one for read and another for write. These
were both taken from the same Blogbench run, so FreeBSD optimizes
writes over reads, that's probably a good thing for your data but a
bad thing when someone totally misrepresents benchmark results.

Other benchmarks in the Phoronix suite and their representations are
similarly flawed, _ALL_ of these results should be ignored and no time
should be wasted by any FreeBSD committer further evaluating this
garbage. (Yes, I have been down this rabbit hole).

Best,
Sam

Jeremy Chadwick

2011-12-15 13:50:13 UTC

Post by Samuel J. Greear

Post by Adrian Chadd
Well, the only way it's going to get fixed is if someone sits down,
replicates it, and starts to document exactly what it is that these
benchmarks are/aren't doing.

I think you will find that investigation is largely a waste of time,
because not only are some of these benchmarks just downright silly,
there are huge differences in the environments (compiler versions),
etc., etc. leading to a largely apples/oranges comparison. But also
the the analysis and reporting of the results by Phoronix is simply
moronic to the point of being worse than useful, they are spreading
misinformation.
Take the first test as an example, Blogbench read. This doesn't raise
any red flags, right? At least not until you realize that Blogbench
isn't a read test, it's a read/write test. So what they have done here
is run a read/write test and then thrown away the write results for
both platforms and reported only the read results. If you dig down
into the actual results,
http://openbenchmarking.org/result/1112113-AR-ORACLELIN37 -- you will
see two Blogbench numbers, one for read and another for write. These
were both taken from the same Blogbench run, so FreeBSD optimizes
writes over reads, that's probably a good thing for your data but a
bad thing when someone totally misrepresents benchmark results.
Other benchmarks in the Phoronix suite and their representations are
similarly flawed, _ALL_ of these results should be ignored and no time
should be wasted by any FreeBSD committer further evaluating this
garbage. (Yes, I have been down this rabbit hole).

For sake of argument, let's say we throw out the Phoronix benchmarks as
a data source (I don't think the benchmark specifically implied or
stated "this is all because of SCHED_ULE" though; remember, that's what
we're supposed to be focusing on. There may not be a direct correlation
between the Phoronix benchmarks and the ULE issue reported here...).
That said: thrown out, data ignored, done.

Now what? Where are we? We're right back where we were a day or two
ago; meaning no closer to solving the dilemma reported by users and
SCHED_ULE. Heck, we're not even sure if there is an issue, other than
some folks confirming that SCHED_4BSD performs better for them (that's
what started this whole thread), and there are at least a couple which
have stated this.

So given the above semi-devil's-advocate response -- Sam, do you have
something positive or progressive to offer so we can move forward on the
ULE vs. 4BSD debacle? :-) The smiley is meant to be sincere, not
sarcastic.

I'm getting to the point where I'm considering formulating a private
mail to Jeff Roberson, requesting that he be aware of the discussion
that's happening (not that he necessarily follow or read it), and that
based on what I can tell we're at a roadblock -- nobody so far is
absolutely certain how to "benchmark" and compare ULE vs. 4BSD in
multiple ways, so that those of us involved here can run such utilities
and provide the data somewhere central for devs to review. I only
mention this because so far I haven't seen anyone really say "okay, this
is what we should be using for these kinds of tests". Yay nature of the
beast.

Daniel Kalchev

2011-12-15 13:59:35 UTC

On Dec 15, 2011, at 3:48 PM, Jeremy Chadwick wrote:

[…]

Post by Jeremy Chadwick
That said: thrown out, data ignored, done.
Now what? Where are we? We're right back where we were a day or two
ago; meaning no closer to solving the dilemma reported by users and
SCHED_ULE. Heck, we're not even sure if there is an issue, other than
some folks confirming that SCHED_4BSD performs better for them (that's
what started this whole thread), and there are at least a couple which
have stated this.

But, are any of these benchmarks really engaging the 4BSD/ULE scheduler differences? Most such benchmarks are run on a system with no other load whatsoever and in no way represent real world experience.

What is more, I believe in such benchmarks "the system feels sluggish" is not measured at all. Even if it is measured, if in such case the benchmark finishes "better" - that is, faster, or say, makes the system freeze for the user for the duration of the test -- it will be considered "win", because the benchmark suite ran faster on that particular system -- whereas a system which ran the benchmark fast, provided good interactive response etc would be considered "loser".

I think it is not good idea to hijack this thread, but instead focusing on the other SCHED_ULE bashing thread to define an reasonable benchmark or a set of benchmarks rather -- so that many would run it and provide feedback.

Daniel_______________________________________________
freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-***@freebsd.org"

Volodymyr Kostyrko

2011-12-15 14:38:32 UTC

Post by Jeremy Chadwick
I'm getting to the point where I'm considering formulating a private
mail to Jeff Roberson, requesting that he be aware of the discussion
that's happening (not that he necessarily follow or read it), and that
based on what I can tell we're at a roadblock -- nobody so far is
absolutely certain how to "benchmark" and compare ULE vs. 4BSD in
multiple ways, so that those of us involved here can run such utilities
and provide the data somewhere central for devs to review. I only
mention this because so far I haven't seen anyone really say "okay, this
is what we should be using for these kinds of tests". Yay nature of the
beast.

I'll try to summarize and propose a test scenario. I don't know whether
this helps or not.

We should have two different task types for this one. The first would be
Super Affine tasks. They should use few to none syscalls, use medium
math, have low memory footprint. No syscalls means this tasks will never
stop for memory/disk or other activity so each time the queue is looked
upon this task will be ready to run. Medium math means this shouldn't be
just a simple big loop so that processor will really compute something
with this data. Low memory footprint means this task can reside with
data on CPU L1 cache for eons. I'm not sure about branch prediction,
should it be distorted or not...

The other task type would be Worker. It doesn't matter what it does but
it agressively uses syscalls like working with files/directories.

There should be at least one SA-task per core and at least 10 (?)
W-tasks per core.

--
Sphinx of black quartz judge my vow.

Michael Ross

2011-12-15 09:07:44 UTC

Am 15.12.2011, 08:32 Uhr, schrieb O. Hartmann

Post by O. Hartmann
http://www.phoronix.com/scan.php?page=news_item&px=MTAyNzA
It may be worth to discuss the sad performance of FBSD in some parts of
the benchmark. A difference of a factor 10 or 100 is simply far beyond
disapointing, it is more than inacceptable and by just reading those
benchmarks, I'd like to drop thinking of using FreeBSD even as a backend
server in scientific and business environments. In detail, some of the
SciMark benches look disappointing.

Why SciMark?

SciMark FreeBSD : Oracle, Mflops

Composite 884.79 : 844.03 (Faster: FreeBSD)
FFT 236.17 : 213.65 (Faster: FreeBSD)
Jacobi 970.76 : 974.84 (Faster: Oracle)
Monte Carlo 443.00 : 246.27 (Faster: FreeBSD)
Sparse Matrix 1213.64 : 1228.22 (Faster: Oracle)
Dense LU 1560.39 : 1557.18 (Faster: FreeBSD)

The threaded I/O results (Oracle outperforms FreeBSD by x10 on one, by
x100 on another test)
or the disc TPS ( 486 : 3526 ) sure look worse and are worth looking into.

Anyway these tests were performed on different hardware, FWIW.
And with different filesystems, different compilers, different GUIs...

Regards,

Michael

Michael Ross

2011-12-15 10:42:36 UTC

Am 15.12.2011, 11:10 Uhr, schrieb Michael Larabel

Post by Michael Ross
Anyway these tests were performed on different hardware, FWIW.
And with different filesystems, different compilers, different GUIs...

No, the same hardware was used for each OS.

The picture under the heading "System Hardware / Software" does not
reflect that.

Motherboard description differs, Chipset description for FreeBSD is empty.

Regards,

Michael

In terms of the software, the stock software stack for each OS was used.
-- Michael

Michael Larabel

2011-12-15 10:58:16 UTC

Post by Michael Ross
Am 15.12.2011, 11:10 Uhr, schrieb Michael Larabel

Post by Michael Ross
Anyway these tests were performed on different hardware, FWIW.
And with different filesystems, different compilers, different GUIs...

No, the same hardware was used for each OS.

The picture under the heading "System Hardware / Software" does not
reflect that.
Motherboard description differs, Chipset description for FreeBSD is empty.

I was the on that carried out the testing and know that it was on the
same system.

All of the testing, including the system tables, is fully automated.
Under FreeBSD sometimes the parsing of some component strings isn't as
nice as Linux and other supported operating systems by the Phoronix Test
Suite. For the BSD motherboard string parsing it's grabbing
hw.vendor/hw.product from sysctl. Is there a better place to read the
motherboard DMI information from?

-- Michael

Post by Michael Ross
Regards,
Michael

In terms of the software, the stock software stack for each OS was used.
-- Michael

_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-stable

Steven Hartland

2011-12-15 11:17:26 UTC

----- Original Message -----

Post by Michael Larabel
I was the on that carried out the testing and know that it was on the
same system.
All of the testing, including the system tables, is fully automated.
Under FreeBSD sometimes the parsing of some component strings isn't as
nice as Linux and other supported operating systems by the Phoronix Test
Suite. For the BSD motherboard string parsing it's grabbing
hw.vendor/hw.product from sysctl. Is there a better place to read the
motherboard DMI information from?

dmidecode may provide better info?

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

Michael Larabel

2011-12-15 11:21:45 UTC

----- Original Message ----- From: "Michael Larabel"

Post by Michael Larabel
I was the on that carried out the testing and know that it was on the
same system.
All of the testing, including the system tables, is fully automated.
Under FreeBSD sometimes the parsing of some component strings isn't
as nice as Linux and other supported operating systems by the
Phoronix Test Suite. For the BSD motherboard string parsing it's
grabbing hw.vendor/hw.product from sysctl. Is there a better place to
read the motherboard DMI information from?

dmidecode may provide better info?
Regards
Steve

dmidecode is used on Linux for parsing some of the hardware information.
I think I looked at using it for BSD too, but offhand I don't recall
what the problem was. I'll check into it again with the latest release
when time allows.

Michael

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd.
and the person or entity to whom it is addressed. In the event of
misdirection, the recipient is prohibited from using, copying,
printing or otherwise disseminating it or any information contained in
it.
In the event of misdirection, illegible or incomplete transmission
please telephone +44 845 868 1337

Gót András

2011-12-15 12:08:24 UTC

It would be also nice to see whether compiling the kernel and the world
for the specific machine counts. I think it's an advantage of FreeBSD,
but never could do a benchmark comparing this.

Andras

----- Original Message ----- From: "Michael Larabel"

Post by Michael Larabel
I was the on that carried out the testing and know that it was on
the same system.
All of the testing, including the system tables, is fully
automated. Under FreeBSD sometimes the parsing of some component
strings isn't as nice as Linux and other supported operating systems
by the Phoronix Test Suite. For the BSD motherboard string parsing
it's grabbing hw.vendor/hw.product from sysctl. Is there a better
place to read the motherboard DMI information from?

dmidecode may provide better info?
Regards
Steve

dmidecode is used on Linux for parsing some of the hardware
information. I think I looked at using it for BSD too, but offhand I
don't recall what the problem was. I'll check into it again with the
latest release when time allows.
Michael

_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to

Steven Hartland

2011-12-15 12:30:28 UTC

Having a quick look at those results aren't there a few annomolies e.g.
THREADED I/O TESTER for Oracle reports 10255.75MB/s

Which is clearly impossible for a single HD system meaning
its basically caching the entire data set?

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

Andriy Gapon

2011-12-15 16:06:37 UTC

I think that the Stefan Esser's post in this thread has likely nailed this one.

--
Andriy Gapon

Michael Ross

2011-12-15 11:20:24 UTC

Am 15.12.2011, 11:55 Uhr, schrieb Michael Larabel

Post by Michael Ross
Am 15.12.2011, 11:10 Uhr, schrieb Michael Larabel

Post by Michael Ross
Anyway these tests were performed on different hardware, FWIW.
And with different filesystems, different compilers, different GUIs...

No, the same hardware was used for each OS.

The picture under the heading "System Hardware / Software" does not
reflect that.
Motherboard description differs, Chipset description for FreeBSD is empty.

I was the on that carried out the testing and know that it was on the
same system.

No offense. I'm not doubting you.

Post by Michael Larabel
All of the testing, including the system tables, is fully automated.
Under FreeBSD sometimes the parsing of some component strings isn't as
nice as Linux and other supported operating systems by the Phoronix Test
Suite. For the BSD motherboard string parsing it's grabbing
hw.vendor/hw.product from sysctl.

so maybe you can understand how I got my impression.
NVidia Audio and Realtek Audio.
Looks different to me :-)

Post by Michael Larabel
Is there a better place to read the motherboard DMI information from?

Following Steven Hartlands' suggestion,
from one of my machines:

/usr/ports/sysutils/dmidecode/#sysctl -a | egrep "hw.vendor|hw.product"

/usr/ports/sysutils/dmidecode/#dmidecode -t 2
# dmidecode 2.11
SMBIOS 2.6 present.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: FUJITSU
Product Name: D2759
Version: S26361-D2759-A13 WGS04 GS02
Serial Number: 35838599
Asset Tag: -
Features:
Board is a hosting board
Board is removable
Location In Chassis: -
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0

Nice. Didn't know about that.

Regards,

Michael

Patrick M. Hausen

2011-12-15 11:29:05 UTC

Hi, all,

Post by Michael Ross
Following Steven Hartlands' suggestion,
/usr/ports/sysutils/dmidecode/#sysctl -a | egrep "hw.vendor|hw.product"
/usr/ports/sysutils/dmidecode/#dmidecode -t 2
# dmidecode 2.11
SMBIOS 2.6 present.
Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: FUJITSU
Product Name: D2759
Version: S26361-D2759-A13 WGS04 GS02
Serial Number: 35838599
Asset Tag: -
Board is a hosting board
Board is removable
Location In Chassis: -
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0

Without the need to install an additional port:

datatomb2# kenv
…
smbios.bios.reldate="11/03/2011"
smbios.bios.vendor="FUJITSU // American Megatrends Inc."
smbios.bios.version="V4.6.4.1 R1.18.0 for D3034-A1x"
smbios.chassis.maker="FUJITSU"
smbios.chassis.serial="YLAP004857"
smbios.chassis.tag="System Asset Tag "
smbios.chassis.version="RX100S7R2"
smbios.memory.enabled="8388608"
smbios.planar.maker="FUJITSU"
smbios.planar.product="D3034-A1"
smbios.planar.serial="LJ1B-P00996"
smbios.planar.version="S26361-D3034-A100 WGS01 GS02"
smbios.socket.enabled="1"
smbios.socket.populated="1"
smbios.system.maker="FUJITSU"
smbios.system.product="PRIMERGY RX100 S7"
smbios.system.serial="YLAP004857"
smbios.system.uuid="f0493081-f5ca-e011-b8a5-a1c4d143da5f"
smbios.system.version="GS02"
smbios.version="2.7"
…

Kind regards,
Patrick

--
punkt.de GmbH * Kaiserallee 13a * 76133 Karlsruhe
Tel. 0721 9109 0 * Fax 0721 9109 100
***@punkt.de http://www.punkt.de
Gf: Jürgen Egeling AG Mannheim 108285

Jeremy Chadwick

2011-12-15 11:54:35 UTC

Post by Michael Ross
Am 15.12.2011, 11:10 Uhr, schrieb Michael Larabel

Post by Michael Ross
Anyway these tests were performed on different hardware, FWIW.
And with different filesystems, different compilers, different GUIs...

No, the same hardware was used for each OS.

The picture under the heading "System Hardware / Software" does
not reflect that.
Motherboard description differs, Chipset description for FreeBSD is empty.

I was the on that carried out the testing and know that it was on
the same system.
All of the testing, including the system tables, is fully automated.
Under FreeBSD sometimes the parsing of some component strings isn't
as nice as Linux and other supported operating systems by the
Phoronix Test Suite. For the BSD motherboard string parsing it's
grabbing hw.vendor/hw.product from sysctl.
Is there a better place to read the motherboard DMI information from?

I *think* what you're referring to is SMBIOS strings -- and these are
available from kenv(1) / kenv(2), not sysctl. But keep reading for why
SMBIOS data is not 100% reliable (greatly depends on the hardware). For
actual device strings/etc. for all devices on busses (PCI, AGP, etc.)
you can use pciconf -lvcb.

That's about as good as it's going to get via software. SMBIOS data
(e.g. smbios.{bios,chassis,planar,system}) is never going to give you
fully-identifiable data; I can point you to tons of systems where the
data inserted there is nonsense, sometimes even just ASCII spaces (and
that is the fault of the system vendor/BIOS manufacturer, not FreeBSD).
Sometimes identical strings are used across completely different
systems/boards (sometimes even server-class boards like ones from
Supermicro). And PCI vendor strings don't give you things like speeds,
frequency/voltages, etc.. Sometimes this matters. For example (just
making something up): "the video benchmark was horrible on FreeBSD",
when in fact it turned out that a run of "pciconf -lvcb" showed your
PCIe card was running at x4 link speed instead of x16.

The best place to get your specifications from are:

* The box
* The physical hardware (by physically inspecting it)
* The user manual / product documentation/
* Purchase orders from whoever bought the hardware
* And, of course, operational speed (if possible) from the OS/userland
utilities

When I read a benchmark/review, I have to assume the person is doing
them on a system they have 100% control over, all the way down to the
hardware. Thus, they should know what exact hardware they have.

Also, when publishing results online, you should take the time to
proofread everything (with a 2nd set of eyes if possible) and be patient
and thorough. People like accuracy, especially when there's hard
data/evidence to back it up that can be made available for download.

Try to understand: so many review-esque sites consist of individuals who
do not understand even remotely what they're doing.

I'm going to give you two examples -- one personal, one word-of-mouth
but from someone I trust dearly.

I have a "reverse analysis" of Anantech's Intel 510 SSD review that has been
sitting in my "draft" folder on my blog for a month now because I'm
downright afraid to publish how their data seems completely and totally
wrong (with evidence to prove it). I'm afraid/stalling because I want
to make absolutely damn sure I'm not missing some key piece of evidence
that explains it, and I've had multiple people read it and go "...wow, I
didn't notice that, that benchmark data makes no sense", but I'm STILL
reluctant. The last thing I want to do is "publish" something that
sparks a controversy where it turns out I'm wrong (and I AM wrong, quite
often!).

As for the other:

http://www.overclockers.com/bulldozer-architecture-explained/

The author of this "review" talks about CPU arch and is praised for
writing a "wonderful article that speaks the truth". But sadly that
doesn't appear to be the case. A colleague of mine is long-time friends
with another individual who is getting his Ph.D in computer architecture
and recently submit a paper to a journal (and was published/accepted)
which has published papers on things like RAID (when it was first
introduced as a concept/method), and hardware watchpoints. Said
individual read the above "review" and described it as, quote, "the
worst article on computer architecture on the entire Internet". One of
the amusing quotes (that got me laughing since I did understand it; my
understanding of CPUs on a silicon level is limited, I'm just an old
65xxx assembly programmer...) was how the article states "this is the
first time AMD has implemented branch prediction". Sigh.

Here's the kicker: said individual immediately recognised that the
article was a near dry cut-and-paste from one of two commonly-used
computer architecture books in college/universities; the first book is
basically a "beginner's guide to CPU architecture". The book is also a
bit old at that. Individual proceeded to look up where the article
author went to school, and noted that said school's CPU architecture
course **ends** with that book.

The user/viewer demographic of overclockers.com is going to be
significantly different from that of phoronix.com -- you know that I'm
sure. The point is that you should be aware that there is going to
be significant discussions that come from publishing such benchmark
comparisons with such a demographic. Things that indicate severe
performance differential (e.g. "10x to 100x worse") are going to be
focused on and criticised -- and hopefully in a socially-agreeable
manner[1] -- and in a much different way than, say, a 3D video card
review site ("lol ur pc sux if u spend onl $4000 on it lol").

The first step is to try and figure out what exactly you're seeing and
why it's so significantly different when compared to other OSes.

[1]: I'm sure by now you know that the BSDs in general tend to harbour a
community of folks who are more argumentative/aggressive than, say,
Linux (generally speaking). In this thread though, I think all of us
really want to assist in some way to figure out what exactly is going on
here, scheduler-wise, and see if we can put something together to hand
developers who are "responsible" for said code and see what comes of it.
Remember, we're all here to try and make things better... I hope. :-)

Footnote: It's nice meeting you (indirectly), I was always curious who
did the phoronix.com reviews/"stuff" when it came to FreeBSD.
Greetings!

Michael Larabel

2011-12-15 11:04:31 UTC

Post by Michael Ross
Am 15.12.2011, 08:32 Uhr, schrieb O. Hartmann

Why SciMark?
SciMark FreeBSD : Oracle, Mflops
Composite 884.79 : 844.03 (Faster: FreeBSD)
FFT 236.17 : 213.65 (Faster: FreeBSD)
Jacobi 970.76 : 974.84 (Faster: Oracle)
Monte Carlo 443.00 : 246.27 (Faster: FreeBSD)
Sparse Matrix 1213.64 : 1228.22 (Faster: Oracle)
Dense LU 1560.39 : 1557.18 (Faster: FreeBSD)
The threaded I/O results (Oracle outperforms FreeBSD by x10 on one, by
x100 on another test)
or the disc TPS ( 486 : 3526 ) sure look worse and are worth looking into.
Anyway these tests were performed on different hardware, FWIW.
And with different filesystems, different compilers, different GUIs...

No, the same hardware was used for each OS.

In terms of the software, the stock software stack for each OS was used.

-- Michael

Post by Michael Ross
Regards,
Michael
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-stable

Michael Larabel

2011-12-15 13:37:44 UTC

No, the same hardware was used for each OS.
In terms of the software, the stock software stack for each OS was used.

I was running some ZFS vs. UFS tests as well and this happened to have
ZFS on when I was running some other tests.

Did you tune the ZFS ARC (e.g. vfs.zfs.arc_max="6G") for the tests?

The OS was left in its stock configuration.

And BTW: Did your measured run times account for the effect, that Linux
keeps much more dirty data in the buffer cache (FreeBSD has a low limit
on dirty buffers since under realistic load the already cached data is
much more likely to be reused and thus more valuable than freshly
written data; aggressively caching dirty data would significantly reduce
throughput and responsiveness under high load). Given the hardware specs
of the test system, I guess that Linux accepts at least 100 times the
dirty data in the buffer cache, compared to FreeBSD (where this number
is at most in the tens of megabyte range).
If you did not, then your results do not represent a server load (which
I'd expect relevant, if you are testing against Oracle Linux 6.1
server), where continuous performance is required. Tests that run on an
idle system starting in a clean state and ignoring background flushing
of the buffer cache after the timed program has stopped are perhaps
useful for a very lowly loaded PC, but not for a system with high load
average as the default.
I bet that if you compared the systems under higher load (which
admittedly makes it much harder to get sensible numbers for the program
under test) or with reduced buffer cache size (or raise the dirty buffer
limit in FreeBSD accordingly, which ought to be possible with sysctl
and/or boot time tuneables, e.g. "vfs.hidirtybuffers").
And a last remark: Single benchmark runs do not provide reliable data.
FreeBSD comes with "ministat" to check the significance of benchmark
results. Each test should be repeated at least 5 times for meaningful
averages with acceptable confidence level.

The Phoronix Test Suite runs most tests a minimum of three times and if
the standard deviation exceeds 3.5% the run count is dynamically
increased, among other safeguards.

-- Michael

Regards, STefan

Michael Larabel

2011-12-15 14:31:49 UTC

No, the same hardware was used for each OS.
In terms of the software, the stock software stack for each OS was used.

I was running some ZFS vs. UFS tests as well and this happened to have
ZFS on when I was running some other tests.

Can we look at the tests?
My opinion is ZFS without tuning is much slower than UFS2.

http://www.phoronix.com/scan.php?page=news_item&px=MTAyNjg

Sergey Matveychuk

2011-12-15 14:47:10 UTC

No, the same hardware was used for each OS.
In terms of the software, the stock software stack for each OS was used.

I was running some ZFS vs. UFS tests as well and this happened to have
ZFS on when I was running some other tests.

Can we look at the tests?
My opinion is ZFS without tuning is much slower than UFS2.

Stefan Esser

2011-12-15 13:39:38 UTC

No, the same hardware was used for each OS.
In terms of the software, the stock software stack for each OS was used.

Just curious: Why did you choose ZFS on FreeBSD, while UFS2 (with
journaling enabled) should be an obvious choice since it is more similar
in concept to ext4 and since that is what most FreeBSD users will use
with FreeBSD?

Did you tune the ZFS ARC (e.g. vfs.zfs.arc_max="6G") for the tests?

And BTW: Did your measured run times account for the effect, that Linux
keeps much more dirty data in the buffer cache (FreeBSD has a low limit
on dirty buffers since under realistic load the already cached data is
much more likely to be reused and thus more valuable than freshly
written data; aggressively caching dirty data would significantly reduce
throughput and responsiveness under high load). Given the hardware specs
of the test system, I guess that Linux accepts at least 100 times the
dirty data in the buffer cache, compared to FreeBSD (where this number
is at most in the tens of megabyte range).

If you did not, then your results do not represent a server load (which
I'd expect relevant, if you are testing against Oracle Linux 6.1
server), where continuous performance is required. Tests that run on an
idle system starting in a clean state and ignoring background flushing
of the buffer cache after the timed program has stopped are perhaps
useful for a very lowly loaded PC, but not for a system with high load
average as the default.

I bet that if you compared the systems under higher load (which
admittedly makes it much harder to get sensible numbers for the program
under test) or with reduced buffer cache size (or raise the dirty buffer
limit in FreeBSD accordingly, which ought to be possible with sysctl
and/or boot time tuneables, e.g. "vfs.hidirtybuffers").

And a last remark: Single benchmark runs do not provide reliable data.
FreeBSD comes with "ministat" to check the significance of benchmark
results. Each test should be repeated at least 5 times for meaningful
averages with acceptable confidence level.

Regards, STefan

Daniel Kalchev

2011-12-15 13:52:25 UTC

No, the same hardware was used for each OS.
In terms of the software, the stock software stack for each OS was used.

Or perhaps, since it is "server" Linux distribution, use ZFS on Linux as well. With identical tuning on both Linux and FreeBSD. Having the same FS used by both OS will help make the comparison more sensible for FS I/O.

Daniel_______________________________________________
freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-***@freebsd.org"

Oliver Pinter

2011-12-15 07:44:30 UTC

Post by Oliver Pinter

Post by Tom Evans
On Wed, Dec 14, 2011 at 11:06 AM, George Mitchell

Post by George Mitchell
Dear Secret Masters of FreeBSD: Can we have a decision on whether to
change back to SCHED_4BSD while SCHED_ULE gets properly fixed?

You compare SCHED_ULE on FBSD 8.1 with SCHED_4BSD on FBSD 7.0? Shouldn't
you compare SCHED_ULE and SCHED_4BSD on the very same platform?
Development of SCHED_ULE has been focused very much on DB like
PostgreSQL, no wonder the performance benefit. But this is also a very
specific scneario where SCHED_ULE shows a real benefit compared to
SCHED_4BSD.

Many people more experst in the issue than myself revealed some issues
in the code of both SCHED_ULE and even SCHED_4BSD. It would be a pitty
if all the discussions get flushed away like a "toilette-busisness" as
it has been done all the way in the past.
Well, I'd like to see a kind of "standardized" benchmark. Like on
openbenchmark.org or at phoronix.com. I know that Phoronix' way of
performing benchmarks is questionable and do not reveal much of the
issues, but it is better than nothing. I'm always surprised by the worse
performance of FreeBSD when it comes to threaded I/O. The differences
between Linux and FreeBSD of the same development maturity are
tremendous and scaring!
It is a long time since I saw a SPEC benchmark on a FreeBSD driven HPC
box. Most benchmark around for testing hardware are performed with Linux
and Linux seems to make the race in nearly every scenario. It would be
highly appreciable and interesting to see how Linux and FreeBSD would
perform in SPEC on the same hardware platform. This is only an idea.
Without a suitable benchmark with a codebase understood the discussion
is in many aspects pointless -both ways.

Hi!

I'm replying with a list of each setting which differs compared to
RELENG_8 stock on our ULE systems. Note that our ULE systems are 1
physical CPU with 4 cores.

On other system that has 4 core I use 7-STABLE, because I have not
enough time for upgraded it, and the system has some custom patches.
The values what I send in previous mail mostly based on this 4 cores
system.

Post by Oliver Pinter
kern.sched.cpusetsize: 8

Yes, this is 9-STABLE.

Post by Oliver Pinter
kern.sched.preemption: 0

This differs; default value is 1.

PREEMPTION is disabled via kernel config.

Post by Oliver Pinter
kern.sched.name: ULE
kern.sched.slice: 13
kern.sched.interact: 30
kern.sched.preempt_thresh: 224

This differs; default value is 64. The "magic value" of 224 has been
discussed in the past, in this thread even.

This magic value has discussed before 1 or 1.5 year here, first for 8-STABLE.

Post by Oliver Pinter
kern.sched.static_boost: 152

This differs; on our systems it's 160.

Post by Oliver Pinter
kern.sched.idlespins: 10000
kern.sched.idlespinthresh: 16

This differs; on our systems it's 4.

Ivan Klymenko

2011-12-15 08:26:35 UTC

В Thu, 15 Dec 2011 03:05:12 +0100

Post by Oliver Pinter

Post by Tom Evans
On Wed, Dec 14, 2011 at 11:06 AM, George Mitchell

Post by George Mitchell
Dear Secret Masters of FreeBSD: Can we have a decision on whether
to change back to SCHED_4BSD while SCHED_ULE gets properly fixed?

I would highly appreciate a decission against SCHED_ULE as the
default scheduler! SCHED_4BSD is considered a more mature entity
and obviously it seems that SCHED_ULE needs some refinements to
achieve a better level of quality.

Post by Tom Evans
On the other hand, we have very many benchmarks showing how poorly
4BSD scales on things like postgresql. We get much more load out of
our 8.1 ULE DB and web servers than we do out of our 7.0 ones. It's
easy to look at what you do and say "well, what suits my
environment is clearly the best default", but I think there are
probably more users typically running IO bound processes than CPU
bound processes.

You compare SCHED_ULE on FBSD 8.1 with SCHED_4BSD on FBSD 7.0?
Shouldn't you compare SCHED_ULE and SCHED_4BSD on the very same
platform?
Development of SCHED_ULE has been focused very much on DB like
PostgreSQL, no wonder the performance benefit. But this is also a
very specific scneario where SCHED_ULE shows a real benefit
compared to SCHED_4BSD.

Post by Tom Evans
I believe the correct thing to do is to put some extra
documentation into the handbook about scheduler choice, noting the
potential issues with loading NCPU+1 CPU bound processes. Perhaps
making it easier to switch scheduler would also help?

Many people more experst in the issue than myself revealed some
issues in the code of both SCHED_ULE and even SCHED_4BSD. It would
be a pitty if all the discussions get flushed away like a
"toilette-busisness" as it has been done all the way in the past.
Well, I'd like to see a kind of "standardized" benchmark. Like on
openbenchmark.org or at phoronix.com. I know that Phoronix' way of
performing benchmarks is questionable and do not reveal much of the
issues, but it is better than nothing. I'm always surprised by the
worse performance of FreeBSD when it comes to threaded I/O. The
differences between Linux and FreeBSD of the same development
maturity are tremendous and scaring!
It is a long time since I saw a SPEC benchmark on a FreeBSD driven
HPC box. Most benchmark around for testing hardware are performed
with Linux and Linux seems to make the race in nearly every
scenario. It would be highly appreciable and interesting to see how
Linux and FreeBSD would perform in SPEC on the same hardware
platform. This is only an idea. Without a suitable benchmark with a
codebase understood the discussion is in many aspects pointless
-both ways.

Hi!
kern.sched.cpusetsize: 8
kern.sched.preemption: 0
kern.sched.name: ULE
kern.sched.slice: 13
kern.sched.interact: 30
kern.sched.preempt_thresh: 224
kern.sched.static_boost: 152
kern.sched.idlespins: 10000
kern.sched.idlespinthresh: 16
kern.sched.affinity: 1
kern.sched.balance: 1
kern.sched.balance_interval: 133
kern.sched.steal_htt: 1
kern.sched.steal_idle: 1
kern.sched.steal_thresh: 1
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="2" mask="3">0, 1</cpu>
<children>
<group level="2" cache-level="2">
<cpu count="2" mask="3">0, 1</cpu>
</group>
</children>
</group>
</groups>
Most of them from 7-STABLE settings, and with this, "works for me".
This an laptop with core2 duo cpu (with enabled powerd), and my kernel
http://oliverp.teteny.bme.hu/freebsd/kernel_conf

And you try to do like there

what would your the cursor mouse and Xorg NOT froze for a split second
or more...
And I'll see how really good your ULE ;)

Daniel Kalchev

2011-12-15 07:47:56 UTC

Post by Tom Evans
On Wed, Dec 14, 2011 at 11:06 AM, George Mitchell

Post by George Mitchell
Dear Secret Masters of FreeBSD: Can we have a decision on whether to
change back to SCHED_4BSD while SCHED_ULE gets properly fixed?

My logic would be, if SCHED_ULE works better on multi-CPU systems, or if
SCHED_4BSD works poor on multi-CPU systems, then by all means keep
SCHED_ULE as default scheduler. We are at the end of 2011 and the number
of single or dual core CPU systems is decreasing. Most people would just
try the newest FreeBSD version on their newest hardware and on that base
make an "informed" decision if it is worth it. If on newer hardware
SCHED_ULE gives better performance, then again it should be the default.

Then, FreeBSD is used in an extremely wide set fo different
environments. What scheduler might benefit an one CPU, simple
architecture X workstation may be damaging for the performance of
multiple CPU, NUMA based server with a large number of non-interactive
processes running.

Perhaps an knob should be provided with sufficient documentation for
those that will not go forward to recompile the kernel (the majority of
users, I would guess).

I tried switching my RELENG8 desktop from SCHED_ULE to SCHED_4BSD
yesterday and cannot see any measurable difference in responsiveness. My
'stress test' is typically an FLASH game, that get's firefox in an
almost unresponsive state, eats one of the CPU cores -- but no
difference. Well, FLASH has it's own set of problems on FreeBSD, but
these are typical "desktop" uses. Running 100% compute intensive
processes in background is not.

Daniel

PS: As to why Linux is "better" in these usages: they do not care much
to do things "right", but rather to achieve performance. In my opinion,
most of us are with FreeBSD for the "do it right" attitude.

George Mitchell

2011-12-15 00:13:20 UTC

[...] This thread has shown that ULE performs poorly
in very specific scenarios where the server is loaded with NCPU+1 CPU
bound processes, [...]

Minor correction: Problem occurs when there are nCPU compute-bound
processes, not nCPU + 1. -- George Mitchell

Mark Linimon

2011-12-14 19:30:09 UTC

I'm not on the Release Engineering Team, and in fact don't have a src
commit bit ... but this close to a major release, no, it's too late to
change the default.

mcl

Eitan Adler

2011-12-10 01:12:37 UTC

Try idprio as well (atm it requires root to use though).

nice only means "play nice". idprio means "only run when nothing else
wants to run".

--
Eitan Adler

Attilio Rao

2011-12-15 16:27:14 UTC