VM2 needs more than 800 queries for the frontpage? (btw less than 200 now)

Milbo · November 22, 2011, 21:38:13 PM

First I ask to consider that the frontpage is most time the page with the highest load. Since more and more people cry that vm2 needs so much queries. I think it is very important to explain why the old rule, the lesser sql the faster the application is sometimes wrong. I explain now different models/ideas how to go through a data tree and different strategies to store it. I take as exampel the category tree. I will show here how it can be that you do more sql queries and you need less time and that this depends on how you organise your db.

The simpelst idea is just to store every category in a row and to store the parent_category_id and an array of children_category_ids in one row. This leads to problems like that the parent category must be updated, when a new children is created or deleted. There are various different methods. The moment you say, that you want to store the children_category_ids clean you make an array table out of it, which just leads to an xref table to hold the relation between the categories.

In vm we use the method to store the arrays of the children in an xref table.
Name: _virtuemart_category_categories
Fields:
id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
   `category_parent_id` INT(1) UNSIGNED NOT NULL DEFAULT '0',
   `category_child_id` INT(1) UNSIGNED NOT NULL DEFAULT '0',
   `ordering` smallint(1) NOT NULL DEFAULT '0'

The category table, which holds the actual information is
Name: _virtuemart_categories
   `virtuemart_category_id` SMALLINT(1) UNSIGNED NOT NULL AUTO_INCREMENT,
   `virtuemart_vendor_id` SMALLINT(1) UNSIGNED NOT NULL DEFAULT '0',
   `category_template` CHAR(24) NULL DEFAULT NULL,
   `category_layout` CHAR(16) NULL DEFAULT NULL,
   `category_product_layout` CHAR(16) NULL DEFAULT NULL,
   `products_per_row` TINYINT(2) NULL DEFAULT NULL,
   `limit_list_start` SMALLINT(1) UNSIGNED NULL DEFAULT NULL,
   `limit_list_step` SMALLINT(1) UNSIGNED NULL DEFAULT NULL,
   `limit_list_max` SMALLINT(1) UNSIGNED NULL DEFAULT NULL,
   `limit_list_initial` SMALLINT(1) UNSIGNED NULL DEFAULT NULL,
   `hits` INT(1) UNSIGNED NOT NULL DEFAULT '0',
   `metarobot` CHAR(40) NOT NULL,
   `metaauthor` CHAR(64) NOT NULL,
   `ordering` INT(2) NOT NULL DEFAULT '0',
   `shared` TINYINT(1) NOT NULL,
   `published` TINYINT(1) NOT NULL DEFAULT '1',
   `created_on` DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
   `created_by` INT(11) NOT NULL DEFAULT '0',
   `modified_on` DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
   `modified_by` INT(11) NOT NULL DEFAULT '0',
   `locked_on` DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
   `locked_by` INT(11) NOT NULL DEFAULT '0',

Additionally we hold all the language information here,
Name: _virtuemart_categories_en_gb
   `virtuemart_category_id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
   `category_name` CHAR(128) NOT NULL DEFAULT '',
   `category_description` VARCHAR(2024) NOT NULL DEFAULT '',
   `metadesc` CHAR(128) NOT NULL DEFAULT '',
   `metakey` CHAR(128) NOT NULL DEFAULT '',
   `slug` CHAR(144) NOT NULL DEFAULT '',

I think it is already clear to see that, this is a completly different tablelayout then to throw everything in the same table. So now lets see how we can browse the data.
Our first method used everytime the same query. When you search for category_parent_id = 0 you get all parent categories (one query for a list of all parent categories), for the children categories we just called it again with the last virtuemart_category_id as the new category_parent_id.
So our first method used 2 queries per category, regardless if has childs or not. Some numbers on my system for this task (145 categories).
vmdebug vmTime: getCategories: 0.0030829906463623
vmdebug vmTime: getCategories: 0.00089716911315918
vmdebug vmTime: getCategories: 0.0012059211730957
vmdebug vmTime: getCategories: 0.00093698501586914
vmdebug vmTime: getCategories: 0.0011470317840576

It depends a bit how much categories are in the list and the first result is the outlier. So I think we can say this tasks takes approximatly 0.001 sec or 1 ms.

Lets assume we have only root categories, without children. Then we have for all categories one query, I called it now listquery (1ms). The first variant used 2 queries per categorylevel. 100 categories with no childs are all of one level. 2 categories with 3 childs each, are 8 categories, but 3 levels. This are already 9 queries.

On the first view suprisingly it is faster to use another query to check if there exist a lower level (children). On a second view it is not surprising. The code we use now checks first if there are children and is only then executing the listquery (1ms).

The query checking for children is using the small table. Here some numbers for it on my machine with 145 categories.
vmdebug vmTime: hasChildren NO: 0.00012993812561035
vmdebug vmTime: hasChildren NO: 0.00013494491577148
vmdebug vmTime: hasChildren NO: 0.00011801719665527
vmdebug vmTime: hasChildren NO: 0.00011301040649414
vmdebug vmTime: hasChildren YES: 0.00011801719665527

We can round that to 0.1 ms. I think it is correct when we say it is by approximatly factor 10 faster then the listquery. So how does it look like for 100 root categories now? We have 1 time the slow query (1ms) and 100 times the fast query, with 0.1 ms. I think that is clearly faster then 101 times 1 ms (11 ms against 101ms).

The question is now, what is the condition that it takes longer to test, then we spare time with it. The answer is 9 categories with only 1 child in a row. Why?

A final leaf needed before one slow query for the list and one slow query to check that there is no child. So with n children you get for it 1ms/n + 1ms.

Now we have one slow query for the list and a fast one to check for children 1ms/n + 0.1ms .

I think it is quite clear to see that the more children, the faster is the solution to make a check more.

But what is the query and time balance for a category which has a children? Then we have 1ms/n + 0.1ms + 1ms. The old solution had here 1ms/n + 1ms. So yes, we loose 0.1 ms. So the calculation for nested 10 categories is.

Old solution:
per leaf: 1ms/n + 1ms. 2 queries
per node: 1ms/n + 1ms. 2 queries
9 nodes + 1 leaf = 20ms

New solution:
per leaf: 1ms/n + 0.1ms 2 queries
per node: 1ms/n + 0.1ms + 1ms 3 queries
9 nodes + 1 leaf = 20ms

So we use MORE queries and it is actually faster ! You can argue that it is slower for every category with a child, but every category whith childs has normally more than one child and here we play the game again. Except for very deep category trees with often only one child (and this 9 times in a row), we have always more categories without child, then categories with a child. So you can say it is practically always faster with more queries then before.

Here is the calculation for a category with 2 children (1 node + 2 leafs).
old solution: 6ms
new solution: 3.8ms

So yes, more queries can actually make it faster! Stop following some rules, just test it, try to find a faster solution. The number of queries is an indicator, but is not a measurement. We are always open for enhancments, but we wont change our code, because some popele think it must be done that or this way or, just because it does not fit in their world.

Just another exampel, why we should try to avoid a query with 1.5 E^-7 secs load time, when there are queries using 0,0015 secs (1.5 E^-3). This means I can do 10k times my fast queries instead of one slow one. okey, overhead, this and that, lets say 1000 times. and yes we have queries like that (SHOW COLUMNS is very, very slow, which we move now to store functions).

It has also a lot todo with the database layout. We use a lot xref tables, this tends automatically to lot fast queries. Big tables and less xref tables means lesser, but slower queries.

I hope that I gave some good insight how to actually improve your speed and not just using rules and to believe in it.

Please also consider that vm2 is now always running in multilanguage mode. That means if you have 1 or 10 languages installed, the speed should be almost the same.
We are always open to implement better queries. But the argument should not be, take this, it uses lesser queries. The argument should be, take this, it is faster.

Maybe there is some mega sql crack laughing at us, because he can do recursive sql, yeah great, then donate it to the project.

Implementing this code took a lot less time then to write this post.

Max

PS: LMAO

nikosdion · November 22, 2011, 22:49:20 PM

Max, as a fellow Mechanical Engineer, I am deeply disappointed in you. Your benchmarking method is full of holes. Let's begin.

Your benchmark ran only once. You very well know that two number mean nothing. Each benchmark should be run a few thousands of times and you should calculate the median value and standard deviation. If the st.dev. is very big, your testing is flawed. Moreover, you get to talk about nanoseconds here. Just think what is the accuracy of your metering instrument. If your measurement is below the guaranteed accuracy of your metering instrument, you're basically measuring bullshit. Guess what you were measuring? I'll tell you: bullshit. That's why these benchmarks have to run thousands of times and go through a statistical analysis. It's the very first thing you learn about QA in any Greek polytechnic. I assume you are taught the same stuff in Germany, which makes me kinda amazed at how you missed the point that you've made a huge systemic error - let alone building your whole argument ON THAT ERROR.

The next error in your methodology is that you are using a testing server where there is a MySQL query cache. Since there are no other users/processes busting the cache, all your data is in memory. The driving factor of query latency is, surprise, the slowest component of the server: the hard drive. Given that your software is being used on petty shared hosting servers, overcrowded with thousands of sites and not dedicated boxes, your benchmarking environment does not simulate a real world use case. I told you on Twitter, you have to turn off MySQL caching. If you want to make things realistic, have a process thrash the hard disk and bust the filesystem cache too. This is the server environment your software is called to run in.

I am terribly disappointed because I met you in person, I know you are smart, I know you have Mechanical Engineering background and I just can't understand how someone like you failed so badly in a. identifying the proper testing environment and b. implement a statistically sound benchmarking method.

Nicholas

P.S. I would LMAO too, but your benchmarking method is far from funny; it's a tragedy.

Milbo · November 22, 2011, 23:47:41 PM

Quote from: nikosdion on November 22, 2011, 22:49:20 PM
Your benchmark ran only once.

No, why do you assume that? I benchmarked it many times. This not a doctor work and the quantity of factor 10 stays,. maybe it is sometime factor 9 the other time factor 11. You remember me a the student who tried to make a biig towhaboo, because some BMW engine had only 95% of the power noted on the datasheet (btw he almost did not get his diploma).

Quote from: nikosdion on November 22, 2011, 22:49:20 PM
Moreover, you get to talk about nanoseconds here.

No, read wiki nano = n = 10^−9 = 0.000000001. I talk about 10^-5, far away from nano.

Quote from: nikosdion on November 22, 2011, 22:49:20 PM
Just think what is the accuracy of your metering instrument.

Hmm the thing is measuring in nanosecs and I talk about milli seconds. Yes I was quite good in error calculation. you can assume that it has an accuracy of at least 10^-7, lets say 10^-6, it would not affect any of my argumentation.

Quote from: nikosdion on November 22, 2011, 22:49:20 PM
The next error in your methodology is that you are using a testing server where there is a MySQL query cache.

Why do you think that? My personal (yes that is subjectiv) experience is that a pure linux vserver is faster running than my machine.[/quote]

Quote from: nikosdion on November 22, 2011, 22:49:20 PM
Since there are no other users/processes busting the cache, all your data is in memory. The driving factor of query latency is, surprise, the slowest component of the server: the hard drive.

yeah use your glasbowl. Just assume something, very professional.

Quote from: nikosdion on November 22, 2011, 22:49:20 PM
Given that your software is being used on petty shared hosting servers, overcrowded with thousands of sites and not dedicated boxes, your benchmarking environment does not simulate a real world use case.
I told you on Twitter, you have to turn off MySQL caching. If you want to make things realistic, have a process thrash the hard disk and bust the filesystem cache too. This is the server environment your software is called to run in.

Strange, maybe you should take a german provider for 3 euro per month. This argument sounds for me like when someone has a jeep and says that my ferrari need bigger wheels, like his jeep. And I should assume that I put of the tires and that the jeeps is then a lot faster. YEahaha.

Quote from: nikosdion on November 22, 2011, 22:49:20 PM
I am terribly disappointed because I met you in person, I know you are smart, I know you have Mechanical Engineering background and I just can't understand how someone like you failed so badly in a. identifying the proper testing environment and b. implement a statistically sound benchmarking method.

Yeah, I am also sorry too that you have only arguments like this http://en.wikipedia.org/wiki/Straw_man or http://en.wikipedia.org/wiki/Ad_hominem .

Nothing real. You assume this, you assume that. Very interesting. Do you actually think that a query with 3 joins and more than 30 fields is fast as a query with 0 joins and 3 fields? I even did not talk about the new servers, how programming had to change since multicore, different processor architecture and so on.

Really strange, you know that I am smart, then you can assume that I did some more tests then just one and that I know how to estimate an error on a result. You could argue with that, when the result is not clear. But a factor 10 is not a within the measurement. Another idea is that you reread how a http://en.wikipedia.org/wiki/Wheatstone_bridge is working.

When I read something written by you, I assume that you did your best. When you would state one value, I would assume that you choosed one representive value. I am sorry that you do not act that way.

jjk · November 23, 2011, 00:53:53 AM

One important point is - even if Milbo would be wrong, VM2 seems to be significantly faster than the old version and any competition (I'm currently testing it with 500+ products and approximately 1500 images)

nikosdion · November 23, 2011, 07:20:27 AM

Max,

You are calling me a straw man?! Now it gets personal. First, I would urge you to open up a book on statistics or at least consult the Wikipedia page on standard deviation.

You gave a bunch of numbers, but not the dataset or the characteristics of your sample (population, median value, standard deviation). The only people doing that in benchmarking are those who haven't bothered collecting sample data and have a flawed analysis. Let me explain why. Your "fast" query takes 0.1msec to complete. If the standard deviation of your sample is 0.5 and the slow query has a median value of 1msec and stdev of 0.01 msec then your "fast" query takes up to 1.1msec with 95% confidence and your "slow" query takes up to 1.02msec with 95% confidence, which would trump your results. Therefore, the lack of proper statistical data shows that you do not know how to perform statistical analysis.

In order to calculate the standard deviation and perform reliable statistical analysis, you need a large dataset. The bare minimum of data points you need to even make sure that you are following a Normal Distribution ("bell curve") is 30. You never said how many samples you got and never shared your dataset. It doesn't take a psychic to know when that happens...

Moreover, I don't use a glass sphere or anything like that. I am just long enough in the business and have had the misfortune of dealing with all the craphosts out there. I know that the fastest shared hosts is shared among "just" 300 websites and has at least 50% CPU and RAM load. I am not pulling these numbers off my ass, this is ALWAYS the case on the blazing fast Rochen shared server where my blog is hosted. While doing work for clients, I found out that the typical shared host has 2,000-3,000 (more if it's GoDaddy) crammed on a single machine. Now, you get to tell me what is the statistical possibility of your data file being in the filecache and/or the MySQL query size.

You might also wonder how the gentle I know that in YOUR test server all the data was stored in memory and not on disk. I mean, dude, think!!! Even if you have a 7200RPM drive, the time it takes to move the head to the next sector is 0,00833 sec or 8,33 msec. Obviously, this can't be true, as your queries take LESS than that IN TOTAL. Therefore YOU ARE NOT SIMULATING A SHARED HOST. Which means that your "feeling" that your PC is slower than a shared host is bullocks, simply because it is NOT based on numbers and it is NOT based on thorough analysis.

Another thing you might be forgetting is that MySQL comes pre-configured with a query cache. The size of the cache is tiny by default, just 1Mb. Of course, this is enough to cache your tiny queries (remember that you are doing small queries with tiny result sets) and skew your results.

And that is JUST the measurement part of things. Now, let's get to the coding part.

You have chosen the single most inefficient method to create a hierarchical structure. With your method, you have to constantly query the database over a finite, small dataset, in order to create a hierarchy in memory. This is stupid! Here are some other ways to do that:
1. With your current schema, load all categories to memory. You only need to load the _virtuemart_category_categories rows, load an object array keyed by the id field and you have a blazing fast search which needs a little bit of RAM to work properly. On a ste with 150 categories you get to save 150*3,8msec = 570msec of page load time (which is actually MUCH MORE on the typical shared host). When you get the categories you want to load, just implode their IDs using a comma as the glue and use the resulting list in a single SELECT query which fetches all the relevant records from the _virtuemart_categories and _virtuemart_categories_(languageCode) tables.
2. Use linked lists instead of a hierarchy table. If you don't know what linked lists are and how they work, feel free to use Google - I gave enough programming advice for a day.
3. Use a flat level field instead of a hierarchy table. Take a look at how NinjaBoard handles infinitely nested forums.
4. If you don't value your sanity, you can always use the left/right approach used by Joomla!, albeit I consider it a braingentle (it's linked lists done backwards and very hard to manage).
All of the above are infinitely faster than running 800 database queries. If you don't believe me, implement the first suggestion (I drew the lines, I think you can color in...) and do a benchmark.

Finally, do note that many shared hosts have something called MySQL query limits. The limit is usually 50k queries per hour. A shop with 50 categories needs 150 queries. With my solution, you only need 3 (and my solution is a constant-query solution, it's not O(N) like yours). Which one is more likely to cause a problem to the site owner. Q.E.D.

Milbo · November 23, 2011, 11:00:07 AM

hahha, Please read that http://en.wikipedia.org/wiki/Straw_man. It does not say that you are a straw man. Heheh but your answers proved that you exactly do it. You do not understand completly what I wrote.... In other case you would understand that a standard deviation of 10% or even 20% does not change the results. You discuss like someone who heared about statistics, but not actually used it. Everyone can download vm2 and enable the lines for the time and test again. The numbers I gave are just there to show how big the standard deviation is. To give an idea of it. To give a feeling. You maybe suprised, but I learned as engineer to trust me intuition. When I do the test 10 times in a row, I just look on the differences. For what should I put hours in measurement. It wouldnt convince you. When somone tries to argue against the form of a content and not the content itself, it shows already the level of the following discussion.

Quote from: nikosdion on November 23, 2011, 07:20:27 AM
1. With your current schema, load all categories to memory. You only need to load the _virtuemart_category_categories rows, load an object array keyed by the id field and you have a blazing fast search which needs a little bit of RAM to work properly. On a ste with 150 categories you get to save 150*3,8msec = 570msec of page load time (which is actually MUCH MORE on the typical shared host). When you get the categories you want to load, just implode their IDs using a comma as the glue and use the resulting list in a single SELECT query which fetches all the relevant records from the _virtuemart_categories and _virtuemart_categories_(languageCode) tables.

Funny search for a text using pure index table. 150*3,8msec = 570mse this shows clearly that you did not understand what I wrote down, the calculation is just nonsense. And the function I am talking about is neither for search nor something like this. It is there to show a tree of all categories, with name. Again you argue against your own strawman, you did not understand what I wrote above and that says all.

Quote from: nikosdion on November 23, 2011, 07:20:27 AM
All of the above are infinitely faster than running 800 database queries. If you don't believe me, implement the first suggestion (I drew the lines, I think you can color in...) and do a benchmark.

Shows again, that you do not understand anything and that you fight Strawmans. In my dataset with 145 categories I need ca 20 queries. Not 800, what a nonsense. It is over 800 with joomla.

Quote from: nikosdion on November 23, 2011, 07:20:27 AM
Finally, do note that many shared hosts have something called MySQL query limits. The limit is usually 50k queries per hour. A shop with 50 categories needs 150 queries. With my solution, you only need 3 (and my solution is a constant-query solution, it's not O(N) like yours). Which one is more likely to cause a problem to the site owner. Q.E.D.

Shows again that you neither understand the sense of the function (to list a category tree), nor my explanations above. You make yourself ridiculous, that was not my intention.

Radek Suski · November 23, 2011, 11:06:59 AM

I don't know if this is such a good idea to answer because, I may be wrong, but I have that bad feeling that you are getting this all very personal. But I'll try.

First at all I didn't questioned your design for the simple reason that I even don't know it.

So you did some statistic and measurements. I don't quite know what did you measured there but it straight me through a bit like you were comparing apples with pears. But I may be wrong at that.

Here are things I learned about statistics.

The first thing I've learned from my wife (I think you even studied the same subject as she). It's the old german saying you probably should know too: Wer viel misst der misst viel Mist.
The funny thing is that in my study I learned very similar rule in software development: every measuring program influences the measured system.
Although you can more or less calculate the influence of a measuring device in electromechanics it's barely possible to do so in software development.
Good example of this is to run in *nix system the program "top" and you'll see that this application is one of the most used in the process list.

The last thing is about statistics: Never trust a statistic you didn't forge yourself.
Now don't get this personal but I know that people unknowingly tends to create statistics that way that it always prove their point.
This is also the reason why a software developer is the worst tester of own software.

From what you are describing you are trying to implement a standard Codd's database design (maybe beside the language table). Which is perfectly fine for me.

But there is also a thing I learned, there are three ways for software development:
- The right one: means to follow all the design pattern, Codd's 12 rules, create stric object oriented software and so on
- The efficient one: means the way which mostly ignores all these rules and concentrate only on the performance
- The best one: is somewhere between both previous ways

I think it's clear for every software developer that for example although object oriented programming has lot of advantages - performance is for sure not one of them.
Or that relational database design is very helpful and allows you to avoid many problems but it requires more queries to get data from the database.
And yes, more queries means definitely - slower.

Just imagine what happens when your script is executing a single query. These are two different applications communicating via socket. In best case these are both on the same local machine but even then the CPU is not executing both threads at the same time. So very roughly described, basically you are sending a query from the PHP thread and wait until the MySQL thread can get the request. Then it can return the result and wait until the PHP thread can get it. Now imagine what happens when the MySQL is running on an other server.

So as I wrote before, I'm not questioning your design but you cannot be remotely serious when you claim that large number of queries doesn't affects the performance.

And the second big problem is that you should be aware of the fact that many, many providers are limiting access to the MySQL server. For example number of queries, or limiting the time. So with 800 queries pro site you will get very soon reports form your users about "MySQL server is gone away" and similar.

Quote from: nikosdion on November 23, 2011, 07:20:27 AM
1. With your current schema, load all categories to memory.

This is unfortunately not always so simple because people tends to have thousands of thousands categories and then you can run in memory limit

Radek Suski · November 23, 2011, 12:16:30 PM

Quote from: Globacide Solutions on November 23, 2011, 12:07:08 PM
How many lines of code did you contribute to VM, Radek, or how much money did you donate? My guess is none. You simply saw a good opportunity in this topic to show off a bit with the "awesome" "saying" your wife taught you.

ROTFL. Not that I have to justify my response because it's not your damn business how I'm spending my time.
Just for your information we responded to this post because Max asked us to do so: https://twitter.com/#!/virtuemart/status/139092866122264576

unbelievable guy

alex.s · November 23, 2011, 14:18:22 PM

I love these discussions

Well.. there were some posts apparently deleted so I can't quote, but here is a point to think about:
Some of us care not because we want to advertise ourselves, but perhaps we have customers with VirtueMart, or even worse, our dedicated servers have sites with VirtueMart. So it is our business to make this software better.

Friendly advice to Nicholas:
Whenever you use inflammatory words you devalue your arguments. Take a distance and the tragedy becomes a comedy

Now on the funny side, Max you need to rethink this entire thing man. You cannot be serious that everyone should use a good hosting company. The reality is not in the lab. Your results are only indicative, the real results can only come from the real world which unfortunately is full of crappy, overloaded, unsecure, outdated or bad-configured servers.

Please try to ignore the personal attacks and reconsider everything everyone said.

One thing I have to add from the real world is the maximum number of open tables, which is 100 by default. Imagine you have users in 5 sites and each opening various tables from Joomla or Virtuemart or anything else. Your test would break because MySQL would have to serve many users at the same time while applying the 100 open tables limit.

Also, I am not sure how you run your tests but let me say a couple of things about tests and measurements:
1) Process/thread schedulers (in windows or unix) work on their own. When you run a test your system may do all sorts of things in the background. To eliminate this you have to run your tests many many times.
2) Your logical results should not be based on actual numbers, but on percentage differences. Your computer may be stronger or weaker than another user's computer. So the actual numbers do not count as much as the percentage differences of tests.

alatak · November 23, 2011, 14:25:44 PM

Hi,

QuoteWell.. there were some posts apparently deleted so I can't quote, but here is a point to think about:

I have moderators rigth on the forum. I can assure you that no post were deleted.
We are open to discussion, and we do not censure posts, unless they are spam posts.

Globacide Solutions · November 23, 2011, 14:41:13 PM

I removed my own post, since Nicholas got upset and started cursing us in Greek, on his Twitter

I intended my posts as amusing, but later realized they might be misinterpreted.

Kisses and hugs

alatak · November 23, 2011, 14:45:15 PM

Hi,

QuoteI removed my own post, since Nicholas got upset and started cursing us in Greek, on his Twitter

Well then i am not really used to this forum

.
On the french joomla forum where i am a moderator also, i see when a post is deleted.

But i maintain what i said, we (the development team) do not censure you.

We do hope that the discussion will continue in "polite" way, with technical arguments.

Milbo · November 23, 2011, 21:34:39 PM

Heyhoo Radek,
now I take the time to answer your post, which has some good points.

Quote from: Radek Suski on November 23, 2011, 11:06:59 AM
It's the old german saying you probably should know too: Wer viel misst der misst viel Mist.
The funny thing is that in my study I learned very similar rule in software development: every measuring program influences the measured system.

Ofc, yes, I know this also. But it is exactly meant for people who measure so long until they get their desired result. Not meant for me. That every measurement is changing the result is very, very old and common for me. In my youth I was a pure physician, I just say hello http://en.wikipedia.org/wiki/Uncertainty_principle. I learned this 15 years ago and then I learned that this princip is to find everywhere (nice books about that written by Terry Pratchett, btw). Technical explanations will follow.

Quote from: Radek Suski on November 23, 2011, 11:06:59 AM
Although you can more or less calculate the influence of a measuring device in electromechanics it's barely possible to do so in software development.

I already had the idea to write that before. I know that the measurement of machines is quite tricky, but to set the standard so high that noone can fullfill them and to say benchmarking is soo complex that only very, very professionals with the right tools can do it, is not the right answer for me. Even when I can not measure the speed of my car, my tachometer says that I drive approximatly 70km/h and then I know that I am to fast for the city. If it is 68,78 or 71,3,... who cares? I worked also as assistent for my professor and teached students how to measure weights, electricity and so on. So additionally to the 50 different measurements every student is doing in his practica, I checked the results of 100s of students. "Wer viel misst der misst viel Mist" is exactly meant for people who do not have any intuition for it.

Quote from: Radek Suski on November 23, 2011, 11:06:59 AM
The last thing is about statistics: Never trust a statistic you didn't forge yourself.
Now don't get this personal but I know that people unknowingly tends to create statistics that way that it always prove their point.
This is also the reason why a software developer is the worst tester of own software.

This is a valid point. in fact it says that you do not trust me. On the other side, I know that it is true, that unconscious people try to give themself a good shape. You can believe me that I avoided that. In fact there is not a big range todo so, the numbers were in more than 20 tests (I dont know how many, I did a lot alltogether) everytime in the same area. But I think this does not differ how much tests I do and how much data I do release. In fact it is not the point of the whole thread. Originaly I just used a real exampel to underline this: "I think it is very important to explain why the old rule, the lesser sql the faster the application is sometimes wrong."

Quote from: Radek Suski on November 23, 2011, 11:06:59 AM
From what you are describing you are trying to implement a standard Codd's database design (maybe beside the language table). Which is perfectly fine for me.

ah thanks :-). To underline, I do not say that we have the perfect implementation.

Quote from: Radek Suski on November 23, 2011, 11:06:59 AM
But there is also a thing I learned, there are three ways for software development:
- The right one: means to follow all the design pattern, Codd's 12 rules, create stric object oriented software and so on
- The efficient one: means the way which mostly ignores all these rules and concentrate only on the performance
- The best one: is somewhere between both previous ways

I am the guy for Nr1, Patrick is the guy for Nr2, Valerie is in the middle way nr3, so we should hit it :-)

Quote from: Radek Suski on November 23, 2011, 11:06:59 AM
So as I wrote before, I'm not questioning your design but you cannot be remotely serious when you claim that large number of queries doesn't affects the performance.

I did not say that. I said that it is sometimes better to make a query more and that our database layout leads to this situation. In fact we try of course to reduce the mysql queries. But it is already a question what you show. The demo frontpage when you just install vm2 shows a lot features. manufacturer module, topten module, categories, 3 groups of products. There is still room for optimisation and hints are welcome, but then based on facts. One of my fast queries shows this for the time Time: 7.1999999999961E-5, of course the number itself is wrong, but this query was clearly faster then that Time: 0.000802. Btw the first query is in the rules the long lasting query is joomla getSession. That I should optimise the load of the rules is a hint I found somehow in this discussion.

Quote from: Radek Suski on November 23, 2011, 11:06:59 AM
And the second big problem is that you should be aware of the fact that many, many providers are limiting access to the MySQL server. For example number of queries, or limiting the time. So with 800 queries pro site you will get very soon reports form your users about "MySQL server is gone away" and similar.

Then the people cant show everything on the frontpage. It is also a part todo by the webadministrators. The shop must fit to the hoster or viceversa. You can also configure the frontpage to use only 500 queries. When I optimised the rules thing, it will be better.

Quote from: alex.s on November 23, 2011, 14:18:22 PM
Now on the funny side, Max you need to rethink this entire thing man. You cannot be serious that everyone should use a good hosting company.

I did not say that. A 3 euro/month host is very cheap in my eyes. You cannot expect that you can run a multilingual shop, with 100 categories, 50 manufacturers, 1000 products and 4000 media files with a proper speed on a cheap host.

Quote from: alex.s on November 23, 2011, 14:18:22 PM
The reality is not in the lab. Your results are only indicative, the real results can only come from the real world which unfortunately is full of crappy, overloaded, unsecure, outdated or bad-configured servers.

Exactly and therefore you should never compare the situation of a normal joomla page, with maybe a forum on it and a webshop. The answer here is that you should NEVER run a webshop on such hosters. In case of a hack you could be sued, because you was too greedy to pay 10 euro a month for a secure server and choosed instead the crappy, overloaded, unsecure, outdated or bad-configured for 1€.

Quote from: alex.s on November 23, 2011, 14:18:22 PM
One thing I have to add from the real world is the maximum number of open tables, which is 100 by default. Imagine you have users in 5 sites and each opening various tables from Joomla or Virtuemart or anything else. Your test would break because MySQL would have to serve many users at the same time while applying the 100 open tables limit.

vm2 has with 1 language and joomla 60 tables. You will never have this problem. When you install a lot vm plugins using tables and other components, like forum and so, then you need a better hoster without these limit.

Quote from: alex.s on November 23, 2011, 14:18:22 PM
Also, I am not sure how you run your tests but let me say a couple of things about tests and measurements:
1) Process/thread schedulers (in windows or unix) work on their own. When you run a test your system may do all sorts of things in the background. To eliminate this you have to run your tests many many times.

Now I come to the things I wanted to mention about benchmarking:
Long time ago I worked also for a quite famous german institut in the section numeric simulation. I worked more than one year with really crazy mathematicians and in fact I learned there the most programming. At this time I got paid to benchmark my results a whole month. This people were mostly dr. of mathematics speshitpilled in numeric. I made a long row of data, but at the end I displayed only one sampel as I did here. In this time I decided that I take these numbers which are mostly likely to see, when you redo the test. There are always some thread problems, in special when you use windows like me... So you need experience in your OS and your hardware. Then you can state very fast a relational result which fits for your machine. To port this knowledge to other machines needs again a lot knowledge about computers. I also wrote it that way: "Some numbers on my system for this task (145 categories).". So when you take it exact, then read my statement also exact and consider that I exactly said that (in fact I wrote it that way to make it clear that the numbers are only vaild for my machine). I know that my results are only correct for Phenom k2 processors, my motherboard, php, mysql and apache version. On an iCore, ARM or old pentium, different php, OS is the result different.
But in fact the experience says then, that it is more likely that it than even turns better on my side. A speshitpilled linux machine is even with weaker hardware better then what I use now. Atm I use a desktop windows machine. When I am testing there are a lot of processes running in the background. This is done on purpose, when I tested the graphic card stuff, I disabled what I could. Here testing php is a completly different thing. I am not interested in the best possible result. I am only interested in the relations.

Quote from: alex.s on November 23, 2011, 14:18:22 PM
2) Your logical results should not be based on actual numbers, but on percentage differences. Your computer may be stronger or weaker than another user's computer. So the actual numbers do not count as much as the percentage differences of tests.

This is exactly what I am doing. I wrote the relation is approximatly 1:10. In fact it is 1:8 maybe. The calculation I did is always valid, just the relation is changing. Just set for the ms, the char u for unit and thats it. Lets define the factor as char f and you get:

Old solution:
per leaf: u/n + u
per node: u/n + u
1 node and 2 leafs = 3 u/n + 3u

New solution:
per leaf: u/n + uf
per node: u/n + uf + u
1 node and 2 leafs = 3u/n + 3uf + u

So when f is higher then f= 2/3, then the new solutions is slower for this case. Checked? What says that in numbers?
vmdebug vmTime: getCategories: 0.0009 => 0.0006 would be enough, we measure 0.00012, this is only 1/6.
vmdebug vmTime: getCategories: 0.0012 => 0.0008 would be enough, we measure 0.00012, this is only 1/8.

Funny is btw, that the numbers of children get eliminated.

Okey, now you doubt the numbers and say, we cannot measure the time, right? So it seems I need to tell how the time measurement is working. The time measurement is tacking the nanoseconds of the system time, when it is called. This is a php c++ command and quite direkt. When we get trouble, until we measure it and until it is stored, we get on the same machine almost the same trouble. That means the time measurement is relative for one machine quite correct. The problem is to know the smallest unit, which you can measure. When I just call two times the line to measure my time, I get usually 5.0E-6. This means I must add on every measurement this value. ... or only a part of it. But when you look on my numbers above, I just compare 0.001 with 0.0001. I need to add 0.000005. On both results, how much does it change the approximatly relation of 1:10? Does not change a lot, or?

And again the maths above shows that the other solution with more queries is faster as long the faster query takes less then 66% of the long query.

ahh and something to the http://en.wikipedia.org/wiki/Propagation_of_uncertainty .The formal above is only using linear formulas. n is a certain integer. So there arent big surprises. When measurement tool makes an error of 10%, the formular makes also an error of 10%. There are is no d/dx stuff, which makes it usually so complex.

and now back to work guys.

alex.s · November 23, 2011, 22:34:20 PM

The numbers we have all been talking about are so tiny, one could think we are talking nonsense and micro-optimizations. The real deal of this discussion will probably draw the line for important guidelines for the entire project, so it is important.

I think you hit the basic point made in this discussion. When you have a very tiny measurement, don't run it 2 times, but 1,000 or 1,000,000 or 1,000,000,000. But even if this is done it may still not be a good measurement because MySQL's cache and internal optimizations can give you misleading results.

The only good answer would come from a server with several Joomla sites and concurrent users (or even bots for the sake of the test) browsing each site's pages. The variables are so many that only a test like this can convince me.

My opinion is for you to leave the answer of this discussion open until a full blown test is run.

Milbo · November 23, 2011, 23:13:31 PM

You did not understand what I meant: I stated this above and simular in twitter: "I think it is very important to explain why the old rule, the lesser sql the faster the application is sometimes wrong."

I proved that pure with maths, it has nothing todo with measurement.

Old solution:
per leaf: u/n + u
per node: u/n + u
1 node and 2 leafs = 3 u/n + 3u

New solution:
per leaf: u/n + uf
per node: u/n + uf + u
1 node and 2 leafs = 3u/n + 3uf + u

So when f is higher then f= 2/3 it is slower in other case it is faster.

But I see, you do believe what you want to believe. When someone would do such a big test, you would say "give me the sheets". When he gaves you the sheets you say, "you may tweaked them". You may only believe it, when you tested it yourself, but for this you say, "there is no time,..." so it is quite useless to discuss this further with you. The measurement is saying that the difference is around 1:8-11 so when I say it is at least 1:5, I am on the safe side.

News:

VM2 needs more than 800 queries for the frontpage? (btw less than 200 now)