Post by account_disabled on Mar 11, 2024 13:32:24 GMT 8
Also on the bright side some of the improvements we made while trying to find the problem have increased the speed of our crawlers and we are now hitting just over a billion pages a day. We had a bug. There was a small bug in our scheduling code this is different from the code that creates the index so our metrics were still good. Previously this bug had been benign but due to several other minor issues when it rains it pours it had a snowball effect and caused some large problems. This made identifying and tracking down the original problem relatively hard.
The bug had farreaching consequences... to be crawled more frequently than they Europe Cell Phone Number List should have been. This happened because we crawled a huge number of lowquality sites for a day period well elaborate on this further down and then generated an index with them. In turn this raised all these sites Domain Authority above a certain threshold where they would have otherwise been ignored when the bug was benign. Now that they crossed this threshold from a DA of to a DA of the bug was acting on them and when crawls were scheduled these domains were treated as if they had a DA of or.
Billions of lowquality sites were flooding the schedule with pages that caused us to crawl fewer pages on highquality sites because we were using the crawl budget to crawl lots of lowquality sites. ...And index quality was affected. We noticed the drop in highquality domain pages being crawled. As a result we started using more and more data to build the index increasing the size of our crawler fleet so that we expanded daily capacity to offset the low numbers and make sure we had enough pages from highquality domains to get a quality index that accurately reflected PADA for our customers.
The bug had farreaching consequences... to be crawled more frequently than they Europe Cell Phone Number List should have been. This happened because we crawled a huge number of lowquality sites for a day period well elaborate on this further down and then generated an index with them. In turn this raised all these sites Domain Authority above a certain threshold where they would have otherwise been ignored when the bug was benign. Now that they crossed this threshold from a DA of to a DA of the bug was acting on them and when crawls were scheduled these domains were treated as if they had a DA of or.
Billions of lowquality sites were flooding the schedule with pages that caused us to crawl fewer pages on highquality sites because we were using the crawl budget to crawl lots of lowquality sites. ...And index quality was affected. We noticed the drop in highquality domain pages being crawled. As a result we started using more and more data to build the index increasing the size of our crawler fleet so that we expanded daily capacity to offset the low numbers and make sure we had enough pages from highquality domains to get a quality index that accurately reflected PADA for our customers.