There you go again – the MLS doesn’t scale

[photopress:Reagan.jpg,thumb,alignright]Ever since Zearch, I’ve been bombarded with work to update or create MLS search web sites for various brokers & agents across the country. Because of this, I’ve had the opportunity to deal with another MLS in the Bay Area (EBRDI) and Central Virginia (CAARMLS). Before I begin another MLS rant (and cause the ghost of Gipper to quip one of his more famous lines), I want to say the IT staff at both EBRDI & the NWMLS have been helpful whenever I’ve had issues, and this primary purpose of the post is to shine a light on the IT challenges that an MLS has (and the hoops that application engineers have to jump through to address them).

After working with EBRDI, and the NWMLS, I can safely say the industry faces some interesting technical challenges ahead. Both MLSes have major bandwidth issues and the download times of data from their servers can be so slow, it makes me wonder if they using Atari 830 Acoustic modems instead of network cards.

The EBRDI provides data to members via ftp downloads. The provide a zip file of text files for the all listing data (which appears to be updated twice daily), and a separate file for all the images for that day’s listings (updated nightly). You can request a DVD-R of all the images to get started, but there is no online mechanism to get all older images. This system is frustrating because if you miss a day’s worth of image downloads, there’s no way to recover other than bothering the EBRDI’s IT staff. If the zip file gets corrupted or otherwise terminated during download, you get to download the multi-megabyte monstrosity again (killing any benefit that zipping the data might have had). Furthermore, zip file compression of images offers no major benefit. The 2-3% size savings is offset by the inconvenience of dealing with large files. The nightly data file averages about 5MB (big but manageable), but the nightly image file averages about 130 MB (a bit big for my liking considering the bandwidth constraints that the EBRDI is operating under).

As much as I complain about the NWMLS, I have to admit they probably have the toughest information distribution challenge. The NWMLS is probably the busiest MLS in the country (and probably one of the largest as well). According to Alexa.com, their servers get more traffic than redfin or John L Scott. If that wasn’t load enough, the NWMLS is the only MLS that I’m aware of that offers sold listing data [link removed]. If that wasn’t load enough, they offer access to live MLS data (via a SOAP based web service) instead of daily downloads that the EBRDI & CAARMLS offer their members. If that wasn’t enough load, I believe they allow up 16 or 20 photos per active listing (which seems to be more than the typical MLS supports). So, you have a database with over 30,000 active listings & 300,000 sold listings, all being consumed by over 1,000 offices and 15,000 agents (and their vendors or consultants). The NWMLS also uses F5 Network’s BigIP products, so they are obviously attempting to address the challenges of their overloaded information infrastructure. Unfortunately, by all appearances it doesn’t seem to be enough to handle the load that brokers & their application engineers are creating.

Interestingly, the other MLS I’ve had the opportunity to deal with (the CAARMLS in Central Virginia) doesn’t appear to have a bandwidth problem. It stores it’s data in a manner similar to EBRDI does. However, it’s a small MLS (only 2400-ish residental listings) and I suspect the reason it doesn’t have bandwidth problem is because of the fact it has fewer members to support and less data to distribute than the larger MLSes do. Either that, or the larger MLSes have seriously under invested in technology infrastructure.

So what can be done to help out the large MLSes with their bandwidth woes? Here’s some wild ideas…

Provide data via DB servers. The problem is that as an application developer, you only really want the differences between your copy of the data and the MLS data. Unforunately, providing a copy of the entire database every day is not the most efficient way of doing this. I think the NWMLS has the right idea with what is essentially SOAP front end for their listing database. Unfortunately, writing code to talk SOAP, do a data compare and download is a much bigger pain than writing a SQL stored proc to do the same thing or using a product like RedGate’s SQLCompare. Furthermore, SOAP is a lot more verbose than the proprietary protocols database servers use to talk to each other. Setting up security might be tricky, but modern DB servers allow you to have view, table, and column permissions so I suspect that’s not a major problem. Perhaps a bigger problem is that every app developer probably uses a different back-end, and getting heterogeneous SQL servers talking to each other is probably as big a headache as SOAP is. Maybe using REST instead of SOAP, would accomplish the same result?

Provide images as individually down-loadable files (preferably over HTTP). I think HTTP would scale better than FTP would for many reasons. HTTP is a less chatty protocol than FTP is, so there’s a lot less back & forth data exchange between the client & server. Also there’s a lot more tech industry investment in the ongoing Apache & IIS web server war than improving ftp servers (I don’t see that changing anytime soon).

Another advantage is that most modern web development frameworks have a means of easily making HTTP requests and generating dynamic images at run time. These features mean a web application could create a custom image page that downloads the image file on the fly at run-time from the MLS server and caches it on the file system when it’s first requested. Then all subsequent image requests would be fast since they are locally accessed and more importantly, the app would only download images for properties that were searched for. Since nearly all searches are restricted somehow (show all homes in Redmond under $800K, show all homes with at least 3 bedrooms, etc), and paged (show only 10, 20, etc. listings at a time), an app developer’s/broker’s servers wouldn’t download images from the MLS that nobody was looking at.

Data push instead of pull. Maybe instead of all the brokers constantly bombarding the MLS servers, maybe the MLS could upload data to broker servers at predefined intervals and in random order. This would prevent certain brokers from being bandwidth hogs, and perhaps it might encourage brokers to share MLS data with each other (easing the MLS bandwidth crunch) and leading to my next idea.

BitTorrents? To quote a popular BitTorrent FAQ – “BitTorrent is a protocol designed for transferring files. It is peer-to-peer in nature, as users connect to each
other directly to send and receive portions of the file. However, there is a central server (called a tracker) which coordinates the action of all such peers. The tracker only manages connections, it does not have any knowledge of the contents of the files being distributed, and therefore a large number of users can be supported with relatively limited tracker bandwidth. The key philosophy of BitTorrent is that users should upload (transmit outbound) at the same time they are downloading (receiving inbound.) In this manner, network bandwidth is utilized as efficiently as possible. BitTorrent is designed to work better as the number of people interested in a certain file increases, in contrast to other file transfer protocols.”

Obviously MLS download usage patterns match this pattern of downloading. The trick would be getting brokers to agree to it and doing it in a way that’s secure enough to prevent unauthorized people from getting at it. At any rate, the current way of distributing data doesn’t scale. As the public and industry’s appetite for web access to MLS data grows and as MLSs across the country merge and consolidate, this problem is only going to get worse. If you ran a large MLS, what would you try (other than writing big checks for more hardware)?

16 thoughts on “There you go again – the MLS doesn’t scale

  1. Robbie, what an insightful post – you’ve nailed the secrets of a scalable system-to-system integration architectures. Having built the systems integrations for Amazon’s drop ship program (i.e. solved a very similar problem), I can confirm that these problems are not unique to MLS’s; most industries make the same mistakes. Push instead of Pull (or “pull-all”) is the best advice here – although it’s hard for a techie to win that argument in the age of RSS. For solving inventory synchronization problems, “pull-all” always ends up more expensive despite the up-front dev. savings. “Pull-only-the-changes-since-last-pull” is probably the holy grail but may be way too expensive to dev. when compared to push.

  2. I’ve heard some similar war stories from people in the data aggregation team at Realtor.com… One of the guys who has been with Move since the beginning tells stories of aggregating MLS data from tapes that were sent via FedEx on a daily basis from MLS organizations all over the country! There is definitely no lack in the diversity of ways that MLSs aggregate their data!

    It is probably worth noting that the Center for Realtor Technology is working on standardizing formats around open standards. Even if it is not BitTorrent (slow down a bit Robbie!!!), it could be helpful in the long run!

  3. I’ve heard some similar war stories from people in the data aggregation team at Realtor.com… One of the guys who has been with Move since the beginning tells stories of aggregating MLS data from tapes that were sent via FedEx on a daily basis from MLS organizations all over the country! There is definitely no lack in the diversity of ways that MLSs aggregate their data!

    It is probably worth noting that the Center for Realtor Technology is working on standardizing formats around open standards. Even if it is not BitTorrent (slow down a bit Robbie!!!), it could be helpful in the long run!

  4. Well, it’s an interesting challenge and your right that many industries have similar problems. I think NWMLS is to be commended for leading the charge. By all appearances, they seem to be the MLS that’s closest to being ready for the challenges of the 21st century.

    I think the real problem with pull is that people always want the latest data and the constant pulling is going to lead to abuse. With push the MLS can force the issue and say your getting your data when we say your getting your data. I’m sure that won’t go over well, but it would stop abuse.

    The problem is complicated by the fact that the MLS isn’t in a position to force it’s partners to adopt the same platforms that it uses. It appears our MLS and the major traditional brokers (JLS, CB Bain) in town are Windows based, while many other brokers (RedFin) are Linux/Unix based. In my former life at Microsoft, the answer to the platform question was always known in advance (Windows, IIS, MS SQL, ASP.net). At Amazon, I suspect the same was true, even if the answer was different (Red Hat, Apache, Oracle, and Mason?). In the outside world, different people use different toolsets, and coming up with something in common that everybody finds acceptable isn’t as easy.

    I think our NWMLS has the right idea, but using SOAP instead of SQL directly just leaves a bad taste in my mouth. And I’m using .net, so I suspect my comrads in Linux land dislike it even more since the SOAP frameworks in that universe aren’t as friendly as they are in a Windows world.

  5. Simply a question based on (hopefully short-lived) ignorance – what impact, if any will RETS 2.0 have on this discussion and the future implementation of MLS’s and public IDX sites?

  6. Dustin – Don’t get me started on the schema differences each MLS has! I now understand why the UI with Move’s web sites is lacking, because you probably have to employ at least a 100 database engineers just deal with getting everything into a common format! I also understand why it’s taken RedFin so long to invade California (same problem). Also, it shows you should never under-estimate the bandwidth of a FedEx truck of digital media. (too bad FedEx trucks such have high latency). 🙂

    Jim – I think RETS has a potentially large impact. The benefit is that application developers would no longer have to redevelop large parts of the back-end portions of their IDX web application infrastructure for each MLS market they serve. Ideally, with RETS, changing Zearch (or some other MLS search app) for your MLS would be as easy as running a new database creation script, changing the url I call for getting MLS data, and changing the logo at the top of the page. Unfortunately, it’s nowhere near that easy today.

    If RETS gets adopted widely, I suspect it will lead to lower prices and better quality IDX solutions for agents and brokers (a good thing), since vendors can spend more time on improving the part of the product that matters and less time on reinventing the back-end. The problem is you need to get enough MLSes in the country to adopt it before the IDX vendors will embrace it. Getting data via RETS is more complicated than FTP data dumps (although better in the long run, it will be more painful in the short run).

    Although I support what the Center for Realtor Technology is attempting to accomplish, I think their preference for Java and Linux based technologies will slow the adoption of their protocols. On planet Windows, the world revolves around .net now and since half of the world is running Windows, they should release C#/.net AND Java implementations of all their technology to ensure that it’s as widely available as possible. Hell, maybe they can do some PERL or VB Script versions for the sake of completeness! I just think the medicine of RETS would go down much easier if it had some C#/.net sugar. Even if the protocol is open, you need good multi-platform & multi-language implementations available to make it easier for folks to embrace and use it in their environments.

  7. Dustin – Don’t get me started on the schema differences each MLS has! I now understand why the UI with Move’s web sites is lacking, because you probably have to employ at least a 100 database engineers just deal with getting everything into a common format! I also understand why it’s taken RedFin so long to invade California (same problem). Also, it shows you should never under-estimate the bandwidth of a FedEx truck of digital media. (too bad FedEx trucks such have high latency). 🙂

    Jim – I think RETS has a potentially large impact. The benefit is that application developers would no longer have to redevelop large parts of the back-end portions of their IDX web application infrastructure for each MLS market they serve. Ideally, with RETS, changing Zearch (or some other MLS search app) for your MLS would be as easy as running a new database creation script, changing the url I call for getting MLS data, and changing the logo at the top of the page. Unfortunately, it’s nowhere near that easy today.

    If RETS gets adopted widely, I suspect it will lead to lower prices and better quality IDX solutions for agents and brokers (a good thing), since vendors can spend more time on improving the part of the product that matters and less time on reinventing the back-end. The problem is you need to get enough MLSes in the country to adopt it before the IDX vendors will embrace it. Getting data via RETS is more complicated than FTP data dumps (although better in the long run, it will be more painful in the short run).

    Although I support what the Center for Realtor Technology is attempting to accomplish, I think their preference for Java and Linux based technologies will slow the adoption of their protocols. On planet Windows, the world revolves around .net now and since half of the world is running Windows, they should release C#/.net AND Java implementations of all their technology to ensure that it’s as widely available as possible. Hell, maybe they can do some PERL or VB Script versions for the sake of completeness! I just think the medicine of RETS would go down much easier if it had some C#/.net sugar. Even if the protocol is open, you need good multi-platform & multi-language implementations available to make it easier for folks to embrace and use it in their environments.

  8. Hey Robbie,

    We deal with some monster MLS’s 10,000+ updates a day. We have direct SQL access to their db’s and that works very slick. The biggest problem around the country is when the mls vender makes a change in the db design and they update 400,000 records in one shot. This will usually bring the system down. That usually makes for a good time to purge your local system and download a fresh copy.

    I agree NWMLS is one of the most progressive. However the RETS standard is coming on strong. The nice thing about RETS is it usually eliminates the weekend warriors from downloading the mls because it takes a higher skill set then someone armed with an ftp program. The bad thing about RETS is people try to use it like a live db and make complicated queries against it. If mls would set it up to only query off modified date and listing numbers that would increase performance on the entire system. I also think the mls should approve the app that is hitting the RETS server this would force developers to be efficient in their use of the servers. In some MLS’s companies hit the RETS server for all modified listings in the last 24 hours. This alone is not a bad thing but when they run that same task every 10 minutes it really slows down connections.

    Anyway my thoughts and I’m glad to see others are in the same jungle.

  9. Allen,

    If you don’t mind me asking, did you write your on RETS processor or download it from someplace? I haven’t been able to find a C# version on http://www.crt.realtors.org and my otherwise super googling powers have failed me. 🙁

    I hope to join the RETS revolution one of these days…

  10. Allen,

    If you don’t mind me asking, did you write your on RETS processor or download it from someplace? I haven’t been able to find a C# version on http://www.crt.realtors.org and my otherwise super googling powers have failed me. 🙁

    I hope to join the RETS revolution one of these days…

  11. I can best explain why I can’t follow this string by letting you know that to get information to write my first real estate transaction in 1978 I had to dial a phone, hear the computer tone and push the phone into a hub, listen to it connect the modem and wait an hour for the new listings to get printed on a dot matrix printer!
    No to mention when I went to college in 66, learrning computer programming for chemistry meant learning keypunch and DITRAN(specific to UCSD). My firt word processing was wordstar, my first spead sheet was Quattro (had to write all the formlulas) and my first harddrive was 2mg and I had it turbo charged in 1980. But, i’m still lost trying to understand this string! Bet you guys have never even heard of these programs. Computer was crashed more than it ran and when 3.0 came out, boy what a learning curve, but I really loved excel.

  12. The biggest problem with RETS, in my opinion, is that there are too many poor implementations of it. We download via RETS from MLS’s that use FNIS, Rapattoni, MarketLinx, FBS, and Interealty. Of those 5, there are 2 fairly good implementations. I won’t say which because I don’t want to ruffle any feathers (and get cut off).

    It seems that the other 3 vendors of RETS Servers started with the notion that just any old server they had lying around would be sufficient to handle the load. Then when they discovered that it was no where near sufficient enough, ratehr than upgrading, they decided to handicap the developer by placing egregious limits on them.

    Some RETS MLS’s restrict yout o downloading only to certain hours of the day. Others limit your data download to 1000 listings per download which would be okay if I was pulling every 30 minutes, but they scream if you pull more than twice a day. We had to argue and haggle to get the limit raised from 400 to 1000. That was a major improvement.

    They tell you to just use the OFFSET parameter to stagger your download by 1000 (1st pull, offset = 0; 2nd pull, offset = 1000; and so on), but that doesn’t work. That would only work if there were absolutely no listings added or changed during the whole download process (which can take up to half an hour on their slow server).

    One vendor has seen it fit to allow us to download data without a limit as long as we only download the MLS Number. So our process has to first download the MLS Number’s of the new and changed listings. From there, we have a formula for calculating how many downloads of each property type we have to do to stay under the limit and then write out the download script pulling listings by the MLS Numbers. And because we can only log in once at a time, we have to wait until this is done before we can start downloading photos. It’s like watching the grass grow.

    I guess, my point is, be careful what you wish for.

  13. Please, nobody use Alexa for any traffic rating. They are false and can not be substantiated. Think about it. How can they possibly know true traffic stats without either having access to the server or code implanted on the site? Cookies and the alexa tool bar are the only way. Alexa traffic stats are the biggest scam on the net.
    If anyone is looking into buying domain or ad space rates on a website and is referred to Alexa stats to support high rates then walk away and don’t look back.
    Dan

  14. Please, nobody use Alexa for any traffic rating. They are false and can not be substantiated. Think about it. How can they possibly know true traffic stats without either having access to the server or code implanted on the site? Cookies and the alexa tool bar are the only way. Alexa traffic stats are the biggest scam on the net.
    If anyone is looking into buying domain or ad space rates on a website and is referred to Alexa stats to support high rates then walk away and don’t look back.
    Dan

  15. Pingback: Death by a thousand paper cuts | Rain City Guide | A Seattle Real Estate Blog...

Leave a Reply