DISQUS

luckyrobot: Semantics, Search and Big Honking Databases

  • terrycojones · 11 months ago
    Oh Gerry, I just flashed through the above before heading off to see my kids. Now I'm going to have to read it at length!

    I'll comment properly tonight. You make me smile & I hope vice-versa.
  • terrycojones · 11 months ago
    Hi again Gerry

    I think you probably know most of my thoughts on all this, but I'll summarize quickly.

    I guess I hate semantics. In fact I don't think there's any such thing as meaning, or understanding. Those are just words. We poor humans take comfort from imagining that they correspond to some underlying "thing" (if I were Husserl or Heidegger, I'd use another word than "thing"), but they do not. As you might put it, we live in an ambiguous world. Deeply ambiguous. I also don't think you can ever answer any question that starts with the word "Why", but that's another (closely related) subject.

    Ahem.

    What I do believe is that if you want to try to build applications that give the illusion of semantics :-) then you should build them on the most flexible architecture possible. Because as your application becomes increasingly heavily used, or as you increasingly realize that you didn't really know what you were doing in the first place, you're really just starting to plumb the depths of ambiguity and if your underlying architecture runs out of flexibility, then you have a problem.

    I also think the base architecture has to be dead simple. Even if it's dead simple it's going to be very hard to build it properly and have it scale. Freebase and SimpleDB didn't come into existence overnight. SimpleDB is certainly not the answer to any of what you're imagining. Freebase is much more interesting. Fluidinfo's FluidDB has some pretty striking departures from Freebase (which I'll save for now). It's safe to say that we're going after different regions of the same space. And it's a very big space. One difference is that Freebase are really into big honking datasets, whereas that's not my initial interest at all.

    I would also add Google's BigTable to the mix, as well as Neo4j (http://neo4j.org/) - let me know if you want an intro. Again, there are differences in emphasis, again within a big honkin' space of possibilities and value (except in the case of my company, which is apparently unfundable :-)).

    Thanks for taking all the time to write this up. Your experience is bigger than N for almost all values of N. I hope Fluidinfo can move quickly enough on the tech side that we'll find a way to do something together before you're off into some other irresistible project.

    Terry
  • gerry campbell · 11 months ago
    I love a vigorous debate, but actually there isn't one here. I pretty much agree with you.

    We choose to state it differently (I think).

    The word "semantics" has been used to cover so many things that it may have lost some precision in its definition. I may be guilty of using a broadened version here.

    I think of it this way: Google solved one of the vexing problems of search: Out of millions of results, which one comes first in the rankings? They created PageRank to approximate what *most humans* would find to be the best result. That was a hard problem and they stepped up to OWN the solution, even creating the "I'm Feeling Lucky" to emphasize that they had a solution to the problem.

    But it's still wrong a portion of the time for a large percentage of users... So they leave the other 999,999,999 results just in case.

    That simple approximation, and the willingness to accept some error while committing to improvement has revolutionized search.

    The exact thing applies to understanding meaning. If we can accept error, and use words like "semantics" or otherwise to describe what we're trying to do, we can make progress.

    If you want to coin a new term for this I'll gladly use it and give you attribution. ;-)

    What we seem to agree on is that there's no room for purity and absolutism here...
  • terrycojones · 11 months ago
    Hi again Gerry

    I wasn't being very nuanced in my original comments. That's partly due to lack of time, partly due to liking a more colorful debate. So here are a few more thoughts, and some pointers.

    Consider Artificial Intelligence and its pursuit of intelligence. We once thought it took real intelligence to play chess (for example). But as we got better and better at engineering, and we thought up smarter (but completely mechanical and non-mysterious) algorithms, we moved the goalposts. I.e., we decided that actually you didn't need to be "intelligent" to play chess after all.

    I don't believe that "intelligence" corresponds to any "thing" either, just like I think "meaning" and "understanding" are also just words. What I do believe however is in engineering and tool-building. We're primates, and primates are pretty good tool builders. So I often suggest to people that they spend less time (and investment monies :-)) on chasing abstract words and more time on building tools.

    The lesson of AI seems clear. If your tools are good enough, you can give the illusion of intelligence up to and beyond (i.e., beyond grandmaster) where it matters in any practical sense. The computer plays chess so well that you might as well say it's intelligent, or not - it just doesn't matter anymore.

    And I believe the same is true of semantics, and going after meaning and understanding. Those things can perfectly well not really exist while at the same time we can practically achieve them (i.e., the convenient and practical illusion, as with intelligence for the purposes of chess playing) by just focusing on engineering and tools.

    Make sense?

    From that POV, I argue that huge strides can be made by improving representation. If you get representation right, things that look like problems can simply go away. If you get the representation right, you may not even need a clever algorithm. Can you do an end-run around Google's armies of PhDs by changing representation? I.e., don't challenge them on the algorithm front, where you're bound to lose, but change the ground under them. You wont be surprised to hear that I think the answer is yes. I'm not talking about "beating" Google as a company, but of taking search - and how we work with information in general - to a new level.

    I wrote about this at some length, back before it was so fashionable to be me :-)

    The main posting is http://www.fluidinfo.com/terry/2007/03/19/why-d...

    And there are several others, including some that give very simple examples of why representation is so important, at http://www.fluidinfo.com/terry/category/represe...

    In summary, I don't think the words matter much. I think we can achieve amazing results (things that look like real intelligence, real understanding, that somehow capture meaning, etc) simply by focusing on engineering. My best bet about where to focus is on representation. What are the implications of the various new ways of representing information that we're exploring? I've been pondering that for over a decade! :-) My own bet, via Fluidinfo, definitely has some strong advantages and some strong weaknesses. It's a tradeoff, like so many things in computer science. Other approaches represent different tradeoffs. It's far from clear what will "win". But as I said in my earlier comment, it's a vast space we're starting to explore, and, I like to imagine, there's plenty of value to go around.

    I hope that's a clearer and a more useful answer.
  • direwolff · 11 months ago
    Great post Gerry and I'm dealing w/some of these issues today in the video space. Before I forget, you should list Dbpedia as one of the publicly accessible databases, which is an RDF normalized version of Wikipedia which now enables programmatic use of Wikipedia information.

    Combining your comments w/some of what Terry said below about our ambiguous world, I'm reminded of a magazine story a long time ago describing Microsoft's poor behavior which was titled "The Gates of Hell". Now if you consider the use of "gates" here, it's a double entendre. It's both a play on the doorway and on the person. Not very clean semantics ;)

    Great points you're making and now the business questions have to be satisfied as well as the financial incentives for the participants (content creators or curators) to do all of this work. When I think of all the work that sites have done in the past 5 yrs for SEO purposes, it's been all about findability. Making themselves more clearly indexed by the search engines and in turn more findable by people using search engines.

    Yesterday I spoke w/a stealth startup that is facilitating companies' ability to more easily make their data accessible to apps but w/business rules and metering services layered into their platform. I love that because companies have data that they s/b making more easily accessible for various uses, but also need to have an ability to monetize that and control to whom and how it is made available. Enabling easy access to it, but still having a spigot to open, filter and close its access to apps seems like the balance needed to open things up faster.

    In a world where data access is opened up because the right business rules can be put into place, and findability continues to be important to companies for the content, products and services they offer, the justification for properly marking up their content does exist. The challenge, which as you're pointing out is slowly being remedied, is "a chicken and the egg" one. Until the apps exist that make use of this marked up content in rich and useful ways, publishers and merchants do not want to go through the trouble of doing all of this work. On the other side of the house, app developers feel stifled to do a lot of work since they don't easy access to rich content sources for cool apps. An interesting company that is creating some good justifications for doing the mark-up work is Dapper in the semantic advertising space.

    It is getting easier, and I know that in my company's case, we're increasingly finding open access data sources to incorporate into our processes and apps, but there's still a long way to go. One of the best data sources we make use of for company and product information, does not make their stuff accessible in an RDF or even XML formatted way. Hence, we have to convert their data and upload it into our databases to then make it useful. It would have been nicer if they played a more open game, but they're small and have a hard time justifying the work to do this.

    I'm a lil' all over the place in this comment, but in essence, I agree w/Terry's comments which echo aspect of yours. Making a light weight technology to make content easily accessible at an organizational level, not so much in a big honking database, is really the way to go. It reminds me of the late '99/'00 time frame when RSS was still very early in its use. The company I was working with had chosen to develop syndication tools for the ICE (Information Content Exchange) syndication protocol which Vignette was supporting. It was more secure and reliable than RSS, and for the pro content providers of the time (ie. Reuters), this was very important to them as they warmed up to distributing their content to online publishers. ICE was a bulky technology however. RSS by contrast was light weight and was easy for almost anyone to use. While it didn't offer much in the way of security, it turns out that this didn't matter. As well, it's growth was secured the same way that eBay's and YouTube's was, by individuals w/a need (ie. bloggers and their readers). After garnering so much attention, the professional content providers realized that they needed to make their content available via RSS if they were to remain relevant (as has happened w/pro merchants on eBay and pro video content providers on YouTube). I believe the same could happen w/a light weight technology that helps make content providers' make their content available quickly and easily in these marked up ways.

    However, where publishers and merchants won't do the work at all, then a big honking database (a la Frebase) could take the work out of their hands and enable someone else to benefit fm the value of aggregating and structuring all of this information for meaningful uses. The big search engines have an obvious advantage here since they could theoretically start doing work to make their aggregated info available in interesting formats. Microsoft's acquisition of Powerset may have aspects of that to come, but may be not.

    Anyway, I'll concur w/your thesis that immense biz opportunities do exist to those who figure out how to pull all of this together.
  • David Semeria · 11 months ago
    Gerry's right about 'semantics' being an overly used expression.

    Terry's argument regarding the objective definition of meaning refers to the term's traditional philosophical usage, whereas Gerry and direwolf are talking about contextual ambiguity.

    I believe pursuing semantics (philosophical) in computing is a futile endeavor until machines are able to feel the wind on the their faces.

    Resolving contextual ambiguity, however, is a much more attainable and in many ways more useful goal. How many times have you Googled something only to be returned hundreds pages with the 'other' use of your key word?

    Whether progress is made via changes in representation, better algorithms or even some sort of stochastic analysis is largely irrelevant (to me).

    The key point is that whoever makes progress in this space will, as the VCs like to say, take away a lot of pain.
  • gerry campbell · 11 months ago
    Does the nature of the task (and this discussion) change if we talk about it as codifying *relationships*? That's really where I am going.

    I am not sure it makes any difference at all WHAT the thing is, it's more about the interrelatedness of one word to other words. In that case, the ambiguity is represented in a set of linkages that are more or less exclusive.

    For example - the linkages to gates the thing vs gates the person would be different. Even in the case of that double entendre, the two sets could be statistically separable.
  • gerry campbell · 11 months ago
    and can't we use co-occurrence, etc to establish that relatedness...
  • David Semeria · 11 months ago
    Gerry, I would say the goal is to *infer* context rather than codify it.

    For example, in a document that it tagged as about MSFT, references to Gates are statistically more likely to refer to the person rather than the object. So, instead of tagging (codifying) each individual reference to Gates in the document, context can be inferred from one single tag, and hence the ambiguity resolved.
  • Rob Mapstead · 11 months ago
    With the 2010 Census on the horizon, I'm thinking about applying for a job with the Census just to see if I can help make sense of it all. Wouldn't it be great if the Census data actually provided us with data that all Americans could actually benefit from? Your discussion of tagging data is extremely important in this regard.

    As it relates to tagging words, isn't this just XML? And don't we also need to tag whole phrases and not just words?
  • Kingsley Idehen · 10 months ago
    How about the burgeoning cloud of RDF based Linked Data?

    Links:
    1. http://virtuoso.openlinksw.com/images/dbpedia-l...
    2. http://esw.w3.org/topic/SweoIG/TaskForces/Commu...
    3. http://dbpedia.org/resource/Linked_Data - cross linked with Freebase and many other structured data spaces
  • Maxim · 8 months ago
    I read your post and was amazed how our work close to what you describe here as semantic technology. We have developed a technology for semantic search and text analysis which leverage Wikipedia knowledge to derive concept meaning and relationships. To recent moment Wikipedia has grown into a massive up-to-date database of such relationships. We would like to show our technology to you as it implements nearly everything that you discribed in your post: disambiguation, semantic tagging, semantic similarity to find related content/concepts and more. Could you please email me at maxim@grinev.net and I will reply with more details. Thank you.