Now I don't want to be a downer: but we collectively seem to have forgotten that HTML as a markup language with sufficient semantic elements, is a perfect API in itself. In fact, if we had stuck with XHTML I would've postulated that it would've been an even better API than JSON due to XPath, XQuery and XSLT.
"HyperText is a way to link and access information of various kinds as a web of nodes in which the user can browse at will. Potentially, HyperText provides a single user-interface to many large classes of stored information such as reports, notes, data-bases, computer documentation and on-line systems help. We propose the implementation of a simple scheme to incorporate several different servers of machine-stored information already available at CERN, including an analysis of the requirements for information access needs by experiments… A program which provides access to the hypertext world we call a browser — T. Berners-Lee, R. Cailliau, 12 November 1990, CERN "
Web apps are somewhat backwards in my opinion. We completely lost the idea of a markup language and decided “wait, we need an API”. So we started using JSON instead of the actual document (HTML or XML) to represent the endpoints. And then we patted ourselves on the back claiming “accessibility".
I think it's what in french we call a <i>fausse bonne idée</i>, or good idea at first sight only.
I scrapped many site, and making an api out of a site is really similar to web scrapping. Despite having tried many time to create an universal scrapper, or at least a helper class for scrapping, I found out that restarting from scratch at each site was the best strategy.
There is a lot of info on stackoverflow about that, and there was a post specifically on this issue but i can't find it.
Maybe one third of the time, everything is ok and I use my favorite tools : python-requests with requests_cache for connection and lxml for parsing. Then, Toapi should work. But there is so much variety in internet site that most of the time, i end up doing something else : Use other way to get the data (websocket, phantomjs, chromeheadless, or going through proxy or Tor or anything ...). Use other data extraction methods, because xpath does not works (parsing json, making screenshot then ocr to get position of text, even fasttext !). Some time there is encoding issue. Some time html is malformated. In this 2/3 of the case, Toapi won't work out of the box. So you will trying to fix it, improve it, etc, and the code base of Toapi will grow but it will never handle all the case.
As HN sometimes likes to pretend XHTML is the answer to all problems with HTML and the Semantic Web would have worked if only developers hadn't been such idiots at the time, let me reiterate a few things:
First off, a full disclosure: I was (and in principle still am) a big fan of web standards and was a strong believer in XHTML and XHTML 2. I thought XML was going to save us all.
Here's a hard truth about HTML: it was never about semantics. HTML was created to connect digital documents. Most of the initial tags were fairly ad hoc and loosely based on CERN's SGML guide, as is still evident from the awkward h1-h6 elements.
The biggest thing the first W3C spec[HTML32] added to HTML? The FONT element (and arguably APPLET because Java applets were a thing). Note that it really didn't add any elements for additional semantics. It was just another iteration on HTML that merged minor changes to HTML 2.0 with some proprietary extensions (mostly by Netscape).
At this point it's worth emphasising what has become the most important mantra in web standards: Don't Break the Web. HTML 4 Strict was trying to enable a break as an opt-in. The website would behave exactly the same way but the DOCTYPE would tell browsers the page doesn't use certain elements the spec defined as deprecated.
Of course by this point web browsers no longer cared about SGML and were purpose built to handle HTML and be able to render whatever junk a content author would throw at them. The Browser War was raging (Netscape still believed it could make money selling commercial licenses for their Navigator -- yes, surprise, Netscape Navigator wasn't free for commercial use) and if your browser couldn't render a page but a competitor's could, that's where users would go.
So in practice Strict vs Transitional didn't make any difference except as an indicator to how sloppy the author likely was with their markup. Eventually Internet Explorer started using it as a shibboleth to determine whether to fall back to "quirks mode" or follow the standards more closely but deprecated elements would still render in strict mode and the only thing that cared were automated validators that spit out pretty badges.
When the W3C created XHTML[XHTML1] the entire point was to drag HTML into the XML ecosystem the W3C had obsessively created and that just wasn't getting much traction on the web. Instead of having to understand SGML, browsers could just learn XML and they'd be able to handle XHTML and all the other beautiful XML standards that would enable the Semantic Web and unify proprietary XML extensions and open standards side by side.
Of course the only flaw in that logic was that browsers didn't understand SGML. Adding support for XML actually meant implementing an additional language in addition to HTML and as XML had much more well-defined implementation requirements, this among other things meant that web pages that weren't well-formed XML would have to break by definition and browsers were only allowed to show a yellow page of death with the syntax error.
This is why browsers for the most part ended up ignoring XHTML: Firefox supported XHTML but developers were upset with WYSIWYG tools breaking their websites in Firefox and because ensuring all your output is proper XML is hard, the easier fix was to just send your "XHTML" tag soup with the HTML mime type so Firefox would just handle it as HTML (which all the other browsers did anyway -- except Internet Explorer, which for once did the right thing and dutifully refused to render proper XHTML because after all it only understood HTML, not XHTML).
But as I said: XHTML wasn't about semantics. In fact, XHTML 1.0 copied the exact same Transitional/Strict/Frameset model HTML 4 had introduced, allowing developers to write the same sloppy markup (but now with more slashes). And the W3C even specified how authors could make sure their XHTML could be parsed as HTML tag soup by non-XHTML browsers (which led to millions of developers now thinking closing slashes in HTML are "more standards compliant").
A few moments later the W3C created XHTML 1.1[XHTML11], which was mostly just XHTML 1.0 Strict but now split into multiple specs to make it easier to sneak other XML specs into XHTML in the future. Again, nothing new in terms of semantics.
Meanwhile, work began on XHTML 2.0[XHTML2] which should never see the light of day. XHTML 2.0 was finally going to be a backwards incompatible change to HTML, getting rid of all the cruft (more than XHTML 1.1 had tried to shake off by dropping Transitional, even promising to kill a few darlings like h1-h6) and replacing lots of HTML concepts with equivalents already available in other XML specs. XHTML 2.0 would finish the transition that XHTML had started and replace HTML with the full power of the XML ecosystem.
Except obviously that was not going to happen. Browser vendors for the most part had given up on XHTML 1 because authors had no interest in buying into a spec that provided no discernable benefits but would introduce a high risk of breaking the entire website or requiring workarounds for other browsers. Netscape was dead, Internet Explorer had stabilised and didn't seem to be going anywhere, the Browser War that had fueled innovation was largely over.
But even so, XHTML 2 wouldn't really have added anything in terms of semantics. That wasn't the goal of XHTML 2. The goal of XHTML 2 was to eliminate redundancies within the new XML/XHTML ecosystem, generalise some parts of XHTML into reusable XML extensions and specify the rest in a way that integrates nicely with everything else (leaving the decision what XML extensions to support up to the implementors).
The real game changer in semantics were Microformats[MFORM]. Because the W3C was too slow to keep up with the rapid evolution of the real web and too academic to be aware of real world problems, the web community came up with ways to use the existing building blocks of HTML to annotate markup in a machine-readable way, building on existing specs -- all this without XML.
When browser vendors finally gave up relying on the W3C's leadership for HTML and began working on a new HTML spec[HTML] under WHATWG, the main lesson was to pave the cowpaths. Instead of relying on authors to explicitly add annotations (although that is still possible using ARIA[ARIA]) various semantic elements like MAIN and ASIDE finally made it into the language. But the most important change was that the spec finally defined the error handling browsers had previously implemented inconsistently.
But again, even with the new HTML spec, HTML's goal never was to be able to declare semantics for any possible document. Yes, you can embed metadata (and the W3C still likes to push RDF-XML as a solution to do that in HTML) but the core elements are only intended to be good enough to provide sufficient semantics to structure generic documents.
Domain-level semantics are still left as an exercise to the reader. And anyone who's tried to actually parse well-formed, valid, well-structured HTML for metadata can tell you that even for generic documents the HTML semantics just aren't sufficient.
Sorry for the lengthy history lesson, but it never ceases to amaze me how rose-tinted some people's glasses are when looking at the mythical Semantic Web, early HTML and the XML ecosystem. I didn't even go into how much of a mess the latter is but I think my point stands even so.
: The initial list of HTML tags[TAGS] allowed for "at least six" levels of heading but the RFC[HTML2] only defined h1-h6, but it also explicitly stated how each level was to be rendered. The original SGML guide[SGML] the tag list was loosely based on used up to six levels of headings in its examples. So in a nutshell, the reason there are six is likely a mix of "good enough" and nobody being able to figure out how to meaningfully define the presentation for anything beyond h6.
I built a similar service called https://Page.REST couple of months back, which uses CSS selectors instead of the XPath for capturing content.
This kind of service is great and technically well done. Of course it "should not be neccessary" if people exposed well-structured information in (X)HTML. However, many actors work hard to publish their valuable data only in a way they can control (for instance human-readable). Such API efforts are helpful to demonstrate that it is impossible to publish data and keep them fenced at the same time.
I wish something like this would use asyncio rather than flask
Does anyone know a service that does scraping of content behind login gates?
somehow misses an option to use local variables, i.e. use the id of the item to look up related data instead of only info inside of that element.
for the hacker news example that would be needed to parse score and comment count
This is still somewhat far from what I was hoping to see, but nonetheless a great inspiration and a good start.
I worked on an a small Firefox plugin whereby would enable visually impaired users to use firefox and interact with websites using voice. It was a small attempt and it was not an easy task...
The two biggest challenges were (1) understand the semantics of the website and (2) interacting with browser.
For challenge one, we can look at the case whereby a div/span/css clickable button is used instead of using the <button> tag. Traditional reader has tried its best to figure out how to find these buttons, but nonetheless it still posts a challenge. “alt” attribute is missing in <imh> tag so no description when image goes 404. So wouldn’t it be awesome if we can be more responsible and also provide a standardized markup API so screen readers can use?
As part of the thesis, I worked woth a few visually impaired users. Whether screen readers have evolved I do not know, so I’s love to hear feedback from those on HN.
Parsing Gmail’s DOM was proven to be very difficult, but fortunately, Gmail.js  exists - I owe the author so much (although I did help fix a bug I think).
Next comes the interaction. The idea is
1. user says to the computer “find the latest email”
2. the computer repeats back to the user
3. a small wait time is given to user a chance to correct the commmand before too late.
4.when time expires the code executes the command
Because there is no AI and because we neednto consider states, the best choice (esp for such small experiement), I created a finite state machine.
However, the number of states have exploded as expected for just 5-6 (??) commands for single website; clearly neither scalable nOr maintainable. It is unacceptable to limit a user to just a couple actions right?
What would be awesome is if every website can expose APIs - not just any API, but follows standard. To start small, we will take sitemap.xml as an inspiration. There should be a manifest file to describe how to login, what are needed to login, is there a “About us” page, how to do search, how to post something and etc.
This manifest file is generated on demand and is dynamic since we cannot write out 10,000 different interactions on Facebook righr? Then screen reader or my plugin can read the manifest file, and let user tell us what he/she desires to do, and we look up the manifest on how to make the call.
Toapi seems to be heading a similar direction, which is great.
The second challenge addon can only do so much. I did have to use the more low-level APIs to do certain thing, and if I remembered correctly service workers weren’t exactly easy back in 2013. Content script has limited capability. Also, there was no speech recognition API in Firefox in 2013, so I turned to speak.js  which was the only one that worked offline and has no extra dependencies like nodeJS. There is some support for Web Speech API in Firefox which is encouraging .
The take aways are:
- we suck at designing a usable website that respects semantics (I am guilty of that on my own silly homepage)
- those don’t use ARIA should try. ARIA does not solve everything but it gives a good start
- DOM is a horrible place to use to deduce and find contrnts, and a nightmare to interact with
- addon has restricted capability for good security reason so not a great idea to implement a complex software
- we hate ads espeically those overlay and popup ads (they cannot be fully blocked even with Chrome’s new feature). But also consider we’d have to unblock ads to load some of the popular websites (ironically HN search will not work on mobile unless Focus is disabled)
- finally, would be awesome if there is a way for machine to consime a website or a webpage’s functionality
Is this any different from any other framework such as Symfony that allows to create regex routes with a controller?
No support for SQL, to quickly query through complicated data structures?